Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

We can defeat gravity. The problem is the paperwork involved.


devel / comp.arch / The Impending Return of Concertina III

SubjectAuthor
* The Impending Return of Concertina IIIQuadibloc
`* Re: The Impending Return of Concertina IIIQuadibloc
 `* Re: The Impending Return of Concertina IIIQuadibloc
  +* Re: The Impending Return of Concertina IIIRobert Finch
  |+* Re: The Impending Return of Concertina IIIQuadibloc
  ||`- Re: The Impending Return of Concertina IIIQuadibloc
  |`* Re: The Impending Return of Concertina IIIBGB
  | +* Re: The Impending Return of Concertina IIIQuadibloc
  | |+- Re: The Impending Return of Concertina IIIQuadibloc
  | |`- Re: The Impending Return of Concertina IIIBGB
  | `* Re: The Impending Return of Concertina IIIMitchAlsup1
  |  +* Re: The Impending Return of Concertina IIIBrian G. Lucas
  |  |`* Re: The Impending Return of Concertina IIIChris M. Thomasson
  |  | `* Re: The Impending Return of Concertina IIIScott Lurndal
  |  |  `- Re: The Impending Return of Concertina IIIChris M. Thomasson
  |  `* Re: The Impending Return of Concertina IIIBGB
  |   `* Re: The Impending Return of Concertina IIIMitchAlsup1
  |    `* Re: The Impending Return of Concertina IIIBGB
  |     +- Re: The Impending Return of Concertina IIIMitchAlsup1
  |     `* Re: The Impending Return of Concertina IIIMitchAlsup1
  |      `* Re: The Impending Return of Concertina IIIBGB
  |       `* Re: The Impending Return of Concertina IIIMitchAlsup1
  |        +- Re: The Impending Return of Concertina IIIBGB
  |        `* Re: The Impending Return of Concertina IIIRobert Finch
  |         +- Re: The Impending Return of Concertina IIIMitchAlsup1
  |         `* Re: The Impending Return of Concertina IIIBGB
  |          `* Re: The Impending Return of Concertina IIIRobert Finch
  |           `* Re: The Impending Return of Concertina IIIMitchAlsup1
  |            `- Re: The Impending Return of Concertina IIIBGB
  `- Re: The Impending Return of Concertina IIIQuadibloc

Pages:12
The Impending Return of Concertina III

<uone2m$14id5$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37060&group=comp.arch#37060

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: The Impending Return of Concertina III
Date: Tue, 23 Jan 2024 04:07:50 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 22
Message-ID: <uone2m$14id5$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Jan 2024 04:07:50 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0309fe518fd96c2ea903419c2efde436";
logging-data="1198501"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18XCjr2STU3Cv63UGTPoBAMsFpov7U5MWY="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:0gZE4svoDwI9hbaMMRDf7pMelLM=
 by: Quadibloc - Tue, 23 Jan 2024 04:07 UTC

As I have noted, the original Concertina architecture was not a
serious proposal for a computer architecture, but merely a
description of an architecture intended to illustrate how
computers work.

Concertina II was a step above that; somewhat serious, but
not fully so; still too idiosyncratic to be taken seriously
as an alternative.

But in a discussion of Concertina II - or, rather, in a thread
that started with Concertina II, but went on to discussing
other things - it was noted that RISC-V is badly flawed.

In that case, an alternative is needed. I need to go beyond
Concertina II - with which I am satisfied now as meeting its
goals, finally - to something that could be considered genuinely
serious.

At the moment, only a link to Concertina III is present on my
main page, no content is yet present.

John Savard

Re: The Impending Return of Concertina III

<uonn2g$15qes$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37061&group=comp.arch#37061

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.niel.me!news.gegeweb.eu!gegeweb.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Tue, 23 Jan 2024 06:41:20 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <uonn2g$15qes$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Jan 2024 06:41:20 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0309fe518fd96c2ea903419c2efde436";
logging-data="1239516"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19I+qXt5wHJw9xvoIZjaFAnSXkTd7VKZkE="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:7/aDFFNqzZ/zm/KIXM7WXhrk2ss=
 by: Quadibloc - Tue, 23 Jan 2024 06:41 UTC

On Tue, 23 Jan 2024 04:07:50 +0000, I wrote:

> At the moment, only a link to Concertina III is present on my
> main page, no content is yet present.

The first few pages, with diagrams of this ultimate simplification
of Concertina II, are now present, starting at

http://www.quadibloc.com/arch/ct19int.htm

I've gone to 15-bit displacements, in order to avoid compromising
addressing modes, while allowing 16-bit instructions without
switching to an alternate instruction set.

Possibly using only three base registers is also sufficiently
non-violent to the addressing modes that I should have done that
instead, so I will likely give consideration to that option in
the days ahead.

Unfortunately, since pseudo-immediate values are something
of which I have been convinced of the necessity, I could not
get rid of block structure, which is, of course, as noted
the major impediment to this ISA being considered for widespread
adoption.

John Savard

Re: The Impending Return of Concertina III

<uoo255$17ka9$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37063&group=comp.arch#37063

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.furie.org.uk!nntp.terraraq.uk!news.gegeweb.eu!gegeweb.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Tue, 23 Jan 2024 09:50:29 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <uoo255$17ka9$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Jan 2024 09:50:29 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0309fe518fd96c2ea903419c2efde436";
logging-data="1298761"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX181eNKThSW3WqXjtMQ5RZ53cWNBzoS6Rnk="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:AcFVOTavYs9TU/G6pgLALfDlXL4=
 by: Quadibloc - Tue, 23 Jan 2024 09:50 UTC

On Tue, 23 Jan 2024 06:41:20 +0000, I wrote:

> I've gone to 15-bit displacements, in order to avoid compromising
> addressing modes, while allowing 16-bit instructions without
> switching to an alternate instruction set.
>
> Possibly using only three base registers is also sufficiently
> non-violent to the addressing modes that I should have done that
> instead, so I will likely give consideration to that option in
> the days ahead.

I have indeed decided that using three base registers for the
basic load-store instructions is much preferable to shortening the
length of the displacement even by one bit.

John Savard

Re: The Impending Return of Concertina III

<uooa4p$1900g$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37064&group=comp.arch#37064

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Tue, 23 Jan 2024 07:06:47 -0500
Organization: A noiseless patient Spider
Lines: 43
Message-ID: <uooa4p$1900g$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 23 Jan 2024 12:06:49 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="5030bca1e83005162f1355f68ed0de5f";
logging-data="1343504"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18AmE7TRPj7wfkaMqFAbEVfoN5esQ25QVQ="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:R+19/3gqiz5gntDFjL97a6HQwsE=
Content-Language: en-US
In-Reply-To: <uoo255$17ka9$1@dont-email.me>
 by: Robert Finch - Tue, 23 Jan 2024 12:06 UTC

On 2024-01-23 4:50 a.m., Quadibloc wrote:
> On Tue, 23 Jan 2024 06:41:20 +0000, I wrote:
>
>> I've gone to 15-bit displacements, in order to avoid compromising
>> addressing modes, while allowing 16-bit instructions without
>> switching to an alternate instruction set.
>>
>> Possibly using only three base registers is also sufficiently
>> non-violent to the addressing modes that I should have done that
>> instead, so I will likely give consideration to that option in
>> the days ahead.
>
> I have indeed decided that using three base registers for the
> basic load-store instructions is much preferable to shortening the
> length of the displacement even by one bit.
>
> John Savard

Packing and unpacking DFP numbers does not take a lot of logic, assuming
one of the common DPD packing methods. The number of registers handling
DFP values could be doubled if they were unpacked and packed for each
operation. Since DFP arithmetic has a high latency anyway, for example
Q+ the DFP unit unpacks, performs the operation, then repacks the DFP
number. So, registers only need be 128-bit.

256 bits seems a little narrow for a vector register. I have seen
several other architectures with vector registers supporting 16+ 32-bit
values, or a length of 512-bits. This is also the width of a typical
cache line.

Having the base register implicitly encoded in the instruction is a way
to reduce the number of bits used to represent the base register. There
seems to be a lot of different base register usages. Will not that make
the compiler more difficult to write?

Does array addressing mode have memory indirect addressing? It seems
like a complex mode to support.

Block headers are tricky to use. They need to follow the output of the
instructions in the assembler so that the assembler has time to generate
the appropriate bits for the header. The entire instruction block needs
to be flushed at the end of a function.

Re: The Impending Return of Concertina III

<uoodn5$19c1l$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37065&group=comp.arch#37065

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.niel.me!news.gegeweb.eu!gegeweb.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Tue, 23 Jan 2024 13:07:49 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 100
Message-ID: <uoodn5$19c1l$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Jan 2024 13:07:49 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0309fe518fd96c2ea903419c2efde436";
logging-data="1355829"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/5Ko3pxl7Mo2xWZk6BWX+3Z7sa3KpPX/A="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:k7lFOkKA13rc3s/J9bUW05Z2jw8=
 by: Quadibloc - Tue, 23 Jan 2024 13:07 UTC

On Tue, 23 Jan 2024 07:06:47 -0500, Robert Finch wrote:

> Packing and unpacking DFP numbers does not take a lot of logic, assuming
> one of the common DPD packing methods.

Well, I'm thinking of the method used by IBM. It is true that method
was designed to use a minimal amount of logic.

> The number of registers handling
> DFP values could be doubled if they were unpacked and packed for each
> operation.

Not doubled, only increased from 24 to 32.

> Since DFP arithmetic has a high latency anyway, for example
> Q+ the DFP unit unpacks, performs the operation, then repacks the DFP
> number. So, registers only need be 128-bit.

I don't believe in wasting any time. And the latency of DFP operations
can be reduced; it is possible to design a Wallace Tree multiplier for
BCD arithmetic.

> 256 bits seems a little narrow for a vector register.

The original Concertina architecture, which had short vector registers
of that size, was designed before AVX-512 was invented. Rather than attempting
to keep revising the size of the short vector registers to keep up, the
ISA also includes long vector registers.

These are patterned after the vector registers of the Cray I, and have room
for 64 double-precision floating-point numbers each.

> I have seen
> several other architectures with vector registers supporting 16+ 32-bit
> values, or a length of 512-bits. This is also the width of a typical
> cache line.

> Having the base register implicitly encoded in the instruction is a way
> to reduce the number of bits used to represent the base register.

Instead of base registers, then, there would be a code segment register
and a data segment register, like on x86. But then how do I access data
belonging to another subroutine? Without variable length instructions,
segment prefixes like on x86 aren't an option. (There actually are
instruction prefixes in the ISA, but they're not intended to be
_common_!)

> There
> seems to be a lot of different base register usages. Will not that make
> the compiler more difficult to write?

I suppose it could. The idea is basically that a program would pick
one memory model and stick with it - a normal program would use the
base registers connected with 16-bit displacements for everything...
except that, where different routines share access to a small area of
memory, then that pointer can be put in a base register for 12-bit
displacements.

> Does array addressing mode have memory indirect addressing? It seems
> like a complex mode to support.

It does indeed use indirect addressing. The idea is that if your
program has a large number of arrays which are over 64K in size,
it shouldn't be necessary to either consume a base register for
each array, or freshly load a base register with the array address
every time it's referenced.

Using the mode is simple enough; basically, the address in the
instruction is effectively the name of the array instead of its
address, and the array is indexed normally.

Of course, there's the overhead of indirection on every access.

So in Concertina II, I had added a new addressing mode which
simply uses the same feature that allows immediate values to
tack a 64-bit absolute address on to an instruction. (Since it
looks like a 64-bit number, the linking loader can relocate it.)
That fancy feature, though, was too much complication for this
stripped-down ISA.

> Block headers are tricky to use. They need to follow the output of the
> instructions in the assembler so that the assembler has time to generate
> the appropriate bits for the header. The entire instruction block needs
> to be flushed at the end of a function.

I don't see an alternative, though, to block structure to allow instructions
to have, in the instruction stream, immediate values of any length, and yet
allow instructions to be rapidly decoded in parallel as if they were all
32 bits long.

And block structure also allows instruction parallelism to be explicitly
indicated.

If you decide not to use the block header feature, though, what you have
left is still a perfectly good ISA. So people can support the architecture
with a basic compiler which doesn't make full use of the chip's features,
and then a fancier compiler which produces more optimal code can make the
effort to handle the block headers.

John Savard

Re: The Impending Return of Concertina III

<uoogr3$1a25p$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37066&group=comp.arch#37066

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.chmurka.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Tue, 23 Jan 2024 14:01:07 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <uoogr3$1a25p$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
<uoodn5$19c1l$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Jan 2024 14:01:07 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0309fe518fd96c2ea903419c2efde436";
logging-data="1378489"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18JqHMIdYSK8aCHBJmKs+Ngz6faVQtoU+U="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:RIk9TYSf0aaiLKA9LdEoG9mHeiU=
 by: Quadibloc - Tue, 23 Jan 2024 14:01 UTC

On Tue, 23 Jan 2024 13:07:49 +0000, Quadibloc wrote:

> So in Concertina II, I had added a new addressing mode which
> simply uses the same feature that allows immediate values to
> tack a 64-bit absolute address on to an instruction. (Since it
> looks like a 64-bit number, the linking loader can relocate it.)
> That fancy feature, though, was too much complication for this
> stripped-down ISA.

This discussion has convinced me that this addressing mode,
although relegated to an alternate instruction set in Concertina II,
is important enough for maximizing performance that it does need
to be included in Concertina III, and the appropriate changes
have been made.

John Savard

Re: The Impending Return of Concertina III

<uop5li$1du2c$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37069&group=comp.arch#37069

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Tue, 23 Jan 2024 13:56:32 -0600
Organization: A noiseless patient Spider
Lines: 170
Message-ID: <uop5li$1du2c$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 23 Jan 2024 19:56:35 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9acc16816d67b8aaced892bf499b0b92";
logging-data="1505356"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19TRBBZuuTDSntXhbhqpBpB"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:GTJuR78wH9S/MHnKRu6fC4ejMmU=
In-Reply-To: <uooa4p$1900g$1@dont-email.me>
Content-Language: en-US
 by: BGB - Tue, 23 Jan 2024 19:56 UTC

On 1/23/2024 6:06 AM, Robert Finch wrote:
> On 2024-01-23 4:50 a.m., Quadibloc wrote:
>> On Tue, 23 Jan 2024 06:41:20 +0000, I wrote:
>>
>>> I've gone to 15-bit displacements, in order to avoid compromising
>>> addressing modes, while allowing 16-bit instructions without
>>> switching to an alternate instruction set.
>>>
>>> Possibly using only three base registers is also sufficiently
>>> non-violent to the addressing modes that I should have done that
>>> instead, so I will likely give consideration to that option in
>>> the days ahead.
>>
>> I have indeed decided that using three base registers for the
>> basic load-store instructions is much preferable to shortening the
>> length of the displacement even by one bit.
>>
>> John Savard
>
> Packing and unpacking DFP numbers does not take a lot of logic, assuming
> one of the common DPD packing methods. The number of registers handling
> DFP values could be doubled if they were unpacked and packed for each
> operation. Since DFP arithmetic has a high latency anyway, for example
> Q+ the DFP unit unpacks, performs the operation, then repacks the DFP
> number. So, registers only need be 128-bit.
>

In my case, had experimented with BCD instructions and DPD pack/unpack.
They were operating with 16 digits packed BCD (64-bits) or 15-digits in
50 bits (DPD). The ops could daisy-chain to support 32-digit or 48-digit
calculations.

Had the issue the there was lack of a compelling use case (to justify
the added cost).

Even with this, a format like Decimal128 is still going to be slower
than Binary128, and BCD ops lacked few other obvious compelling use-cases.

In practice, the main feature that these instructions added was the
realization that they could be used for faster Binary<->Decimal
conversion. But, even then, "potentially shaves some clock cycles off
the printf's" isn't all that compelling.

And, if one wants decimal floating point, something more akin to the
format that .NET had used makes more sense (using 32-bit chunks to
represent linear values in the range of 000000000..999999999).

> 256 bits seems a little narrow for a vector register. I have seen
> several other architectures with vector registers supporting 16+ 32-bit
> values, or a length of 512-bits. This is also the width of a typical
> cache line.
>

My case:
Narrow SIMD, 64-bit or 2x 64-bit.

Most data fits nicely in 2 or 4 element vectors, but is harder pressed
to make effective use of wider vectors unless one is effectively
SIMD'ing the SIMD operations.

Though, I guess some amount of vector stuff seems to try to present an
abstraction of looping over arrays, rather than say: "Here is a 3D
vector, calculate a dot or cross product, ..."

> Having the base register implicitly encoded in the instruction is a way
> to reduce the number of bits used to represent the base register. There
> seems to be a lot of different base register usages. Will not that make
> the compiler more difficult to write?
>

Yes. Short of 16-bit ops or similar, personally I would advise against
this sort of thing.

Better to have instructions that can access all of the registers at the
same time.

> Does array addressing mode have memory indirect addressing? It seems
> like a complex mode to support.
>

IME, the main address modes are:
(Rm, Disp) // ~ 66% +/- 10%
(Rm, Ro*FixSc) // ~ 33% +/- 10%
Where: FixSc matches the element size.
Pretty much everything else falls into the noise.

RISC-V only has the former, but kinda shoots itself in the foot:
GCC is good at eliminating most SP relative loads/stores;
That means, the nominal percentage of indexed is even higher...

As a result, the code is basically left doing excessive amounts of
shifts and adds, which (vs BJX2) effectively dethrone the memory
load/store ops for top-place.

Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
also shoots itself in the foot. Because, not only has one hit the limits
of the ALU and LD/ST ops, there are no cheap fallbacks for intermediate
range constants.

If my compiler, with its arguably poor optimizer and barely functional
register allocation, is beating GCC for performance (when targeting
RISC-V), I don't really consider this a win for some of RISC-V's design
choices.

And, if GCC in its great wisdom, is mostly loading constants from memory
(having apparently offloaded most of them into the ".data" section),
this is also not a good sign.

Also, needing to use shift-pairs to sign and zero extend things is a bit
weak as well, ...

Though, theoretically, one can at least sign-extend 'int' with:
ADDIW Xd, Xs, 0

Another minor annoyances:
Bxx Rs, Rt, Disp //Compare two registers and branch
Is needlessly expensive (both for encoding space and logic cost).
Much of the time, Rs or Rt are X0 anyways.
Meanwhile:
Bxx Rs, Disp //Compare with zero and branch
Has much of the benefit, but at a lower cost.

So, say, "if(x>10)" goes from, say:
LI X8, 10
BGT X10, X8, label
To, say:
SLTI X8, X10, 11
BEQ X8, label

Also, as a random annoyance, RISC-V's instruction layout is very
difficult to decipher from a hexadecimal view. One basically needs to
dump it in binary to make it viable to mentally parse and lookup
instructions, which sucks.

I will count this one in BJX2's favor, in that it isn't quite suck a
horrid level of suck to mentally decode instructions presented in
hexadecimal form.

Granted, BJX2 has some design flaws as well.

But, as noted, in a "head to head" comparison BJX2 is seemingly holding
up fairly OK (despite my compiler's level of suck).

But, this is part of why I had kept putting RISC-V support on the back
shelf so much. Like, yes, it is a more popular ISA, and wasn't too hard
to support with my pipeline, but... It just kinda sucks as well...

Like, it isn't due to issues of lacking fancy features, so much as all
the areas where it "shoots itself in the foot" with more basic features.

> Block headers are tricky to use. They need to follow the output of the
> instructions in the assembler so that the assembler has time to generate
> the appropriate bits for the header. The entire instruction block needs
> to be flushed at the end of a function.
>

Agreed. Would not be in favor of block-headers or block structuring.
Linear instruction formats are preferable, preferably in 32-bit chunks.

Re: The Impending Return of Concertina III

<uop9ch$1eg72$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37070&group=comp.arch#37070

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Tue, 23 Jan 2024 21:00:01 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <uop9ch$1eg72$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
<uop5li$1du2c$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Jan 2024 21:00:01 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0309fe518fd96c2ea903419c2efde436";
logging-data="1523938"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Zmjq+8baXQBaIP+ezhriPvfznMo0WUXA="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:Xv6B4oZ45be3AT/GjSHvOKMfHS0=
 by: Quadibloc - Tue, 23 Jan 2024 21:00 UTC

On Tue, 23 Jan 2024 13:56:32 -0600, BGB wrote:

> Agreed. Would not be in favor of block-headers or block structuring.
> Linear instruction formats are preferable, preferably in 32-bit chunks.

The good news is that, although Concertina III still has block structure,
it gives you a choice. The ISA is similar to a RISC architecture, but
with a number of added features, if you just use 32-bit instructions.

On Concertina II, you need to use block structure for:

- 17-bit instructions
- Immediate constants other than 8-bit or 16-bit
- Absolute array addresses
- Instruction prefixes
- Explicit indication of parallelism
- Instruction predication

On Concertina III, you need to use block structure for immediate constants other
than 8 bit, but the 16-bit instructions and the absolute array addresses are
available without block structure.

As it stands, Concertina III doesn't have instruction predication at all, which
is a deficiency I will need to see if I can remedy.

John Savard

Re: The Impending Return of Concertina III

<uopb09$1eqrg$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37073&group=comp.arch#37073

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Tue, 23 Jan 2024 21:27:38 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 10
Message-ID: <uopb09$1eqrg$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
<uop5li$1du2c$1@dont-email.me> <uop9ch$1eg72$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Jan 2024 21:27:38 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0309fe518fd96c2ea903419c2efde436";
logging-data="1534832"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+LAezDmy/uezQ3yWkYbkmZEc17AHFmsPQ="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:rfeo+5eqoSutKYRshrZGEwCuVws=
 by: Quadibloc - Tue, 23 Jan 2024 21:27 UTC

On Tue, 23 Jan 2024 21:00:01 +0000, Quadibloc wrote:

> the absolute array addresses are
> available without block structure.

No; they may not be in an alternate instruction set, but
they still are like pseudo-immediates, so they do need
the block structure.

John Savard

Re: The Impending Return of Concertina III

<dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37075&group=comp.arch#37075

  copy link   Newsgroups: comp.arch
Date: Tue, 23 Jan 2024 22:10:21 +0000
Subject: Re: The Impending Return of Concertina III
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$1t/l20o3f.aN7waPRydL..6c5yAj5eqii4haXpQ3Ak1PZQdipshYG
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me> <uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me> <uop5li$1du2c$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org>
 by: MitchAlsup1 - Tue, 23 Jan 2024 22:10 UTC

BGB wrote:

> On 1/23/2024 6:06 AM, Robert Finch wrote:
>>

> IME, the main address modes are:
> (Rm, Disp) // ~ 66% +/- 10%
> (Rm, Ro*FixSc) // ~ 33% +/- 10%
> Where: FixSc matches the element size.
> Pretty much everything else falls into the noise.

With dynamically linked libraries one needs:: k is constant at link time

LD Rd,[IP,GOT[k]] // get a pointer to the external variable
and
CALX [IP,GOT[k]] // call external entry point

But now that you have the above you can easily get::

CALX [IP,Ri<<3,Table] // call indexed method
// can also be used for threaded JITs

> RISC-V only has the former, but kinda shoots itself in the foot:
> GCC is good at eliminating most SP relative loads/stores;
> That means, the nominal percentage of indexed is even higher...

A funny thing happens when you get rid of the "extra instructions"
most IRSC ISAs cause you to have in your instruction stream::
a) the number of instructions goes down
b) you get rid of the easy instructions
c) leaving all the complicated ones remaining

> As a result, the code is basically left doing excessive amounts of
> shifts and adds, which (vs BJX2) effectively dethrone the memory
> load/store ops for top-place.

These are the easy instructions that are not necessary when ISA is
properly conceived.

> Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
> also shoots itself in the foot. Because, not only has one hit the limits
> of the ALU and LD/ST ops, there are no cheap fallbacks for intermediate
> range constants.

My 66000 has constants of all sizes for all instructions.

> If my compiler, with its arguably poor optimizer and barely functional
> register allocation, is beating GCC for performance (when targeting
> RISC-V), I don't really consider this a win for some of RISC-V's design
> choices.

When you benchmark against a strawman, cows get to eat.

> And, if GCC in its great wisdom, is mostly loading constants from memory
> (having apparently offloaded most of them into the ".data" section),
> this is also not a good sign.

Loading constants:
a) pollutes the data cache
b) wastes energy
c) wastes instructions

> Also, needing to use shift-pairs to sign and zero extend things is a bit
> weak as well, ...

See cows eat above.

>

> Also, as a random annoyance, RISC-V's instruction layout is very
> difficult to decipher from a hexadecimal view. One basically needs to
> dump it in binary to make it viable to mentally parse and lookup
> instructions, which sucks.

When you consume 3/4ths of the instruction space for 16-bit instructions;
you create stress in other areas of ISA>

Re: The Impending Return of Concertina III

<uopkj5$1g9ab$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37076&group=comp.arch#37076

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bagel99@gmail.com (Brian G. Lucas)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Tue, 23 Jan 2024 18:11:17 -0600
Organization: A noiseless patient Spider
Lines: 8
Message-ID: <uopkj5$1g9ab$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
<uop5li$1du2c$1@dont-email.me>
<dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 24 Jan 2024 00:11:17 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="33266123b173f0b359df51ea79953c6e";
logging-data="1582411"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/GC2MB72W+kvCK9YmS9WhE"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.9.1
Cancel-Lock: sha1:+SuCGeu57ystwzGAiYVxmpqygi0=
In-Reply-To: <dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org>
Content-Language: en-US
 by: Brian G. Lucas - Wed, 24 Jan 2024 00:11 UTC

On 1/23/24 16:10, MitchAlsup1 wrote:
>
> When you benchmark against a strawman, cows get to eat.

Not a farm boy I'll bet. Cows eat hay, but not straw.

brian

Re: The Impending Return of Concertina III

<uopmk8$1geee$3@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37077&group=comp.arch#37077

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Tue, 23 Jan 2024 16:45:59 -0800
Organization: A noiseless patient Spider
Lines: 12
Message-ID: <uopmk8$1geee$3@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
<uop5li$1du2c$1@dont-email.me>
<dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org>
<uopkj5$1g9ab$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 24 Jan 2024 00:46:00 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="eb65a73d5fc6fb0010bcf2681c155871";
logging-data="1587662"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Mx1EkfzJFRWzktSOqnxX/7toa2IPDMhU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:hKDifzIPSmn8exbPq3m39244riE=
In-Reply-To: <uopkj5$1g9ab$1@dont-email.me>
Content-Language: en-US
 by: Chris M. Thomasson - Wed, 24 Jan 2024 00:45 UTC

On 1/23/2024 4:11 PM, Brian G. Lucas wrote:
> On 1/23/24 16:10, MitchAlsup1 wrote:
>>
>> When you benchmark against a strawman, cows get to eat.
>
> Not a farm boy I'll bet.  Cows eat hay, but not straw.

https://en.wikipedia.org/wiki/Nord_and_Bert_Couldn%27t_Make_Head_or_Tail_of_It

Re: The Impending Return of Concertina III

<uoq8v4$1mnaf$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37078&group=comp.arch#37078

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Tue, 23 Jan 2024 23:58:56 -0600
Organization: A noiseless patient Spider
Lines: 236
Message-ID: <uoq8v4$1mnaf$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
<uop5li$1du2c$1@dont-email.me>
<dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 24 Jan 2024 05:59:00 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="77eb85471f1ee329bbdf40286549d717";
logging-data="1793359"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18fOdiJhi3c3CdCOAUwxRZR"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:+xXOBRL6hG9taI/u5QGvZFjlPIk=
Content-Language: en-US
In-Reply-To: <dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org>
 by: BGB - Wed, 24 Jan 2024 05:58 UTC

On 1/23/2024 4:10 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 1/23/2024 6:06 AM, Robert Finch wrote:
>>>
>
>> IME, the main address modes are:
>>    (Rm, Disp)       // ~ 66%  +/- 10%
>>    (Rm, Ro*FixSc)   // ~ 33%  +/- 10%
>>      Where: FixSc matches the element size.
>> Pretty much everything else falls into the noise.
>
> With dynamically linked libraries one needs:: k is constant at link time
>
>     LD    Rd,[IP,GOT[k]]     // get a pointer to the external variable
> and
>     CALX  [IP,GOT[k]]        // call external entry point
>
> But now that you have the above you can easily get::
>
>     CALX  [IP,Ri<<3,Table]   // call indexed method
>                              // can also be used for threaded JITs
>

These are unlikely to be particularly common cases *except* if using a
GOT or similar. However, if one does not use a GOT, then this is less of
an issue.

Granted, this does mean if importing variables is supported, yes, it
will come with a penalty. It is either this or add a mechanism where one
can use an absolute addressing mode and then fix-up every instance of
the variable during program load.

Say:
MOV Abs64, R4
MOV.Q (R4), R8

Though, neither ELF nor PE/COFF has a mechanism for doing this.

Not currently a huge issue, as this would first require the ability to
import/export variables in DLLs.

>> RISC-V only has the former, but kinda shoots itself in the foot:
>>    GCC is good at eliminating most SP relative loads/stores;
>>    That means, the nominal percentage of indexed is even higher...
>
> A funny thing happens when you get rid of the "extra instructions"
> most IRSC ISAs cause you to have in your instruction stream::
> a) the number of instructions goes down
> b) you get rid of the easy instructions
> c) leaving all the complicated ones remaining
>

Possibly.
RISC-V is at a stage where execution is dominated by ALU ops;
BJX2 is at a stage where it is mostly dominated by memory Load/Store.

Being Ld/St bound seems like it would be worse, but part of this is
because it isn't burning quite so many ALU instructions on things like
address calculations.

Technically, part of the role had been moved over to LEA, but the LEA
ops are a bit further down the ranking.

>> As a result, the code is basically left doing excessive amounts of
>> shifts and adds, which (vs BJX2) effectively dethrone the memory
>> load/store ops for top-place.
>
> These are the easy instructions that are not necessary when ISA is
> properly conceived.
>

Yeah.

>> Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
>> also shoots itself in the foot. Because, not only has one hit the
>> limits of the ALU and LD/ST ops, there are no cheap fallbacks for
>> intermediate range constants.
>
> My 66000 has constants of all sizes for all instructions.
>

At present:
BJX2: 9u ALU and LD/ST, 10u/10s in XG2
Though, scaled 9u can give 2K / 4K for L/Q.
The Disp10s might have been better in retrospect as 10u.
RV64: 12s, unscaled for LD/ST
This gives a slight advantage for ALU ops in RV64.
BJX2:
Can load Imm17s into R0-R31 in Baseline, R0..R63 in XG2;
Can load Imm25s into R0.
RV64:
No single-op option larger than 12-bits;
LUI and AUIPC don't really count here.

RV64 can encode 32 bit constant in a 2-op sequence;
BJX2 can encode an arbitrary 33-bit immed with a 64-bit encoding, or a
64-bit constant in a 96-bit encoding.

RV64IMA has no way to encode a 64-bit constant in fewer than 6 ops.

Seems like GCC's solution to a lot of this is "yeah, just use memory
loads for everything" (though still using 2-op sequences for PC-relative
address generation).

>> If my compiler, with its arguably poor optimizer and barely functional
>> register allocation, is beating GCC for performance (when targeting
>> RISC-V), I don't really consider this a win for some of RISC-V's
>> design choices.
>
> When you benchmark against a strawman, cows get to eat.
>

Yeah.

Would probably be a somewhat different situation against a similar
clocked ARMv8 core.

Though, some people were claiming that RISC-V can match ARMv8
performance?...

I would expect ARMv8 to beat RV64 for similar reasons to how BJX2 can
beat RV64, but with ARMv8 also having the advantage of a more capable
compiler.

Then again, I can note that generally BGBCC also uses stack canaries:
On function entry, it puts a magic number of the stack;
On function return, it reads the value and makes sure it is intact,
if not intact, it triggers a breakpoint.

Well, also some boilerplate tasks:
Saving/reloading GBR, and going through a ritual to reload GBR as-needed
(say, in case the function is called from somewhere where GBR was set up
for a different program image);
Also uses an instruction that enables/disables WEX support in the CPU
based on the requested WEX profile;
....

There was also some amount of optional boilerplate (per function) to
facilitate exception unwinding (and the possibility of using try/catch
blocks). But, I am generally disabling this on the command-line
("-fnoexcept") as it is N/A for C. If enabled, every function needs this
boilerplate, or else it will not be possible to unwind through these
stack-frames on an exception.

These things eat a small amount of code-space and clock-cycles,
generally GCC doesn't seem to do any of this.

I am guessing also maybe it has some other way to infer that it doesn't
need to have exception-unwinding for plain C programs?...

>> And, if GCC in its great wisdom, is mostly loading constants from
>> memory (having apparently offloaded most of them into the ".data"
>> section), this is also not a good sign.
>
> Loading constants:
> a) pollutes the data cache
> b) wastes energy
> c) wastes instructions
>

Yes.

But, I guess it does improve code density in this case... Because the
constants are "somewhere else" and thus don't contribute to the size of
'.text'; the program just puts a few kB worth of constants into '.data'
instead...

Does make the code density slightly less impressive.

Granted, one can argue the same of prolog/epilog compression in my case:
Save some space on prolog/epilog by calling or branching to prior
versions (since the code to save and restore GPRs is fairly repetitive).

>> Also, needing to use shift-pairs to sign and zero extend things is a
>> bit weak as well, ...
>
> See cows eat above.
>
>>
>
>> Also, as a random annoyance, RISC-V's instruction layout is very
>> difficult to decipher from a hexadecimal view. One basically needs to
>> dump it in binary to make it viable to mentally parse and lookup
>> instructions, which sucks.
>
> When you consume 3/4ths of the instruction space for 16-bit instructions;
> you create stress in other areas of ISA>

BJX2 Baseline originally burned 7/8 of the encoding space for for 16-bit
ops.

For XG2, this space was reclaimed, generally for:
Expand register fields to 6-bits;
Expand Disp and Imm fields;
Imm9/Disp9 -> Imm10/Disp10 (3RI)
Imm10 -> Imm12 (2RI).
Expand BRA/BSR from 20 to 23 bits.
IOW: XG2 now has +/- 8MB for branch ops.
....

Bigger difference I think for mental decoding has to do with how bits
were organized. Most things were organized around a 4-bit nybbles, and
immediate fields are mostly contiguous, and still organized around 4-bit
nybbles. Result is generally that it is much easier to visually match
the opcode and extract the register fields.

With RISC-V, absent dumping the whole instruction in binary, this is
very difficult.

This was a bit painful when trying to debug Doom booting in RISC-V mode
in my Verilog core vis "$display()" statements.

But, luckily, did at least eventually get it working.
So, at least to the limited extent of being able to boot directly into
Doom and similar, RISC-V mode does seem to be working...

Re: The Impending Return of Concertina III

<uoqgus$1nqs6$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37080&group=comp.arch#37080

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Wed, 24 Jan 2024 02:15:21 -0600
Organization: A noiseless patient Spider
Lines: 168
Message-ID: <uoqgus$1nqs6$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
<uop5li$1du2c$1@dont-email.me> <uop9ch$1eg72$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 24 Jan 2024 08:15:24 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="77eb85471f1ee329bbdf40286549d717";
logging-data="1829766"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19BXiAZtqHIPbPF7T6gTafE"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:enkTE/bp6C8TQ2Dt4X1zMA6jmPk=
In-Reply-To: <uop9ch$1eg72$1@dont-email.me>
Content-Language: en-US
 by: BGB - Wed, 24 Jan 2024 08:15 UTC

On 1/23/2024 3:00 PM, Quadibloc wrote:
> On Tue, 23 Jan 2024 13:56:32 -0600, BGB wrote:
>
>> Agreed. Would not be in favor of block-headers or block structuring.
>> Linear instruction formats are preferable, preferably in 32-bit chunks.
>
> The good news is that, although Concertina III still has block structure,
> it gives you a choice. The ISA is similar to a RISC architecture, but
> with a number of added features, if you just use 32-bit instructions.
>
> On Concertina II, you need to use block structure for:
>
> - 17-bit instructions
> - Immediate constants other than 8-bit or 16-bit
> - Absolute array addresses
> - Instruction prefixes
> - Explicit indication of parallelism
> - Instruction predication
>
> On Concertina III, you need to use block structure for immediate constants other
> than 8 bit, but the 16-bit instructions and the absolute array addresses are
> available without block structure.
>
> As it stands, Concertina III doesn't have instruction predication at all, which
> is a deficiency I will need to see if I can remedy.
>

Hmm, if I were to try to design something "new", optimizing for current
thoughts / observations.

Current leaning, say (flat register space):
64x 64-bit GPRs.
R0: ZR / PC(Ld/St)
R1: LR / TP(Ld/St)
R2: SP
R3: GP
R4 ..R7: Scratch, A0..A3
R8 ..R15: Preserve
R16..R19: Scratch
R16/R17, Return Value, addr for struct return.
R18: 'this'
R19: LR2 (prolog/epilog compression)
R20..R23: Scratch, A4..A7
R24..R31: Preserve
R32..R35: Scratch
R36..R39: Scratch, A8..A11 ?
R40..R47: Preserve
R48..R51: Scratch
R52..R55: Scratch, A12..A15 ?
R56..R63: Preserve

While not perfect, this would limit changes vs my existing ABI.

This was a roadblock towards trying to add RISC-V support to BGBCC, as
the ABI is different enough that it would require significant changes to
the backend.

ZR (Zero) and LR would be visible to "Normal" ops, but not as a
base-address for Load/Store, where they would be reinterpreted as PC and
TP (Task Pointer), where TP is assumed read-only for usermode programs.

With 32-bit instruction words, say:
ppzz-zzzz-zznn-nnnn-iiii-iiii-iiii-iiii // 2RI Imm16
ppzz-zzzz-zznn-nnnn-ssss-ssii-iiii-iiii // 3RI Imm10
ppzz-zzzz-zznn-nnnn-zzzz-iiii-iiii-iiii // 2RI Imm12
ppzz-zzzz-zznn-nnnn-ssss-sstt-tttt-zzzz // 3R (? 3RI Imm6)
ppzz-zzzz-zznn-nnnn-ssss-sszz-zzzz-zzzz // 2R
ppzz-zzzz-iiii-iiii-iiii-iiii-iiii-iiii // Imm24

pp:
00: PredT
01: PredF
10: Scalar
11: WEX

There would be a 'T' status bit, which would be designated exclusively
for predication.

pp00:
pp00-zzzz-zznn-nnnn-ssss-sstt-tttt-zzzz // 3R Space

pp00-0000-zznn-nnnn-ssss-sstt-tttt-0000 // LDS{B/W/L/Q} (Rs, Rt)
pp00-0000-zznn-nnnn-ssss-sstt-tttt-0001 // LDU{B/W/L/Q} (Rs, Rt)
pp00-0000-zznn-nnnn-ssss-sstt-tttt-0010 // ST{B/W/L/Q} (Rs, Rt)
pp00-0000-zznn-nnnn-ssss-sstt-tttt-0011 // LEA{B/W/L/Q} (Rs, Rt)

Note that Rs would always be the base register for load/store, for
stores, Rn would serve as the value-source, for loads the destination.
Here, Rt would serve as an index, scaled by the access size.

pp00-0001-zznn-nnnn-ssss-sstt-tttt-000z // ALU
ADD/SUB/SHAD/SHLD/MUL/AND/OR/XOR
pp00-0001-zznn-nnnn-ssss-sstt-tttt-001z // ALU
ADDSL/SUBSL/MULSL/SHADL/ADDUL/SUBUL/MULUL/SHLDL

pp01:
pp01-zzzz-zznn-nnnn-ssss-ssii-iiii-iiii // LD/ST Disp10
(Will likely assume scaled zero-extended LD/ST displacements)

Could maybe provide Imm6n LD/ST ops for a limited range of negative
displacements, but negative displacements are rarely used in general
(and typically much smaller than positive displacements).

pp10:
pp10-zzzz-zznn-nnnn-ssss-ssii-iiii-iiii // ALU Imm10
(Most ALU ops will have zero-extended immediate values)

pp10-000z-zznn-nnnn-ssss-ssii-iiii-iiii // ALU Rs, Imm10, Rn
ADD/SUB/SHAD/SHLD/MUL/AND/OR/XOR

pp10-001z-zznn-nnnn-ssss-ssii-iiii-iiii // ALU Rs, Imm10, Rn
ADDSL/SUBSL/MULSL/-/ADDUL/SUBUL/MULUL/-

pp11-0zzz:
pp11-0zzz-zznn-nnnn-zzzz-iiii-iiii-iiii // 2RI Imm12
(Some ops may be effectively Imm13s)

pp11-10zz:
pp11-10zz-zznn-nnnn-ssss-sszz-zzzz-zzzz // 2R

pp11-110z:
pp11-1100-00nn-nnnn-iiii-iiii-iiii-iiii // LI Imm16u, Rn //0..65535
pp11-1100-01nn-nnnn-iiii-iiii-iiii-iiii // LI Imm16n, Rn //-65536..-1
pp11-1100-10nn-nnnn-iiii-iiii-iiii-iiii // ADD Imm16u, Rn
pp11-1100-11nn-nnnn-iiii-iiii-iiii-iiii // ADD Imm16n, Rn
pp11-1101-00nn-nnnn-iiii-iiii-iiii-iiii // FLDCH Imm16u, Rn //Fp16
pp11-1101-01nn-nnnn-iiii-iiii-iiii-iiii // ? LEA.Q (GP, Imm16u), Rn
pp11-1101-10nn-nnnn-iiii-iiii-iiii-iiii // -
pp11-1101-11nn-nnnn-iiii-iiii-iiii-iiii // -

pp11-111z:
0011-1110-iiii-iiii-iiii-iiii-iiii-iiii // BT Disp24
0111-1110-iiii-iiii-iiii-iiii-iiii-iiii // BF Disp24
1011-1110-iiii-iiii-iiii-iiii-iiii-iiii // BRA Disp24
1111-1110-iiii-iiii-iiii-iiii-iiii-iiii // Jumbo-Imm

0011-1111-iiii-iiii-iiii-iiii-iiii-iiii // -
0111-1111-iiii-iiii-iiii-iiii-iiii-iiii // -
1011-1111-iiii-iiii-iiii-iiii-iiii-iiii // BSR Disp24
1111-1111-iiii-iiii-iiii-iiii-iiii-iiii // Jumbo-Op

This would sacrifice a few cases that exist in BJX2, but had mostly
fallen into disuse as a consequence of the existence of Jumbo prefixes.

Here, one can assume that Jumbo prefixes will exist, so the relative
loss of not having a dedicated "load 24 bits into a fixed register" case
is less.

Would also assume jumbo prefixes will deal with things like loading
function pointer addresses, etc.

....

> John Savard

Re: The Impending Return of Concertina III

<io9sN.327978$p%Mb.91865@fx15.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37083&group=comp.arch#37083

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!news.neodome.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx15.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: The Impending Return of Concertina III
Newsgroups: comp.arch
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me> <uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me> <uop5li$1du2c$1@dont-email.me> <dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org> <uopkj5$1g9ab$1@dont-email.me> <uopmk8$1geee$3@dont-email.me>
Lines: 13
Message-ID: <io9sN.327978$p%Mb.91865@fx15.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Wed, 24 Jan 2024 14:45:34 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Wed, 24 Jan 2024 14:45:34 GMT
X-Received-Bytes: 1324
 by: Scott Lurndal - Wed, 24 Jan 2024 14:45 UTC

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>On 1/23/2024 4:11 PM, Brian G. Lucas wrote:
>> On 1/23/24 16:10, MitchAlsup1 wrote:
>>>
>>> When you benchmark against a strawman, cows get to eat.
>>
>> Not a farm boy I'll bet.  Cows eat hay, but not straw.
>
>https://en.wikipedia.org/wiki/Nord_and_Bert_Couldn%27t_Make_Head_or_Tail_of_It

Although a strawman can be made from hay or leaves and twigs, or any
other stuffing, straw, as a waste product from grain production,
is traditional.

Re: The Impending Return of Concertina III

<f69eaadf31222abccef981153e67479b@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37090&group=comp.arch#37090

  copy link   Newsgroups: comp.arch
Date: Wed, 24 Jan 2024 20:23:56 +0000
Subject: Re: The Impending Return of Concertina III
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$pOPjmDVStLZtNZSkfqZ2iOXwkkEIcH1UFtwwFu3DLKufyxqqRuSZO
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me> <uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me> <uop5li$1du2c$1@dont-email.me> <dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org> <uoq8v4$1mnaf$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <f69eaadf31222abccef981153e67479b@www.novabbs.org>
 by: MitchAlsup1 - Wed, 24 Jan 2024 20:23 UTC

BGB wrote:

> On 1/23/2024 4:10 PM, MitchAlsup1 wrote:
>>
>>> Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
>>> also shoots itself in the foot. Because, not only has one hit the
>>> limits of the ALU and LD/ST ops, there are no cheap fallbacks for
>>> intermediate range constants.
>>
>> My 66000 has constants of all sizes for all instructions.
>>
------------------------
>>> And, if GCC in its great wisdom, is mostly loading constants from
>>> memory (having apparently offloaded most of them into the ".data"
>>> section), this is also not a good sign.
>>
>> Loading constants:
>> a) pollutes the data cache
>> b) wastes energy
>> c) wastes instructions
>>

> Yes.

> But, I guess it does improve code density in this case... Because the
> constants are "somewhere else" and thus don't contribute to the size of
> '.text'; the program just puts a few kB worth of constants into '.data'
> instead...

Consider the store of a constant to a constant address::

array[7] = bigFPconstant;

RISC-V
.text
aupic Ra,high(&bigFPconstant)
ldd Rd,[Ra+low(&bigFPconstant)]
aupic Ra,high(&array+48)
std Rd,[Ra+low(&array+48)]
.data
double bigFPconstant

4 instructions 6 words of memory 2 registers

My 66000:
STD #bigFPconstant,[IP,,&array+48]

1 instruction 4 words of memory all in .text 0 registers

Also note: RISC-V has no real way to support 64-bit displacements other
than resorting to LDs of pointers (ala GOT and similar).

> Does make the code density slightly less impressive.

> Granted, one can argue the same of prolog/epilog compression in my case:
> Save some space on prolog/epilog by calling or branching to prior
> versions (since the code to save and restore GPRs is fairly repetitive).

ENTER and EXIT eliminate the additional control transfers and can allow
FETCH of the return address to start before the restores are finished.

Re: The Impending Return of Concertina III

<uot19f$27cov$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37102&group=comp.arch#37102

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Thu, 25 Jan 2024 01:06:19 -0600
Organization: A noiseless patient Spider
Lines: 163
Message-ID: <uot19f$27cov$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
<uop5li$1du2c$1@dont-email.me>
<dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org>
<uoq8v4$1mnaf$1@dont-email.me>
<f69eaadf31222abccef981153e67479b@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 25 Jan 2024 07:06:23 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="24084832507996451ef1c94cc93bb096";
logging-data="2339615"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ZfGmOWYji7eBNf3yqQGmw"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:4EU0RKB/HM1aR6gt80HHWKZ1At4=
In-Reply-To: <f69eaadf31222abccef981153e67479b@www.novabbs.org>
Content-Language: en-US
 by: BGB - Thu, 25 Jan 2024 07:06 UTC

On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 1/23/2024 4:10 PM, MitchAlsup1 wrote:
>>>
>>>> Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
>>>> also shoots itself in the foot. Because, not only has one hit the
>>>> limits of the ALU and LD/ST ops, there are no cheap fallbacks for
>>>> intermediate range constants.
>>>
>>> My 66000 has constants of all sizes for all instructions.
>>>
> ------------------------
>>>> And, if GCC in its great wisdom, is mostly loading constants from
>>>> memory (having apparently offloaded most of them into the ".data"
>>>> section), this is also not a good sign.
>>>
>>> Loading constants:
>>> a) pollutes the data cache
>>> b) wastes energy
>>> c) wastes instructions
>>>
>
>> Yes.
>
>> But, I guess it does improve code density in this case... Because the
>> constants are "somewhere else" and thus don't contribute to the size
>> of '.text'; the program just puts a few kB worth of constants into
>> '.data' instead...
>
> Consider the store of a constant to a constant address::
>
>     array[7] = bigFPconstant;
>
> RISC-V
> text
>     aupic     Ra,high(&bigFPconstant)
>     ldd       Rd,[Ra+low(&bigFPconstant)]
>     aupic     Ra,high(&array+48)
>     std       Rd,[Ra+low(&array+48)]
> data
>     double    bigFPconstant
>
> 4 instructions 6 words of memory 2 registers
>
> My 66000:
>     STD       #bigFPconstant,[IP,,&array+48]
>
> 1 instruction 4 words of memory all in .text 0 registers
>

This scenario would be two instructions in my case.

I suspect the situation isn't quire *that* bad for RISC-V, mostly
because from the instruction dumps, it looks like it lumps constants
together into tables and then loads them from the table, able to use a
shared based register (and, in some cases, GP).

Say:
GP is initialized to 2K past the start of '.data';
Seems to cluster common constants at negative addresses relative to GP,
common local variables at positive addresses.

Then seemingly falls back to AUIPC+LD/ST outside of +/- 2K, with other
constants being held in tables (maybe GOT, hard to really tell from
disassembly, or looking at the back-track in machine-code).

Or, at least, this is how it seemed to work when debugging stuff.

But, yeah, looks like, besides adding indexed load/store to a
"wishlist", something like a 17-bit constant load would also be a high
priority.

From my own possible extension list:

* 00110ss-ooooo-mmmmm-ttt-nnnnn-01-01111 Lt Rn, (Rm, Ro)
** 00110ss-ttttt-mmmmm-000-nnnnn-01-01111 ? LB Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-001-nnnnn-01-01111 ? LH Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-010-nnnnn-01-01111 ? LW Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-011-nnnnn-01-01111 ? LD Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-100-nnnnn-01-01111 ? LBU Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-101-nnnnn-01-01111 ? LHU Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-110-nnnnn-01-01111 ? LWU Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-111-nnnnn-01-01111 ? LX Rn, (Rm, Rt*Sc)

* 00111ss-ooooo-mmmmm-ttt-nnnnn-01-01111 St (Rm, Ro), Rn
** 00110ss-ttttt-mmmmm-000-nnnnn-01-01111 ? SB (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-001-nnnnn-01-01111 ? SH (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-010-nnnnn-01-01111 ? SW (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-011-nnnnn-01-01111 ? SD (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-100-nnnnn-01-01111 ? SBU (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-101-nnnnn-01-01111 ? SHU (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-110-nnnnn-01-01111 ? SWU (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-111-nnnnn-01-01111 ? SX (Rm, Rt*Sc), Rn

Shoved into a hole in the AMO space.

* iiiiiii-iiiii-iiiii-111-nnnnn-00-11011 ? LI Rn, Imm17s
// In the space ANDIW would have existed in, if it existed.

Granted, the harder part here is confirming which encodings may or may
not be in use, as there doesn't seem to be any public opcode list or
registry.

> Also note: RISC-V has no real way to support 64-bit displacements other
> than resorting to LDs of pointers (ala GOT and similar).
>

Yeah.
It has ended up in a situation where GOT is seemingly the "best" option.

>> Does make the code density slightly less impressive.
>
>> Granted, one can argue the same of prolog/epilog compression in my case:
>> Save some space on prolog/epilog by calling or branching to prior
>> versions (since the code to save and restore GPRs is fairly repetitive).
>
> ENTER and EXIT eliminate the additional control transfers and can allow
> FETCH of the return address to start before the restores are finished.

Possible, but branches are cheaper to implement in hardware, and would
have been implemented already...

Granted, it is a similar thing to the recent addition of a memcpy()
slide for intermediate-sized memcpy.

Where, if one expresses the slide in reverse order, copying any multiple
of N bytes can be expressed as a branch into the slide (with less
overhead than a loop).

But, I guess in theory, the memcpy slide could be implemented in plain C
with a switch.
uint64_t *dst, *src;
uint64_t li0, li1, li2, li3;
... copy final bytes ...
switch(sz>>5)
{
...
case 2:
li0=src[4]; li1=src[5];
li2=src[6]; li3=src[7];
dst[4]=li0; dst[5]=li1;
dst[6]=li2; dst[7]=li3;
case 1:
li0=src[0]; li1=src[1];
li2=src[2]; li3=src[3];
dst[0]=li0; dst[1]=li1;
dst[2]=li2; dst[3]=li3;
case 0:
break;
}

Like, in theory one could have a special hardware feature, but a plain
software solution is reasonably effective.

Re: The Impending Return of Concertina III

<a7a7d8b33307ee9f5be799eb138caa1d@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37119&group=comp.arch#37119

  copy link   Newsgroups: comp.arch
Date: Thu, 25 Jan 2024 17:18:08 +0000
Subject: Re: The Impending Return of Concertina III
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$b8ACbeyk8NaeWgs4RyWPCu.lxk/CTpLIMyk0eY6LeR1JbWTeTxAOi
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me> <uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me> <uop5li$1du2c$1@dont-email.me> <dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org> <uoq8v4$1mnaf$1@dont-email.me> <f69eaadf31222abccef981153e67479b@www.novabbs.org> <uot19f$27cov$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <a7a7d8b33307ee9f5be799eb138caa1d@www.novabbs.org>
 by: MitchAlsup1 - Thu, 25 Jan 2024 17:18 UTC

BGB wrote:

> On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> Granted, one can argue the same of prolog/epilog compression in my case:
>>> Save some space on prolog/epilog by calling or branching to prior
>>> versions (since the code to save and restore GPRs is fairly repetitive).
>>
>> ENTER and EXIT eliminate the additional control transfers and can allow
>> FETCH of the return address to start before the restores are finished.

> Possible, but branches are cheaper to implement in hardware, and would
> have been implemented already...

Are you intentionally misreading what I wrote ??

There is a se

> Granted, it is a similar thing to the recent addition of a memcpy()
> slide for intermediate-sized memcpy.

> Where, if one expresses the slide in reverse order, copying any multiple
> of N bytes can be expressed as a branch into the slide (with less
> overhead than a loop).

> But, I guess in theory, the memcpy slide could be implemented in plain C
> with a switch.
> uint64_t *dst, *src;
> uint64_t li0, li1, li2, li3;
> ... copy final bytes ...
> switch(sz>>5)
> {
> ...
> case 2:
> li0=src[4]; li1=src[5];
> li2=src[6]; li3=src[7];
> dst[4]=li0; dst[5]=li1;
> dst[6]=li2; dst[7]=li3;
> case 1:
> li0=src[0]; li1=src[1];
> li2=src[2]; li3=src[3];
> dst[0]=li0; dst[1]=li1;
> dst[2]=li2; dst[3]=li3;
> case 0:
> break;
> }

> Like, in theory one could have a special hardware feature, but a plain
> software solution is reasonably effective.

Re: The Impending Return of Concertina III

<22994cf4121cbd9f56183ee3fe85888e@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37120&group=comp.arch#37120

  copy link   Newsgroups: comp.arch
Date: Thu, 25 Jan 2024 17:26:50 +0000
Subject: Re: The Impending Return of Concertina III
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$0oQisXnnOQsva.6roNMHEORJ.g5ROP29Nk0jkOzO89THkWpf2AwSO
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me> <uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me> <uop5li$1du2c$1@dont-email.me> <dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org> <uoq8v4$1mnaf$1@dont-email.me> <f69eaadf31222abccef981153e67479b@www.novabbs.org> <uot19f$27cov$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <22994cf4121cbd9f56183ee3fe85888e@www.novabbs.org>
 by: MitchAlsup1 - Thu, 25 Jan 2024 17:26 UTC

BGB wrote:

> On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> Granted, one can argue the same of prolog/epilog compression in my case:
>>> Save some space on prolog/epilog by calling or branching to prior
>>> versions (since the code to save and restore GPRs is fairly repetitive).
>>
>> ENTER and EXIT eliminate the additional control transfers and can allow
>> FETCH of the return address to start before the restores are finished.

> Possible, but branches are cheaper to implement in hardware, and would
> have been implemented already...

Are you intentionally misreading what I wrote ??

Epilogue is a sequence of loads leading to a jump to the return address.

Your ISA cannot jump to the return address while performing the loads
so FETCH does not get the return address and can't start fetching
instructions until the jump is performed.

Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
the return address from the stack and fetch the instructions at the
return address while still loading the preserved registers (that were
saved) so that the instructions are ready for execution by the time
the last LD is performed.

In addition, If one is performing an EXIT and fetch runs into a CALL;
it can fetch the Called address and if there is an ENTER instruction
there, it can cancel the remainder of EXIT and cancel some of ENTER
because the preserved registers are already on the stack where they are
supposed to be.

Doing these with STs and LDs cannot save those cycles.

> Granted, it is a similar thing to the recent addition of a memcpy()
> slide for intermediate-sized memcpy.

> Where, if one expresses the slide in reverse order, copying any multiple
> of N bytes can be expressed as a branch into the slide (with less
> overhead than a loop).

> But, I guess in theory, the memcpy slide could be implemented in plain C
> with a switch.
> uint64_t *dst, *src;
> uint64_t li0, li1, li2, li3;
> ... copy final bytes ...
> switch(sz>>5)
> {
> ...
> case 2:
> li0=src[4]; li1=src[5];
> li2=src[6]; li3=src[7];
> dst[4]=li0; dst[5]=li1;
> dst[6]=li2; dst[7]=li3;
> case 1:
> li0=src[0]; li1=src[1];
> li2=src[2]; li3=src[3];
> dst[0]=li0; dst[1]=li1;
> dst[2]=li2; dst[3]=li3;
> case 0:
> break;
> }

Looks like Duff's device.

But why not just::

MM Rto,Rfrom,Rcount

> Like, in theory one could have a special hardware feature, but a plain
> software solution is reasonably effective.

Re: The Impending Return of Concertina III

<uoubqf$2e3oa$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37122&group=comp.arch#37122

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Thu, 25 Jan 2024 13:12:11 -0600
Organization: A noiseless patient Spider
Lines: 154
Message-ID: <uoubqf$2e3oa$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
<uop5li$1du2c$1@dont-email.me>
<dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org>
<uoq8v4$1mnaf$1@dont-email.me>
<f69eaadf31222abccef981153e67479b@www.novabbs.org>
<uot19f$27cov$1@dont-email.me>
<22994cf4121cbd9f56183ee3fe85888e@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 25 Jan 2024 19:12:15 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="24084832507996451ef1c94cc93bb096";
logging-data="2559754"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Exbnl6mQyA4tIQ3apX4an"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:rxEOZTqIwPNc7XVrAd7nVwQnXAU=
In-Reply-To: <22994cf4121cbd9f56183ee3fe85888e@www.novabbs.org>
Content-Language: en-US
 by: BGB - Thu, 25 Jan 2024 19:12 UTC

On 1/25/2024 11:26 AM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> Granted, one can argue the same of prolog/epilog compression in my
>>>> case:
>>>> Save some space on prolog/epilog by calling or branching to prior
>>>> versions (since the code to save and restore GPRs is fairly
>>>> repetitive).
>>>
>>> ENTER and EXIT eliminate the additional control transfers and can allow
>>> FETCH of the return address to start before the restores are finished.
>
>> Possible, but branches are cheaper to implement in hardware, and would
>> have been implemented already...
>
> Are you intentionally misreading what I wrote ??
>

?? I don't understand.

> Epilogue is a sequence of loads leading to a jump to the return address.
>
> Your ISA cannot jump to the return address while performing the loads
> so FETCH does not get the return address and can't start fetching
> instructions until the jump is performed.
>

You can put the load for the return address before the other loads.
Then, if the epilog is long enough (so that this load is no-longer in
flight once it hits the final jump), the branch-predictor will lead to
it start loading the post-return instructions before the jump is reached.

This is likely a non-issue as I see it.

It is only really an issue if one demands that reloading the return
address be done as one of the final instructions in the epilog, and not
one of the first instructions.

Granted, one would have to do it as one of the final ops, if it were
implemented as a slide, but it is not. There are "practical reasons" why
a slide would not be a workable strategy in this case.

So, generally, these parts of the prolog/epilog sequences are emitted
for every combination of saved/restored registers that had been encountered.

Though, granted, when used, does mean that any such function needs to
effectively two two sets of stack-pointer adjustments:
One set for the save/restore area (in the reused part);
One part for the function (for its data and local/temporary variables
and similar).

> Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
> the return address from the stack and fetch the instructions at the
> return address while still loading the preserved registers (that were
> saved) so that the instructions are ready for execution by the time
> the last LD is performed.
>
> In addition, If one is performing an EXIT and fetch runs into a CALL;
> it can fetch the Called address and if there is an ENTER instruction
> there, it can cancel the remainder of EXIT and cancel some of ENTER
> because the preserved registers are already on the stack where they are
> supposed to be.
>
> Doing these with STs and LDs cannot save those cycles.
>

I don't see why not, the branch-predictor can still do its thing
regardless of whether or not LD/ST ops were used.

And, having the instructions in the pipeline a few cycles earlier will
buy nothing if they still can't execute until after the data is reloaded.

Similarly, can't go much wider than the existing 128-bit load stores
absent adding more register ports, so...

The main thing something like ENTER/EXIT could save would be some code
space.

>> Granted, it is a similar thing to the recent addition of a memcpy()
>> slide for intermediate-sized memcpy.
>
>> Where, if one expresses the slide in reverse order, copying any
>> multiple of N bytes can be expressed as a branch into the slide (with
>> less overhead than a loop).
>
>
>> But, I guess in theory, the memcpy slide could be implemented in plain
>> C with a switch.
>>    uint64_t *dst, *src;
>>    uint64_t li0, li1, li2, li3;
>>    ... copy final bytes ...
>>    switch(sz>>5)
>>    {
>>      ...
>>      case 2:
>>        li0=src[4]; li1=src[5];
>>        li2=src[6]; li3=src[7];
>>        dst[4]=li0; dst[5]=li1;
>>        dst[6]=li2; dst[7]=li3;
>>      case 1:
>>        li0=src[0]; li1=src[1];
>>        li2=src[2]; li3=src[3];
>>        dst[0]=li0; dst[1]=li1;
>>        dst[2]=li2; dst[3]=li3;
>>      case 0:
>>        break;
>>    }
>
> Looks like Duff's device.
>

Kinda, but generally without the loop and egregious abuse of C syntax.

Would get kinda bulky to express 1K or so worth of memory copy as a big
"switch()" block.

For anything past a certain size limit, will need to use a loop though.

> But why not just::
>
>       MM    Rto,Rfrom,Rcount
>

Would need special hardware support for this (namely, hardware to fake a
series of loads/stores in the pipeline).

Potentially burning a few K of code-space for a big copy-slide is at
least a reasonable tradeoff in that no special hardware facilities are
needed.

Partly, as there needs to be two sets of copy-slides:
One that deals with aligned copy;
One that can deal with unaligned copy.

Though, generally not used for size-optimized binaries, since here size
is the priority (and always using a loop-based generic copy, is smaller).

>> Like, in theory one could have a special hardware feature, but a plain
>> software solution is reasonably effective.

Re: The Impending Return of Concertina III

<uoucbq$2e5kj$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37124&group=comp.arch#37124

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Thu, 25 Jan 2024 11:21:30 -0800
Organization: A noiseless patient Spider
Lines: 17
Message-ID: <uoucbq$2e5kj$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
<uop5li$1du2c$1@dont-email.me>
<dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org>
<uopkj5$1g9ab$1@dont-email.me> <uopmk8$1geee$3@dont-email.me>
<io9sN.327978$p%Mb.91865@fx15.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 25 Jan 2024 19:21:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7f23c2054aa5a15fad4661d44b257fda";
logging-data="2561683"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+XA4BQRVYQnalZ9qud7Vhtg+QdezzDOds="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:ENUsFFQ9OwqXLfga+6iFFRkQSHA=
Content-Language: en-US
In-Reply-To: <io9sN.327978$p%Mb.91865@fx15.iad>
 by: Chris M. Thomasson - Thu, 25 Jan 2024 19:21 UTC

On 1/24/2024 6:45 AM, Scott Lurndal wrote:
> "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>> On 1/23/2024 4:11 PM, Brian G. Lucas wrote:
>>> On 1/23/24 16:10, MitchAlsup1 wrote:
>>>>
>>>> When you benchmark against a strawman, cows get to eat.
>>>
>>> Not a farm boy I'll bet.  Cows eat hay, but not straw.
>>
>> https://en.wikipedia.org/wiki/Nord_and_Bert_Couldn%27t_Make_Head_or_Tail_of_It
>
> Although a strawman can be made from hay or leaves and twigs, or any
> other stuffing, straw, as a waste product from grain production,
> is traditional.

Indeed. Fwiw, I will never forget when I overheard a farmer talk about
how some of his fence lines were infested with Jimsonweed.

Re: The Impending Return of Concertina III

<a7b6bc9ec5550edc6bc4d7d120ea616e@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37125&group=comp.arch#37125

  copy link   Newsgroups: comp.arch
Date: Thu, 25 Jan 2024 21:25:26 +0000
Subject: Re: The Impending Return of Concertina III
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$gyaVdBS567xlbEob8wMEeuwng.UtQcJH/eU2r9BTNjMs887Z3hpUK
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me> <uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me> <uop5li$1du2c$1@dont-email.me> <dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org> <uoq8v4$1mnaf$1@dont-email.me> <f69eaadf31222abccef981153e67479b@www.novabbs.org> <uot19f$27cov$1@dont-email.me> <22994cf4121cbd9f56183ee3fe85888e@www.novabbs.org> <uoubqf$2e3oa$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <a7b6bc9ec5550edc6bc4d7d120ea616e@www.novabbs.org>
 by: MitchAlsup1 - Thu, 25 Jan 2024 21:25 UTC

BGB wrote:

> On 1/25/2024 11:26 AM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
>>>> BGB wrote:
>>>>
>>>>> Granted, one can argue the same of prolog/epilog compression in my
>>>>> case:
>>>>> Save some space on prolog/epilog by calling or branching to prior
>>>>> versions (since the code to save and restore GPRs is fairly
>>>>> repetitive).
>>>>
>>>> ENTER and EXIT eliminate the additional control transfers and can allow
>>>> FETCH of the return address to start before the restores are finished.
>>
>>> Possible, but branches are cheaper to implement in hardware, and would
>>> have been implemented already...
>>
>> Are you intentionally misreading what I wrote ??
>>

> ?? I don't understand.

>> Epilogue is a sequence of loads leading to a jump to the return address.
>>
>> Your ISA cannot jump to the return address while performing the loads
>> so FETCH does not get the return address and can't start fetching
>> instructions until the jump is performed.
>>

> You can put the load for the return address before the other loads.
> Then, if the epilog is long enough (so that this load is no-longer in
> flight once it hits the final jump), the branch-predictor will lead to
> it start loading the post-return instructions before the jump is reached.

Yes, you can read RA early.
What you cannot do is JMP early so the FETCH stage fetches instructions
at return address early.
{{If you JMP early, then the rest of the LDs won't happen}}

> This is likely a non-issue as I see it.

> It is only really an issue if one demands that reloading the return
> address be done as one of the final instructions in the epilog, and not
> one of the first instructions.

I make no such demand--I merely demand the JMP RA is the last instruction.

> Granted, one would have to do it as one of the final ops, if it were
> implemented as a slide, but it is not. There are "practical reasons" why
> a slide would not be a workable strategy in this case.

> So, generally, these parts of the prolog/epilog sequences are emitted
> for every combination of saved/restored registers that had been encountered.

> Though, granted, when used, does mean that any such function needs to
> effectively two two sets of stack-pointer adjustments:
> One set for the save/restore area (in the reused part);
> One part for the function (for its data and local/temporary variables
> and similar).

>> Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
>> the return address from the stack and fetch the instructions at the
>> return address while still loading the preserved registers (that were
>> saved) so that the instructions are ready for execution by the time
>> the last LD is performed.
>>
>> In addition, If one is performing an EXIT and fetch runs into a CALL;
>> it can fetch the Called address and if there is an ENTER instruction
>> there, it can cancel the remainder of EXIT and cancel some of ENTER
>> because the preserved registers are already on the stack where they are
>> supposed to be.
>>
>> Doing these with STs and LDs cannot save those cycles.
>>

> I don't see why not, the branch-predictor can still do its thing
> regardless of whether or not LD/ST ops were used.

Consider::

main:
...
CALL funct1
CALL funct2

funct2:
SUB Sp,SP,stackArea2
ST R0,[SP,offset20]
ST R0,[SP,offset20]
ST R30,[SP,offset230]
ST R29,[SP,offset229]
ST R28,[SP,offset228]
ST R27,[SP,offset227]
ST R26,[SP,offset226]
ST R25,[SP,offset225]
...

funct1:
...
LD R0,[SP,offset10]
LD R30,[SP,offset130]
LD R29,[SP,offset129]
LD R28,[SP,offset128]
LD R27,[SP,offset127]
LD R26,[SP,offset126]
LD R25,[SP,offset125]
LD R24,[SP,offset124]
LD R23,[SP,offset123]
LD R22,[SP,offset122]
LD R21,[SP,offset121]
ADD SP,SP,stackArea1
JMP R0

The above would have to observe that all offset1's are equal to all
offset2's in order to short circuit the data movements. A single::

LD R26,[SP,someotheroffset]

ruins the short circuit.

Whereas:

funct2:
ENTER R25,R0,stackArea2
...

funct1:
...
EXIT R21,R0,stackArea1

will have registers R0,R25..R30 in the same positions on the stack
guaranteed by ISA definition!!

Re: The Impending Return of Concertina III

<uov49d$2hovf$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37126&group=comp.arch#37126

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Thu, 25 Jan 2024 20:09:46 -0600
Organization: A noiseless patient Spider
Lines: 203
Message-ID: <uov49d$2hovf$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
<uop5li$1du2c$1@dont-email.me>
<dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org>
<uoq8v4$1mnaf$1@dont-email.me>
<f69eaadf31222abccef981153e67479b@www.novabbs.org>
<uot19f$27cov$1@dont-email.me>
<22994cf4121cbd9f56183ee3fe85888e@www.novabbs.org>
<uoubqf$2e3oa$1@dont-email.me>
<a7b6bc9ec5550edc6bc4d7d120ea616e@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 26 Jan 2024 02:09:49 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="a965100c7548781ec1acdc21732a3958";
logging-data="2679791"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/3QlXIEk6m0U2OctDiY7qa"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:VFMxD7EtkOmrfX9wI+3Gbr0AZb0=
Content-Language: en-US
In-Reply-To: <a7b6bc9ec5550edc6bc4d7d120ea616e@www.novabbs.org>
 by: BGB - Fri, 26 Jan 2024 02:09 UTC

On 1/25/2024 3:25 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 1/25/2024 11:26 AM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
>>>>> BGB wrote:
>>>>>
>>>>>> Granted, one can argue the same of prolog/epilog compression in my
>>>>>> case:
>>>>>> Save some space on prolog/epilog by calling or branching to prior
>>>>>> versions (since the code to save and restore GPRs is fairly
>>>>>> repetitive).
>>>>>
>>>>> ENTER and EXIT eliminate the additional control transfers and can
>>>>> allow
>>>>> FETCH of the return address to start before the restores are finished.
>>>
>>>> Possible, but branches are cheaper to implement in hardware, and
>>>> would have been implemented already...
>>>
>>> Are you intentionally misreading what I wrote ??
>>>
>
>> ?? I don't understand.
>
>
>
>>> Epilogue is a sequence of loads leading to a jump to the return address.
>>>
>>> Your ISA cannot jump to the return address while performing the loads
>>> so FETCH does not get the return address and can't start fetching
>>> instructions until the jump is performed.
>>>
>
>> You can put the load for the return address before the other loads.
>> Then, if the epilog is long enough (so that this load is no-longer in
>> flight once it hits the final jump), the branch-predictor will lead to
>> it start loading the post-return instructions before the jump is reached.
>
> Yes, you can read RA early.
> What you cannot do is JMP early so the FETCH stage fetches instructions
> at return address early.
> {{If you JMP early, then the rest of the LDs won't happen}}
>
>> This is likely a non-issue as I see it.
>
>> It is only really an issue if one demands that reloading the return
>> address be done as one of the final instructions in the epilog, and
>> not one of the first instructions.
>
> I make no such demand--I merely demand the JMP RA is the last instruction.
>

In my case, both LR and R1 are forwarded to the branch-predictor via
side-channels, so the values are visible as soon as they cross the WB stage.

Once this happens, they can be predicted in the same way as normal
constant-displacement branches (IOW: it can see through the "RTS" or
"JMP R1" instruction).

This is N/A if using a different register.
In RV64 Mode, LR is mapped to X1 and R1/DHR to X5.

So, theoretically the same optimization can be used for RV64, though at
the moment, the branch predictor doesn't yet match RV instructions.

Note that this does not effect performance estimates via my emulator,
which had assumed the RV branches would be branch predicted (though, in
the Verilog core, at present the actual RV code will run slower than the
emulator predicts...).

As I look at Doom running in the Verilog simulation and can observe that
for RISC-V at the moment it is running at roughly 8-11 fps...
Well, and with a lot of sprites going on, 5 fps.

So, I have RV64 running in the Verilog simulation, but it appears to be
performing a bit worse than my emulator predicts.

TBD how much has to do with a current lack of RV support in the branch
predictor.

>> Granted, one would have to do it as one of the final ops, if it were
>> implemented as a slide, but it is not. There are "practical reasons"
>> why a slide would not be a workable strategy in this case.
>
>> So, generally, these parts of the prolog/epilog sequences are emitted
>> for every combination of saved/restored registers that had been
>> encountered.
>
>> Though, granted, when used, does mean that any such function needs to
>> effectively two two sets of stack-pointer adjustments:
>> One set for the save/restore area (in the reused part);
>> One part for the function (for its data and local/temporary variables
>> and similar).
>
>
>>> Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
>>> the return address from the stack and fetch the instructions at the
>>> return address while still loading the preserved registers (that were
>>> saved) so that the instructions are ready for execution by the time
>>> the last LD is performed.
>>>
>>> In addition, If one is performing an EXIT and fetch runs into a CALL;
>>> it can fetch the Called address and if there is an ENTER instruction
>>> there, it can cancel the remainder of EXIT and cancel some of ENTER
>>> because the preserved registers are already on the stack where they
>>> are supposed to be.
>>>
>>> Doing these with STs and LDs cannot save those cycles.
>>>
>
>> I don't see why not, the branch-predictor can still do its thing
>> regardless of whether or not LD/ST ops were used.
>
> Consider::
>
> main:
>      ...
>      CALL   funct1
>      CALL   funct2
>
> funct2:
>      SUB    Sp,SP,stackArea2
>      ST     R0,[SP,offset20]
>      ST     R0,[SP,offset20]
>      ST     R30,[SP,offset230]
>      ST     R29,[SP,offset229]
>      ST     R28,[SP,offset228]
>      ST     R27,[SP,offset227]
>      ST     R26,[SP,offset226]
>      ST     R25,[SP,offset225]
>      ...
>
> funct1:
>      ...
>      LD     R0,[SP,offset10]
>      LD     R30,[SP,offset130]
>      LD     R29,[SP,offset129]
>      LD     R28,[SP,offset128]
>      LD     R27,[SP,offset127]
>      LD     R26,[SP,offset126]
>      LD     R25,[SP,offset125]
>      LD     R24,[SP,offset124]
>      LD     R23,[SP,offset123]
>      LD     R22,[SP,offset122]
>      LD     R21,[SP,offset121]
>      ADD    SP,SP,stackArea1
>      JMP    R0
>
> The above would have to observe that all offset1's are equal to all
> offset2's in order to short circuit the data movements. A single::
>
>      LD     R26,[SP,someotheroffset]
>
> ruins the short circuit.
>
> Whereas:
>
> funct2:
>      ENTER   R25,R0,stackArea2
>      ...
>
> funct1:
>      ...
>      EXIT    R21,R0,stackArea1
>
> will have registers R0,R25..R30 in the same positions on the stack
> guaranteed by ISA definition!!

OK.

This would be something other than a branch-predictor concern.

In the return case, all the branch predictor cares about is whether LR
or R1 is still in-flight at the moment the "RTS" or "JMP R1" is
encountered, but need not pattern-match the Loads/Stores to get there.

MOV.X instruction also saves/restores 2 registers, doesn't care about
what happens with the values, or how it relates to other instructions.

I guess, If one wanted to pattern match two MOV.X's, say, into a "MOV.Y"
(hypothetical 256-bit LD/ST), one would care that the offsets and
registers pair up.

This isn't currently done (since I don't have the register ports for this).

I guess it could be possible to detect this case for MOV.Q pairs and
effectively merge them into MOV.X operation. Similar for LD pairs.

But, pattern matching instructions (AKA: "fusion") won't be cheap
either. For now though, I will ignore this possibility.

Re: The Impending Return of Concertina III

<uovq1g$2od98$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37127&group=comp.arch#37127

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Fri, 26 Jan 2024 08:21:04 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 12
Message-ID: <uovq1g$2od98$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 26 Jan 2024 08:21:04 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="bfe2c8ff40b87c232461d75fef535778";
logging-data="2897192"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19tm44Xtbj/wZbiHwKIxOUqay9G4XnUWVI="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:DifMZg/FP3H5GMME4YgN7a0IU3w=
 by: Quadibloc - Fri, 26 Jan 2024 08:21 UTC

On Tue, 23 Jan 2024 09:50:29 +0000, Quadibloc wrote:

> I have indeed decided that using three base registers for the
> basic load-store instructions is much preferable to shortening the
> length of the displacement even by one bit.

Another change has been made to Concertina III, based on the work
done for Concertina IV. The instruction prefix has been eliminated
as a possible meaning of the header word; instead, instruction
predication can be specified by the header.

John Savard

Re: The Impending Return of Concertina III

<up0obg$2un46$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37130&group=comp.arch#37130

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: The Impending Return of Concertina III
Date: Fri, 26 Jan 2024 11:58:24 -0500
Organization: A noiseless patient Spider
Lines: 168
Message-ID: <up0obg$2un46$1@dont-email.me>
References: <uone2m$14id5$1@dont-email.me> <uonn2g$15qes$1@dont-email.me>
<uoo255$17ka9$1@dont-email.me> <uooa4p$1900g$1@dont-email.me>
<uop5li$1du2c$1@dont-email.me>
<dc0a43d861a45b6dafa9e2380baa8b46@www.novabbs.org>
<uoq8v4$1mnaf$1@dont-email.me>
<f69eaadf31222abccef981153e67479b@www.novabbs.org>
<uot19f$27cov$1@dont-email.me>
<22994cf4121cbd9f56183ee3fe85888e@www.novabbs.org>
<uoubqf$2e3oa$1@dont-email.me>
<a7b6bc9ec5550edc6bc4d7d120ea616e@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 26 Jan 2024 16:58:24 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4055cc1d6c7c5cc6e4e1965c8d3a71f8";
logging-data="3103878"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Qlk4VI5fAqwKbt/EkTwG/TYzeAkcAPvI="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:NeRfg3hQnh2xPzTzN5NM36lmdgc=
In-Reply-To: <a7b6bc9ec5550edc6bc4d7d120ea616e@www.novabbs.org>
Content-Language: en-US
 by: Robert Finch - Fri, 26 Jan 2024 16:58 UTC

On 2024-01-25 4:25 p.m., MitchAlsup1 wrote:
> BGB wrote:
>
>> On 1/25/2024 11:26 AM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
>>>>> BGB wrote:
>>>>>
>>>>>> Granted, one can argue the same of prolog/epilog compression in my
>>>>>> case:
>>>>>> Save some space on prolog/epilog by calling or branching to prior
>>>>>> versions (since the code to save and restore GPRs is fairly
>>>>>> repetitive).
>>>>>
>>>>> ENTER and EXIT eliminate the additional control transfers and can
>>>>> allow
>>>>> FETCH of the return address to start before the restores are finished.
>>>
>>>> Possible, but branches are cheaper to implement in hardware, and
>>>> would have been implemented already...
>>>
>>> Are you intentionally misreading what I wrote ??
>>>
>
>> ?? I don't understand.
>
>
>
>>> Epilogue is a sequence of loads leading to a jump to the return address.
>>>
>>> Your ISA cannot jump to the return address while performing the loads
>>> so FETCH does not get the return address and can't start fetching
>>> instructions until the jump is performed.
>>>
>
>> You can put the load for the return address before the other loads.
>> Then, if the epilog is long enough (so that this load is no-longer in
>> flight once it hits the final jump), the branch-predictor will lead to
>> it start loading the post-return instructions before the jump is reached.
>
> Yes, you can read RA early.
> What you cannot do is JMP early so the FETCH stage fetches instructions
> at return address early.
> {{If you JMP early, then the rest of the LDs won't happen}}
>
>> This is likely a non-issue as I see it.
>
>> It is only really an issue if one demands that reloading the return
>> address be done as one of the final instructions in the epilog, and
>> not one of the first instructions.
>
> I make no such demand--I merely demand the JMP RA is the last instruction.
>
>> Granted, one would have to do it as one of the final ops, if it were
>> implemented as a slide, but it is not. There are "practical reasons"
>> why a slide would not be a workable strategy in this case.
>
>> So, generally, these parts of the prolog/epilog sequences are emitted
>> for every combination of saved/restored registers that had been
>> encountered.
>
>> Though, granted, when used, does mean that any such function needs to
>> effectively two two sets of stack-pointer adjustments:
>> One set for the save/restore area (in the reused part);
>> One part for the function (for its data and local/temporary variables
>> and similar).
>
>
>>> Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
>>> the return address from the stack and fetch the instructions at the
>>> return address while still loading the preserved registers (that were
>>> saved) so that the instructions are ready for execution by the time
>>> the last LD is performed.
>>>
>>> In addition, If one is performing an EXIT and fetch runs into a CALL;
>>> it can fetch the Called address and if there is an ENTER instruction
>>> there, it can cancel the remainder of EXIT and cancel some of ENTER
>>> because the preserved registers are already on the stack where they
>>> are supposed to be.
>>>
>>> Doing these with STs and LDs cannot save those cycles.
>>>
>
>> I don't see why not, the branch-predictor can still do its thing
>> regardless of whether or not LD/ST ops were used.
>
> Consider::
>
> main:
>      ...
>      CALL   funct1
>      CALL   funct2
>
> funct2:
>      SUB    Sp,SP,stackArea2
>      ST     R0,[SP,offset20]
>      ST     R0,[SP,offset20]
>      ST     R30,[SP,offset230]
>      ST     R29,[SP,offset229]
>      ST     R28,[SP,offset228]
>      ST     R27,[SP,offset227]
>      ST     R26,[SP,offset226]
>      ST     R25,[SP,offset225]
>      ...
>
> funct1:
>      ...
>      LD     R0,[SP,offset10]
>      LD     R30,[SP,offset130]
>      LD     R29,[SP,offset129]
>      LD     R28,[SP,offset128]
>      LD     R27,[SP,offset127]
>      LD     R26,[SP,offset126]
>      LD     R25,[SP,offset125]
>      LD     R24,[SP,offset124]
>      LD     R23,[SP,offset123]
>      LD     R22,[SP,offset122]
>      LD     R21,[SP,offset121]
>      ADD    SP,SP,stackArea1
>      JMP    R0
>
> The above would have to observe that all offset1's are equal to all
> offset2's in order to short circuit the data movements. A single::
>
>      LD     R26,[SP,someotheroffset]
>
> ruins the short circuit.
>
> Whereas:
>
> funct2:
>      ENTER   R25,R0,stackArea2
>      ...
>
> funct1:
>      ...
>      EXIT    R21,R0,stackArea1
>
> will have registers R0,R25..R30 in the same positions on the stack
> guaranteed by ISA definition!!

I like the ENTER / EXIT instructions and safe stack idea, and have
incorporated them into Q+ called ENTER and LEAVE. EXIT makes me think of
program exit(). They can improve code density. I gather that the stack
used for ENTER and EXIT is not the same stack as is available for the
rest of the app. This means managing two stack pointers, the regular
stack and the safe stack. Q+ could have the safe stack pointer as a
register that is not even accessible by the app and not part of the GPR
file.

For ENTER/LEAVE Q+ has the number of registers to save specified as a
four-bit number and saves only the saved registers, link register and
frame pointer according to the ABI. So, “ENTER 3,64” will save s0 to s2,
the frame-pointer, link register and allocate 64 bytes plus the return
block on the stack. The return block contains the frame-pointer, link
register and two slots that are zeroed out intended for exception
handlers. The saved registers are limited to s0 so s9.

Q+ also has a PUSHA / POPA instructions to push or pop all the
registers, meant for interrupt handlers. PUSH and POP instructions by
themselves can push or pop up to five registers.

Some thought has been given towards modifying ENTER and LEAVE to support
interrupt handlers, rather than have separate PUSHA / POPA instructions.
ENTER 15,0 would save all the registers, and LEAVE 15,0 would restore
them all and return using an interrupt return.

Pages:12
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor