Welcome to Rocksolid Light

mail files register newsreader groups login

Message-ID:

Whom computers would destroy, they must first drive mad.

Re: Stealing a Great Idea from the 6600

Subject	Author
Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	Scott Lurndal
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	Lawrence D'Oliveiro
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	Lawrence D'Oliveiro
Re: Stealing a Great Idea from the 6600	Scott Lurndal
Re: a bit of history, Stealing a Great Idea from the 6600	John Levine
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	George Neuner
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	George Neuner
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	Anton Ertl
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	EricP
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	Thomas Koenig
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	Thomas Koenig
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	Thomas Koenig
Re: Stealing a Great Idea from the 6600	Tim Rentsch
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	Terje Mathisen
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	BGB
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	MitchAlsup1
Re: Stealing a Great Idea from the 6600	John Savard
Re: Stealing a Great Idea from the 6600	Lawrence D'Oliveiro
Re: Stealing a Great Idea from the 6600	MitchAlsup1

Pages:12 3

Stealing a Great Idea from the 6600

<lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38312&group=comp.arch#38312

copy link Newsgroups: comp.arch

by: John Savard - Wed, 17 Apr 2024 21:19 UTC

Not that I expect Mitch Alsup to approve!

The 6600 had several I/O processors with a 12-bit word length that
were really one processor, basicallty using SMT.

Well, if I have a processor with an ISA that involves register banks
of 32 registers each... an alternate instruction set involving
register banks of 8 registers each would let me allocate either one
compute thread or four threads with the I/O processor instruction set.

And what would the I/O processor instruction set look like?

Think of the PDP-11 or the 9900 but give more impiortance to
floating-point. So I've come up with this format for a part of the
instruction set:

0 : 1 bit
(First two bits of opcode: 00, 01, or 10 but not 11): 2 bits
(remainder of opcode): 5 bits
(mode, not 11): 2 bits
(destination register): 3 bits
(source register): 3 bits

is the format of register-to-register instructions;

but memory-to-register instructions are load-store:

0: 1 bit
(first two bits of opcode: 00, 01, or 10 but not 11): 2 bits
(remainder of load/store opcode): 3 bits
(base register): 2 bits
(mode: 11): 2 bits
(destination register): 3 bits
(index register): 3 bits

(displacement): 16 bits

If the index register is zero, the instruction refers to memory, but
is not indexed, as usual.

An almost complete instruction set, using 3/8 of the available opcode
space. Subroutine call and branch instructions, of course, are still
also needed.

John Savard

Re: Stealing a Great Idea from the 6600

<CuXTN.2184$Dfwf.1335@fx12.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38313&group=comp.arch#38313

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.neodome.net!npeer.as286.net!npeer-ng0.as286.net!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx12.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Stealing a Great Idea from the 6600
Newsgroups: comp.arch
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com>
Lines: 21
Message-ID: <CuXTN.2184$Dfwf.1335@fx12.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Wed, 17 Apr 2024 21:50:26 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Wed, 17 Apr 2024 21:50:26 GMT
X-Received-Bytes: 1459

by: Scott Lurndal - Wed, 17 Apr 2024 21:50 UTC

John Savard <quadibloc@servername.invalid> writes:
>Not that I expect Mitch Alsup to approve!
>
>The 6600 had several I/O processors with a 12-bit word length that
>were really one processor, basicallty using SMT.
>
>Well, if I have a processor with an ISA that involves register banks
>of 32 registers each... an alternate instruction set involving
>register banks of 8 registers each would let me allocate either one
>compute thread or four threads with the I/O processor instruction set.
>
>And what would the I/O processor instruction set look like?

On the Burroughs B4900, it looked a lot like an 8085. In fact,
it was an 8085.

>
>Think of the PDP-11 or the 9900 but give more impiortance to
>floating-point.

Why on earth would an I/O processor use floating point?

Re: Stealing a Great Idea from the 6600

<e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38314&group=comp.arch#38314

copy link Newsgroups: comp.arch

Date: Wed, 17 Apr 2024 23:32:20 +0000
Subject: Re: Stealing a Great Idea from the 6600
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$oBXoA.3STq5/9re52fHsd.8.F4VsNJrofLfhDUcUv9H9eu1BwfsH6
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com>
Organization: Rocksolid Light
Message-ID: <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org>

by: MitchAlsup1 - Wed, 17 Apr 2024 23:32 UTC

While I much admire CDC 6600 PPs and how much work those puppies did
allowing the big number crunchers to <well> crunch numbers::

With modern technology allowing 32-128 CPUs on a single die--there is
no reason to limit the width of a PP to 12-bits (1965:: yes there was
ample reason:: 2024 no reason whatsoever.) There is little reason to
even do 32-bit PPs when it cost so little more to get a 64-bit core.

In 2005-6 I was looking into a Verilog full x86-64 core {less FP} so
that those smaller CPUs could run ISRs and kernel codes to offload the
big CPUs from I/O duties. Done in Verilog meant anyone could compile it
onto another die so the I/O CPUs were out on the PCIe tree nanoseconds
away from the peripherals rather than microseconds away. Close enough
to perform the DMA activities on behalf of the devices; and consuming
interrupts so the bigger cores did not see any of them (except timer).

As Scott stated:: there does not seem to be any reason to need FP on a
core only doing I/O and kernel queueing services.

Re: Stealing a Great Idea from the 6600

<in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38315&group=comp.arch#38315

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (John Savard)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Wed, 17 Apr 2024 21:14:39 -0600
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 18 Apr 2024 05:14:41 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="ac73ca56c1e1c8f8b4fdd3a853c5f630";
logging-data="2168573"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX185s7Q5L7MOdMIEdkpwLITRKxQwjEV3DZc="
Cancel-Lock: sha1:te0cAB7948G8XnRqP/se/rfwdUY=
X-Newsreader: Forte Free Agent 3.3/32.846

by: John Savard - Thu, 18 Apr 2024 03:14 UTC

On Wed, 17 Apr 2024 23:32:20 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

>With modern technology allowing 32-128 CPUs on a single die--there is
>no reason to limit the width of a PP to 12-bits (1965:: yes there was
>ample reason:: 2024 no reason whatsoever.) There is little reason to
>even do 32-bit PPs when it cost so little more to get a 64-bit core.

Well, I'm not. The PP instruction set I propose uses 16-bit and 32-bit
instructions, and so uses the same bus as the main instruction set.

>As Scott stated:: there does not seem to be any reason to need FP on a
>core only doing I/O and kernel queueing services.

That's true.

This isn't about cores, though. Instead, a core running the main ISA
of the processor will simply have the option to replace one
regular-ISA thread by four threads which use 8 registers instead of
32, allowing SMT with more threads.

So we're talking about the same core. The additional threads will get
to execute instructions 1/4 as often as regular threads, so their
performance is reduced, matching an ISA that gives them fewer
registers.

Since the design is reminiscent of the 6600 PPs, these threads might
be used for I/O tasks, but nothing stops them from being used for
other purposes for which access to the FP capabilities of the chip may
be relevant.

John Savard

Re: Stealing a Great Idea from the 6600

<uvq4c7$229m8$3@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38316&group=comp.arch#38316

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ldo@nz.invalid (Lawrence D'Oliveiro)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Thu, 18 Apr 2024 03:34:31 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <uvq4c7$229m8$3@dont-email.me>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 18 Apr 2024 05:34:31 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="fcb29bae0a71a3b2c238e3a2e5ff89a6";
logging-data="2172616"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+6xsEX7gx8fKfrrxpwLLW4"
User-Agent: Pan/0.155 (Kherson; fc5a80b8)
Cancel-Lock: sha1:Tzp0JZ1Qvor9ykkyYSOA223Jo40=

by: Lawrence D'Oliv - Thu, 18 Apr 2024 03:34 UTC

On Wed, 17 Apr 2024 15:19:03 -0600, John Savard wrote:

> The 6600 had several I/O processors with a 12-bit word length that were
> really one processor, basicallty using SMT.

Originally these “PPUs” (“Peripheral Processor Units”) were for running
the OS, while the main CPU was primarily dedicated to running user
programs.

Aparently this idea did not work out so well, and in later versions of the
OS, more code ran on the CPU instead of the PPUs.

Re: Stealing a Great Idea from the 6600

<71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38317&group=comp.arch#38317

copy link Newsgroups: comp.arch

Date: Thu, 18 Apr 2024 16:55:37 +0000
Subject: Re: Stealing a Great Idea from the 6600
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$CMDMWCvS0HYhhc0nKl86ue3mXJvkcWGm7GbbW3.ttuK3JtQoxqSlC
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org> <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com>
Organization: Rocksolid Light
Message-ID: <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org>

by: MitchAlsup1 - Thu, 18 Apr 2024 16:55 UTC

John Savard wrote:

> On Wed, 17 Apr 2024 23:32:20 +0000, mitchalsup@aol.com (MitchAlsup1)
> wrote:

>>With modern technology allowing 32-128 CPUs on a single die--there is
>>no reason to limit the width of a PP to 12-bits (1965:: yes there was
>>ample reason:: 2024 no reason whatsoever.) There is little reason to
>>even do 32-bit PPs when it cost so little more to get a 64-bit core.

> Well, I'm not. The PP instruction set I propose uses 16-bit and 32-bit
> instructions, and so uses the same bus as the main instruction set.

>>As Scott stated:: there does not seem to be any reason to need FP on a
>>core only doing I/O and kernel queueing services.

> That's true.

> This isn't about cores, though. Instead, a core running the main ISA
> of the processor will simply have the option to replace one
> regular-ISA thread by four threads which use 8 registers instead of
> 32, allowing SMT with more threads.

The hard thing is to run the Operating System in the PPs using the same
compiled code in either a big core or in a little core. The big cores
are on a CPU centric die, the little ones out on device oriented dies.
In 7nm a MIPS R2000 is less than 0.07mm^2 using std cells. At this size
every device can have its own core.

> So we're talking about the same core. The additional threads will get
> to execute instructions 1/4 as often as regular threads, so their
> performance is reduced, matching an ISA that gives them fewer
> registers.

I knew you were talking about it that way, I was trying to get you to
change your mind and use the same ISA in the device cores as you use
in the CPU cores so you can run the same OS code and even a bit of the
device drivers as well.

> Since the design is reminiscent of the 6600 PPs, these threads might
> be used for I/O tasks, but nothing stops them from being used for
> other purposes for which access to the FP capabilities of the chip may
> be relevant.

Yes, exactly, and it is for those other purposes that you want these
device cores to operate on the same ISA as the big cores. This way if
anything goes wrong, you can simply lob the code back to a CPU centric
core and finish the job.

> John Savard

Re: Stealing a Great Idea from the 6600

<25b984ada404192243d78f6a78f45709@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38318&group=comp.arch#38318

copy link Newsgroups: comp.arch

Date: Thu, 18 Apr 2024 16:59:18 +0000
Subject: Re: Stealing a Great Idea from the 6600
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$Bri25NgG0t0fwfi9YSNw4.lCGCfM4ROXCyVD4DG08THR4wwuoeVA6
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <uvq4c7$229m8$3@dont-email.me>
Organization: Rocksolid Light
Message-ID: <25b984ada404192243d78f6a78f45709@www.novabbs.org>

by: MitchAlsup1 - Thu, 18 Apr 2024 16:59 UTC

Lawrence D'Oliveiro wrote:

> On Wed, 17 Apr 2024 15:19:03 -0600, John Savard wrote:

>> The 6600 had several I/O processors with a 12-bit word length that were
>> really one processor, basicallty using SMT.

> Originally these “PPUs” (“Peripheral Processor Units”) were for running
> the OS,

Including polling DMA performed by the PPS.

> while the main CPU was primarily dedicated to running user
> programs.

> Aparently this idea did not work out so well, and in later versions of the
> OS, more code ran on the CPU instead of the PPUs.

Imagine that 10×12-bit CPUs, running 1/10 frequency of the main CPU, having
a hard time performing OS workloads while the 50× faster CPU cores perform
user workloads.

Re: Stealing a Great Idea from the 6600

<1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38326&group=comp.arch#38326

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (John Savard)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Thu, 18 Apr 2024 23:42:15 -0600
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org> <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 19 Apr 2024 07:42:16 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="7d2a877c4c14baf843abfdc00620d99b";
logging-data="2910937"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/C7hDj7syKfzXucDJ926e8ek3JzOl2+ic="
Cancel-Lock: sha1:g9Fy2YwaBqnIlPAfXKOinSIGLzI=
X-Newsreader: Forte Free Agent 3.3/32.846

by: John Savard - Fri, 19 Apr 2024 05:42 UTC

On Thu, 18 Apr 2024 16:55:37 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

>Yes, exactly, and it is for those other purposes that you want these
>device cores to operate on the same ISA as the big cores. This way if
>anything goes wrong, you can simply lob the code back to a CPU centric
>core and finish the job.

If the design has P-cores and E-cores, both will have the same *pair*
of ISAs.

Code written in the big ISA will run on both kinds of core, and code
written in the little ISA will also run on both kinds of core, but use
less resources on whichever core it is placed.

So I won't have _that_ problem.

Each core can just switch between compute duty with N threads, and I/O
service duty with 4*N threads - or anywhere in between.

John Savard

Re: Stealing a Great Idea from the 6600

<oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38327&group=comp.arch#38327

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (John Savard)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Fri, 19 Apr 2024 01:38:45 -0600
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org> <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 19 Apr 2024 09:38:46 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="7d2a877c4c14baf843abfdc00620d99b";
logging-data="2956939"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19wZqG1+PHvZ5vUIgXP1O5VLlPVTcy81MY="
Cancel-Lock: sha1:wW3owjP4jDvhUnM+lg63ksulYnc=
X-Newsreader: Forte Free Agent 3.3/32.846

by: John Savard - Fri, 19 Apr 2024 07:38 UTC

On Thu, 18 Apr 2024 23:42:15 -0600, John Savard
<quadibloc@servername.invalid> wrote:

>Each core can just switch between compute duty with N threads, and I/O
>service duty with 4*N threads - or anywhere in between.

So I hope it is clear now I'm talking about SMT threads, not cores.
Threads are orthogonal to cores.

But I did make one oversimplification that could be confusing.

The full instruction set assumes banks of 32 registers, one each for
integer and floats, the reduced instruction set assumes banks of 8
registers, one each for integer and floats.

So one thread of the full ISA can be replaced by four threads of the
reduced ISA, both use the same number of registes.

That's all right for an in-order design. But in real life, computers
are out-of-order. So the *rename* registers would have to be split up.

Since the reduced ISA threads are four times greater in number, their
instructions have four times longer to finish executing before their
thread gets a chance to execute again. So presumably reduced ISA
threads will need less agressive OoO, and 1/4 the rename registers
might be adequate, but there's obviously no guarantee that this would
indeed be an ideal fit.

John Savard

Re: Stealing a Great Idea from the 6600

<dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38328&group=comp.arch#38328

copy link Newsgroups: comp.arch

Date: Fri, 19 Apr 2024 18:40:45 +0000
Subject: Re: Stealing a Great Idea from the 6600
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$J/tKTQjQApK1twCX.huhoOodK6F1F51ivO.yjreeiIgje4Z5CT7FK
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org> <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com>
Organization: Rocksolid Light
Message-ID: <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org>

by: MitchAlsup1 - Fri, 19 Apr 2024 18:40 UTC

John Savard wrote:

> On Thu, 18 Apr 2024 23:42:15 -0600, John Savard
> <quadibloc@servername.invalid> wrote:

>>Each core can just switch between compute duty with N threads, and I/O
>>service duty with 4*N threads - or anywhere in between.

> So I hope it is clear now I'm talking about SMT threads, not cores.
> Threads are orthogonal to cores.

That was already clear.

> But I did make one oversimplification that could be confusing.

> The full instruction set assumes banks of 32 registers, one each for
> integer and floats, the reduced instruction set assumes banks of 8
> registers, one each for integer and floats.

> So one thread of the full ISA can be replaced by four threads of the
> reduced ISA, both use the same number of registes.

So how does a 32-register thread "call" an 8 register thread ?? or vice
versa ??

What ABI model does the compiler use ??

When an 8-register thread takes an exception, is it handled by a 8-reg
thread or a 32-register thread ??

> That's all right for an in-order design. But in real life, computers
> are out-of-order. So the *rename* registers would have to be split up.

In K9 we unified the x86 register files into a single file to simplify
HW maintenance of the OoO state.

> Since the reduced ISA threads are four times greater in number, their
> instructions have four times longer to finish executing before their
> thread gets a chance to execute again.

Now all that forwarding logic is wasting its gates of delay and area
without adding any performance.

Now all those instruction schedulers are sitting around doing nothing.

> So presumably reduced ISA
> threads will need less agressive OoO, and 1/4 the rename registers
> might be adequate, but there's obviously no guarantee that this would
> indeed be an ideal fit.

LoL.

> John Savard

Re: Stealing a Great Idea from the 6600

<b1q62jhfp2qi2gjbnqd4kk14boderokara@4ax.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38329&group=comp.arch#38329

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (John Savard)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Sat, 20 Apr 2024 01:06:33 -0600
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <b1q62jhfp2qi2gjbnqd4kk14boderokara@4ax.com>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org> <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com> <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 20 Apr 2024 09:06:36 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="0c3b57d7672b1fead873f2708a4e58d7";
logging-data="3722905"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX183eIBCtAkBGHkfpVdm2BC6LqrgYyteVQY="
Cancel-Lock: sha1:Ic1qKiLERTAsL1rdcpF3DGOJNP4=
X-Newsreader: Forte Free Agent 3.3/32.846

by: John Savard - Sat, 20 Apr 2024 07:06 UTC

On Fri, 19 Apr 2024 18:40:45 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

>So how does a 32-register thread "call" an 8 register thread ?? or vice
>versa ??

That sort of thing would be done by supervisor mode instructions,
similar to the ones used to start additional threads on a given core,
or start threads on a new core.

Since the lightweight ISA has the benefit of having fewer registers
allocated, it's not the same as, slay, a "thumb mode" which offers
more compact code as its benefit. Instead, this is for use in classes
of threads that are separate from ordinary code.

I/O processing threads being one example of this.

The intent of this kind of lightweight ISA is to reduce the temptation
to decide "oh, we've got to put special smaller cores in the SoC/on
the motherboard to perform this specialized task, because the main CPU
is overkill". Because now you're using a smaller slice of the main
CPU, so it's not a waste to do it there any more.

John Savard

Re: Stealing a Great Idea from the 6600

<acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38330&group=comp.arch#38330

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (John Savard)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Sat, 20 Apr 2024 01:09:53 -0600
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org> <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com> <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 20 Apr 2024 09:09:53 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="0c3b57d7672b1fead873f2708a4e58d7";
logging-data="3722905"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18ZLJ68NeW8lb/VUh7yftVLVVnWssHf7tQ="
Cancel-Lock: sha1:CiMixs8NJrEC+jNg514lujHhtXM=
X-Newsreader: Forte Free Agent 3.3/32.846

by: John Savard - Sat, 20 Apr 2024 07:09 UTC

On Fri, 19 Apr 2024 18:40:45 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:
>John Savard wrote:

>> So presumably reduced ISA
>> threads will need less agressive OoO, and 1/4 the rename registers
>> might be adequate, but there's obviously no guarantee that this would
>> indeed be an ideal fit.
>
>LoL.

Well, yes. The fact that pretty much all serious high-performance
designs these days _are_ OoO basically means that my brilliant idea is
DoA.

Of course, instead of replacing 1 full-ISA thread with 4 light-ISA
threads, one could use a different number, based on what is optimum
for a given implementation. But that ratio would now vary from one
chip to another, being model-dependent.

So it's not *totally* destroyed, but this is still a major blow.

John Savard

Re: Stealing a Great Idea from the 6600

<kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38331&group=comp.arch#38331

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (John Savard)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Sat, 20 Apr 2024 01:12:25 -0600
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org> <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com> <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org> <acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 20 Apr 2024 09:12:25 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="0c3b57d7672b1fead873f2708a4e58d7";
logging-data="3725036"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19aS34tzXsgZg3gtd7KBjR0W3KgdGulOnM="
Cancel-Lock: sha1:94Jk5VbnIAMB5EWG73vaPAZ71pQ=
X-Newsreader: Forte Free Agent 3.3/32.846

by: John Savard - Sat, 20 Apr 2024 07:12 UTC

On Sat, 20 Apr 2024 01:09:53 -0600, John Savard
<quadibloc@servername.invalid> wrote:

>On Fri, 19 Apr 2024 18:40:45 +0000, mitchalsup@aol.com (MitchAlsup1)
>wrote:
>>John Savard wrote:
>
>>> So presumably reduced ISA
>>> threads will need less agressive OoO, and 1/4 the rename registers
>>> might be adequate, but there's obviously no guarantee that this would
>>> indeed be an ideal fit.
>>
>>LoL.
>
>Well, yes. The fact that pretty much all serious high-performance
>designs these days _are_ OoO basically means that my brilliant idea is
>DoA.
>
>Of course, instead of replacing 1 full-ISA thread with 4 light-ISA
>threads, one could use a different number, based on what is optimum
>for a given implementation. But that ratio would now vary from one
>chip to another, being model-dependent.
>
>So it's not *totally* destroyed, but this is still a major blow.

And, hey, I'm not the first guy to get sunk because of forgetting what
lies under the tip of the iceberg that's above the water.

That also happened to the captain of the _Titanic_.

John Savard

Re: Stealing a Great Idea from the 6600

<9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38332&group=comp.arch#38332

copy link Newsgroups: comp.arch

Date: Sat, 20 Apr 2024 17:07:11 +0000
Subject: Re: Stealing a Great Idea from the 6600
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$BYbIV31PXrDsz9e49RpRo.4uRw0NFpcjFxwORvRCPT4hkieO32E3e
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org> <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com> <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org> <acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com> <kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com>
Organization: Rocksolid Light
Message-ID: <9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org>

by: MitchAlsup1 - Sat, 20 Apr 2024 17:07 UTC

John Savard wrote:

> On Sat, 20 Apr 2024 01:09:53 -0600, John Savard
> <quadibloc@servername.invalid> wrote:

> And, hey, I'm not the first guy to get sunk because of forgetting what
> lies under the tip of the iceberg that's above the water.

> That also happened to the captain of the _Titanic_.

Concer-tina-tanic !?!

> John Savard

Re: Stealing a Great Idea from the 6600

<v017mg$3rcg9$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38333&group=comp.arch#38333

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Sat, 20 Apr 2024 15:14:07 -0500
Organization: A noiseless patient Spider
Lines: 206
Message-ID: <v017mg$3rcg9$1@dont-email.me>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com>
<e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org>
<in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com>
<71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org>
<1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com>
<oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com>
<dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org>
<acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com>
<kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com>
<9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 20 Apr 2024 22:14:08 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="e2be22f641367eea2e3a03e20b93105a";
logging-data="4043273"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19fIpjSTxCqPmxYS63YSowVf8A0OyWpEbo="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:5rL9RF7zJgmhbjD/Y+I3JBwG5JI=
Content-Language: en-US
In-Reply-To: <9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org>

by: BGB - Sat, 20 Apr 2024 20:14 UTC

On 4/20/2024 12:07 PM, MitchAlsup1 wrote:
> John Savard wrote:
>
>> On Sat, 20 Apr 2024 01:09:53 -0600, John Savard
>> <quadibloc@servername.invalid> wrote:
>
>
>> And, hey, I'm not the first guy to get sunk because of forgetting what
>> lies under the tip of the iceberg that's above the water.
>
>> That also happened to the captain of the _Titanic_.
>
> Concer-tina-tanic !?!
>

Seems about right.
Seems like a whole lot of flailing with designs that seem needlessly
complicated...

Meanwhile, has looked around and noted:
In some ways, RISC-V is sort of like MIPS with the field order reversed,
and (ironically) actually smaller immediate fields (MIPS was using a lot
of Imm16 fields. whereas RISC-V mostly used Imm12).

But, seemed to have more wonk:
A mode with 32x 32-bit GPRs;
A mode with 32x 64-bit GPRs;
Apparently a mode with 32x 32-bit GPRs that can be paired to 16x 64-bits
as needed for 64-bit operations?...
Integer operations (on 64-bit registers) that give UB or trap if values
are outside of signed Int32 range;
Other operations that sign-extend the values but are ironically called
"unsigned" (apparently, similar wonk to RISC-V by having signed-extended
Unsigned Int);
Branch operations are bit-sliced;
....

I had preferred a different strategy in some areas:
Assume non-trapping operations by default;
Sign-extend signed values, zero-extend unsigned values.

Though, this is partly the source of some operations in my case assuming
33 bit sign-extended: This can represent both the signed and unsigned
32-bit ranges.

One could argue that sign-extending both could save 1 bit in some cases.
But, this creates wonk in other cases, such as requiring an explicit
zero extension for "unsigned int" to "long long" casts; and more cases
where separate instructions are needed for Int32 and Int64 cases (say,
for example, RISC-V needed around 4x as many Int<->Float conversion
operators due to its design choices in this area).

Say:
RV64:
Int32<->Binary32, UInt32<->Binary32
Int64<->Binary32, UInt64<->Binary32
Int32<->Binary64, UInt32<->Binary64
Int64<->Binary64, UInt64<->Binary64
BJX2:
Int64<->Binary64, UInt64<->Binary64

With the Uint64 case mostly added because otherwise one needs a wonky
edge case to deal with this (but is rare in practice).

The separate 32-bit cases were avoided by tending to normalize
everything to Binary64 in registers (with Binary32 only existing in SIMD
form or in memory).

Annoyingly, I did end up needing to add logic for all of these cases to
deal with RV64G.

Currently no plans to implement RISC-V's Privileged ISA stuff, mostly
because it would likely be unreasonably expensive. It is in theory
possible to write an OS to run in RISC-V mode, but it would need to deal
with the different OS level and hardware-level interfaces (in much the
same way, as I needed to use a custom linker script for GCC, as my stuff
uses a different memory map from the one GCC had assumed; namely that of
RAM starting at the 64K mark, rather than at the 16MB mark).

In some cases in my case, there are distinctions between 32-bit and
64-bit compare-and-branch ops. I am left thinking this distinction may
be unnecessary, and one may only need 64 bit compare and branch.

In the emulator, the current difference ended up mostly that the 32-bit
version sees if the 32-bit and 64-bit version would give a different
result and faulting if so, since this generally means that there is a
bug elsewhere (such as other code is producing out-of-range values).

For a few newer cases (such as the 3R compare ops, which produce a 1-bit
output in a register), had only defined 64-bit versions.

One could just ignore the distinction between 32 and 64 bit compare in
hardware, but had still burnt the encoding space on this. In a new ISA
design, I would likely drop the existence of 32-bit compare and use
exclusively 64-bit compare.

In many cases, the distinction between 32-bit and 64-bit operations, or
between 2R and 3R cases, had ended up less significant than originally
thought (and now have ended up gradually deprecating and disabling some
of the 32-bit 2R encodings mostly due to "lack of relevance").

Though, admittedly, part of the reason for a lot of separate 2R cases
existing was that I had initially had the impression that there may have
been a performance cost difference between 2R and 3R instructions. This
ended up not really the case, as the various units ended up typically
using 3R internally anyways.

So, say, one needs an ALU with, say:
2 inputs, one output;
Ability to bit-invert the second input
along with inverting carry-in, ...
Ability to sign or zero extend the output.
So, say, operations:
ADD / SUB (Add, 64-bit)
ADDSL / SUBSL (Add, 32-bit, sign extent)
ADDUL / SUBUL (Add, 32-bit, zero extent)
AND
OR
XOR
CMPEQ
CMPNE
CMPGT (CMPLT implicit)
CMPGE (CMPLE implicit)
CMPHI (unsigned GT)
CMPHS (unsigned GE)
....

Where, internally compare works by performing a subtract and then
producing a result based on some status bits (Z,C,S,O). As I see it,
ideally these bits should not be exposed at the ISA level though (much
pain and hair results from the existence of architecturally visible ALU
status-flag bits).

Some other features could still be debated though, along with how much
simplification could be possible.

If I did a new design, would probably still keep predication and jumbo
prefixes.

Explicit bundling vs superscalar could be argued either way, as
superscalar isn't as expensive as initially thought, but in a simpler
form is comparably weak (the compiler has an advantage that it can
invest more expensive analysis into this, reorder instructions, etc; but
this only goes so far as the compiler understands the CPU's pipeline,
ties the code to a specific pipeline structure, and becomes effectively
moot with OoO CPU designs).

So, a case could be made that a "general use" ISA be designed without
the use of explicit bundling. In my case, using the bundle flags also
requires the code to use an instruction to signal to the CPU what
configuration of pipeline it expects to run on, with the CPU able to
fall back to scalar (or superscalar) execution if it does not match.

For the most part, thus far nearly everything has ended up as "Mode 2",
namely:
3 lanes;
Lane 1 does everything;
Lane 2 does Basic ALU ops, Shift, Convert (CONV), ...
Lane 3 only does Basic ALU ops and a few CONV ops and similar.
Lane 3 originally also did Shift, dropped to reduce cost.
Mem ops may eat Lane 3, ...

Where, say:
Mode 0 (Default):
Only scalar code is allowed, CPU may use superscalar (if available).
Mode 1:
2 lanes:
Lane 1 does everything;
Lane 2 does ALU, Shift, and CONV.
Mem ops take up both lanes.
Effectively scalar for Load/Store.
Later defined that 128-bit MOV.X is allowed in a Mode 1 core.

Had defined wider modes, and ones that allow dual-lane IO and FPU
instructions, but these haven't seen use (too expensive to support in
hardware).

Had ended up with the ambiguous "extension" to the Mode 2 rules of
allowing an FPU instruction to be executed from Lane 2 if there was not
an FPU instruction in Lane 1, or allowing co-issuing certain FPU
instructions if they effectively combine into a corresponding SIMD op.

In my current configurations, there is only a single memory access port.
A second memory access port would help with performance, but is
comparably a rather expensive feature (and doesn't help enough to
justify its fairly steep cost).

For lower-end cores, a case could be made for assuming a 1-wide CPU with
a 2R1W register file, but designing the whole ISA around this limitation
and not allowing for anything more is limiting (and mildly detrimental
to performance). If we can assume cores with an FPU, we can probably
also assume cores with more than two register read ports available.

....

Re: Stealing a Great Idea from the 6600

<da6dc5fe28bb31b4c73d78ef1aac2ac5@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38334&group=comp.arch#38334

copy link Newsgroups: comp.arch

Date: Sat, 20 Apr 2024 22:03:21 +0000
Subject: Re: Stealing a Great Idea from the 6600
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$tyQLyfhCYEQfIu4I8aUGNuHWpa1KT5UfbKOWym.eEUxwUDco1.BB6
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org> <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com> <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org> <acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com> <kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com> <9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org> <v017mg$3rcg9$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <da6dc5fe28bb31b4c73d78ef1aac2ac5@www.novabbs.org>

by: MitchAlsup1 - Sat, 20 Apr 2024 22:03 UTC

BGB wrote:

> On 4/20/2024 12:07 PM, MitchAlsup1 wrote:
>> John Savard wrote:
>>
>>> On Sat, 20 Apr 2024 01:09:53 -0600, John Savard
>>> <quadibloc@servername.invalid> wrote:
>>
>>
>>> And, hey, I'm not the first guy to get sunk because of forgetting what
>>> lies under the tip of the iceberg that's above the water.
>>
>>> That also happened to the captain of the _Titanic_.
>>
>> Concer-tina-tanic !?!
>>

> Seems about right.
> Seems like a whole lot of flailing with designs that seem needlessly
> complicated...

> Meanwhile, has looked around and noted:
> In some ways, RISC-V is sort of like MIPS with the field order reversed,

They, in effect, Litle-Endian-ed the fields.

> and (ironically) actually smaller immediate fields (MIPS was using a lot
> of Imm16 fields. whereas RISC-V mostly used Imm12).

Yes, RISC-V took a step back with the 12-bit immediates. My 66000, on
the other hand, only has 12-bit immediates for shift instructions--
allowing all shifts to reside in one Major OpCode; the rest inst[31]=1
have 16-bit immediates (universally sign extended).

> But, seemed to have more wonk:
> A mode with 32x 32-bit GPRs; // unnecessary
> A mode with 32x 64-bit GPRs;
> Apparently a mode with 32x 32-bit GPRs that can be paired to 16x 64-bits
> as needed for 64-bit operations?...

Repeating the mistake I made on Mc 88100....

> Integer operations (on 64-bit registers) that give UB or trap if values
> are outside of signed Int32 range;

Isn't it just wonderful ??

> Other operations that sign-extend the values but are ironically called
> "unsigned" (apparently, similar wonk to RISC-V by having signed-extended
> Unsigned Int);
> Branch operations are bit-sliced;
> ....

> I had preferred a different strategy in some areas:
> Assume non-trapping operations by default;

Assume trap/"do the expected thing" under a user accessible flag.

> Sign-extend signed values, zero-extend unsigned values.

Another mistake I mad in Mc 88100.

Do you sign extend the 16-bit displacement on an unsigned LD ??

> Though, this is partly the source of some operations in my case assuming
> 33 bit sign-extended: This can represent both the signed and unsigned
> 32-bit ranges.

These are some of the reasons My 66000 is 64-bit register/calculation only.

> One could argue that sign-extending both could save 1 bit in some cases.
> But, this creates wonk in other cases, such as requiring an explicit
> zero extension for "unsigned int" to "long long" casts; and more cases
> where separate instructions are needed for Int32 and Int64 cases (say,
> for example, RISC-V needed around 4x as many Int<->Float conversion
> operators due to its design choices in this area).

It also gets difficult when you consider EADD Rd,Rdouble,Rexponent ??
is it a FP calculation or an integer calculation ?? If Rdouble is a
constant is the constant FP or int, if Rexponent is a constant is it
double or int,..... Does it raise FP overflow or integer overflow ??

> Say:
> RV64:
> Int32<->Binary32, UInt32<->Binary32
> Int64<->Binary32, UInt64<->Binary32
> Int32<->Binary64, UInt32<->Binary64
> Int64<->Binary64, UInt64<->Binary64
> BJX2:
> Int64<->Binary64, UInt64<->Binary64
My 66000:
int64_t -> { uint64_t, float, double }
uint64_t -> { int64_t, float, double }
float -> { uint64_t, int64_t, double }
double -> { uint64_t, int64_t, float }

> With the Uint64 case mostly added because otherwise one needs a wonky
> edge case to deal with this (but is rare in practice).

> The separate 32-bit cases were avoided by tending to normalize
> everything to Binary64 in registers (with Binary32 only existing in SIMD
> form or in memory).

I saved LD and ST instructions by leaving float 32-bits in the registers.

> Annoyingly, I did end up needing to add logic for all of these cases to
> deal with RV64G.

No rest for the wicked.....

> Currently no plans to implement RISC-V's Privileged ISA stuff, mostly
> because it would likely be unreasonably expensive.

The sea of control registers or the sequencing model applied thereon ??
My 66000 allows access to all control registers via memory mapped I/O
space.

> It is in theory
> possible to write an OS to run in RISC-V mode, but it would need to deal
> with the different OS level and hardware-level interfaces (in much the
> same way, as I needed to use a custom linker script for GCC, as my stuff
> uses a different memory map from the one GCC had assumed; namely that of
> RAM starting at the 64K mark, rather than at the 16MB mark).

> In some cases in my case, there are distinctions between 32-bit and
> 64-bit compare-and-branch ops. I am left thinking this distinction may
> be unnecessary, and one may only need 64 bit compare and branch.

No 32-bit stuff, thereby no 32-bit distinctions needed.

> In the emulator, the current difference ended up mostly that the 32-bit
> version sees if the 32-bit and 64-bit version would give a different
> result and faulting if so, since this generally means that there is a
> bug elsewhere (such as other code is producing out-of-range values).

Saving vast amounts of power {{{not}}}

> For a few newer cases (such as the 3R compare ops, which produce a 1-bit
> output in a register), had only defined 64-bit versions.

Oh what a tangled web we.......

> One could just ignore the distinction between 32 and 64 bit compare in
> hardware, but had still burnt the encoding space on this. In a new ISA
> design, I would likely drop the existence of 32-bit compare and use
> exclusively 64-bit compare.

> In many cases, the distinction between 32-bit and 64-bit operations, or
> between 2R and 3R cases, had ended up less significant than originally
> thought (and now have ended up gradually deprecating and disabling some
> of the 32-bit 2R encodings mostly due to "lack of relevance").

I deprecated all of them.

> Though, admittedly, part of the reason for a lot of separate 2R cases
> existing was that I had initially had the impression that there may have
> been a performance cost difference between 2R and 3R instructions. This
> ended up not really the case, as the various units ended up typically
> using 3R internally anyways.

> So, say, one needs an ALU with, say:
> 2 inputs, one output;
you forgot carry, and inversion to perform subtraction.
> Ability to bit-invert the second input
> along with inverting carry-in, ...
> Ability to sign or zero extend the output.

So, My 66000 integer adder has 3 carry inputs, and I discovered a way to
perform these that takes no more gates of delay than the typical 1-carry
in 64-bit integer adder. This gives me a = -b -c; for free.

> So, say, operations:
> ADD / SUB (Add, 64-bit)
> ADDSL / SUBSL (Add, 32-bit, sign extent) // nope
> ADDUL / SUBUL (Add, 32-bit, zero extent) // nope
> AND
> OR
> XOR
> CMPEQ // 1 ICMP inst
> CMPNE
> CMPGT (CMPLT implicit)
> CMPGE (CMPLE implicit)
> CMPHI (unsigned GT)
> CMPHS (unsigned GE)
> ....

> Where, internally compare works by performing a subtract and then
> producing a result based on some status bits (Z,C,S,O). As I see it,
> ideally these bits should not be exposed at the ISA level though (much
> pain and hair results from the existence of architecturally visible ALU
> status-flag bits).

I agree that these flags should not be exposed through ISA; and I did not.
On the other hand multi-precision arithmetic demands at least carry {or
some other means which is even more powerful--such as CARRY.....}

> Some other features could still be debated though, along with how much
> simplification could be possible.

> If I did a new design, would probably still keep predication and jumbo
> prefixes.

I kept predication but not the way most predication works.
My work on Mc 88120 and K9 taught me the futility of things in the
instruction stream that provide artificial boundaries. I have a suspicion
that if you have the FPGA capable of allowing you to build a 8-wide
machine, you would do the jumbo stuff differently, too.

> Explicit bundling vs superscalar could be argued either way, as
> superscalar isn't as expensive as initially thought, but in a simpler
> form is comparably weak (the compiler has an advantage that it can
> invest more expensive analysis into this, reorder instructions, etc; but
> this only goes so far as the compiler understands the CPU's pipeline,

Compilers are notoriously unable to outguess a good branch predictor.

Click here to read the complete article

Re: Stealing a Great Idea from the 6600

<sdl82jpkpf1t0ctr8sgqm5bvqqireg08j5@4ax.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38335&group=comp.arch#38335

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (John Savard)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Sat, 20 Apr 2024 17:59:12 -0600
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <sdl82jpkpf1t0ctr8sgqm5bvqqireg08j5@4ax.com>
References: <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com> <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org> <acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com> <kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com> <9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org> <v017mg$3rcg9$1@dont-email.me> <da6dc5fe28bb31b4c73d78ef1aac2ac5@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 21 Apr 2024 01:59:14 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="927163daac3c1197b41694d2d6822608";
logging-data="4134060"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1//CVtHGwrax7VPFcPU0p/dD0kNQRP2POU="
Cancel-Lock: sha1:4poenvelVYddkyMmJvoOMMOVYCQ=
X-Newsreader: Forte Free Agent 3.3/32.846

by: John Savard - Sat, 20 Apr 2024 23:59 UTC

On Sat, 20 Apr 2024 22:03:21 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:
>BGB wrote:

>> Sign-extend signed values, zero-extend unsigned values.

>Another mistake I mad in Mc 88100.

As that is a mistake the IBM 360 made, I make it too. But I make it
the way the 360 did: there are no signed and unsigned values, in the
sense of a Burroughs machine, there are just Load, Load Unsigned - and
Insert - instructions.

Index and base register values are assumed to be unsigned.

John Savard

Re: Stealing a Great Idea from the 6600

<onl82j9k5llpmtn8fdn6qkdbkp258d3r6b@4ax.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38336&group=comp.arch#38336

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (John Savard)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Sat, 20 Apr 2024 18:01:49 -0600
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <onl82j9k5llpmtn8fdn6qkdbkp258d3r6b@4ax.com>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org> <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com> <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org> <b1q62jhfp2qi2gjbnqd4kk14boderokara@4ax.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 21 Apr 2024 02:01:50 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="927163daac3c1197b41694d2d6822608";
logging-data="4134060"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19DD/fE+hRHsQFluC7FGwLJ7VJKopC2Cp8="
Cancel-Lock: sha1:Qy8ZlPdhtDM8tK7Ci2OC3ys23xA=
X-Newsreader: Forte Free Agent 3.3/32.846

by: John Savard - Sun, 21 Apr 2024 00:01 UTC

On Sat, 20 Apr 2024 01:06:33 -0600, John Savard
<quadibloc@servername.invalid> wrote:

>On Fri, 19 Apr 2024 18:40:45 +0000, mitchalsup@aol.com (MitchAlsup1)
>wrote:
>
>>So how does a 32-register thread "call" an 8 register thread ?? or vice
>>versa ??
>
>That sort of thing would be done by supervisor mode instructions,
>similar to the ones used to start additional threads on a given core,
>or start threads on a new core.
>
>Since the lightweight ISA has the benefit of having fewer registers
>allocated, it's not the same as, slay, a "thumb mode" which offers
>more compact code as its benefit. Instead, this is for use in classes
>of threads that are separate from ordinary code.
>
>I/O processing threads being one example of this.

Of course, though, there's nothing preventing using the lightweight
ISA as the basic for something that _could_ interoperate with the full
ISA. Keep all 32 registers in each bank, and have a sliding 8-register
window, or use bundles of instructions, say up to seven instructions,
using one of three groups of eight integer registers and one of four
groups of floating-point registers. (The fourth group of integer
registers is the base registers.)

John Savard

Re: Stealing a Great Idea from the 6600

<gul82jlmud2gglbf1siupn180r3f5o3qo5@4ax.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38337&group=comp.arch#38337

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (John Savard)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Sat, 20 Apr 2024 18:06:22 -0600
Organization: A noiseless patient Spider
Lines: 25
Message-ID: <gul82jlmud2gglbf1siupn180r3f5o3qo5@4ax.com>
References: <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com> <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org> <acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com> <kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com> <9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org> <v017mg$3rcg9$1@dont-email.me> <da6dc5fe28bb31b4c73d78ef1aac2ac5@www.novabbs.org> <sdl82jpkpf1t0ctr8sgqm5bvqqireg08j5@4ax.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 21 Apr 2024 02:06:23 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="927163daac3c1197b41694d2d6822608";
logging-data="4134060"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19LVJS7Q/hNxWIhVfdIVCGGQdQ7LhfOjyU="
Cancel-Lock: sha1:WB0k4209DI+KEXpFkXBCUtwUHAE=
X-Newsreader: Forte Free Agent 3.3/32.846

by: John Savard - Sun, 21 Apr 2024 00:06 UTC

On Sat, 20 Apr 2024 17:59:12 -0600, John Savard
<quadibloc@servername.invalid> wrote:

>On Sat, 20 Apr 2024 22:03:21 +0000, mitchalsup@aol.com (MitchAlsup1)
>wrote:
>>BGB wrote:
>
>>> Sign-extend signed values, zero-extend unsigned values.
>
>>Another mistake I mad in Mc 88100.
>
>As that is a mistake the IBM 360 made, I make it too. But I make it
>the way the 360 did: there are no signed and unsigned values, in the
>sense of a Burroughs machine, there are just Load, Load Unsigned - and
>Insert - instructions.

Since there was only one set of arithmetic instrucions, that meant
that when you wrote code to operate on unsigned values, you had to
remember that the normal names of the condition code values were
oriented around signed arithmetic.

So during unsigned arithmetic, "overflow" didn't _mean_ overflow.
Instead, carry was overflow.

John Savard

Re: Stealing a Great Idea from the 6600

<44fdd1209496c66ba18e425370a8b50d@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38338&group=comp.arch#38338

copy link Newsgroups: comp.arch

Date: Sun, 21 Apr 2024 00:43:21 +0000
Subject: Re: Stealing a Great Idea from the 6600
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$zBk7lnfTNM6rsbJGQbEFmuXzorFiPTA32MPcmpNiDaGD7QBx5xdQu
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com> <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org> <acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com> <kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com> <9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org> <v017mg$3rcg9$1@dont-email.me> <da6dc5fe28bb31b4c73d78ef1aac2ac5@www.novabbs.org> <sdl82jpkpf1t0ctr8sgqm5bvqqireg08j5@4ax.com>
Organization: Rocksolid Light
Message-ID: <44fdd1209496c66ba18e425370a8b50d@www.novabbs.org>

by: MitchAlsup1 - Sun, 21 Apr 2024 00:43 UTC

John Savard wrote:

> On Sat, 20 Apr 2024 22:03:21 +0000, mitchalsup@aol.com (MitchAlsup1)
> wrote:
>>BGB wrote:

>>> Sign-extend signed values, zero-extend unsigned values.

>>Another mistake I mad in Mc 88100.

> As that is a mistake the IBM 360 made, I make it too. But I make it
> the way the 360 did: there are no signed and unsigned values, in the
> sense of a Burroughs machine, there are just Load, Load Unsigned - and
> Insert - instructions.

> Index and base register values are assumed to be unsigned.

I would use the term signless as opposed to unsigned.

Address arithmetic is ADD only and does not care about signs or
overflow. There is no concept of a negative base register or a
negative index register (or, for that matter, a negative displace-
ment), overflow, underflow, carry, ...

> John Savard

Re: Stealing a Great Idea from the 6600

<v02eij$6d5b$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38340&group=comp.arch#38340

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Sun, 21 Apr 2024 02:17:39 -0500
Organization: A noiseless patient Spider
Lines: 727
Message-ID: <v02eij$6d5b$1@dont-email.me>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com>
<e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org>
<in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com>
<71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org>
<1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com>
<oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com>
<dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org>
<acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com>
<kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com>
<9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org>
<v017mg$3rcg9$1@dont-email.me>
<da6dc5fe28bb31b4c73d78ef1aac2ac5@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 21 Apr 2024 09:17:41 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="eb8a718465b00007c507332e9ba6007d";
logging-data="210091"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18cgsk1ZT4CtNrI/Ypm6riPsmKNBlRuu5U="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Q6Qh/Lb8V4a89vGdA76oZZp3h3o=
Content-Language: en-US
In-Reply-To: <da6dc5fe28bb31b4c73d78ef1aac2ac5@www.novabbs.org>

by: BGB - Sun, 21 Apr 2024 07:17 UTC

On 4/20/2024 5:03 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 4/20/2024 12:07 PM, MitchAlsup1 wrote:
>>> John Savard wrote:
>>>
>>>> On Sat, 20 Apr 2024 01:09:53 -0600, John Savard
>>>> <quadibloc@servername.invalid> wrote:
>>>
>>>
>>>> And, hey, I'm not the first guy to get sunk because of forgetting what
>>>> lies under the tip of the iceberg that's above the water.
>>>
>>>> That also happened to the captain of the _Titanic_.
>>>
>>> Concer-tina-tanic !?!
>>>
>
>> Seems about right.
>> Seems like a whole lot of flailing with designs that seem needlessly
>> complicated...
>
>
>
>> Meanwhile, has looked around and noted:
>> In some ways, RISC-V is sort of like MIPS with the field order reversed,
>
> They, in effect, Litle-Endian-ed the fields.
>

Yeah.

>> and (ironically) actually smaller immediate fields (MIPS was using a
>> lot of Imm16 fields. whereas RISC-V mostly used Imm12).
>
> Yes, RISC-V took a step back with the 12-bit immediates. My 66000, on
> the other hand, only has 12-bit immediates for shift instructions--
> allowing all shifts to reside in one Major OpCode; the rest inst[31]=1
> have 16-bit immediates (universally sign extended).
>

I had gone further and used mostly 9/10 bit fields (mostly expanded to
10/12 in XG2).

I don't really think this is a bad choice in a statistical sense (as it
so happens, most of the immediate values can fit into a 9-bit field,
without going too far into "diminishing returns" territory).

Ended up with some inconsistency when expanding to 10 bits:
Displacements went 9u -> 10s
ADD/SUB: 9u/9n -> 10u/10n
AND: 9u -> 10s
OR,XOR: 9u -> 10u

And was initially 9u->10u (like OR and XOR), but changed over at the
last minute:
Negative masks were far more common than 10-bit masks;
At the moment, the change didn't seem to break anything;
I didn't really have any other encoding space to put this.
The main "sane" location to put it was already taken by RSUB;
The Imm9 space is basically already full.

With OR and XOR, negative masks are essentially absent, so switching
these to signed would not make sense; even if this breaks the symmetry
between AND/OR/XOR.

>> But, seemed to have more wonk:
>> A mode with 32x 32-bit GPRs; // unnecessary
>> A mode with 32x 64-bit GPRs;
>> Apparently a mode with 32x 32-bit GPRs that can be paired to 16x
>> 64-bits as needed for 64-bit operations?...
>
> Repeating the mistake I made on Mc 88100....
>

I had saw a video talking about the Nintendo 64, and it was saying that
the 2x paired 32-bit register mode was used more often than the native
64-bit mode, as the native 64-bit mode was slower as apparently it
couldn't fully pipeline the 64-bit ops, so using it in this mode came at
a performance hit (vs using it to run glorified 32-bit code).

>> Integer operations (on 64-bit registers) that give UB or trap if
>> values are outside of signed Int32 range;
>
> Isn't it just wonderful ??
>

No direct equivalent in my case, nor any desire to add these.

Preferable I think if the behavior of instructions is consistent across
implementations, though OTOH can claim strict 1:1 between my Verilog
implementation and emulator, but at least I try to keep things consistent.

Though, things fall short of strict 100% consistency between the Verilog
implementation and emulator (usually in cases where the emulator will
trap, but the Verilog implementation will "do whatever").

Though, in part, this is because the emulator serves the secondary
purpose of linting the compiler output.

Though, partly it is a case of, not even trapping is entirely free.

>> Other operations that sign-extend the values but are ironically called
>> "unsigned" (apparently, similar wonk to RISC-V by having
>> signed-extended Unsigned Int);
>> Branch operations are bit-sliced;
>> ....
>
>
>> I had preferred a different strategy in some areas:
>> Assume non-trapping operations by default;
>
> Assume trap/"do the expected thing" under a user accessible flag.

Most are defined in ways that I feel are sensible.

For ALU this means one of:
64-bit result;
Sign-extended from 32-bit result;
Zero extended from 32-bit result.

>> Sign-extend signed values, zero-extend unsigned values.
>
> Another mistake I mad in Mc 88100.
>
> Do you sign extend the 16-bit displacement on an unsigned LD ??
>

In my case; for the Baseline encoding, Ld/St displacements were unsigned
only.

For XG2, they are signed. It was a tight call, but the sign-extended
case won out by an admittedly thin margin in this case.

Granted, this means that the Load/Store ops with a Disp5u/Disp6s
encodings are mostly redundant in XG2, but are the only way to directly
encode negative displacements in Baseline+XGPR (in pure Baseline,
negative Ld/St displacements being N/E).

But, as for values in registers, I personally feel that my scheme (as a
direct extension of the scheme that C itself seems to use) works better
than the one used by MIPS and RISC-V, which seems needlessly wonky with
a bunch of edge cases (that end up ultimately requiring the ISA to care
more about the size and type of the value rather than less).

Then again, x86-64 and ARM64 went the other direction (always zero
extending the 32-bit values).

Then again, it seems like a case where spending more in one area can
save cost in others.

>> Though, this is partly the source of some operations in my case
>> assuming 33 bit sign-extended: This can represent both the signed and
>> unsigned 32-bit ranges.
>
> These are some of the reasons My 66000 is 64-bit register/calculation only.
>

It is a tradeoff.

Many operations are full 64-bit.

Load/Store and Branch displacements have tended to be 33 bit to save
cost over 48 bit displacements (with a 48-bit address space, with
16-bits for optional type-tags or similar).

Though, this does theoretically add a penalty if "size_t" or "long" or
similar is used as an array index (rather than "int" or smaller), since
in this case the compiler will need to fall back to ALU operations to
perform the index operation (similar to what typically happens for array
indexing on RISC-V).

Mostly not a huge issue, as pretty much all the code seems to use 'int'
for array indices.

Optionally, can enable the use of 48-bit displacements, but not really
worth much if they are not being used (similar issue for the 96-bit
addressing thing).

Even 48-bits is overkill when one can fit the entirety of both RAM and
secondary storage into the address space.

Kind of a very different situation from 16-bit days, where people were
engaging in wonk to try to fit in more RAM than they had address space...

Well, nevermind a video where a guy managed to get a 486 PC working with
no SIMM's, only apparently some on-board RAM on the MOBO, and some ISA
RAM-expansion cards (apparently intended for the 80286).

Apparently he was getting Doom framerates (in real-time) almost on-par
with what I am seeing in Verilog simulations (roughly 11 seconds per
frame at the moment; simulation running approximately 250x slower than
real-time).

>> One could argue that sign-extending both could save 1 bit in some
>> cases. But, this creates wonk in other cases, such as requiring an
>> explicit zero extension for "unsigned int" to "long long" casts; and
>> more cases where separate instructions are needed for Int32 and Int64
>> cases (say, for example, RISC-V needed around 4x as many Int<->Float
>> conversion operators due to its design choices in this area).
>
> It also gets difficult when you consider EADD Rd,Rdouble,Rexponent ??
> is it a FP calculation or an integer calculation ?? If Rdouble is a
> constant is the constant FP or int, if Rexponent is a constant is it
> double or int,..... Does it raise FP overflow or integer overflow ??
>

Dunno, neither RISC-V nor BJX2 has this...

>> Say:
>>    RV64:
>>      Int32<->Binary32, UInt32<->Binary32
>>      Int64<->Binary32, UInt64<->Binary32
>>      Int32<->Binary64, UInt32<->Binary64
>>      Int64<->Binary64, UInt64<->Binary64
>>    BJX2:
>>      Int64<->Binary64, UInt64<->Binary64
>     My 66000:
>       int64_t -> { uint64_t, float,   double }
>       uint64_t -> { int64_t, float,   double }
>       float    -> { uint64_t, int64_t, double }
>       double   -> { uint64_t, int64_t, float }
>

Click here to read the complete article

Re: Stealing a Great Idea from the 6600

<152f8504112a37d8434c663e99cb36c5@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38345&group=comp.arch#38345

copy link Newsgroups: comp.arch

Date: Sun, 21 Apr 2024 18:57:27 +0000
Subject: Re: Stealing a Great Idea from the 6600
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$th7kCerwKtka4wFIu7ZxO.yFSDMb9bJmqEZYr7PZg92/LIYEeou4a
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org> <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com> <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org> <acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com> <kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com> <9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org> <v017mg$3rcg9$1@dont-email.me> <da6dc5fe28bb31b4c73d78ef1aac2ac5@www.novabbs.org> <v02eij$6d5b$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <152f8504112a37d8434c663e99cb36c5@www.novabbs.org>

by: MitchAlsup1 - Sun, 21 Apr 2024 18:57 UTC

BGB wrote:

> On 4/20/2024 5:03 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>> Compilers are notoriously unable to outguess a good branch predictor.
>>

> Errm, assuming the compiler is capable of things like general-case
> inlining and loop-unrolling.

> I was thinking of simpler things, like shuffling operators between
> independent (sub)expressions to limit the number of register-register
> dependencies.

> Like, in-order superscalar isn't going to do crap if nearly every
> instruction depends on every preceding instruction. Even pipelining
> can't help much with this.

Pipelining CREATED this (back to back dependencies). No amount of
pipelining can eradicate RAW data dependencies.

> The compiler can shuffle the instructions into an order to limit the
> number of register dependencies and better fit the pipeline. But, then,
> most of the "hard parts" are already done (so it doesn't take much more
> for the compiler to flag which instructions can run in parallel).

Compiler scheduling works for exactly 1 pipeline implementation and
is suboptimal for all others.

> Meanwhile, a naive superscalar may miss cases that could be run in
> parallel, if it is evaluating the rules "coarsely" (say, evaluating what
> is safe or not safe to run things in parallel based on general groupings
> of opcodes rather than the rules of specific opcodes; or, say,
> false-positive register alias if, say, part of the Imm field of a 3RI
> instruction is interpreted as a register ID, ...).

> Granted, seemingly even a naive approach is able to get around 20% ILP
> out of "GCC -O3" output for RV64G...

> But, the GCC output doesn't seem to be quite as weak as some people are
> claiming either.

>>> ties the code to a specific pipeline structure, and becomes
>>> effectively moot with OoO CPU designs).
>>
>> OoO exists, in a practical sense, to abstract the pipeline out of the
>> compiler; or conversely, to allow multiple implementations to run the
>> same compiled code optimally on each implementation.
>>

> Granted, but OoO isn't cheap.

But it does get the job done.

>>> So, a case could be made that a "general use" ISA be designed without
>>> the use of explicit bundling. In my case, using the bundle flags also
>>> requires the code to use an instruction to signal to the CPU what
>>> configuration of pipeline it expects to run on, with the CPU able to
>>> fall back to scalar (or superscalar) execution if it does not match.
>>
>> Sounds like a bridge too far for your 8-wide GBOoO machine.
>>

> For sake of possible fancier OoO stuff, I upheld a basic requirement for
> the instruction stream:
> The semantics of the instructions as executed in bundled order needs to
> be equivalent to that of the instructions as executed in sequential order.

> In this case, the OoO CPU can entirely ignore the bundle hints, and
> treat "WEXMD" as effectively a NOP.

> This would have broken down for WEX-5W and WEX-6W (where enforcing a
> parallel==sequential constraint effectively becomes unworkable, and/or
> renders the wider pipeline effectively moot), but these designs are
> likely dead anyways.

> And, with 3-wide, the parallel==sequential order constraint remains in
> effect.

>>> For the most part, thus far nearly everything has ended up as "Mode
>>> 2", namely:
>>>    3 lanes;
>>>      Lane 1 does everything;
>>>      Lane 2 does Basic ALU ops, Shift, Convert (CONV), ...
>>>      Lane 3 only does Basic ALU ops and a few CONV ops and similar.
>>>        Lane 3 originally also did Shift, dropped to reduce cost.
>>>      Mem ops may eat Lane 3, ...
>>
>> Try 6-lanes:
>>    1,2,3 Memory ops + integer ADD and Shifts
>>    4     FADD   ops + integer ADD and FMisc
>>    5     FMAC   ops + integer ADD
>>    6     CMP-BR ops + integer ADD
>>

> As can be noted, my thing is more a "LIW" rather than a "true VLIW".

Mine is neither LIW or VLIW but it definitely is LBIO through GBOoO

> So, MEM/BRA/CMP/... all end up in Lane 1.

> Lanes 2/3 effectively ending up used for fold over most of the ALU ops
> turning Lane 1 mostly into a wall of Load and Store instructions.

>>> Where, say:
>>>    Mode 0 (Default):
>>>      Only scalar code is allowed, CPU may use superscalar (if available).
>>>    Mode 1:
>>>      2 lanes:
>>>        Lane 1 does everything;
>>>        Lane 2 does ALU, Shift, and CONV.
>>>      Mem ops take up both lanes.
>>>        Effectively scalar for Load/Store.
>>>        Later defined that 128-bit MOV.X is allowed in a Mode 1 core.
>> Modeless.
>>

>>> Had defined wider modes, and ones that allow dual-lane IO and FPU
>>> instructions, but these haven't seen use (too expensive to support in
>>> hardware).
>>
>>> Had ended up with the ambiguous "extension" to the Mode 2 rules of
>>> allowing an FPU instruction to be executed from Lane 2 if there was
>>> not an FPU instruction in Lane 1, or allowing co-issuing certain FPU
>>> instructions if they effectively combine into a corresponding SIMD op.
>>
>>> In my current configurations, there is only a single memory access port.
>>
>> This should imply that your 3-wide pipeline is running at 90%-95%
>> memory/cache saturation.
>>

> If you mean that execution is mostly running end-to-end memory
> operations, yeah, this is basically true.

> Comparably, RV code seems to end up running a lot of non-memory ops in
> Lane 1, whereas BJX2 is mostly running lots of memory ops, with Lane 2
> handling most of the ALU ops and similar (and Lane 3, occasionally).

One of the things that I notice with My 66000 is when you get all the
constants you ever need at the calculation OpCodes, you end up with
FEWER instructions that "go random places" such as instructions that
<well> paste constants together. This leave you with a data dependent
string of calculations with occasional memory references. That is::
universal constants gets rid of the easy to pipeline extra instructions
leaving the meat of the algorithm exposed.

>>
>> If you design around the notion of a 3R1W register file, FMAC and INSERT
>> fall out of the encoding easily. Done right, one can switch it into a 4R
>> or 4W register file for ENTER and EXIT--lessening the overhead of call/ret.
>>

> Possibly.

> It looks like some savings could be possible in terms of prologs and
> epilogs.

> As-is, these are generally like:
> MOV LR, R18
> MOV GBR, R19
> ADD -192, SP
> MOV.X R18, (SP, 176) //save GBR and LR
> MOV.X ... //save registers

Why not an instruction that saves LR and GBR without wasting instructions
to place them side by side prior to saving them ??

> WEXMD 2 //specify that we want 3-wide execution here

> //Reload GBR, *1
> MOV.Q (GBR, 0), R18
> MOV 0, R0 //special reloc here
> MOV.Q (GBR, R0), R18
> MOV R18, GBR

It is gorp like that that lead me to do it in HW with ENTER and EXIT.
Save registers to the stack, setup FP if desired, allocate stack on SP,
and decide if EXIT also does RET or just reloads the file. This would
require 2 free registers if done in pure SW, along with several MOVs...

> //Generate Stack Canary, *2
> MOV 0x5149, R18 //magic number (randomly generated)
> VSKG R18, R18 //Magic (combines input with SP and magic numbers)
> MOV.Q R18, (SP, 144)

> ...
> function-specific stuff
> ...

> MOV 0x5149, R18
> MOV.Q (SP, 144), R19
> VSKC R18, R19 //Validate canary
> ...

> *1: This part ties into the ABI, and mostly exists so that each PE image
> can get GBR reloaded back to its own ".data"/".bss" sections (with

Universal displacements make GBR unnecessary as a memory reference can
be accompanied with a 16-bit, 32-bit, or 64-bit displacement. Yes, you
can read GOT[#i] directly without a pointer to it.

> multiple program instances in a single address space). But, does mean
> that pretty much every non-leaf function ends up needing to go through
> this ritual.

Universal constant solves the underlying issue.

> *2: Pretty much any function that has local arrays or similar, serves to
> protect register save area. If the magic number can't regenerate a
> matching canary at the end of the function, then a fault is generated.

Click here to read the complete article

Re: Stealing a Great Idea from the 6600

<v045in$hqoj$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38350&group=comp.arch#38350

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Sun, 21 Apr 2024 17:56:21 -0500
Organization: A noiseless patient Spider
Lines: 502
Message-ID: <v045in$hqoj$1@dont-email.me>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com>
<e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org>
<in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com>
<71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org>
<1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com>
<oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com>
<dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org>
<acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com>
<kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com>
<9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org>
<v017mg$3rcg9$1@dont-email.me>
<da6dc5fe28bb31b4c73d78ef1aac2ac5@www.novabbs.org>
<v02eij$6d5b$1@dont-email.me>
<152f8504112a37d8434c663e99cb36c5@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Apr 2024 00:56:24 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="7b1e3ac212388cea6886df46e04c8fee";
logging-data="584467"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/dTiZJRtPbnAhpooR3G5aIarkjKEqCUcw="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:+7J22inMRR2yeKhh/6pDKC46vxQ=
In-Reply-To: <152f8504112a37d8434c663e99cb36c5@www.novabbs.org>
Content-Language: en-US

by: BGB - Sun, 21 Apr 2024 22:56 UTC

On 4/21/2024 1:57 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 4/20/2024 5:03 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>> Compilers are notoriously unable to outguess a good branch predictor.
>>>
>
>> Errm, assuming the compiler is capable of things like general-case
>> inlining and loop-unrolling.
>
>> I was thinking of simpler things, like shuffling operators between
>> independent (sub)expressions to limit the number of register-register
>> dependencies.
>
>> Like, in-order superscalar isn't going to do crap if nearly every
>> instruction depends on every preceding instruction. Even pipelining
>> can't help much with this.
>
> Pipelining CREATED this (back to back dependencies). No amount of
> pipelining can eradicate RAW data dependencies.
>

Pretty much, this is the problem.

But, when one converts from expressions to instructions either via
directly walking the AST, or by going to RPN and then generating
instructions from the RPN. Then the generated code has this problem
pretty bad.

Seemingly the only real fix is to try to shuffle things around, at the
3AC or machine-instruction level, or both, to try to reduce the number
of RAW dependencies.

Though, this is an areas where "things could have been done better" in
BGBCC. Though, mostly it would be in the backend.

Ironically, the approach of first compiling everything into an RPN
bytecode, then generating 3AC and machine code from the RPN, seems to
work reasonably OK. Even if the bytecode itself is kinda weird.

Though, one area that could be improved is the memory overhead of BGBCC,
where generally BGBCC uses too much RAM to really be viable to have
TestKern be self-hosting.

>> The compiler can shuffle the instructions into an order to limit the
>> number of register dependencies and better fit the pipeline. But,
>> then, most of the "hard parts" are already done (so it doesn't take
>> much more for the compiler to flag which instructions can run in
>> parallel).
>
> Compiler scheduling works for exactly 1 pipeline implementation and
> is suboptimal for all others.
>

Possibly true.

But, can note, even crude shuffling is better than no shuffling this
case. And, the shuffling needed to make an in-order superscalar not
perform like crap, also happens to map over well to a LIW (and is the
main hard part of the problem).

>> Meanwhile, a naive superscalar may miss cases that could be run in
>> parallel, if it is evaluating the rules "coarsely" (say, evaluating
>> what is safe or not safe to run things in parallel based on general
>> groupings of opcodes rather than the rules of specific opcodes; or,
>> say, false-positive register alias if, say, part of the Imm field of a
>> 3RI instruction is interpreted as a register ID, ...).
>
>
>> Granted, seemingly even a naive approach is able to get around 20% ILP
>> out of "GCC -O3" output for RV64G...
>
>> But, the GCC output doesn't seem to be quite as weak as some people
>> are claiming either.
>
>
>>>> ties the code to a specific pipeline structure, and becomes
>>>> effectively moot with OoO CPU designs).
>>>
>>> OoO exists, in a practical sense, to abstract the pipeline out of the
>>> compiler; or conversely, to allow multiple implementations to run the
>>> same compiled code optimally on each implementation.
>>>
>
>> Granted, but OoO isn't cheap.
>
> But it does get the job done.
>

But... Also makes the CPU too big and expensive to fit into most
consumer/hobbyist grade FPGAs.

They can do in-order designs pretty OK though.

People were doing some impressive looking things over on the Altera side
of things, but it is harder to do a direct comparison between Cyclone V
and Artix / Spartan.

Some stuff I was skimming though implied that I guess the free version
of Quartus is more limited vs Vivado, and one effectively needs to pay
for the commercial version to make full use of the FPGA (whereas Vivado
allows mostly full use of the FPGA, but not any FPGA's larger than a
certain cutoff).

Well, and the non-free version of Vivado costs well more than I could
justify spending on a hobby project.

>>>> So, a case could be made that a "general use" ISA be designed
>>>> without the use of explicit bundling. In my case, using the bundle
>>>> flags also requires the code to use an instruction to signal to the
>>>> CPU what configuration of pipeline it expects to run on, with the
>>>> CPU able to fall back to scalar (or superscalar) execution if it
>>>> does not match.
>>>
>>> Sounds like a bridge too far for your 8-wide GBOoO machine.
>>>
>
>> For sake of possible fancier OoO stuff, I upheld a basic requirement
>> for the instruction stream:
>> The semantics of the instructions as executed in bundled order needs
>> to be equivalent to that of the instructions as executed in sequential
>> order.
>
>> In this case, the OoO CPU can entirely ignore the bundle hints, and
>> treat "WEXMD" as effectively a NOP.
>
>
>> This would have broken down for WEX-5W and WEX-6W (where enforcing a
>> parallel==sequential constraint effectively becomes unworkable, and/or
>> renders the wider pipeline effectively moot), but these designs are
>> likely dead anyways.
>
>> And, with 3-wide, the parallel==sequential order constraint remains in
>> effect.
>
>
>>>> For the most part, thus far nearly everything has ended up as "Mode
>>>> 2", namely:
>>>>    3 lanes;
>>>>      Lane 1 does everything;
>>>>      Lane 2 does Basic ALU ops, Shift, Convert (CONV), ...
>>>>      Lane 3 only does Basic ALU ops and a few CONV ops and similar.
>>>>        Lane 3 originally also did Shift, dropped to reduce cost.
>>>>      Mem ops may eat Lane 3, ...
>>>
>>> Try 6-lanes:
>>>     1,2,3 Memory ops + integer ADD and Shifts
>>>     4     FADD   ops + integer ADD and FMisc
>>>     5     FMAC   ops + integer ADD
>>>     6     CMP-BR ops + integer ADD
>>>
>
>> As can be noted, my thing is more a "LIW" rather than a "true VLIW".
>
> Mine is neither LIW or VLIW but it definitely is LBIO through GBOoO
>

I aimed for Scalar and LIW.

On the XC7S25 and XC7A35T, can't really do much more than a simple
scalar core (it is a pain enough even trying to fit an FPU into the thing).

On the XC7S50 (~ 33k LUT), it is more a challenge of trying to fit both
a 3-wide core and an FP-SIMD unit (fitting the CPU onto it is a little
easier if one skips the existence of FP-SIMD, or can accept slower SIMD
implemented by pipelining the elements through the FPU).

I had been looking into a configuration for the XC7S50 which had dropped
down to a more limited 2-wide configuration (with a 4R2W register file),
but keeping the SIMD unit intact. Mostly trying to optimizing this case
for doing lots of SIMD math for NN workloads.

This is vaguely similar to a past considered "GPU Profile", but
ultimately ended up implementing the rasterizer module instead (which is
cheaper and a little faster at this task than a CPU core would have
been, albeit less flexible).

Doing in-order superscalar for BJX2 could be possible, but haven't put
much effort into this thus far, as the "WEX-3W" profile currently hits
this nail pretty well.

Did end up going with superscalar for RISC-V, mostly as no other option.

It is, however, a fairly narrow window...

For smaller targets, need to fall back to scalar, and for wider, part of
the ISA design becomes effectively moot.

>> So, MEM/BRA/CMP/... all end up in Lane 1.
>
>> Lanes 2/3 effectively ending up used for fold over most of the ALU ops
>> turning Lane 1 mostly into a wall of Load and Store instructions.
>
>
>>>> Where, say:
>>>>    Mode 0 (Default):
>>>>      Only scalar code is allowed, CPU may use superscalar (if
>>>> available).
>>>>    Mode 1:
>>>>      2 lanes:
>>>>        Lane 1 does everything;
>>>>        Lane 2 does ALU, Shift, and CONV.
>>>>      Mem ops take up both lanes.
>>>>        Effectively scalar for Load/Store.
>>>>        Later defined that 128-bit MOV.X is allowed in a Mode 1 core.
>>> Modeless.
>>>
>
>
>>>> Had defined wider modes, and ones that allow dual-lane IO and FPU
>>>> instructions, but these haven't seen use (too expensive to support
>>>> in hardware).
>>>
>>>> Had ended up with the ambiguous "extension" to the Mode 2 rules of
>>>> allowing an FPU instruction to be executed from Lane 2 if there was
>>>> not an FPU instruction in Lane 1, or allowing co-issuing certain FPU
>>>> instructions if they effectively combine into a corresponding SIMD op.
>>>
>>>> In my current configurations, there is only a single memory access
>>>> port.
>>>
>>> This should imply that your 3-wide pipeline is running at 90%-95%
>>> memory/cache saturation.
>>>
>
>> If you mean that execution is mostly running end-to-end memory
>> operations, yeah, this is basically true.
>
>
>> Comparably, RV code seems to end up running a lot of non-memory ops in
>> Lane 1, whereas BJX2 is mostly running lots of memory ops, with Lane 2
>> handling most of the ALU ops and similar (and Lane 3, occasionally).
>
> One of the things that I notice with My 66000 is when you get all the
> constants you ever need at the calculation OpCodes, you end up with
> FEWER instructions that "go random places" such as instructions that
> <well> paste constants together. This leave you with a data dependent
> string of calculations with occasional memory references. That is::
> universal constants gets rid of the easy to pipeline extra instructions
> leaving the meat of the algorithm exposed.
>

Click here to read the complete article

Re: Stealing a Great Idea from the 6600

<631f946ee0323ccaa31fae0d7e30e2d5@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38351&group=comp.arch#38351

copy link Newsgroups: comp.arch

Date: Sun, 21 Apr 2024 23:31:55 +0000
Subject: Re: Stealing a Great Idea from the 6600
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$3hHgQuDqJe9uDp6Y6ZNcFuM/df9vWBOhYT.CjkK8aICACUin4IVuK
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org> <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com> <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org> <acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com> <kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com> <9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org> <v017mg$3rcg9$1@dont-email.me> <da6dc5fe28bb31b4c73d78ef1aac2ac5@www.novabbs.org> <v02eij$6d5b$1@dont-email.me> <152f8504112a37d8434c663e99cb36c5@www.novabbs.org> <v045in$hqoj$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <631f946ee0323ccaa31fae0d7e30e2d5@www.novabbs.org>

by: MitchAlsup1 - Sun, 21 Apr 2024 23:31 UTC

BGB wrote:

> On 4/21/2024 1:57 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>> One of the things that I notice with My 66000 is when you get all the
>> constants you ever need at the calculation OpCodes, you end up with
>> FEWER instructions that "go random places" such as instructions that
>> <well> paste constants together. This leave you with a data dependent
>> string of calculations with occasional memory references. That is::
>> universal constants gets rid of the easy to pipeline extra instructions
>> leaving the meat of the algorithm exposed.
>>

> Possibly true.

> RISC-V tends to have a lot of extra instructions due to lack of big
> constants and lack of indexed addressing.

You forgot the "every one an his brother" design of the ISA>

> And, BJX2 has a lot of frivolous register-register MOV instructions.

I empower you to get rid of them....
<snip>
>>>> If you design around the notion of a 3R1W register file, FMAC and INSERT
>>>> fall out of the encoding easily. Done right, one can switch it into a 4R
>>>> or 4W register file for ENTER and EXIT--lessening the overhead of
>>>> call/ret.
>>>>
>>
>>> Possibly.
>>
>>> It looks like some savings could be possible in terms of prologs and
>>> epilogs.
>>
>>> As-is, these are generally like:
>>>    MOV    LR, R18
>>>    MOV    GBR, R19
>>>    ADD    -192, SP
>>>    MOV.X R18, (SP, 176) //save GBR and LR
>>>    MOV.X ... //save registers
>>
>> Why not an instruction that saves LR and GBR without wasting instructions
>> to place them side by side prior to saving them ??
>>

> I have an optional MOV.C instruction, but would need to restructure the
> code for generating the prologs to make use of them in this case.

> Say:
> MOV.C GBR, (SP, 184)
> MOV.C LR, (SP, 176)

> Though, MOV.C is considered optional.

> There is a "MOV.C Lite" option, which saves some cost by only allowing
> it for certain CR's (mostly LR and GBR), which also sort of overlaps
> with (and is needed) by RISC-V mode, because these registers are in GPR
> land for RV.

> But, in any case, current compiler output shuffles them to R18 and R19
> before saving them.

>>>    WEXMD 2 //specify that we want 3-wide execution here
>>
>>>    //Reload GBR, *1
>>>    MOV.Q (GBR, 0), R18
>>>    MOV    0, R0 //special reloc here
>>>    MOV.Q (GBR, R0), R18
>>>    MOV    R18, GBR
>>

> Correction:
> >> MOV.Q (R18, R0), R18

>> It is gorp like that that lead me to do it in HW with ENTER and EXIT.
>> Save registers to the stack, setup FP if desired, allocate stack on SP,
>> and decide if EXIT also does RET or just reloads the file. This would
>> require 2 free registers if done in pure SW, along with several MOVs...
>>

> Possibly.
> The partial reason it loads into R0 and uses R0 as an index, was that I
> defined this mechanism before jumbo prefixes existed, and hadn't updated
> it to allow for jumbo prefixes.

No time like the present...

> Well, and if I used a direct displacement for GBR (which, along with PC,
> is always BYTE Scale), this would have created a hard limit of 64 DLL's
> per process-space (I defined it as Disp24, which allows a more
> reasonable hard upper limit of 2M DLLs per process-space).

In my case, restricting myself to 32-bit IP relative addressing, GOT can
be anywhere within ±2GB of the accessing instruction and can be as big as
one desires.

> Granted, nowhere near even the limit of 64 as of yet. But, I had noted
> that Windows programs would often easily exceed this limit, with even a
> fairly simple program pulling in a fairly large number of random DLLs,
> so in any case, a larger limit was needed.

Due to the way linkages work in My 66000, each DLL gets its own GOT.
So there is essentially no bounds on how many can be present/in-use.
A LD of a GOT[entry] gets a pointer to the external variable.
A CALX of GOT[entry] is a call through the GOT table using std ABI.
{{There is no PLT}}

> One potential optimization here is that the main EXE will always be 0 in
> the process, so this sequence could be reduced to, potentially:
> MOV.Q (GBR, 0), R18
> MOV.C (R18, 0), GBR

> Early on, I did not have the constraint that main EXE was always 0, and
> had initially assumed it would be treated equivalently to a DLL.

>>>    //Generate Stack Canary, *2
>>>    MOV    0x5149, R18 //magic number (randomly generated)
>>>    VSKG   R18, R18 //Magic (combines input with SP and magic numbers)
>>>    MOV.Q R18, (SP, 144)
>>
>>>    ...
>>>    function-specific stuff
>>>    ...
>>
>>>    MOV    0x5149, R18
>>>    MOV.Q (SP, 144), R19
>>>    VSKC   R18, R19 //Validate canary
>>>    ...
>>
>>
>>> *1: This part ties into the ABI, and mostly exists so that each PE
>>> image can get GBR reloaded back to its own ".data"/".bss" sections (with
>>
>> Universal displacements make GBR unnecessary as a memory reference can
>> be accompanied with a 16-bit, 32-bit, or 64-bit displacement. Yes, you
>> can read GOT[#i] directly without a pointer to it.
>>

> If I were doing a more conventional ABI, I would likely use (PC,
> Disp33s) for accessing global variables.

Even those 128GB away ??

> Problem is:
> What if one wants multiple logical instances of a given PE image in a
> single address space?

Not a problem when each PE has a different set of mapping tables (at least
the entries pointing at GOTs[*].

> PC REL breaks in this case, unless you load N copies of each PE image,
> which is a waste of memory (well, or use COW mappings, mandating the use
> of an MMU).

> ELF FDPIC had used a different strategy, but then effectively turned
> each function call into something like (in SH):
> MOV R14, R2 //R14=GOT
> MOV disp, R0 //offset into GOT
> ADD R0, R2 //adjust by offset
> //R2=function pointer
> MOV.L (R2, 0), R1 //function address
> MOV.L (R2, 4), R3 //GOT
> JSR R1

Which I do with::

CALX [IP,R0,#GOT+index<<3-.]

> In the callee:
> ... save registers ...
> MOV R3, R14 //put GOT into a callee-save register
> ...

> In the BJX2 ABI, had rolled this part into the callee, reasoning that
> handling it in the callee (per-function) was less overhead than handling
> it in the caller (per function call).

> Though, on the RISC-V side, it has the relative advantage of compiling
> for absolute addressing, albeit still loses in terms of performance.

Compiling and linking to absolute addresses works "really well" when one
needs to place different sections in different memory every time the
application/kernel runs due to malicious codes trying to steal everything.
ASLR.....

> I don't imagine an FDPIC version of RISC-V would win here, but this is
> only assuming there exists some way to get GCC to output FDPIC binaries
> (most I could find, was people debating whether to add FDPIC support for
> RISC-V).

> PIC or PIE would also sort of work, but these still don't really allow
> for multiple program instances in a single address space.

Once you share the code and some of the data, the overhead of using different
mappings for special stuff {GOT, local thread data,...} is

>>> multiple program instances in a single address space). But, does mean
>>> that pretty much every non-leaf function ends up needing to go through
>>> this ritual.
>>
>> Universal constant solves the underlying issue.
>>

> I am not so sure that they could solve the "map multiple instances of
> the same binary into a single address space" issue, which is sort of the
> whole thing for why GBR is being used.

> Otherwise, I would have been using PC-REL...

>>> *2: Pretty much any function that has local arrays or similar, serves
>>> to protect register save area. If the magic number can't regenerate a
>>> matching canary at the end of the function, then a fault is generated.
>>
>> My 66000 can place the callee save registers in a place where user cannot
>> access them with LDs or modify them with STs. So malicious code cannot
>> damage the contract between ABI and core.
>>

> Possibly. I am using a conventional linear stack.

Click here to read the complete article

Re: Stealing a Great Idea from the 6600

<e8eb2j1ftsikv6j4eeaksm8lkhc31fuipi@4ax.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38353&group=comp.arch#38353

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (John Savard)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Sun, 21 Apr 2024 19:16:04 -0600
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <e8eb2j1ftsikv6j4eeaksm8lkhc31fuipi@4ax.com>
References: <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com> <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org> <acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com> <kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com> <9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org> <v017mg$3rcg9$1@dont-email.me> <da6dc5fe28bb31b4c73d78ef1aac2ac5@www.novabbs.org> <v02eij$6d5b$1@dont-email.me> <152f8504112a37d8434c663e99cb36c5@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 22 Apr 2024 03:16:06 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="9b6945b7d48ec219e9571a56b657569c";
logging-data="632088"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+GsR/8VWX9PgnG6ngb4S6vN/wInN43jlg="
Cancel-Lock: sha1:hDKRLRb3gFuI3E+pxZcb9hF22tE=
X-Newsreader: Forte Free Agent 3.3/32.846

by: John Savard - Mon, 22 Apr 2024 01:16 UTC

On Sun, 21 Apr 2024 18:57:27 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:
>BGB wrote:

>> Like, in-order superscalar isn't going to do crap if nearly every
>> instruction depends on every preceding instruction. Even pipelining
>> can't help much with this.

>Pipelining CREATED this (back to back dependencies). No amount of
>pipelining can eradicate RAW data dependencies.

This is quite true. However, in case an unsophisticated individual
might read this thread, I think that I shall clarify.

Without pipelining, it is not a problem if each instruction depends on
the one immediately previous, and so people got used to writing
programs that way, as it was simple to write the code to do one thing
before starting to write the code to begin doing another thing.

This remained true when the simplest original form of pipelining was
brought in - where fetching one instruction from memory was overlapped
with decoding the previous instruction, and executing the instruction
before that.

It's only when what was originally called "superpipelining" came
along, where the execute stages of multiple successive instructions
could be overlapped, that it was necessary to do something about
dependencies in order to take advantage of the speedup that could
provide.

John Savard

Pages:12 3

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor