Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

Happiness is a hard disk.


devel / comp.arch / Microarch Club

SubjectAuthor
* Microarch ClubGeorge Musk
`* Re: Microarch ClubBGB-Alt
 `* Re: Microarch ClubMitchAlsup1
  `* Re: Microarch ClubBGB
   `* Re: Microarch ClubMitchAlsup1
    `* Re: Microarch ClubBGB-Alt
     +* Re: Microarch ClubMichael S
     |`* Re: Microarch ClubBGB
     | `* Re: Microarch ClubMitchAlsup1
     |  +* Re: Microarch ClubScott Lurndal
     |  |+- Re: Microarch ClubMitchAlsup1
     |  |`* Re: Microarch ClubTerje Mathisen
     |  | +- Re: Microarch ClubScott Lurndal
     |  | `* Re: Microarch ClubMichael S
     |  |  `* Re: Microarch ClubTerje Mathisen
     |  |   `- Re: Microarch ClubMichael S
     |  `* Re: Microarch ClubMichael S
     |   `* Re: Microarch ClubBGB-Alt
     |    `* Re: Microarch ClubMitchAlsup1
     |     `- Re: Microarch ClubBGB
     `* Re: Microarch ClubMitchAlsup1
      `* Re: Microarch ClubBGB
       `* Re: Microarch ClubMitchAlsup1
        `- Re: Microarch ClubBGB

1
Microarch Club

<uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38115&group=comp.arch#38115

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!nnrp.usenet.blueworldhosting.com!.POSTED!not-for-mail
From: grgmusk@skiff.com (George Musk)
Newsgroups: comp.arch
Subject: Microarch Club
Date: Thu, 21 Mar 2024 19:34:50 -0000 (UTC)
Organization: BWH Usenet Archive (https://usenet.blueworldhosting.com)
Message-ID: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 21 Mar 2024 19:34:50 -0000 (UTC)
Injection-Info: nnrp.usenet.blueworldhosting.com;
logging-data="74224"; mail-complaints-to="usenet@blueworldhosting.com"
User-Agent: Pan/0.154 (Izium; 517acf4)
Cancel-Lock: sha1:XvmHkL8yrsHPWgPgdfcN0ZBFMXw= sha256:fIMKEgf/fgQ/qeUst3eUADZyxVRsoy489b7ARSoOx+U=
sha1:9FkLmx1V+qpU/pRya84xUoc5KBE= sha256:UsvJxxyA7HLrrOC9J+1QGoY0M/scv2FN9ap0QYFGx/4=
 by: George Musk - Thu, 21 Mar 2024 19:34 UTC

Thought this may be interesting:
https://microarch.club/
https://www.youtube.com/@MicroarchClub/videos

Re: Microarch Club

<utsrft$1b76a$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38149&group=comp.arch#38149

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Mon, 25 Mar 2024 16:48:43 -0500
Organization: A noiseless patient Spider
Lines: 61
Message-ID: <utsrft$1b76a$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 25 Mar 2024 22:48:45 +0100 (CET)
Injection-Info: dont-email.me; posting-host="45f9dcc3d866b8e585773ea070097f8e";
logging-data="1416394"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18bz0HGcxFm+WzFOMew2Ad6wyHg+AT3KWo="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:WvR0yJHqPMenHiLJKKSUlikYxt0=
In-Reply-To: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
Content-Language: en-US
 by: BGB-Alt - Mon, 25 Mar 2024 21:48 UTC

On 3/21/2024 2:34 PM, George Musk wrote:
> Thought this may be interesting:
> https://microarch.club/
> https://www.youtube.com/@MicroarchClub/videos

At least sort of interesting...

I guess one of the guys on there did a manycore VLIW architecture with
the memory local to each of the cores. Seems like an interesting
approach, though not sure how well it would work on a general purpose
workload. This is also closer to what I had imagined when I first
started working on this stuff, but it had drifted more towards a
slightly more conventional design.

But, admittedly, this is for small-N cores, 16/32K of L1 with a shared
L2, seemed like a better option than cores with a very large shared L1
cache.

I am not sure that abandoning a global address space is such a great
idea, as a lot of the "merits" can be gained instead by using weak
coherence models (possibly with a shared 256K or 512K or so for each
group of 4 cores, at which point it goes out to a higher latency global
bus). In this case, the division into independent memory regions could
be done in software.

It is unclear if my approach is "sufficiently minimal". There is more
complexity than I would like in my ISA (and effectively turning it into
the common superset of both my original design and RV64G, doesn't really
help matters here).

If going for a more minimal core optimized for perf/area, some stuff
might be dropped. Would likely drop integer and floating-point divide
again. Might also make sense to add an architectural zero register, and
eliminate some number of encodings which exist merely because of the
lack of a zero register (though, encodings are comparably cheap, as the
internal uArch has a zero register, and effectively treats immediate
values as a special register as well, ...). Some of the debate is more
related to the logic cost of dealing with some things in the decoder.

Though, would likely still make a few decisions differently from those
in RISC-V. Things like indexed load/store, predicated ops (with a
designated flag bit), and large-immediate encodings, help enough with
performance (relative to cost) to be worth keeping (though, mostly
because the alternatives are not so good in terms of performance).

Staying with 3-wide also makes sense, as going to 1-wide or 2-wide will
not save much if one still wants things like FP-SIMD and unaligned
memory access (when as-is, all the 3rd lane can really do is basic ALU
ops and similar; and the savings are negligible if one has a register
file with 6-ports, which in turn is needed to fully provision stuff for
the 2-wide cases, ...).

Practically the manycore idea doesn't go very far on FPGAs, as to have
"useful" cores, one can't fit more than a few cores on any of the
"affordable" FPGAs...

But, maybe other people will have different ideas.

....

Re: Microarch Club

<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38150&group=comp.arch#38150

  copy link   Newsgroups: comp.arch
Date: Mon, 25 Mar 2024 22:17:03 +0000
Subject: Re: Microarch Club
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$CfG8MEZfWDKPaqyjnTBz6.WDUVoRyE7VHHWsRMIGMHmsKtl9.s1xG
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com> <utsrft$1b76a$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
 by: MitchAlsup1 - Mon, 25 Mar 2024 22:17 UTC

BGB-Alt wrote:

> On 3/21/2024 2:34 PM, George Musk wrote:
>> Thought this may be interesting:
>> https://microarch.club/
>> https://www.youtube.com/@MicroarchClub/videos

> At least sort of interesting...

> I guess one of the guys on there did a manycore VLIW architecture with
> the memory local to each of the cores. Seems like an interesting
> approach, though not sure how well it would work on a general purpose
> workload. This is also closer to what I had imagined when I first
> started working on this stuff, but it had drifted more towards a
> slightly more conventional design.

> But, admittedly, this is for small-N cores, 16/32K of L1 with a shared
> L2, seemed like a better option than cores with a very large shared L1
> cache.

You appear to be "starting to get it"; congratulations.

> I am not sure that abandoning a global address space is such a great
> idea, as a lot of the "merits" can be gained instead by using weak
> coherence models (possibly with a shared 256K or 512K or so for each
> group of 4 cores, at which point it goes out to a higher latency global
> bus). In this case, the division into independent memory regions could
> be done in software.

Most of the last 50 years has been towards a single global address space.

> It is unclear if my approach is "sufficiently minimal". There is more
> complexity than I would like in my ISA (and effectively turning it into
> the common superset of both my original design and RV64G, doesn't really
> help matters here).

> If going for a more minimal core optimized for perf/area, some stuff
> might be dropped. Would likely drop integer and floating-point divide

I think this is pound foolish even if penny wise.

> again. Might also make sense to add an architectural zero register, and
> eliminate some number of encodings which exist merely because of the
> lack of a zero register (though, encodings are comparably cheap, as the

I got an effective zero register without having to waste a register
name to "get it". My 66000 gives you 32 registers of 64-bits each and
you can put any bit pattern in any register and treat it as you like.
Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
available.

> internal uArch has a zero register, and effectively treats immediate
> values as a special register as well, ...). Some of the debate is more
> related to the logic cost of dealing with some things in the decoder.

The problem is universal constants. RISCs being notably poor in their
support--however this is better than addressing modes which require
µCode.

> Though, would likely still make a few decisions differently from those
> in RISC-V. Things like indexed load/store,

Absolutely

> predicated ops (with a
> designated flag bit),

Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}

> and large-immediate encodings,

Nothing else is so poorly served in typical ISAs.

> help enough with
> performance (relative to cost)

+40%

> to be worth keeping (though, mostly
> because the alternatives are not so good in terms of performance).

Damage to pipeline ability less than -5%.

Re: Microarch Club

<uttfk3$1j3o3$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38156&group=comp.arch#38156

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Mon, 25 Mar 2024 22:32:13 -0500
Organization: A noiseless patient Spider
Lines: 276
Message-ID: <uttfk3$1j3o3$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
<utsrft$1b76a$1@dont-email.me>
<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 26 Mar 2024 04:32:20 +0100 (CET)
Injection-Info: dont-email.me; posting-host="fa3546eee5471627f4b184ff03e68974";
logging-data="1675011"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX185V4lZ7w/Bh+QL7n78TOjsz+NBcunrftA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:ARkiN822WxqClDyhJr9/pieYfKA=
In-Reply-To: <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
Content-Language: en-US
 by: BGB - Tue, 26 Mar 2024 03:32 UTC

On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
> BGB-Alt wrote:
>
>> On 3/21/2024 2:34 PM, George Musk wrote:
>>> Thought this may be interesting:
>>> https://microarch.club/
>>> https://www.youtube.com/@MicroarchClub/videos
>
>> At least sort of interesting...
>
>> I guess one of the guys on there did a manycore VLIW architecture with
>> the memory local to each of the cores. Seems like an interesting
>> approach, though not sure how well it would work on a general purpose
>> workload. This is also closer to what I had imagined when I first
>> started working on this stuff, but it had drifted more towards a
>> slightly more conventional design.
>
>
>> But, admittedly, this is for small-N cores, 16/32K of L1 with a shared
>> L2, seemed like a better option than cores with a very large shared L1
>> cache.
>
> You appear to be "starting to get it"; congratulations.
>

I had experimented with stuff before, and "big L1 caches" seemed to be
in most regards worse. Hit rate goes into diminishing return territory,
and timing isn't too happy either.

At least for my workloads, 32K seemed like the local optimum.

Say, checking hit rates (in Doom):
8K: 67%, 16K: 78%,
32K: 85%, 64K: 87%
128K: 88%
This being for a direct-mapped cache configuration with even/odd paired
16-byte cache lines.

Other programs seem similar.

For a direct-mapped L1 cache, there is an issue with conflict misses,
where I was able to add in a small cache to absorb ~ 1-2% that was due
to conflict misses, which also had the (seemingly more obvious) effect
of reducing L2 misses (from a direct-mapped L2 cache). Though, it is
likely that a set-associative L2 cache could have also addressed this
issue (but likely with a higher cost impact).

>> I am not sure that abandoning a global address space is such a great
>> idea, as a lot of the "merits" can be gained instead by using weak
>> coherence models (possibly with a shared 256K or 512K or so for each
>> group of 4 cores, at which point it goes out to a higher latency
>> global bus). In this case, the division into independent memory
>> regions could be done in software.
>
> Most of the last 50 years has been towards a single global address space.
>

Yeah.

From what I can gather, the guy in the video had an architecture which
gives each CPU its own 128K and needs explicit message passing to access
outside of this (and faking a global address space in software, at a
significant performance penalty). As I see it, this does not seem like
such a great idea...

Something like weak coherence can get most of the same savings, with
much less impact on how one writes code (albeit, it does mean that mutex
locking may still be painfully slow).

But, this does mean it is better to try to approach software in a way
that neither requires TSO semantics nor frequent mutex locking.

>> It is unclear if my approach is "sufficiently minimal". There is more
>> complexity than I would like in my ISA (and effectively turning it
>> into the common superset of both my original design and RV64G, doesn't
>> really help matters here).
>
>> If going for a more minimal core optimized for perf/area, some stuff
>> might be dropped. Would likely drop integer and floating-point divide
>
> I think this is pound foolish even if penny wise.
>

The "shift and add" unit isn't free, and the relative gains are small.
For integer divide, granted, it is faster than the pure software version
in the general case. For FPU divide, N-R is faster, but shift-add can
give an exact result. Most other / faster hardware divide strategies
seem to be more expensive than a shift-and-add unit.

My confidence in hardware divide isn't too high, noting for example that
the AMD K10 and Bulldozer/15h had painfully slow divide operations (to
such a degree that doing it in software was often faster). This implies
that divide cost/performance is still not really a "solved" issue, even
if one has the resources to throw at it.

One can avoid the cost of the shift-and-add unit via "trap and emulate",
but then the performance is worse.

Say, "we have an instruction, but it is a boat anchor" isn't an ideal
situation (unless to be a placeholder for if/when it is not a boat anchor).

>> again. Might also make sense to add an architectural zero register,
>> and eliminate some number of encodings which exist merely because of
>> the lack of a zero register (though, encodings are comparably cheap,
>> as the
>
> I got an effective zero register without having to waste a register name
> to "get it". My 66000 gives you 32 registers of 64-bits each and you can
> put any bit pattern in any register and treat it as you like.
> Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
> available.
>

I guess offloading this to the compiler can also make sense.

Least common denominator would be, say, not providing things like NEG
instructions and similar (pretending as-if one had a zero register), and
if a program needs to do a NEG or similar, it can load 0 into a register
itself.

In the extreme case (say, one also lacks a designated "load immediate"
instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to
zero a register...

Say:
XOR R14, R14, R14 //Designate R14 as pseudo-zero...
...
ADD R14, 0x123, R8 //Load 0x123 into R8

Though, likely still makes sense in this case to provide some
"convenience" instructions.

>> internal uArch has a zero register, and effectively treats immediate
>> values as a special register as well, ...). Some of the debate is more
>> related to the logic cost of dealing with some things in the decoder.
>
> The problem is universal constants. RISCs being notably poor in their
> support--however this is better than addressing modes which require
> µCode.
>

Yeah.

I ended up with jumbo-prefixes. Still not perfect, and not perfectly
orthogonal, but mostly works.

Allows, say:
ADD R4, 0x12345678, R6

To be performed in potentially 1 clock-cycle and with a 64-bit encoding,
which is better than, say:
LUI X8, 0x12345
ADD X8, X8, 0x678
ADD X12, X10, X8

Though, for jumbo-prefixes, did end up adding a special case in the
compile where it will try to figure out if a constant will be used
multiple times in a basic-block and, if so, will load it into a register
rather than use a jumbo-prefix form.

It could maybe make sense to have function-scale static-assigned
constants, but have not done so yet.

Though, it appears as if one of the "top contenders" here would be 0,
mostly because things like:
foo->x=0;
And:
bar[i]=0;
Are semi-common, and as-is end up needing to load 0 into a register each
time they appear.

Had already ended up with a similar sort of special case to optimize
"return 0;" and similar, mostly because this was common enough that it
made more sense to have a special case:
BRA .lbl_ret //if function does not end with "return 0;"
.lbl_ret_zero:
MOV 0, R2
.lbl_ret:
... epilog ...

For many functions, which allowed "return 0;" to be emitted as:
BRA .lbl_ret_zero
Rather than:
MOV 0, R2
BRA .lbl_ret
Which on average ended up as a net-win when there are more than around 3
of them per function.

Though, another possibility could be to allow constants to be included
in the "statically assign variables to registers" logic (as-is, they are
excluded except in "tiny leaf" functions).

>> Though, would likely still make a few decisions differently from those
>> in RISC-V. Things like indexed load/store,
>
> Absolutely
>
>>                                            predicated ops (with a
>> designated flag bit),
>
> Predicated then and else clauses which are branch free.
> {{Also good for constant time crypto in need of flow control...}}
>

I have per instruction predication:
CMPxx ...
OP?T //if-true
OP?F //if-false
Or:
OP?T | OP?F //both in parallel, subject to encoding and ISA rules

Performance gains are modest, but still noticeable (part of why
predication ended up as a core ISA feature). Effect on pipeline seems to
be small in its current form (it is handled along with register fetch,
mostly turning non-executed instructions into NOPs during the EX stages).

For the most part, 1-bit seems sufficient.
More complex schemes generally ran into issues (had experimented with
allowing a second predicate bit, or handling predicates as a
stack-machine, but these ideas were mostly dead on arrival).


Click here to read the complete article
Re: Microarch Club

<c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38172&group=comp.arch#38172

  copy link   Newsgroups: comp.arch
Date: Tue, 26 Mar 2024 19:16:07 +0000
Subject: Re: Microarch Club
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$bgAXkHXpiaHjfxTbP7YnMe8TaohYzcg85de48NPN5DimBNsR.lnWG
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com> <utsrft$1b76a$1@dont-email.me> <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> <uttfk3$1j3o3$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
 by: MitchAlsup1 - Tue, 26 Mar 2024 19:16 UTC

BGB wrote:

> On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
>> BGB-Alt wrote:
>>

> Say, "we have an instruction, but it is a boat anchor" isn't an ideal
> situation (unless to be a placeholder for if/when it is not a boat anchor).

If the boat anchor is a required unit of functionality, and I believe
IDIV and FPDIV is, it should be defined in ISA and if you can't afford
it find some way to trap rapidly so you can fix it up without excessive
overhead. Like a MIPS TLB reload. If you can't get trap and emulate at
sufficient performance, then add the HW to perform the instruction.

>>> again. Might also make sense to add an architectural zero register,
>>> and eliminate some number of encodings which exist merely because of
>>> the lack of a zero register (though, encodings are comparably cheap,
>>> as the
>>
>> I got an effective zero register without having to waste a register name
>> to "get it". My 66000 gives you 32 registers of 64-bits each and you can
>> put any bit pattern in any register and treat it as you like.
>> Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
>> available.
>>

> I guess offloading this to the compiler can also make sense.

> Least common denominator would be, say, not providing things like NEG
> instructions and similar (pretending as-if one had a zero register), and
> if a program needs to do a NEG or similar, it can load 0 into a register
> itself.

> In the extreme case (say, one also lacks a designated "load immediate"
> instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to
> zero a register...

MOV Rd,#imm16

Cost 1 instruction of 32-bits in size and can be performed in 0 cycles

> Say:
> XOR R14, R14, R14 //Designate R14 as pseudo-zero...
> ...
> ADD R14, 0x123, R8 //Load 0x123 into R8

> Though, likely still makes sense in this case to provide some
> "convenience" instructions.

>>> internal uArch has a zero register, and effectively treats immediate
>>> values as a special register as well, ...). Some of the debate is more
>>> related to the logic cost of dealing with some things in the decoder.
>>
>> The problem is universal constants. RISCs being notably poor in their
>> support--however this is better than addressing modes which require
>> µCode.
>>

> Yeah.

> I ended up with jumbo-prefixes. Still not perfect, and not perfectly
> orthogonal, but mostly works.

> Allows, say:
> ADD R4, 0x12345678, R6

> To be performed in potentially 1 clock-cycle and with a 64-bit encoding,
> which is better than, say:
> LUI X8, 0x12345
> ADD X8, X8, 0x678
> ADD X12, X10, X8

This strategy completely fails when the constant contains more than 32-bits

FDIV R9,#3.141592653589247,R17

When you have universal constants (including 5-bit immediates), you rarely
need a register containing 0.

> Though, for jumbo-prefixes, did end up adding a special case in the
> compile where it will try to figure out if a constant will be used
> multiple times in a basic-block and, if so, will load it into a register
> rather than use a jumbo-prefix form.

This is a delicate balance:: while each use of the constant takes a
unit or 2 of the instruction stream, each use cost 0 more instructions.
The breakeven point in My 66000 is about 4 uses in a small area (loop)
means that it should be hoisted into a register.

> It could maybe make sense to have function-scale static-assigned
> constants, but have not done so yet.

> Though, it appears as if one of the "top contenders" here would be 0,
> mostly because things like:
> foo->x=0;
> And:
> bar[i]=0;

I see no need for a zero register:: the following are 1 instruction !

ST #0,[Rfoo,offset(x)]

ST #0,[Rbar,Ri]

> Are semi-common, and as-is end up needing to load 0 into a register each
> time they appear.

> Had already ended up with a similar sort of special case to optimize
> "return 0;" and similar, mostly because this was common enough that it
> made more sense to have a special case:
> BRA .lbl_ret //if function does not end with "return 0;"
> .lbl_ret_zero:
> MOV 0, R2
> .lbl_ret:
> ... epilog ...

> For many functions, which allowed "return 0;" to be emitted as:
> BRA .lbl_ret_zero
> Rather than:
> MOV 0, R2
> BRA .lbl_ret
> Which on average ended up as a net-win when there are more than around 3
> of them per function.

Special defined tails......

> Though, another possibility could be to allow constants to be included
> in the "statically assign variables to registers" logic (as-is, they are
> excluded except in "tiny leaf" functions).

>>> Though, would likely still make a few decisions differently from those
>>> in RISC-V. Things like indexed load/store,
>>
>> Absolutely
>>
>>>                                            predicated ops (with a
>>> designated flag bit),
>>
>> Predicated then and else clauses which are branch free.
>> {{Also good for constant time crypto in need of flow control...}}
>>

> I have per instruction predication:
> CMPxx ...
> OP?T //if-true
> OP?F //if-false
> Or:
> OP?T | OP?F //both in parallel, subject to encoding and ISA rules

CMP Rt,Ra,#whatever
PLE Rt,TTTTTEEE
// This begins the then-clause 5Ts -> 5 instructions
OP1
OP2
OP3
OP4
OP5
// this begins the else-clause 3Es -> 3 instructions
OP6
OP7
OP8
// we are now back join point.

Notice no internal flow control instructions.

> Performance gains are modest, but still noticeable (part of why
> predication ended up as a core ISA feature). Effect on pipeline seems to
> be small in its current form (it is handled along with register fetch,
> mostly turning non-executed instructions into NOPs during the EX stages).

The effect is that one uses Predication whenever you will have already
fetched instructions at the join point by the time you have determined
the predicate value {then, else} clauses. The PARSE and DECODE do the
flow control without bothering FETCH.

> For the most part, 1-bit seems sufficient.

How do you do && and || predication with 1 bit ??

> More complex schemes generally ran into issues (had experimented with
> allowing a second predicate bit, or handling predicates as a
> stack-machine, but these ideas were mostly dead on arrival).

Also note: the instructions in the then and else clauses know NOTHING
about being under a predicate mask (or not) Thus, they waste no bit
while retaining the ability to run under predication.

>>>                       and large-immediate encodings,
>>
>> Nothing else is so poorly served in typical ISAs.
>>

> Probably true.

>>>                                                      help enough with
>>> performance (relative to cost)
>>
>> +40%
>>

> I am mostly seeing around 30% or so, for Doom and similar.
> A few other programs still being closer to break-even at present.

> Things are a bit more contentious in terms of code density:
> With size-minimizing options to GCC:
> ".text" is slightly larger with BGBCC vs GCC (around 11%);
> However, the GCC output has significantly more ".rodata".

A lot of this .rodata becomes constants in .text with universal constants.

> A reasonable chunk of the code-size difference could be attributed to
> jumbo prefixes making the average instruction size slightly bigger.

Size is one thing and it primarily diddles in cache footprint statstics.
Instruction count is another and primarily diddles in pipeline cycles
to execute statistics.
Fewer instruction wins almost all the time.

> More could be possible with more compiler optimization effort.
> Currently, a few recent optimization cases are disabled as they seem to
> be causing bugs that I haven't figured out yet.

>>>                                to be worth keeping (though, mostly
>>> because the alternatives are not so good in terms of performance).
>>
>> Damage to pipeline ability less than -5%.

> Yeah.

Re: Microarch Club

<utvggu$2cgkl$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38173&group=comp.arch#38173

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!newsfeed.bofh.team!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Tue, 26 Mar 2024 16:59:57 -0500
Organization: A noiseless patient Spider
Lines: 431
Message-ID: <utvggu$2cgkl$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
<utsrft$1b76a$1@dont-email.me>
<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
<uttfk3$1j3o3$1@dont-email.me>
<c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 26 Mar 2024 21:59:59 +0100 (CET)
Injection-Info: dont-email.me; posting-host="9e0688b238126de83c4cd3272be1498b";
logging-data="2507413"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18KKWTXyCMsqJiD8XSHFnIQxZT2Aw1I74I="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:l35LhN2DlM7ntTYKtkskbThoVJM=
In-Reply-To: <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
Content-Language: en-US
 by: BGB-Alt - Tue, 26 Mar 2024 21:59 UTC

On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
>>> BGB-Alt wrote:
>>>
>
>> Say, "we have an instruction, but it is a boat anchor" isn't an ideal
>> situation (unless to be a placeholder for if/when it is not a boat
>> anchor).
>
> If the boat anchor is a required unit of functionality, and I believe
> IDIV and FPDIV is, it should be defined in ISA and if you can't afford
> it find some way to trap rapidly so you can fix it up without excessive
> overhead. Like a MIPS TLB reload. If you can't get trap and emulate at
> sufficient performance, then add the HW to perform the instruction.
>

Though, 32-bit ARM managed OK without integer divide.

In my case, I ended up supporting it mostly for sake of the RV64 'M'
extension, but it is in this case a little faster than a pure software
solution (unlike on the K10 and Piledriver).

Still costs around 1.5 kLUTs though for 64-bit MUL/DIV support, and a
little more to route FDIV through it.

Cheapest FPU approach is still the "ADD/SUB/MUL only" route.

>>>> again. Might also make sense to add an architectural zero register,
>>>> and eliminate some number of encodings which exist merely because of
>>>> the lack of a zero register (though, encodings are comparably cheap,
>>>> as the
>>>
>>> I got an effective zero register without having to waste a register
>>> name to "get it". My 66000 gives you 32 registers of 64-bits each and
>>> you can put any bit pattern in any register and treat it as you like.
>>> Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
>>> available.
>>>
>
>> I guess offloading this to the compiler can also make sense.
>
>> Least common denominator would be, say, not providing things like NEG
>> instructions and similar (pretending as-if one had a zero register),
>> and if a program needs to do a NEG or similar, it can load 0 into a
>> register itself.
>
>> In the extreme case (say, one also lacks a designated "load immediate"
>> instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy
>> to zero a register...
>
>     MOV   Rd,#imm16
>
> Cost 1 instruction of 32-bits in size and can be performed in 0 cycles
>

Though, RV had skipped this:
ADD Xd, Zero, Imm12s
Or:
LUI Xd, ImmHi20
ADD Xd, Xd, ImmLo12s

One can argue for this on the basis of not needing an immediate-load
instruction (nor a MOV instruction, nor NEG, nor ...).

Though, yeah, in my case I ended up with more variety here:
LDIZ Imm10u, Rn //10-bit, zero-extend, Imm12u (XG2)
LDIN Imm10n, Rn //10-bit, one-extend, Imm12n (XG2)
LDIMIZ Imm10u, Rn //Rn=Imm10u<<16 (newish)
LDIMIN Imm10n, Rn //Rn=Imm10n<<16 (newish)
LDIHI Imm10u, Rn //Rn=Imm10u<<22
LDIQHI Imm10u, Rn //Rn=Imm10u<<54

LDIZ Imm16u, Rn //16-bit, zero-extend
LDIN Imm16n, Rn //16-bit, one-extend

Then 64-bit jumbo forms:
LDI Imm33s, Rn //33-bit, sign-extend
LDIHI Imm33s, Rn //Rn=Imm33s<<16
LDIQHI Imm33s, Rn //Rn=Imm33s<<32

Then, 96 bit:
LDI Imm64, Rn //64-bit, sign-extend

And, some special cases:
FLDCH Imm16u, Rn //Binary16->Binary64

One could argue though that this is wild extravagance...

The recent addition of LDIMIx was mostly because otherwise one needed a
64-bit encoding to load constants like 262144 or similar (and a lot of
bit-masks).

At one point I did evaluate a more ARM32-like approach (effectively
using a small value and a rotate). But, this cost more than the other
options (would have required the great evil of effectively being able to
feed two immediate values into the integer-shift unit, whereas many of
the others could be routed through logic I already have for other ops).

Though, one can argue that the drawback is that one does end up with
more instructions in the ISA listing.

>> Say:
>>    XOR R14, R14, R14  //Designate R14 as pseudo-zero...
>>    ...
>>    ADD R14, 0x123, R8  //Load 0x123 into R8
>
>> Though, likely still makes sense in this case to provide some
>> "convenience" instructions.
>
>
>>>> internal uArch has a zero register, and effectively treats immediate
>>>> values as a special register as well, ...). Some of the debate is
>>>> more related to the logic cost of dealing with some things in the
>>>> decoder.
>>>
>>> The problem is universal constants. RISCs being notably poor in their
>>> support--however this is better than addressing modes which require
>>> µCode.
>>>
>
>> Yeah.
>
>> I ended up with jumbo-prefixes. Still not perfect, and not perfectly
>> orthogonal, but mostly works.
>
>> Allows, say:
>>    ADD R4, 0x12345678, R6
>
>> To be performed in potentially 1 clock-cycle and with a 64-bit
>> encoding, which is better than, say:
>>    LUI X8, 0x12345
>>    ADD X8, X8, 0x678
>>    ADD X12, X10, X8
>
> This strategy completely fails when the constant contains more than 32-bits
>
>     FDIV   R9,#3.141592653589247,R17
>
> When you have universal constants (including 5-bit immediates), you rarely
> need a register containing 0.
>

The jumbo prefixes at least allow for a 64-bit constant load, but as-is
not for 64-bit immediate values to 3RI ops. The latter could be done,
but would require 128-bit fetch and decode, which doesn't seem worth it.

There is the limbo feature of allowing for 57-bit immediate values, but
this is optional.

OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with
Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.

Typical GCC response on RV64 seems to be to turn nearly all of the
big-constant cases into memory loads, which kinda sucks.

Even something like a "LI Xd, Imm17s" instruction, would notably reduce
the number of constants loaded from memory (as GCC seemingly prefers to
use a LHU or LW or similar rather than encode it using LUI+ADD).

I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or
S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping
them enabled in the CPU core (they involved the non-zero cost of
repacking them into Binary16 in ID1 and then throwing a
Binary16->Binary64 converter into the ID2 stage).

Generally, the "FLDCH Imm16, Rn" instruction works well enough here (and
can leverage a more generic Binary16->Binary64 converter path).

For FPU compare with zero, can almost leverage the integer compare ops,
apart from the annoying edge cases of -0.0 and NaN leading to "not
strictly equivalent" behavior (though, an ASM programmer could more
easily get away with this). But, not common enough to justify adding FPU
specific ops for this.

>> Though, for jumbo-prefixes, did end up adding a special case in the
>> compile where it will try to figure out if a constant will be used
>> multiple times in a basic-block and, if so, will load it into a
>> register rather than use a jumbo-prefix form.
>
> This is a delicate balance:: while each use of the constant takes a
> unit or 2 of the instruction stream, each use cost 0 more instructions.
> The breakeven point in My 66000 is about 4 uses in a small area (loop)
> means that it should be hoisted into a register.
>

IIRC, I had set it as x>2, since 2 seemed break-even, 3+ favoring using
a register, and 1 favoring a constant.

This did require adding logic to look forward over the current basic
block to check for additional usage, which does seem a little like a
hack. But, similar logic had already been used by the register allocator
to try to prioritize which values should be evicted.

>> It could maybe make sense to have function-scale static-assigned
>> constants, but have not done so yet.
>
>> Though, it appears as if one of the "top contenders" here would be 0,
>> mostly because things like:
>>    foo->x=0;
>> And:
>>    bar[i]=0;
>
> I see no need for a zero register:: the following are 1 instruction !
>
>    ST     #0,[Rfoo,offset(x)]
>
>    ST     #0,[Rbar,Ri]
>

In my case, there is no direct equivalent in the core ISA.

But, in retrospect, a zero-register is probably not worth it if one is
not trying for RISC-V style minimalism in the core ISA (nevermind if
they seemingly promptly disregard minimalism as soon as one gets outside
of 'I', but then fail to address the shortcomings of said minimalism in
'I').


Click here to read the complete article
Re: Microarch Club

<20240327012715.0000125c@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38174&group=comp.arch#38174

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Wed, 27 Mar 2024 00:27:15 +0200
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <20240327012715.0000125c@yahoo.com>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
<utsrft$1b76a$1@dont-email.me>
<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
<uttfk3$1j3o3$1@dont-email.me>
<c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
<utvggu$2cgkl$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 26 Mar 2024 22:27:18 +0100 (CET)
Injection-Info: dont-email.me; posting-host="fdc93945bf4086afcea95294ad40c436";
logging-data="2512648"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18d3p8/74fLvkXKmsP5TwwMNAg0tinx7kA="
Cancel-Lock: sha1:/izJ3hTvS7/GCcFIRqI47KmekF0=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
 by: Michael S - Tue, 26 Mar 2024 22:27 UTC

On Tue, 26 Mar 2024 16:59:57 -0500
BGB-Alt <bohannonindustriesllc@gmail.com> wrote:

> On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
> > BGB wrote:
> >
> >> On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
> >>> BGB-Alt wrote:
> >>>
> >
> >> Say, "we have an instruction, but it is a boat anchor" isn't an
> >> ideal situation (unless to be a placeholder for if/when it is not
> >> a boat anchor).
> >
> > If the boat anchor is a required unit of functionality, and I
> > believe IDIV and FPDIV is, it should be defined in ISA and if you
> > can't afford it find some way to trap rapidly so you can fix it up
> > without excessive overhead. Like a MIPS TLB reload. If you can't
> > get trap and emulate at sufficient performance, then add the HW to
> > perform the instruction.
>
> Though, 32-bit ARM managed OK without integer divide.
>

For slightly less then 20 years ARM managed OK without integer divide.
Then in 2004 they added integer divide instruction in ARMv7 (including
ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are
doing great :-)

Re: Microarch Club

<3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38175&group=comp.arch#38175

  copy link   Newsgroups: comp.arch
Date: Wed, 27 Mar 2024 00:02:05 +0000
Subject: Re: Microarch Club
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$MP4yyA3Q/mGTF79uwjlLEuWMQSrb/myYSpnyznxIgQk0pmgdpu8yy
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com> <utsrft$1b76a$1@dont-email.me> <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> <uttfk3$1j3o3$1@dont-email.me> <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org> <utvggu$2cgkl$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org>
 by: MitchAlsup1 - Wed, 27 Mar 2024 00:02 UTC

BGB-Alt wrote:

> On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> I ended up with jumbo-prefixes. Still not perfect, and not perfectly
>>> orthogonal, but mostly works.
>>
>>> Allows, say:
>>>    ADD R4, 0x12345678, R6
>>
>>> To be performed in potentially 1 clock-cycle and with a 64-bit
>>> encoding, which is better than, say:
>>>    LUI X8, 0x12345
>>>    ADD X8, X8, 0x678
>>>    ADD X12, X10, X8
>>
>> This strategy completely fails when the constant contains more than 32-bits
>>
>>     FDIV   R9,#3.141592653589247,R17
>>
>> When you have universal constants (including 5-bit immediates), you rarely
>> need a register containing 0.
>>

> The jumbo prefixes at least allow for a 64-bit constant load, but as-is
> not for 64-bit immediate values to 3RI ops. The latter could be done,
> but would require 128-bit fetch and decode, which doesn't seem worth it.

> There is the limbo feature of allowing for 57-bit immediate values, but
> this is optional.

> OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with
> Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.

Which the LLVM compiler for RISC-V does not do, instead it uses a AUPIC
and a LD to get the value from data memory within ±2GB of IP. This takes
3 instructions and 2 words in memory when universal constants do this in
1 instruction and 2 words in the code stream to do this.

> Typical GCC response on RV64 seems to be to turn nearly all of the
> big-constant cases into memory loads, which kinda sucks.

This is typical when the underlying architecture is not very extensible
to 64-bit virtual address spaces; they have to waste a portion of the
32-bit space to get access to all the 64-bit space. Universal constants
makes this problem vanish.

> Even something like a "LI Xd, Imm17s" instruction, would notably reduce
> the number of constants loaded from memory (as GCC seemingly prefers to
> use a LHU or LW or similar rather than encode it using LUI+ADD).

Reduce when compared to RISC-V but increased when compared to My 66000.
My 66000 has (at 99& level) uses no instructions to fetch or create
constants, nor does it waste any register (or registers) to hold use
once constants.

> I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or
> S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping
> them enabled in the CPU core (they involved the non-zero cost of
> repacking them into Binary16 in ID1 and then throwing a
> Binary16->Binary64 converter into the ID2 stage).

> Generally, the "FLDCH Imm16, Rn" instruction works well enough here (and
> can leverage a more generic Binary16->Binary64 converter path).

Sometimes I see a::

CVTSD R2,#5

Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed in
register R2 so it can be accesses as an argument in the subroutine call
to happen in a few instructions.

Mostly, a floating point immediate is available from a 32-bit constant
container. When accesses in a float calculation it is used as IEEE32
when accessed by a 6double calculation IEEE32->IEEE64 promotion is
performed in the constant delivery path. So, one can use almost any
floating point constant that is representable in float as a double
without eating cycles and while saving code footprint.

> For FPU compare with zero, can almost leverage the integer compare ops,
> apart from the annoying edge cases of -0.0 and NaN leading to "not
> strictly equivalent" behavior (though, an ASM programmer could more
> easily get away with this). But, not common enough to justify adding FPU
> specific ops for this.

Actually, the edge/noise cases are not that many gates.
a) once you are separating out NaNs, infinities are free !!
b) once you are checking denorms for zero, infinites become free !!

Having structured a Compare-to-zero circuit based on the fields in double;
You can compose the terns to do all signed and unsigned integers and get
a gate count, then the number of gates you add to cover all 10 cases of
floating point is 12% gate count over the simple integer version. Also
note:: this circuit is about 10% of the gate count of an integer adder.

-----------------------

> Seems that generally 0 still isn't quite common enough to justify having
> one register fewer for variables though (or to have a designated zero
> register), but otherwise it seems there is not much to justify trying to
> exclude the "implicit zero" ops from the ISA listing.

It is common enough,
But there are lots of ways to get a zero where you want it for a return.

>>
>>>>> Though, would likely still make a few decisions differently from
>>>>> those in RISC-V. Things like indexed load/store,
>>>>
>>>> Absolutely
>>>>
>>>>>                                            predicated ops (with a
>>>>> designated flag bit),
>>>>
>>>> Predicated then and else clauses which are branch free.
>>>> {{Also good for constant time crypto in need of flow control...}}
>>>>
>>
>>> I have per instruction predication:
>>>    CMPxx ...
>>>    OP?T  //if-true
>>>    OP?F  //if-false
>>> Or:
>>>    OP?T | OP?F  //both in parallel, subject to encoding and ISA rules
>>
>>     CMP  Rt,Ra,#whatever
>>     PLE  Rt,TTTTTEEE
>>     // This begins the then-clause 5Ts -> 5 instructions
>>     OP1
>>     OP2
>>     OP3
>>     OP4
>>     OP5
>>     // this begins the else-clause 3Es -> 3 instructions
>>     OP6
>>     OP7
>>     OP8
>>     // we are now back join point.
>>
>> Notice no internal flow control instructions.
>>

> It can be similar in my case, with the ?T / ?F encoding scheme.

Except you eat that/those bits in OpCode encoding.

> While poking at it, did go and add a check to exclude large struct-copy
> operations from predication, as it is slower to turn a large struct copy
> into NOPs than to branch over it.

> Did end up leaving struct-copies where sz<=64 as allowed though (where a
> 64 byte copy at least has the merit of achieving full pipeline
> saturation and being roughly break-even with a branch-miss, whereas a
> 128 byte copy would cost roughly twice as much as a branch miss).

I decided to bite the bullet and have LDM, STM and MM so the compiler does
not have to do any analysis. This puts the onus on the memory unit designer
to process these at least as fast as a series of LDs and STs. Done right
this saves ~40%of the power of the caches avoiding ~70% of tag accesses
and 90% of TLB accesses. You access the tag only when/after crossing a line
boundary and you access TLB only after crossing a page boundary.

>>> Performance gains are modest, but still noticeable (part of why
>>> predication ended up as a core ISA feature). Effect on pipeline seems
>>> to be small in its current form (it is handled along with register
>>> fetch, mostly turning non-executed instructions into NOPs during the
>>> EX stages).
>>
>> The effect is that one uses Predication whenever you will have already
>> fetched instructions at the join point by the time you have determined
>> the predicate value {then, else} clauses. The PARSE and DECODE do the
>> flow control without bothering FETCH.
>>

> Yeah, though in my pipeline, it is still a tradeoff of the relative cost
> of a missed branch, vs the cost of sliding over both the THEN and ELSE
> branches as a series of NOPs.

>>> For the most part, 1-bit seems sufficient.
>>
>> How do you do && and || predication with 1 bit ??
>>

> Originally, it didn't.
> Now I added some 3R and 3RI CMPxx encodings.

> This allows, say:
> CMPGT R8, R10, R4
> CMPGT R8, R11, R5
> TST R4, R5
> ....

All I had to do was to make the second predication overwrite the first
predication's mask, and the compiler did the rest.

Re: Microarch Club

<uu1o2p$30cnr$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38182&group=comp.arch#38182

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Wed, 27 Mar 2024 13:21:04 -0500
Organization: A noiseless patient Spider
Lines: 343
Message-ID: <uu1o2p$30cnr$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
<utsrft$1b76a$1@dont-email.me>
<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
<uttfk3$1j3o3$1@dont-email.me>
<c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
<utvggu$2cgkl$1@dont-email.me>
<3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 27 Mar 2024 18:21:14 +0100 (CET)
Injection-Info: dont-email.me; posting-host="1724f6b222b7d6e5637447713cb23601";
logging-data="3158779"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+rG3KEYjMp0HKOhQuK3ce4plqkH/NZkN4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:S0vCQiaILZ2kZhi23n1GrJmkUxc=
Content-Language: en-US
In-Reply-To: <3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org>
 by: BGB - Wed, 27 Mar 2024 18:21 UTC

On 3/26/2024 7:02 PM, MitchAlsup1 wrote:
> BGB-Alt wrote:
>
>> On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> I ended up with jumbo-prefixes. Still not perfect, and not perfectly
>>>> orthogonal, but mostly works.
>>>
>>>> Allows, say:
>>>>    ADD R4, 0x12345678, R6
>>>
>>>> To be performed in potentially 1 clock-cycle and with a 64-bit
>>>> encoding, which is better than, say:
>>>>    LUI X8, 0x12345
>>>>    ADD X8, X8, 0x678
>>>>    ADD X12, X10, X8
>>>
>>> This strategy completely fails when the constant contains more than
>>> 32-bits
>>>
>>>      FDIV   R9,#3.141592653589247,R17
>>>
>>> When you have universal constants (including 5-bit immediates), you
>>> rarely
>>> need a register containing 0.
>>>
>
>> The jumbo prefixes at least allow for a 64-bit constant load, but
>> as-is not for 64-bit immediate values to 3RI ops. The latter could be
>> done, but would require 128-bit fetch and decode, which doesn't seem
>> worth it.
>
>> There is the limbo feature of allowing for 57-bit immediate values,
>> but this is optional.
>
>
>> OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with
>> Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.
>
> Which the LLVM compiler for RISC-V does not do, instead it uses a AUPIC
> and a LD to get the value from data memory within ±2GB of IP. This takes
> 3 instructions and 2 words in memory when universal constants do this in
> 1 instruction and 2 words in the code stream to do this.
>

I was mostly testing with GCC for RV64, but, yeah, it just does memory
loads.

>> Typical GCC response on RV64 seems to be to turn nearly all of the
>> big-constant cases into memory loads, which kinda sucks.
>
> This is typical when the underlying architecture is not very extensible
> to 64-bit virtual address spaces; they have to waste a portion of the
> 32-bit space to get access to all the 64-bit space. Universal constants
> makes this problem vanish.
>

Yeah.

It at least seems worthwhile to have non-suck fallback strategies.

>> Even something like a "LI Xd, Imm17s" instruction, would notably
>> reduce the number of constants loaded from memory (as GCC seemingly
>> prefers to use a LHU or LW or similar rather than encode it using
>> LUI+ADD).
>
> Reduce when compared to RISC-V but increased when compared to My 66000.
> My 66000 has (at 99& level) uses no instructions to fetch or create
> constants, nor does it waste any register (or registers) to hold use
> once constants.
>

Yeah, but the issue here is mostly with RISC-V and its lack of constant
load.

Or burning lots of encoding space (on LUI and AUIPC), but still lacks
good general-purpose options.

FWIW, if they had:
LI Xd, Imm17s
SHORI Xd, Imm16u //Xd=(Xd<<16)|Imm16u

Would have still allowed a 32-bit constant in 2 ops, but would have also
allowed 64-bit in 4 ops (within the limits of fixed-length 32-bit
instructions); while also needing significantly less encoding space
(could fit both of them into the remaining space in the OP-IMM-32 block).

>> I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or
>> S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping
>> them enabled in the CPU core (they involved the non-zero cost of
>> repacking them into Binary16 in ID1 and then throwing a
>> Binary16->Binary64 converter into the ID2 stage).
>
>> Generally, the "FLDCH Imm16, Rn" instruction works well enough here
>> (and can leverage a more generic Binary16->Binary64 converter path).
>
> Sometimes I see a::
>
>     CVTSD     R2,#5
>
> Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed
> in register R2 so it can be accesses as an argument in the subroutine call
> to happen in a few instructions.
>

I had looked into, say:
FADD Rm, Imm5fp, Rn
Where, despite Imm5fp being severely limited, it had an OK hit rate.

Unpacking imm5fp to Binary16 being, essentially:
aee.fff -> 0.aAAee.fff0000000

OTOH, can note that a majority of typical floating point constants can
be represented exactly in Binary16 (well, excluding "0.1" or similar),
so it works OK as an immediate format.

This allows a single 32-bit op to be used for constant loads (nevermind
if one needs a 96 bit encoding for 0.1, or PI, or ...).

IIRC, I originally had it in the CONV2 path, which would give it a
2-cycle latency and only allow it in Lane 1.

I later migrated the logic to the "MOV_IR" path which also deals with
"Rn=(Rn<<16)|Imm" and similar, and currently allows 1-cycle Binary16
immediate-loads in all 3 lanes.

Though, BGBCC still assumes it is Lane-1 only unless the FPU Immediate
extension is enabled (as with the other FP converters).

> Mostly, a floating point immediate is available from a 32-bit constant
> container. When accesses in a float calculation it is used as IEEE32
> when accessed by a 6double calculation IEEE32->IEEE64 promotion is
> performed in the constant delivery path. So, one can use almost any
> floating point constant that is representable in float as a double
> without eating cycles and while saving code footprint.
>

Don't currently have the encoding space for this.

Could in theory pull off truncated Binary32 an Imm29s form, but not
likely worth it. Would also require putting a converted in the ID2
stage, so not free.

In this case, the issue is more one of LUT cost to support these cases.

>> For FPU compare with zero, can almost leverage the integer compare
>> ops, apart from the annoying edge cases of -0.0 and NaN leading to
>> "not strictly equivalent" behavior (though, an ASM programmer could
>> more easily get away with this). But, not common enough to justify
>> adding FPU specific ops for this.
>
> Actually, the edge/noise cases are not that many gates.
> a) once you are separating out NaNs, infinities are free !!
> b) once you are checking denorms for zero, infinites become free !!
>
> Having structured a Compare-to-zero circuit based on the fields in double;
> You can compose the terns to do all signed and unsigned integers and get
> a gate count, then the number of gates you add to cover all 10 cases of
> floating point is 12% gate count over the simple integer version. Also
> note:: this circuit is about 10% of the gate count of an integer adder.
>

I could add them, but, is it worth it?...

In this case, it is more a question of encoding space than logic cost.

It is semi-common in FP terms, but likely not common enough to justify
dedicated compare-and-branch ops and similar (vs the minor annoyance at
the integer ops not quite working correctly due to edge cases).

> -----------------------
>
>> Seems that generally 0 still isn't quite common enough to justify
>> having one register fewer for variables though (or to have a
>> designated zero register), but otherwise it seems there is not much to
>> justify trying to exclude the "implicit zero" ops from the ISA listing.
>
>
> It is common enough,
> But there are lots of ways to get a zero where you want it for a return.
>

I think the main use case for a zero register is mostly that it allows
using it as a special case for pseudo-ops. I guess, not quite the same
if it is a normal GPR that just so happens to be 0.

Recently ended up fixing a bug where:
y=-x;
Was misbehaving with "unsigned int":
"NEG" produces a value which falls outside of UInt range;
But, "NEG; EXTU.L" is a 2-op sequence.
It had the EXT for SB/UB/SW/UW, but not for UL.
For SL, bare NEG almost works, apart from ((-1)<<31).
Could encode it as:
SUBU.L Zero, Rs, Rn
But, without a zero register, the compiler needs to special-case
provision this (or, in theory, add a "NEGU.L" instruction, but doesn't
seem common enough to justify this).

....

>>>
>>>>>> Though, would likely still make a few decisions differently from
>>>>>> those in RISC-V. Things like indexed load/store,
>>>>>
>>>>> Absolutely
>>>>>
>>>>>>                                            predicated ops (with a
>>>>>> designated flag bit),
>>>>>
>>>>> Predicated then and else clauses which are branch free.
>>>>> {{Also good for constant time crypto in need of flow control...}}
>>>>>
>>>
>>>> I have per instruction predication:
>>>>    CMPxx ...
>>>>    OP?T  //if-true
>>>>    OP?F  //if-false
>>>> Or:
>>>>    OP?T | OP?F  //both in parallel, subject to encoding and ISA rules
>>>
>>>      CMP  Rt,Ra,#whatever
>>>      PLE  Rt,TTTTTEEE
>>>      // This begins the then-clause 5Ts -> 5 instructions
>>>      OP1
>>>      OP2
>>>      OP3
>>>      OP4
>>>      OP5
>>>      // this begins the else-clause 3Es -> 3 instructions
>>>      OP6
>>>      OP7
>>>      OP8
>>>      // we are now back join point.
>>>
>>> Notice no internal flow control instructions.
>>>
>
>> It can be similar in my case, with the ?T / ?F encoding scheme.
>
> Except you eat that/those bits in OpCode encoding.
>


Click here to read the complete article
Re: Microarch Club

<uu1op0$30i4b$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38184&group=comp.arch#38184

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Wed, 27 Mar 2024 13:32:56 -0500
Organization: A noiseless patient Spider
Lines: 66
Message-ID: <uu1op0$30i4b$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
<utsrft$1b76a$1@dont-email.me>
<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
<uttfk3$1j3o3$1@dont-email.me>
<c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
<utvggu$2cgkl$1@dont-email.me> <20240327012715.0000125c@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 27 Mar 2024 18:33:05 +0100 (CET)
Injection-Info: dont-email.me; posting-host="1724f6b222b7d6e5637447713cb23601";
logging-data="3164299"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19z3T6jf+lskyoNmb/yGZyS4hiZhorEbeU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:VbU9XrtaFaTU9rnJU39Q/6lyrHU=
Content-Language: en-US
In-Reply-To: <20240327012715.0000125c@yahoo.com>
 by: BGB - Wed, 27 Mar 2024 18:32 UTC

On 3/26/2024 5:27 PM, Michael S wrote:
> On Tue, 26 Mar 2024 16:59:57 -0500
> BGB-Alt <bohannonindustriesllc@gmail.com> wrote:
>
>> On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
>>>>> BGB-Alt wrote:
>>>>>
>>>
>>>> Say, "we have an instruction, but it is a boat anchor" isn't an
>>>> ideal situation (unless to be a placeholder for if/when it is not
>>>> a boat anchor).
>>>
>>> If the boat anchor is a required unit of functionality, and I
>>> believe IDIV and FPDIV is, it should be defined in ISA and if you
>>> can't afford it find some way to trap rapidly so you can fix it up
>>> without excessive overhead. Like a MIPS TLB reload. If you can't
>>> get trap and emulate at sufficient performance, then add the HW to
>>> perform the instruction.
>>
>> Though, 32-bit ARM managed OK without integer divide.
>>
>
> For slightly less then 20 years ARM managed OK without integer divide.
> Then in 2004 they added integer divide instruction in ARMv7 (including
> ARMv7-M variant intended for small microcontroller cores like
> Cortex-M3) and for the following 20 years instead of merely OK they are
> doing great :-)
>

OK.

I think both modern ARM and AMD Zen went over to "actually fast" integer
divide.

I think for a long time, the de-facto integer divide was ~ 36-40 cycles
for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
can get from a shift-add unit.

On my BJX2 core, it is currently similar (36 and 68 cycle for divide).
This works out faster than a generic shift-subtract divider (or using a
runtime call which then sorts out what to do).

A special case allows turning small divisors internally into
divide-by-reciprocal, which allows for a 3-cycle divide special case.
But, this is a LUT cost tradeoff.

It could be possible in theory to support a general 3-cycle integer
divide, albeit if one can accept inexact results (would be faster than
the software-based lookup table strategy).

But, it is debatable. Pure minimalism would likely favor leaving out
divide (and a bunch of other stuff). Usual rationale being, say, to try
to fit the entire ISA listing on a single page of paper or similar (vs
having a listing with several hundred defined encodings).

Nevermind if the commonly used ISAs (x86 and 64-bit ARM) have ISA
listings that are considerably larger (thousands of encodings).

....

Re: Microarch Club

<727d2b0fa197393309db7fefd76591e0@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38189&group=comp.arch#38189

  copy link   Newsgroups: comp.arch
Date: Wed, 27 Mar 2024 21:03:52 +0000
Subject: Re: Microarch Club
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$YVg3DMJ4HF87oVtGNgD0L.Oa5ieBOM/VYBaar7D46zVmTnrSLZB9C
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com> <utsrft$1b76a$1@dont-email.me> <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> <uttfk3$1j3o3$1@dont-email.me> <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org> <utvggu$2cgkl$1@dont-email.me> <3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org> <uu1o2p$30cnr$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <727d2b0fa197393309db7fefd76591e0@www.novabbs.org>
 by: MitchAlsup1 - Wed, 27 Mar 2024 21:03 UTC

BGB wrote:

> On 3/26/2024 7:02 PM, MitchAlsup1 wrote:
>>
>> Sometimes I see a::
>>
>>     CVTSD     R2,#5
>>
>> Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed
>> in register R2 so it can be accesses as an argument in the subroutine call
>> to happen in a few instructions.
>>

> I had looked into, say:
> FADD Rm, Imm5fp, Rn
> Where, despite Imm5fp being severely limited, it had an OK hit rate.

> Unpacking imm5fp to Binary16 being, essentially:
> aee.fff -> 0.aAAee.fff0000000

realistically ±{0, 1, 2, 3, 4, 5, .., 31} only misses a few of the often
used fp constants--but does include 0, 1, 2, and 10. Also, realistically
the missing cases are the 0.5s.

> OTOH, can note that a majority of typical floating point constants can
> be represented exactly in Binary16 (well, excluding "0.1" or similar),
> so it works OK as an immediate format.

> This allows a single 32-bit op to be used for constant loads (nevermind
> if one needs a 96 bit encoding for 0.1, or PI, or ...).

>> Mostly, a floating point immediate is available from a 32-bit constant
>> container. When accesses in a float calculation it is used as IEEE32
>> when accessed by a 6double calculation IEEE32->IEEE64 promotion is
>> performed in the constant delivery path. So, one can use almost any
>> floating point constant that is representable in float as a double
>> without eating cycles and while saving code footprint.
>>

> Don't currently have the encoding space for this.

> Could in theory pull off truncated Binary32 an Imm29s form, but not
> likely worth it. Would also require putting a converted in the ID2
> stage, so not free.

> In this case, the issue is more one of LUT cost to support these cases.

>>> For FPU compare with zero, can almost leverage the integer compare
>>> ops, apart from the annoying edge cases of -0.0 and NaN leading to
>>> "not strictly equivalent" behavior (though, an ASM programmer could
>>> more easily get away with this). But, not common enough to justify
>>> adding FPU specific ops for this.
>>
>> Actually, the edge/noise cases are not that many gates.
>> a) once you are separating out NaNs, infinities are free !!
>> b) once you are checking denorms for zero, infinites become free !!
>>
>> Having structured a Compare-to-zero circuit based on the fields in double;
>> You can compose the terns to do all signed and unsigned integers and get
>> a gate count, then the number of gates you add to cover all 10 cases of
>> floating point is 12% gate count over the simple integer version. Also
>> note:: this circuit is about 10% of the gate count of an integer adder.
>>

> I could add them, but, is it worth it?...

Whether to add them or not is on you.
I found things like this to be more straw on the camel's back
{where the camel collapses to a unified register file model.}

> In this case, it is more a question of encoding space than logic cost.

> It is semi-common in FP terms, but likely not common enough to justify
> dedicated compare-and-branch ops and similar (vs the minor annoyance at
> the integer ops not quite working correctly due to edge cases).

My model requires about ½ the instruction count when processing FP
comparisons compared to RISC-V (big, no; around 5% in FP code and 0
elsewhere.} Where it wins big is compare against a non-zero FP constant.
My 66000 uses 1 instructions {FCMP, BB} whereas RISC=V uses 4 {AUPIC,
LD, FCMP, BC}

>> -----------------------
>>
>>> Seems that generally 0 still isn't quite common enough to justify
>>> having one register fewer for variables though (or to have a
>>> designated zero register), but otherwise it seems there is not much to
>>> justify trying to exclude the "implicit zero" ops from the ISA listing.
>>
>>
>> It is common enough,
>> But there are lots of ways to get a zero where you want it for a return.
>>

> I think the main use case for a zero register is mostly that it allows
> using it as a special case for pseudo-ops. I guess, not quite the same
> if it is a normal GPR that just so happens to be 0.

> Recently ended up fixing a bug where:
> y=-x;
> Was misbehaving with "unsigned int":
> "NEG" produces a value which falls outside of UInt range;
> But, "NEG; EXTU.L" is a 2-op sequence.
> It had the EXT for SB/UB/SW/UW, but not for UL.
> For SL, bare NEG almost works, apart from ((-1)<<31).
> Could encode it as:
> SUBU.L Zero, Rs, Rn

ADD Rn,#0,-Rs

But notice::
y = -x;
a = b + y;
can be performed as if it had been written::
y = -x;
a = b + (-x);
Which is encoded as::

ADD Ry,#0,-Rx
ADD Ra,Rb,-Rx

> But, without a zero register,

#0 is not a register, but its value is 0x0000000000000000 anyway.

You missed the point entirely, if you can get easy access to #0
then you no longer need a register to hold this simple bit pattern.
In fact a large portion of My 66000 ISA over RISC-V comes from this
mechanism.

> the compiler needs to special-case

The compiler needs easy access to #0 and the compiler needs to know
that #0 exists, but the compiler does not need to know if some register
contains that same bit pattern.

> provision this (or, in theory, add a "NEGU.L" instruction, but doesn't
> seem common enough to justify this).

> ....

> It is less bad than 32-bit ARM, where I only burnt 2 bits, rather than 4.

I burned 0 per instruction, but you can claim I burned 1 instruction PRED and
6.4 bits of that instruction are used to create masks that project upon up to
8 following instructions.

> Also seems like a reasonable tradeoff, as the 2 bits effectively gain:
> Per-instruction predication;
> WEX / Bundle encoding;
> Jumbo prefixes;
> ...

> But, maybe otherwise could have justified slightly bigger immediate
> fields, dunno.

>>> While poking at it, did go and add a check to exclude large
>>> struct-copy operations from predication, as it is slower to turn a
>>> large struct copy into NOPs than to branch over it.
>>
>>> Did end up leaving struct-copies where sz<=64 as allowed though (where
>>> a 64 byte copy at least has the merit of achieving full pipeline
>>> saturation and being roughly break-even with a branch-miss, whereas a
>>> 128 byte copy would cost roughly twice as much as a branch miss).
>>
>> I decided to bite the bullet and have LDM, STM and MM so the compiler does
>> not have to do any analysis. This puts the onus on the memory unit designer
>> to process these at least as fast as a series of LDs and STs. Done right
>> this saves ~40%of the power of the caches avoiding ~70% of tag accesses
>> and 90% of TLB accesses. You access the tag only when/after crossing a
>> line boundary and you access TLB only after crossing a page boundary.

> OK.

> In my case, it was more a case of noting that sliding over, say, 1kB
> worth of memory loads/stores, is slower than branching around it.

This is why My 66000 predication has use limits. Once you can get where
you want faster with a branch, then a branch is what you should use.
I reasoned that my 1-wide machine would fetch 16-bytes (4 words) per
cycle and that the minimum DECODE time is 2 cycles, that Predication
wins when the number of instructions <= FWidth × Dcycles = 8.
Use predication and save cycles by not disrupting the front end.
Use branching and save cycles by disrupting the front end.

>>
>> All I had to do was to make the second predication overwrite the first
>> predication's mask, and the compiler did the rest.

> Not so simple in my case, but the hardware is simpler, since it just
> cares about the state of 1 bit (which is explicitly saved/restored along
> with the rest of the status-register if an interrupt occurs).

Simpler than 8-flip flops used as a shift right register ??

Re: Microarch Club

<c7065593299c0defd89eaac999e79bbb@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38190&group=comp.arch#38190

  copy link   Newsgroups: comp.arch
Date: Wed, 27 Mar 2024 21:14:01 +0000
Subject: Re: Microarch Club
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$pXuL8EOenb6LTFUdulQC2eQVT/qctqSuo.6/JpcU22r9NniDVZ8gm
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com> <utsrft$1b76a$1@dont-email.me> <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> <uttfk3$1j3o3$1@dont-email.me> <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org> <utvggu$2cgkl$1@dont-email.me> <20240327012715.0000125c@yahoo.com> <uu1op0$30i4b$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <c7065593299c0defd89eaac999e79bbb@www.novabbs.org>
 by: MitchAlsup1 - Wed, 27 Mar 2024 21:14 UTC

BGB wrote:

> On 3/26/2024 5:27 PM, Michael S wrote:
>>
>>
>> For slightly less then 20 years ARM managed OK without integer divide.
>> Then in 2004 they added integer divide instruction in ARMv7 (including
>> ARMv7-M variant intended for small microcontroller cores like
>> Cortex-M3) and for the following 20 years instead of merely OK they are
>> doing great :-)
>>

> OK.

The point is they are doing better now after adding IDIV and FDIV.

> I think both modern ARM and AMD Zen went over to "actually fast" integer
> divide.

> I think for a long time, the de-facto integer divide was ~ 36-40 cycles
> for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
> can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division (including
SRT variants).

What I don't get is the reluctance for using the FP multiplier as a fast
divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
and with special casing of leading 1s and 0s average around 10-cycles ???

I submit that at 10-cycles for average latency, the need to invent screwy
forms of even faster division fall by the wayside {accurate or not}.

NOTE well:: The size of the FMUL (or FMAC) unit does increase, but its
increase is less than that of an STR divisor unit.

> On my BJX2 core, it is currently similar (36 and 68 cycle for divide).
> This works out faster than a generic shift-subtract divider (or using a
> runtime call which then sorts out what to do).

This is because you are using a linear iteration, try using a quadratic
convergent iteration instead. OH but you CAN'T because your multiplier
tree does not give accurate lower order bits.

> A special case allows turning small divisors internally into
> divide-by-reciprocal, which allows for a 3-cycle divide special case.
> But, this is a LUT cost tradeoff.

> It could be possible in theory to support a general 3-cycle integer
> divide, albeit if one can accept inexact results (would be faster than
> the software-based lookup table strategy).

> But, it is debatable. Pure minimalism would likely favor leaving out
> divide (and a bunch of other stuff). Usual rationale being, say, to try
> to fit the entire ISA listing on a single page of paper or similar (vs
> having a listing with several hundred defined encodings).

> Nevermind if the commonly used ISAs (x86 and 64-bit ARM) have ISA
> listings that are considerably larger (thousands of encodings).

> ....

Re: Microarch Club

<Az0NN.724623$xHn7.37631@fx14.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38192&group=comp.arch#38192

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.quux.org!tncsrv06.tnetconsulting.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Microarch Club
Newsgroups: comp.arch
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com> <utsrft$1b76a$1@dont-email.me> <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> <uttfk3$1j3o3$1@dont-email.me> <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org> <utvggu$2cgkl$1@dont-email.me> <20240327012715.0000125c@yahoo.com> <uu1op0$30i4b$1@dont-email.me> <c7065593299c0defd89eaac999e79bbb@www.novabbs.org>
Lines: 40
Message-ID: <Az0NN.724623$xHn7.37631@fx14.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Wed, 27 Mar 2024 21:53:36 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Wed, 27 Mar 2024 21:53:36 GMT
X-Received-Bytes: 2413
 by: Scott Lurndal - Wed, 27 Mar 2024 21:53 UTC

mitchalsup@aol.com (MitchAlsup1) writes:
>BGB wrote:
>
>> On 3/26/2024 5:27 PM, Michael S wrote:
>>>
>>>
>>> For slightly less then 20 years ARM managed OK without integer divide.
>>> Then in 2004 they added integer divide instruction in ARMv7 (including
>>> ARMv7-M variant intended for small microcontroller cores like
>>> Cortex-M3) and for the following 20 years instead of merely OK they are
>>> doing great :-)
>>>
>
>> OK.
>
>The point is they are doing better now after adding IDIV and FDIV.
>
>> I think both modern ARM and AMD Zen went over to "actually fast" integer
>> divide.
>
>> I think for a long time, the de-facto integer divide was ~ 36-40 cycles
>> for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
>> can get from a shift-add unit.
>
>While those numbers are acceptable for shift-subtract division (including
>SRT variants).
>
>What I don't get is the reluctance for using the FP multiplier as a fast
>divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
>FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
>and with special casing of leading 1s and 0s average around 10-cycles ???

Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
(where s is the number of significant digits in the quotient).

https://www.quinapalus.com/cm7cycles.html

>
>I submit that at 10-cycles for average latency, the need to invent screwy
>forms of even faster division fall by the wayside {accurate or not}.

Re: Microarch Club

<20240328020605.00002786@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38193&group=comp.arch#38193

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Thu, 28 Mar 2024 01:06:05 +0200
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <20240328020605.00002786@yahoo.com>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
<utsrft$1b76a$1@dont-email.me>
<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
<uttfk3$1j3o3$1@dont-email.me>
<c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
<utvggu$2cgkl$1@dont-email.me>
<20240327012715.0000125c@yahoo.com>
<uu1op0$30i4b$1@dont-email.me>
<c7065593299c0defd89eaac999e79bbb@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 27 Mar 2024 23:06:09 +0100 (CET)
Injection-Info: dont-email.me; posting-host="42b24005b54905b687cc0e6ed6e01981";
logging-data="3244652"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/rOCuHih1lnFOPYDVK5wGKXmfd8MQIhcE="
Cancel-Lock: sha1:2jPZbu+MtR6L5fIRGiMgHPNoytI=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
 by: Michael S - Wed, 27 Mar 2024 23:06 UTC

On Wed, 27 Mar 2024 21:14:01 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
>
> What I don't get is the reluctance for using the FP multiplier as a
> fast divisor (IBM 360/91). AMD Opteron used this means to achieve
> 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under
> 20-cycles ?? and with special casing of leading 1s and 0s average
> around 10-cycles ???
>
> I submit that at 10-cycles for average latency, the need to invent
> screwy forms of even faster division fall by the wayside {accurate or
> not}.
>

All modern performance-oriented cores from Intel, AMD, ARM and Apple
have fast integer dividers and typically even faster FP dividers.
The last "big" cores with relatively slow 64-bit IDIV were Intel Skylake
(launched in 2015) and AMD Zen2 (launched in 2019), but the later is
slow only in the worst case, the best case is o.k.
I'd guess, when Skylake was designed nobody at Intel could imagine that
it and its variations would be manufactured in huge volumes up to 2021.

Re: Microarch Club

<4f63a339527a85e67bcd85c6f5388bfa@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38194&group=comp.arch#38194

  copy link   Newsgroups: comp.arch
Date: Wed, 27 Mar 2024 23:11:34 +0000
Subject: Re: Microarch Club
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$eyg1twdVoEdwLYVO5GYvm.5667cnd2c/EIFm4YAt2FF5FWGA.70Jm
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com> <utsrft$1b76a$1@dont-email.me> <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> <uttfk3$1j3o3$1@dont-email.me> <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org> <utvggu$2cgkl$1@dont-email.me> <20240327012715.0000125c@yahoo.com> <uu1op0$30i4b$1@dont-email.me> <c7065593299c0defd89eaac999e79bbb@www.novabbs.org> <Az0NN.724623$xHn7.37631@fx14.iad>
Organization: Rocksolid Light
Message-ID: <4f63a339527a85e67bcd85c6f5388bfa@www.novabbs.org>
 by: MitchAlsup1 - Wed, 27 Mar 2024 23:11 UTC

Scott Lurndal wrote:

> mitchalsup@aol.com (MitchAlsup1) writes:
>>BGB wrote:
>>
>>> On 3/26/2024 5:27 PM, Michael S wrote:
>>>>
>>>>
>>>> For slightly less then 20 years ARM managed OK without integer divide.
>>>> Then in 2004 they added integer divide instruction in ARMv7 (including
>>>> ARMv7-M variant intended for small microcontroller cores like
>>>> Cortex-M3) and for the following 20 years instead of merely OK they are
>>>> doing great :-)
>>>>
>>
>>> OK.
>>
>>The point is they are doing better now after adding IDIV and FDIV.
>>
>>> I think both modern ARM and AMD Zen went over to "actually fast" integer
>>> divide.
>>
>>> I think for a long time, the de-facto integer divide was ~ 36-40 cycles
>>> for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
>>> can get from a shift-add unit.
>>
>>While those numbers are acceptable for shift-subtract division (including
>>SRT variants).
>>
>>What I don't get is the reluctance for using the FP multiplier as a fast
>>divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
>>FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
>>and with special casing of leading 1s and 0s average around 10-cycles ???

> Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
> (where s is the number of significant digits in the quotient).

I submit that a 5+2×ln8(s) is faster still.
32-bits = 15 cycles <not so much faster>
64-bits = 17 cycles <A lot faster>

{Log base 8, where one uses Newton-Raphson or Goldschmidt to get 8 significant
digits (9.2 bits are correct) and double the significant bits each iteration
(2-cycles). }

5 comes from looking at numerator and denominator to find the first bit of
significance, and then shifting numerator and denominator so that the FDIV
algorithm can work.

> https://www.quinapalus.com/cm7cycles.html

>>
>>I submit that at 10-cycles for average latency, the need to invent screwy
>>forms of even faster division fall by the wayside {accurate or not}.

Re: Microarch Club

<uu39sg$3fb7n$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38197&group=comp.arch#38197

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Thu, 28 Mar 2024 09:31:11 +0100
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <uu39sg$3fb7n$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
<utsrft$1b76a$1@dont-email.me>
<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
<uttfk3$1j3o3$1@dont-email.me>
<c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
<utvggu$2cgkl$1@dont-email.me> <20240327012715.0000125c@yahoo.com>
<uu1op0$30i4b$1@dont-email.me>
<c7065593299c0defd89eaac999e79bbb@www.novabbs.org>
<Az0NN.724623$xHn7.37631@fx14.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 28 Mar 2024 08:31:12 +0100 (CET)
Injection-Info: dont-email.me; posting-host="e729d923a516023ee51249bcca99f97d";
logging-data="3648759"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/stoWszxzahYOUUAUNCxq98LEMN+nAQwkI1z/EA8YExw=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18.1
Cancel-Lock: sha1:qUsJthcDOaI/jlPdGgGXglPtA88=
In-Reply-To: <Az0NN.724623$xHn7.37631@fx14.iad>
 by: Terje Mathisen - Thu, 28 Mar 2024 08:31 UTC

Scott Lurndal wrote:
> mitchalsup@aol.com (MitchAlsup1) writes:
>> BGB wrote:
>>
>>> On 3/26/2024 5:27 PM, Michael S wrote:
>>>>
>>>>
>>>> For slightly less then 20 years ARM managed OK without integer divide.
>>>> Then in 2004 they added integer divide instruction in ARMv7 (including
>>>> ARMv7-M variant intended for small microcontroller cores like
>>>> Cortex-M3) and for the following 20 years instead of merely OK they are
>>>> doing great :-)
>>>>
>>
>>> OK.
>>
>> The point is they are doing better now after adding IDIV and FDIV.
>>
>>> I think both modern ARM and AMD Zen went over to "actually fast" integer
>>> divide.
>>
>>> I think for a long time, the de-facto integer divide was ~ 36-40 cycles
>>> for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
>>> can get from a shift-add unit.
>>
>> While those numbers are acceptable for shift-subtract division (including
>> SRT variants).
>>
>> What I don't get is the reluctance for using the FP multiplier as a fast
>> divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
>> FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
>> and with special casing of leading 1s and 0s average around 10-cycles ???
>
> Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
> (where s is the number of significant digits in the quotient).
>
> https://www.quinapalus.com/cm7cycles.html

That looks a lot like an SRT divisor with early out?

Having variable timing DIV means that any crypto operating (including
hashes?) where you use modulo operations, said modulus _must_ be a known
constant, otherwise information about will leak from the timings, right?
>
>>
>> I submit that at 10-cycles for average latency, the need to invent screwy
>> forms of even faster division fall by the wayside {accurate or not}.

I agree.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Microarch Club

<GMeNN.99575$24ld.73090@fx07.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38198&group=comp.arch#38198

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.neodome.net!npeer.as286.net!npeer-ng0.as286.net!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx07.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Microarch Club
Newsgroups: comp.arch
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com> <utsrft$1b76a$1@dont-email.me> <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> <uttfk3$1j3o3$1@dont-email.me> <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org> <utvggu$2cgkl$1@dont-email.me> <20240327012715.0000125c@yahoo.com> <uu1op0$30i4b$1@dont-email.me> <c7065593299c0defd89eaac999e79bbb@www.novabbs.org> <Az0NN.724623$xHn7.37631@fx14.iad> <uu39sg$3fb7n$1@dont-email.me>
Lines: 50
Message-ID: <GMeNN.99575$24ld.73090@fx07.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Thu, 28 Mar 2024 14:03:18 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Thu, 28 Mar 2024 14:03:18 GMT
X-Received-Bytes: 3008
 by: Scott Lurndal - Thu, 28 Mar 2024 14:03 UTC

Terje Mathisen <terje.mathisen@tmsw.no> writes:
>Scott Lurndal wrote:
>> mitchalsup@aol.com (MitchAlsup1) writes:
>>> BGB wrote:
>>>
>>>> On 3/26/2024 5:27 PM, Michael S wrote:
>>>>>
>>>>>
>>>>> For slightly less then 20 years ARM managed OK without integer divide.
>>>>> Then in 2004 they added integer divide instruction in ARMv7 (including
>>>>> ARMv7-M variant intended for small microcontroller cores like
>>>>> Cortex-M3) and for the following 20 years instead of merely OK they are
>>>>> doing great :-)
>>>>>
>>>
>>>> OK.
>>>
>>> The point is they are doing better now after adding IDIV and FDIV.
>>>
>>>> I think both modern ARM and AMD Zen went over to "actually fast" integer
>>>> divide.
>>>
>>>> I think for a long time, the de-facto integer divide was ~ 36-40 cycles
>>>> for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
>>>> can get from a shift-add unit.
>>>
>>> While those numbers are acceptable for shift-subtract division (including
>>> SRT variants).
>>>
>>> What I don't get is the reluctance for using the FP multiplier as a fast
>>> divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
>>> FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
>>> and with special casing of leading 1s and 0s average around 10-cycles ???
>>
>> Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
>> (where s is the number of significant digits in the quotient).
>>
>> https://www.quinapalus.com/cm7cycles.html
>
>That looks a lot like an SRT divisor with early out?
>
>Having variable timing DIV means that any crypto operating (including
>hashes?) where you use modulo operations, said modulus _must_ be a known
>constant, otherwise information about will leak from the timings, right?

Perhaps, yet I suspect that the m7 isn't generally used for crypto,
nor used in an SMP implementation where an observer can monitor
a shared cache.

Re: Microarch Club

<20240328233841.00007f41@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38201&group=comp.arch#38201

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Thu, 28 Mar 2024 22:38:41 +0200
Organization: A noiseless patient Spider
Lines: 56
Message-ID: <20240328233841.00007f41@yahoo.com>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
<utsrft$1b76a$1@dont-email.me>
<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
<uttfk3$1j3o3$1@dont-email.me>
<c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
<utvggu$2cgkl$1@dont-email.me>
<20240327012715.0000125c@yahoo.com>
<uu1op0$30i4b$1@dont-email.me>
<c7065593299c0defd89eaac999e79bbb@www.novabbs.org>
<Az0NN.724623$xHn7.37631@fx14.iad>
<uu39sg$3fb7n$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 28 Mar 2024 20:38:46 +0100 (CET)
Injection-Info: dont-email.me; posting-host="f7a00dabb4786e83c68ae0e574964a78";
logging-data="3992280"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/RJz2vHuOfZJ4FZAIm9NK4xo21mS6PWno="
Cancel-Lock: sha1:AM5wk8+7p4JeED01sli4NxQ+Q7A=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
 by: Michael S - Thu, 28 Mar 2024 20:38 UTC

On Thu, 28 Mar 2024 09:31:11 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

> Scott Lurndal wrote:
> > mitchalsup@aol.com (MitchAlsup1) writes:
> >> BGB wrote:
> >>
> >>> On 3/26/2024 5:27 PM, Michael S wrote:
> >>>>
> >>>>
> >>>> For slightly less then 20 years ARM managed OK without integer
> >>>> divide. Then in 2004 they added integer divide instruction in
> >>>> ARMv7 (including ARMv7-M variant intended for small
> >>>> microcontroller cores like Cortex-M3) and for the following 20
> >>>> years instead of merely OK they are doing great :-)
> >>>>
> >>
> >>> OK.
> >>
> >> The point is they are doing better now after adding IDIV and FDIV.
> >>
> >>> I think both modern ARM and AMD Zen went over to "actually fast"
> >>> integer divide.
> >>
> >>> I think for a long time, the de-facto integer divide was ~ 36-40
> >>> cycles for 32-bit, and 68-72 cycles for 64-bit. This is also
> >>> on-par with what I can get from a shift-add unit.
> >>
> >> While those numbers are acceptable for shift-subtract division
> >> (including SRT variants).
> >>
> >> What I don't get is the reluctance for using the FP multiplier as
> >> a fast divisor (IBM 360/91). AMD Opteron used this means to
> >> achieve 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV
> >> not be under 20-cycles ?? and with special casing of leading 1s
> >> and 0s average around 10-cycles ???
> >
> > Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2]
> > cycles (where s is the number of significant digits in the
> > quotient).
> >
> > https://www.quinapalus.com/cm7cycles.html
>
> That looks a lot like an SRT divisor with early out?
>
> Having variable timing DIV means that any crypto operating (including
> hashes?) where you use modulo operations, said modulus _must_ be a
> known constant, otherwise information about will leak from the
> timings, right?

Are you aware of any professional crypto algorithm, including hashes,
that uses modulo operations by modulo that is neither power-of-two nor
at least 192-bit wide?

Re: Microarch Club

<uu6cp0$9s5h$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38203&group=comp.arch#38203

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Fri, 29 Mar 2024 13:38:55 +0100
Organization: A noiseless patient Spider
Lines: 66
Message-ID: <uu6cp0$9s5h$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
<utsrft$1b76a$1@dont-email.me>
<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
<uttfk3$1j3o3$1@dont-email.me>
<c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
<utvggu$2cgkl$1@dont-email.me> <20240327012715.0000125c@yahoo.com>
<uu1op0$30i4b$1@dont-email.me>
<c7065593299c0defd89eaac999e79bbb@www.novabbs.org>
<Az0NN.724623$xHn7.37631@fx14.iad> <uu39sg$3fb7n$1@dont-email.me>
<20240328233841.00007f41@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 29 Mar 2024 12:38:56 +0100 (CET)
Injection-Info: dont-email.me; posting-host="5ef041179388cbe579ddcdf75eae0f68";
logging-data="323761"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/VJzHP1nrb0l8OoKkWKAR65SnfKyeDHbYoMcSMMXjcEw=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18.1
Cancel-Lock: sha1:kisiEA3Bacqi1sqMLs+M0jk0MR4=
In-Reply-To: <20240328233841.00007f41@yahoo.com>
 by: Terje Mathisen - Fri, 29 Mar 2024 12:38 UTC

Michael S wrote:
> On Thu, 28 Mar 2024 09:31:11 +0100
> Terje Mathisen <terje.mathisen@tmsw.no> wrote:
>
>> Scott Lurndal wrote:
>>> mitchalsup@aol.com (MitchAlsup1) writes:
>>>> BGB wrote:
>>>>
>>>>> On 3/26/2024 5:27 PM, Michael S wrote:
>>>>>>
>>>>>>
>>>>>> For slightly less then 20 years ARM managed OK without integer
>>>>>> divide. Then in 2004 they added integer divide instruction in
>>>>>> ARMv7 (including ARMv7-M variant intended for small
>>>>>> microcontroller cores like Cortex-M3) and for the following 20
>>>>>> years instead of merely OK they are doing great :-)
>>>>>>
>>>>
>>>>> OK.
>>>>
>>>> The point is they are doing better now after adding IDIV and FDIV.
>>>>
>>>>> I think both modern ARM and AMD Zen went over to "actually fast"
>>>>> integer divide.
>>>>
>>>>> I think for a long time, the de-facto integer divide was ~ 36-40
>>>>> cycles for 32-bit, and 68-72 cycles for 64-bit. This is also
>>>>> on-par with what I can get from a shift-add unit.
>>>>
>>>> While those numbers are acceptable for shift-subtract division
>>>> (including SRT variants).
>>>>
>>>> What I don't get is the reluctance for using the FP multiplier as
>>>> a fast divisor (IBM 360/91). AMD Opteron used this means to
>>>> achieve 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV
>>>> not be under 20-cycles ?? and with special casing of leading 1s
>>>> and 0s average around 10-cycles ???
>>>
>>> Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2]
>>> cycles (where s is the number of significant digits in the
>>> quotient).
>>>
>>> https://www.quinapalus.com/cm7cycles.html
>>
>> That looks a lot like an SRT divisor with early out?
>>
>> Having variable timing DIV means that any crypto operating (including
>> hashes?) where you use modulo operations, said modulus _must_ be a
>> known constant, otherwise information about will leak from the
>> timings, right?
>
> Are you aware of any professional crypto algorithm, including hashes,
> that uses modulo operations by modulo that is neither power-of-two nor
> at least 192-bit wide?

I was involved with the optimization of DFC, the AES condidate from CERN:

It uses a fixed prime just above 2^64 as the modulus (2^64+13 afair),
and that resulted in a very simple reciprocal, i.e. no need for a DIV
opcode.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Microarch Club

<20240329173858.0000091d@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38205&group=comp.arch#38205

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Fri, 29 Mar 2024 16:38:58 +0200
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <20240329173858.0000091d@yahoo.com>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
<utsrft$1b76a$1@dont-email.me>
<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
<uttfk3$1j3o3$1@dont-email.me>
<c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
<utvggu$2cgkl$1@dont-email.me>
<20240327012715.0000125c@yahoo.com>
<uu1op0$30i4b$1@dont-email.me>
<c7065593299c0defd89eaac999e79bbb@www.novabbs.org>
<Az0NN.724623$xHn7.37631@fx14.iad>
<uu39sg$3fb7n$1@dont-email.me>
<20240328233841.00007f41@yahoo.com>
<uu6cp0$9s5h$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 29 Mar 2024 14:39:03 +0100 (CET)
Injection-Info: dont-email.me; posting-host="1c8415fced5d144b3aa5dafe4e72b2bb";
logging-data="375542"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18CeJr4zPAIa/FbP8P8B0wRx7rDsX9x8yc="
Cancel-Lock: sha1:ehOFuj6cEj3HWNiOnHbSaQDBLTQ=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
 by: Michael S - Fri, 29 Mar 2024 14:38 UTC

On Fri, 29 Mar 2024 13:38:55 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

> Michael S wrote:
> > On Thu, 28 Mar 2024 09:31:11 +0100
> > Terje Mathisen <terje.mathisen@tmsw.no> wrote:
> >
> >
> > Are you aware of any professional crypto algorithm, including
> > hashes, that uses modulo operations by modulo that is neither
> > power-of-two nor at least 192-bit wide?
>
> I was involved with the optimization of DFC, the AES condidate from
> CERN:
>
> It uses a fixed prime just above 2^64 as the modulus (2^64+13 afair),
> and that resulted in a very simple reciprocal, i.e. no need for a DIV
> opcode.
>
> Terje
>

Since DFC lost, I suppose that even ignoring reciprocal optimization
the answer to my question is 'No'.

Re: Microarch Club

<uudtkr$2dam1$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38209&group=comp.arch#38209

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Mon, 1 Apr 2024 04:09:35 -0500
Organization: A noiseless patient Spider
Lines: 513
Message-ID: <uudtkr$2dam1$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
<utsrft$1b76a$1@dont-email.me>
<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
<uttfk3$1j3o3$1@dont-email.me>
<c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
<utvggu$2cgkl$1@dont-email.me>
<3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org>
<uu1o2p$30cnr$1@dont-email.me>
<727d2b0fa197393309db7fefd76591e0@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 01 Apr 2024 09:09:48 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="be3554a31be4c56ba5fde27664964ae9";
logging-data="2534081"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1901r1wiR2WLoWsjhNFi52lr9YQgjDgv90="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:krUCesIohUreEj/70Spp4skNqDE=
In-Reply-To: <727d2b0fa197393309db7fefd76591e0@www.novabbs.org>
Content-Language: en-US
 by: BGB - Mon, 1 Apr 2024 09:09 UTC

( Was distracted with other stuff, mostly trying to find and fix bugs. )

On 3/27/2024 4:03 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 3/26/2024 7:02 PM, MitchAlsup1 wrote:
>>>
>>> Sometimes I see a::
>>>
>>>      CVTSD     R2,#5
>>>
>>> Where a 5-bit immediate (value = 5) is converted into 5.0D0 and
>>> placed in register R2 so it can be accesses as an argument in the
>>> subroutine call
>>> to happen in a few instructions.
>>>
>
>> I had looked into, say:
>>    FADD Rm, Imm5fp, Rn
>> Where, despite Imm5fp being severely limited, it had an OK hit rate.
>
>> Unpacking imm5fp to Binary16 being, essentially:
>>    aee.fff -> 0.aAAee.fff0000000
>

Self correction: aee.ff

> realistically ±{0, 1, 2, 3, 4, 5, .., 31} only misses a few of the often
> used fp constants--but does include 0, 1, 2, and 10. Also, realistically
> the missing cases are the 0.5s.
>

This scheme had (rounded):
* 000xx: 3000, 3100, 3200, 3300 ( 0.125, 0.156, 0.188, 0.219)
* 001xx: 3400, 3500, 3600, 3700 ( 0.250, 0.313, 0.375, 0.438)
* 010xx: 3800, 3900, 3A00, 3B00 ( 0.500, 0.625, 0.750, 0.875)
* 011xx: 3C00, 3D00, 3E00, 3F00 ( 1.000, 1.250, 1.500, 1.750)
* 100xx: 4000, 4100, 4200, 4300 ( 2.000, 2.500, 3.000, 3.500)
* 101xx: 4400, 4500, 4600, 4700 ( 4.000, 5.000, 6.000, 7.000)
* 110xx: 4800, 4900, 4A00, 4B00 ( 8.000, 10.000, 12.000, 14.000)
* 111xx: 4C00, 4D00, 4E00, 4F00 (16.000, 20.000, 24.000, 28.000)

Of the options evaluated, this seemed to get the best general-case hit rate.

>> OTOH, can note that a majority of typical floating point constants can
>> be represented exactly in Binary16 (well, excluding "0.1" or similar),
>> so it works OK as an immediate format.
>
>> This allows a single 32-bit op to be used for constant loads
>> (nevermind if one needs a 96 bit encoding for 0.1, or PI, or ...).
>
>
>>> Mostly, a floating point immediate is available from a 32-bit constant
>>> container. When accesses in a float calculation it is used as IEEE32
>>> when accessed by a 6double calculation IEEE32->IEEE64 promotion is
>>> performed in the constant delivery path. So, one can use almost any
>>> floating point constant that is representable in float as a double
>>> without eating cycles and while saving code footprint.
>>>
>
>> Don't currently have the encoding space for this.
>
>> Could in theory pull off truncated Binary32 an Imm29s form, but not
>> likely worth it. Would also require putting a converted in the ID2
>> stage, so not free.
>
>> In this case, the issue is more one of LUT cost to support these cases.
>
>
>>>> For FPU compare with zero, can almost leverage the integer compare
>>>> ops, apart from the annoying edge cases of -0.0 and NaN leading to
>>>> "not strictly equivalent" behavior (though, an ASM programmer could
>>>> more easily get away with this). But, not common enough to justify
>>>> adding FPU specific ops for this.
>>>
>>> Actually, the edge/noise cases are not that many gates.
>>> a) once you are separating out NaNs, infinities are free !!
>>> b) once you are checking denorms for zero, infinites become free !!
>>>
>>> Having structured a Compare-to-zero circuit based on the fields in
>>> double;
>>> You can compose the terns to do all signed and unsigned integers and get
>>> a gate count, then the number of gates you add to cover all 10 cases
>>> of floating point is 12% gate count over the simple integer version.
>>> Also
>>> note:: this circuit is about 10% of the gate count of an integer adder.
>>>
>
>> I could add them, but, is it worth it?...
>
> Whether to add them or not is on you.
> I found things like this to be more straw on the camel's back {where the
> camel collapses to a unified register file model.}
>

I already have a unified register file in my ISA (only the RISC-V mode
treats it as two sub-spaces).

But, in this case:
BEQ/BNE/BGT/... almost work with floating-point values, apart from the
annoying edge cases like -0 and similar.

If substituted for floating-point comparison, most things work as
expected, except that Quake ends up with some minor physics glitches
(*1) and similar (mostly due to the occasional appearance of -0, and it
not comparing equal to +0 in this case).

*1: Most commonly in the form that sliding a bounding box along the
ground will occasionally encounter seams where it gets stuck and can't
slide across (so the player would need to jump over these seams).

>> In this case, it is more a question of encoding space than logic cost.
>
>> It is semi-common in FP terms, but likely not common enough to justify
>> dedicated compare-and-branch ops and similar (vs the minor annoyance
>> at the integer ops not quite working correctly due to edge cases).
>
> My model requires about ½ the instruction count when processing FP
> comparisons compared to RISC-V (big, no; around 5% in FP code and 0
> elsewhere.} Where it wins big is compare against a non-zero FP constant.
> My 66000 uses 1 instructions {FCMP, BB} whereas RISC=V uses 4 {AUPIC,
> LD, FCMP, BC}
>

At present, options are:
FLDCH + FCMPxx (2R) + ( BT/BF | MOVT/MOVNT )
FCMPxx (2RI) + ( BT/BF | MOVT/MOVNT )

The latter currently requires the FpImm extension.

The integer ops currently have some additional denser cases:
Compare and Branch
Compare-Zero and Branch
Compare and Set (3R and small 3RI).

However, adding these cases for FP would eat encoding space, and
floating-point compare is less common than integer compare.

Granted, it does look like FCMPGT is around 0.7% of the clock-cycle
budget in GLQuake, so isn't too far down on the list. Would need to
evaluate whether is is more commonly being used for branching or as a
predicate or similar, and the relative amount of compare-with-zero vs
comparing two values.

Can note that FCMPEQ is much further down the list, at 0.03%.

>
>>> -----------------------
>>>
>>>> Seems that generally 0 still isn't quite common enough to justify
>>>> having one register fewer for variables though (or to have a
>>>> designated zero register), but otherwise it seems there is not much
>>>> to justify trying to exclude the "implicit zero" ops from the ISA
>>>> listing.
>>>
>>>
>>> It is common enough,
>>> But there are lots of ways to get a zero where you want it for a return.
>>>
>
>> I think the main use case for a zero register is mostly that it allows
>> using it as a special case for pseudo-ops. I guess, not quite the same
>> if it is a normal GPR that just so happens to be 0.
>
>> Recently ended up fixing a bug where:
>>    y=-x;
>> Was misbehaving with "unsigned int":
>>    "NEG" produces a value which falls outside of UInt range;
>>    But, "NEG; EXTU.L" is a 2-op sequence.
>>    It had the EXT for SB/UB/SW/UW, but not for UL.
>>      For SL, bare NEG almost works, apart from ((-1)<<31).
>> Could encode it as:
>>    SUBU.L  Zero, Rs, Rn
>
>     ADD  Rn,#0,-Rs
>
> But notice::
>     y = -x;
>     a = b + y;
> can be performed as if it had been written::
>     y = -x;
>     a = b + (-x);
> Which is encoded as::
>
>     ADD  Ry,#0,-Rx
>     ADD  Ra,Rb,-Rx
>
>> But, without a zero register,
>
> #0 is not a register, but its value is 0x0000000000000000 anyway.
>
> You missed the point entirely, if you can get easy access to #0
> then you no longer need a register to hold this simple bit pattern.
> In fact a large portion of My 66000 ISA over RISC-V comes from this
> mechanism.
>

OK.

I did end up switching "-x" for Int and UInt over to ADDS.L and ADDU.L
with the zero in a register. Noted that on-average, this is still faster
than "NEG + EXTx.L" even in the absence of a zero register.

But, then ended up fighting with a bunch of compiler bugs, as some
recent behavioral tweaks in the compiler had started stepping on a bunch
of buggy cases:
Some instructions which were invalid as prefixes were being set as
prefixes in the WEXifier;
The edge case for expanding Op32 to Op64 was mangling large immediate
values (mostly related to cases dealing with encoding R32..R63 in
Baseline mode);
The handling for "tiny leaf functions" was incorrectly allowing
functions to be compiled as tiny-leaf functions in cases where they had
a by-value structure type on the local frame (in this case, the local
struct address would end up as "whatever value was already in the
register", potentially stomping stuff elsewhere in memory, *1).


Click here to read the complete article
Re: Microarch Club

<uuhsci$3ef1d$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38210&group=comp.arch#38210

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Tue, 2 Apr 2024 16:12:49 -0500
Organization: A noiseless patient Spider
Lines: 141
Message-ID: <uuhsci$3ef1d$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
<utsrft$1b76a$1@dont-email.me>
<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
<uttfk3$1j3o3$1@dont-email.me>
<c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
<utvggu$2cgkl$1@dont-email.me> <20240327012715.0000125c@yahoo.com>
<uu1op0$30i4b$1@dont-email.me>
<c7065593299c0defd89eaac999e79bbb@www.novabbs.org>
<20240328020605.00002786@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 02 Apr 2024 21:12:50 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="4f2931bce06da15b5a77aaf780ad944a";
logging-data="3619885"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Ats0Hu8N0juDUKt0YLEtvXINKuQrf0CE="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:3U5z8jLxyRcxNiDdQ0hGQvVBZ+c=
Content-Language: en-US
In-Reply-To: <20240328020605.00002786@yahoo.com>
 by: BGB-Alt - Tue, 2 Apr 2024 21:12 UTC

On 3/27/2024 6:06 PM, Michael S wrote:
> On Wed, 27 Mar 2024 21:14:01 +0000
> mitchalsup@aol.com (MitchAlsup1) wrote:
>>
>> What I don't get is the reluctance for using the FP multiplier as a
>> fast divisor (IBM 360/91). AMD Opteron used this means to achieve
>> 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under
>> 20-cycles ?? and with special casing of leading 1s and 0s average
>> around 10-cycles ???
>>
>> I submit that at 10-cycles for average latency, the need to invent
>> screwy forms of even faster division fall by the wayside {accurate or
>> not}.
>>
>
> All modern performance-oriented cores from Intel, AMD, ARM and Apple
> have fast integer dividers and typically even faster FP dividers.
> The last "big" cores with relatively slow 64-bit IDIV were Intel Skylake
> (launched in 2015) and AMD Zen2 (launched in 2019), but the later is
> slow only in the worst case, the best case is o.k.
> I'd guess, when Skylake was designed nobody at Intel could imagine that
> it and its variations would be manufactured in huge volumes up to 2021.
>

FWIW: Newest CPU that I have is Zen+ ...
A lot of the other stuff around (excluding a few RasPi's I have, is
stuff from the mid-to-late 2000's).

But, yeah, I guess fast dividers are possible, question I guess is more
whether they can be done cheaply enough for cores where one would
actually care about things like the costs of instruction decoding?...

Otherwise:

But, yeah, at the moment, apart from debugging, had gotten around to
starting to work on some more of the GUI related stuff:
Added support for drawing window bitmaps using 256 and 16 color graphics;
Starting working on a "toolkit" for GUI widget drawing;
Also starting to add stuff for scalable / SDF font rendering.

For UI widgets:
Idea is that the widgets will be primarily drawn into a bitmap buffer
that will be mapped to the window (lower level window management will
not know or care about UI widgets);
Generally, I am breaking the UI into "forms" which will be applied to
the windows, with the information about the "form" and its contained
widgets being handled separately from any state associated with those
widgets (unlike GTK, which tends to keep the widget definition and state
in the same object).

Likewise, also generally approaching things in a way that "makes sense"
for C (IOW: not the struct-casting mess that GTK had used). Also in the
current design, the idea is that UI elements will be sized and
positioned explicitly using window pixel coordinates (scaling UI based
on window resizing may possibly be treated as more of an application
concern, but could maybe have "layout helper" logic for this, TBD).

I have yet to decide on the default bit-depth for UI widgets, mostly
torn between 16-color and 256-color. Probably don't need high bit-depths
for "text and widgets" windows, but may need more than 16-color. Also
TBD which color palette to use in the case if 256 color is used (the
palette used for the lower-level parts of the GUI works OK, but requires
a lookup table to convert RGB555 to indexed color, don't want to
duplicate this table for every program; and don't necessarily need to
force programs to all use the same palette, even though this would be
better in a color-fidelity sense). Though, for UI widgets it almost
makes most sense to instead use a palette based around RGB222 (64 color)
or similar.

Say:
00..0F: RGBI colors
10..1F: Grayscale
20..3F: Dunno
40..7F: RGB222 colors
80..FE: Dunno
FF: Transparent
Well, or maybe reuse a variant of the 216 color "web safe" palette:
00..0F: RGBI colors
10..1F: Grayscale
20..FD: 216 colors
FE: Dunno
FF: Transparent

Well, unless RGBI is sufficient (where, in this mode, the
high-intensity-magenta color was interpreted as transparent).

May also make sense to either overlay UI elements with different
bit-depths, and/or treat the UI widgets as an overlay (though, this
could have its own issues, say, if the background image is redrawn and
the UI overlaps part of, the UI will also need to be redrawn). And/or
render the UI bitmap at a higher bit-depth (such as hi-color) if higher
bit-depth elements are present (say, if there is a widget representing a
"Bitmap object" or similar). In the "overlaid UI widgets" scenario,
would likely make most sense to draw the parts showing the underlying
image as a dedicated transparent color.

Though, non-overlap would be preferable in the case of things like video
playback or similar.

For scalable fonts, current idea is mostly to use SDF fonts encoded as
256-color BMP images, where the X/Y distances are encoded as 4 bits
each. The color palette is kinda redundant in this case, but doesn't add
too much additional storage overhead to a group of 16x16 character cells
(say, representing CP-1252 with each glyph encoded as 16x16 pixels).

Partly this is because SDF font rendering seems technically simpler than
dealing with TrueType fonts (and was also used in some of my prior 3D
engine projects; though with the alteration of representing the SDF's as
256-color BMP images rather than TGA or similar).

Currently no plans to address the rest of Unicode with this though (and
it would take around 17MB to represent the entire Unicode BMP with this
scheme).

In the past, had used Unifont, which was encoded in a form that stores
each glyph as 16x16 at 1bpp; basically works, but no good way to scale
the glyphs in this case without them looking terrible (well, absent
upscaling them, possibly smoothing the edges, and transcoding into SDF
form).

But, no really a good algorithmic way to figure out which corners need
to be sharp and which should be rounded. Best option is likely to look
at blocks of, say, 3x3 or 5x5 pixels, and then make a guess based on the
pattern how the pixel in the center should expand from 1 pixel to 2x2
(which is applied iteratively to double the resolution of each glyph).

Though, 3x3 is small enough to make it viable to figure out a lookup
table by hand. Well, or generate a bunch of images of large-font text
and try to "mine" the patterns from the images (and then use these
patterns to hopefully upsample Unifont glyphs to 64x64 or similar, as
needed for the 16x16 SDF generation...).

Any thoughts on any of this?...

....

Re: Microarch Club

<e5cced24b1badb2f3c7662bd9849216f@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38239&group=comp.arch#38239

  copy link   Newsgroups: comp.arch
Date: Fri, 5 Apr 2024 21:43:08 +0000
Subject: Re: Microarch Club
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$gjLKKf04MeUB4oGS/Kr97OP5NeW6qbdtJOtUPNVwcHCoix/Glzjr6
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com> <utsrft$1b76a$1@dont-email.me> <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> <uttfk3$1j3o3$1@dont-email.me> <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org> <utvggu$2cgkl$1@dont-email.me> <20240327012715.0000125c@yahoo.com> <uu1op0$30i4b$1@dont-email.me> <c7065593299c0defd89eaac999e79bbb@www.novabbs.org> <20240328020605.00002786@yahoo.com> <uuhsci$3ef1d$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <e5cced24b1badb2f3c7662bd9849216f@www.novabbs.org>
 by: MitchAlsup1 - Fri, 5 Apr 2024 21:43 UTC

BGB-Alt wrote:

>

> I have yet to decide on the default bit-depth for UI widgets, mostly
> torn between 16-color and 256-color. Probably don't need high bit-depths
> for "text and widgets" windows, but may need more than 16-color. Also

When I write documents I have a favorite set of pastel colors I use to
shade boxes, arrows, and text. I draw the figures first and then fill in
the text later. To give the eye an easy means to go from reading the text
to looking at a figure and finding the discussed item, I place a box around
the text and shade the box with the same R-G-B color of the pre.jpg box in
the figure. All figures are *.jpg. So, even a word processor needs 24-bit
color.

{{I have also found that Word R-G-B color values are not exactly the same
as the R-G-V values in the *.jpg figure, too; but they are close enough
to avoid "figuring it out and trying to fix it to perfection.}}

> TBD which color palette to use in the case if 256 color is used (the

256 is OK enough for DOOM games, but not for professional documentation.

>

Re: Microarch Club

<uuqqtk$1uf2v$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=38240&group=comp.arch#38240

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Sat, 6 Apr 2024 01:42:58 -0500
Organization: A noiseless patient Spider
Lines: 214
Message-ID: <uuqqtk$1uf2v$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
<utsrft$1b76a$1@dont-email.me>
<80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
<uttfk3$1j3o3$1@dont-email.me>
<c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
<utvggu$2cgkl$1@dont-email.me> <20240327012715.0000125c@yahoo.com>
<uu1op0$30i4b$1@dont-email.me>
<c7065593299c0defd89eaac999e79bbb@www.novabbs.org>
<20240328020605.00002786@yahoo.com> <uuhsci$3ef1d$1@dont-email.me>
<e5cced24b1badb2f3c7662bd9849216f@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 06 Apr 2024 06:43:00 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="90cf0e11fd90a70ba9adabba2be00f66";
logging-data="2047071"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19BCU3w2Yk05r5Ujny7tZaqnibtbOa+874="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:vfpbWX9kZJ96ob7IkNfrVABg2tI=
In-Reply-To: <e5cced24b1badb2f3c7662bd9849216f@www.novabbs.org>
Content-Language: en-US
 by: BGB - Sat, 6 Apr 2024 06:42 UTC

On 4/5/2024 4:43 PM, MitchAlsup1 wrote:
> BGB-Alt wrote:
>
>>
>
>> I have yet to decide on the default bit-depth for UI widgets, mostly
>> torn between 16-color and 256-color. Probably don't need high
>> bit-depths for "text and widgets" windows, but may need more than
>> 16-color. Also
>
> When I write documents I have a favorite set of pastel colors I use to
> shade boxes, arrows, and text. I draw the figures first and then fill in
> the text later. To give the eye an easy means to go from reading the
> text to looking at a figure and finding the discussed item, I place a
> box around
> the text and shade the box with the same R-G-B color of the pre.jpg box
> in the figure. All figures are *.jpg. So, even a word processor needs
> 24-bit color.
>
> {{I have also found that Word R-G-B color values are not exactly the same
> as the R-G-V values in the *.jpg figure, too; but they are close enough
> to avoid "figuring it out and trying to fix it to perfection.}}
>
>> TBD which color palette to use in the case if 256 color is used (the
>
> 256 is OK enough for DOOM games, but not for professional documentation.
>

If the palette is set up well, one can approximate most RGB colors
semi-passably via the palette. Granted, still not great for photos or
video (or even games); and a tradeoff needs to be made between color
fidelity and levels of brightness (with the system palette I was using
favoring brightness accuracy over color fidelity).

Where games only really look good if the color palette used by the
system roughly matches that used by the game. The palette I had used, as
can be noted, was 14 levels of 18 colors:
6 high-saturation (3-bit RGB);
6 low saturation;
3 off-white;
grayscale (16 levels);
orange;
azure.

Though, not really a fast/accurate procedural transform, as the palette
is complicated enough to require a lookup table (unlike 216 color or
RGB332).

Has at one point used a palette encoding which had used a 4-bit Y and
the remaining 4 bits effectively encoding a normalized vector in YUV
space, but the current scheme is more efficient. In this older scheme,
palette entries that would represent an out-of-gamut color were
effectively wasted. This strategy worked semi-OK for photo-like
graphics, but worse for other things; encoding logic was intermediate
complexity (effectively built around a highly modified YCbCr transform).

Though, one possible (semi-OK) way to algorithmically encode colors to
the current palette would be:
Calculate the Luma;
Normalize the RGB color into a color vector;
Feed this vector through a fixed RGB222 lookup table;
Encode the color based on which axis was selected.

Though, for most UI widgets, one arguably only needs 4 colors, say:
Black, White, Dark-Gray, Light-Gray.

Add maybe a few other colors for things like title-bars and similar, ...

But, imposing a 16-color RGBI limit may be too restricting (even in the
name of saving memory by representing windows at a lower bit-depth when
a higher bit depth is unnecessary).

If one allows for customizable UI color schemes (like in older versions
of Windows), then RGBI would be insufficient even for simple pointy
clicky apps (leaving 256-color as the minimum option).

Here, 256 colors could make sense. Though, RGB555 would likely be
overkill for most simple pointy clicky GUI apps.

One possibility here could be for the widget toolkit window to
automatically upgrade to a higher bit-depth if it detects the use of
bitmap-object widgets or similar (possibly making the choice of
bit-depth mostly invisible to the program in this case).

Well, or leave it per program, where a program that may use more complex
graphics requests the use of a higher bit-depth, and then say, one that
only does a text-editor or calculator or similar, sticks with a more
limited colorspace to save memory.

Say, looking at Windows Calculator:
Looks like it uses:
~ 5 or 6 shades of gray;
A shade of blue.

Programs of this sort don't really need a higher color depth.
Would a Notepad style program need lots of colors?
...

Granted, something like an image editor would need full-color, so in
this case supporting higher color depths will make sense as well. Just,
one doesn't want to use higher bit-depths when they are not needed, as
this wastes more RAM on the backing framebuffers for the various
windows, etc...

In other news, had been messing with doing SDF versions of Unifont.

Curiously, newer versions (15.1.05) of the Unifont font (post
processing) seem to actually be a little bit smaller than an older
versions (5.1). Not entirely sure what is going on with this (but, can
note that there are a number of large gaps in the codepoint space).

Well, and seemingly multiple versions of the font:
The normal Unicode BMP;
Another 'JP' version which follows JIS rules;
Apparently maps in various characters from Planes 1 and 2, ...
Appears mostly similar apart from the contents of the CJK ranges.
...

Say:
After conversion to a 1bpp glyph set:
5.1 : 1.99 MB
15.1.05: 1.78 MB
After conversion to SDF and storing the bitmaps in a WAD container:
5.1 : ~ 7.0MB
15.1.05: ~ 6.1MB

Where, in this case, WAD based packaging seemed "mostly sufficient".
May or may not end up putting additional metadata in the WAD files.
For now, it is mostly just a collection of bitmap images...

With each image with its number encoded in the lump name, and understood
to be a grid of 16x16 glyphs.

Though, glyph shape reconstruction is less exact than could be hoped.
Arguably, still better than if one tries to scale the bitmaps directly,
but results may come out a little lumpy/weird here (though, trying to
store them at 4-bits per sub-channel probably doesn't exactly help this).

At the moment, storage is basically:
Represent the SDF as a 256-color image, with each pixel split into 2
sub-channels. The palette color is interpreted as the input vector for
the SDF drawing (interpolate between the pixels, take the median, and
compare this value against a threshold).

Initially, was not using the palette, but the palette allows more
flexibility in this case.

The current algorithm generates the SDF images by finding the horizontal
and vertical distances. Some papers imply that it may make sense to
allow more distance-calculation functions, possibly trying to regenerate
each glyph using each combination of functions and picking whichever
combination of functions yields the lowest reconstruction error.

Granted, ended up using a fairly naive up-sampling algorithm (from the
16x16 pixel input glyphs), which possibly doesn't really help (I fiddled
with it until I got something semi-OK looking, but it has a limitation
of not having any idea for what the upsampled glyphs are "supposed" to
look like, what corners should be rounded and which kept sharp and
pointy; and patterns that work for ASCII/8859-1 range don't extend to
the rest of the BMP).

Though, at least for 1252, this range of characters can be handled
manually. And in this case, the upsampled glyphs were based on 8x8 pixel
variants, where following the shapes as the low-res glyphs makes the
font much more tolerant of scaling (comparably the the Unifont glyphs
are "less robust" in these areas).

However, as-is, processing the Unicode BMP still takes an annoyingly
long time (and a more involved axis-search for each glyph wouldn't
exactly help matters).

Likely options, if I did so:
Simple 2D euclidean distance (naive algorithm);
Horizontal distance (current);
Vertical distance (current);
Down-Left (diagonal);
Up-Left (diagonal).

Where, the generation would pick the best 2, with an implicit "3rd axis"
which would merely be the average of the first 2.

In theory, could use hi-color images (with 3x 5-bit channels), but this
would effectively double the memory requirements of each SDF image
(though, the current implementation demand loads each "page", rather
than bulk-loading the entire Unicode BMP).

Well, or I use hi-color for CP-1252, but then the 8-bit scheme for Unifont.

Not currently still wanting to face the complexity of dealing with
TrueType fonts.

Can also note that TTF fonts that cover the entire Unicode BMP are "not
exactly small" either...

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor