Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

Wait! You have not been prepared! -- Mr. Atoz, "Tomorrow is Yesterday", stardate 3113.2


devel / comp.arch / Re: Tonight's tradeoff

SubjectAuthor
* Tonight's tradeoffRobert Finch
+* Re: Tonight's tradeoffEricP
|`* Re: Tonight's tradeoffMitchAlsup
| `* Re: Tonight's tradeoffRobert Finch
|  `* Re: Tonight's tradeoffMitchAlsup
|   `* Re: Tonight's tradeoffRobert Finch
|    +- Re: Tonight's tradeoffRobert Finch
|    `* Re: Tonight's tradeoffMitchAlsup
|     `* Re: Tonight's tradeoffRobert Finch
|      `* Re: Tonight's tradeoffRobert Finch
|       `* Re: Tonight's tradeoffMitchAlsup
|        +* Re: Tonight's tradeoffRobert Finch
|        |+* Re: Tonight's tradeoffBGB
|        ||`* Re: Tonight's tradeoffRobert Finch
|        || +* Re: Tonight's tradeoffScott Lurndal
|        || |`- Re: Tonight's tradeoffMitchAlsup
|        || `- Re: Tonight's tradeoffBGB
|        |+- Re: Tonight's tradeoffScott Lurndal
|        |`* Re: Tonight's tradeoffMitchAlsup
|        | `* Re: Tonight's tradeoffScott Lurndal
|        |  +* Re: Tonight's tradeoffMitchAlsup
|        |  |`* Re: Tonight's tradeoffScott Lurndal
|        |  | `* Re: Tonight's tradeoffRobert Finch
|        |  |  +- Re: Tonight's tradeoffMitchAlsup
|        |  |  `- Re: Tonight's tradeoffScott Lurndal
|        |  `* Re: Tonight's tradeoffAnton Ertl
|        |   +* Re: Tonight's tradeoffEricP
|        |   |+- Re: Tonight's tradeoffMitchAlsup
|        |   |`- Re: Tonight's tradeoffAnton Ertl
|        |   +* Re: Tonight's tradeoffBGB
|        |   |+* Re: Tonight's tradeoffScott Lurndal
|        |   ||+- Re: Tonight's tradeoffBGB
|        |   ||`* Re: Tonight's tradeoffMitchAlsup
|        |   || `- Re: Tonight's tradeoffBGB
|        |   |+- Re: Tonight's tradeoffRobert Finch
|        |   |`* Re: Tonight's tradeoffAnton Ertl
|        |   | `- Re: Tonight's tradeoffBGB
|        |   `* Re: Tonight's tradeoffScott Lurndal
|        |    `* Re: Tonight's tradeoffAnton Ertl
|        |     `* Re: Tonight's tradeoffScott Lurndal
|        |      `* Re: Tonight's tradeoffAnton Ertl
|        |       `* Re: Tonight's tradeoffRobert Finch
|        |        +- Re: Tonight's tradeoffScott Lurndal
|        |        +* Re: Tonight's tradeoffEricP
|        |        |`* Re: Tonight's tradeoffMitchAlsup
|        |        | `* Re: Tonight's tradeoffRobert Finch
|        |        |  `* Re: Tonight's tradeoffMitchAlsup
|        |        |   `* Re: Tonight's tradeoffRobert Finch
|        |        |    `* Re: Tonight's tradeoffMitchAlsup
|        |        |     `* Re: Tonight's tradeoffRobert Finch
|        |        |      `- Re: Tonight's tradeoffMitchAlsup
|        |        `* Re: Tonight's tradeoffRobert Finch
|        |         `* Re: Tonight's tradeoffEricP
|        |          +* Re: Tonight's tradeoffMitchAlsup
|        |          |+- Re: Tonight's tradeoffRobert Finch
|        |          |`* Re: Tonight's tradeoffBGB
|        |          | `* Re: Tonight's tradeoffRobert Finch
|        |          |  `* Re: Tonight's tradeoffBGB
|        |          |   `* Re: Tonight's tradeoffRobert Finch
|        |          |    +- Re: Tonight's tradeoffMitchAlsup
|        |          |    `* Re: Tonight's tradeoffBGB
|        |          |     `* Re: Tonight's tradeoffRobert Finch
|        |          |      `* Re: Tonight's tradeoffBGB
|        |          |       `* Re: Tonight's tradeoffRobert Finch
|        |          |        `* Re: Tonight's tradeoffRobert Finch
|        |          |         `* Re: Tonight's tradeoffMitchAlsup
|        |          |          `* Re: Tonight's tradeoffBGB
|        |          |           `* Re: Tonight's tradeoffRobert Finch
|        |          |            `* Re: Tonight's tradeoffMitchAlsup
|        |          |             `* Re: Tonight's tradeoffRobert Finch
|        |          |              `* Re: Tonight's tradeoffMitchAlsup
|        |          |               `* Re: Tonight's tradeoffRobert Finch
|        |          |                +- Re: Tonight's tradeoffRobert Finch
|        |          |                `* Re: Tonight's tradeoffMitchAlsup
|        |          |                 `* Re: Tonight's tradeoffRobert Finch
|        |          |                  +* Re: Tonight's tradeoffMitchAlsup
|        |          |                  |`* Re: Tonight's tradeoffRobert Finch
|        |          |                  | `* Re: Tonight's tradeoffBGB
|        |          |                  |  `* Re: Tonight's tradeoffRobert Finch
|        |          |                  |   +* Re: Tonight's tradeoffMitchAlsup
|        |          |                  |   |`- Re: Tonight's tradeoffRobert Finch
|        |          |                  |   `* Re: Tonight's tradeoffBGB
|        |          |                  |    `* Re: Tonight's tradeoffRobert Finch
|        |          |                  |     `* Re: Tonight's tradeoffRobert Finch
|        |          |                  |      `* Re: Tonight's tradeoffEricP
|        |          |                  |       +* Re: Tonight's tradeoffMitchAlsup
|        |          |                  |       |`* Re: Tonight's tradeoffRobert Finch
|        |          |                  |       | +- Re: Tonight's tradeoffRobert Finch
|        |          |                  |       | `* Re: Tonight's tradeoffEricP
|        |          |                  |       |  `* Re: Tonight's tradeoffMitchAlsup
|        |          |                  |       |   `* Re: Tonight's tradeoffRobert Finch
|        |          |                  |       |    `* Re: Tonight's tradeoffRobert Finch
|        |          |                  |       |     +- Re: Tonight's tradeoffBGB
|        |          |                  |       |     `* Re: Tonight's tradeoffEricP
|        |          |                  |       |      `* Re: Tonight's tradeoffMitchAlsup
|        |          |                  |       |       +- Re: Tonight's tradeoffRobert Finch
|        |          |                  |       |       `* Re: Tonight's tradeoffEricP
|        |          |                  |       |        +* Re: Tonight's tradeoffChris M. Thomasson
|        |          |                  |       |        |`* Re: Tonight's tradeoffEricP
|        |          |                  |       |        | +- Re: Tonight's tradeoffAnton Ertl
|        |          |                  |       |        | `* Re: Tonight's tradeoffChris M. Thomasson
|        |          |                  |       |        `* Re: Tonight's tradeoffChris M. Thomasson
|        |          |                  |       `- Re: Tonight's tradeoffBGB
|        |          |                  `- Re: Tonight's tradeoffMitchAlsup
|        |          `- Re: Tonight's tradeoffRobert Finch
|        `- Re: Tonight's tradeoffScott Lurndal
+- Re: Tonight's tradeoffMitchAlsup
`* Re: Tonight's tradeoffRobert Finch

Pages:123456789101112
Re: Tonight's tradeoff

<ul4dpg$2mf1g$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35630&group=comp.arch#35630

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 10 Dec 2023 08:17:36 -0500
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <ul4dpg$2mf1g$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me>
<ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 10 Dec 2023 13:17:37 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0c7f8acff44c263fa81be3a6ca17f6b8";
logging-data="2833456"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+G+QiA4lgK7FjBnxrA72gNv3JOfv20WLg="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:F0wDS34gWDxXF9RyJBVC3orl4lU=
Content-Language: en-US
In-Reply-To: <ul3ngl$2jhoh$1@dont-email.me>
 by: Robert Finch - Sun, 10 Dec 2023 13:17 UTC

Thinking again about using a block header approach to locating
instructions. I note that it is necessary only to locate the position of
a group of instructions being fetched at the same time, since the CPU
processes instructions in groups. An instruction length decoder can be
relied upon to determine the location of instructions within a group.
Assuming a fetch group is four instructions with postfixes, and
instructions are an average of about four bytes in length, then only
about four groups of instructions would fit in a 512-bit, 64-byte block.
To reference a position within a 64-byte block requires a six-bit code.
Five six-bit codes would fit into 32-bits in a header allowing the
position of up to six groups of instructions to be identified. This is
about 6% memory overhead for locating instruction groups. There may end
up being wasted space at the end of a block if instructions are short.
Or the last group may end up being truncated making it necessary to pad
with NOPs. I suspect on average there would only be a couple of bytes
wasted in a block. I am guessing that using blocks of variable length
instructions will increase code density over having larger fixed length
instructions. A five-byte instruction length decreases code density by
20% over a four-byte length. But using blocks should only cost about 10%
in density, making it more economical. Seems like some experimentation
is in order.

Re: Tonight's tradeoff

<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35632&group=comp.arch#35632

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 10 Dec 2023 15:11:38 +0000
Organization: novaBBS
Message-ID: <9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <s929N.28687$rx%7.18632@fx47.iad> <2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me> <ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad> <3a714dde36640fc4c47255d0a170aaee@news.novabbs.com> <ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me> <ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me> <ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me> <5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com> <ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me> <59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com> <ul38cp$2e25u$1@dont-email.me> <689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com> <ul3ngl$2jhoh$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="3673370"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$mUg8Mc/k8gx7/Kb79r67dOJN7w13mMCShj6bA34Pq8KcwCfFNTAPG
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
 by: MitchAlsup - Sun, 10 Dec 2023 15:11 UTC

Robert Finch wrote:

> On 2023-12-09 11:06 p.m., MitchAlsup wrote:
>> Robert Finch wrote:
>>
>> I have a LD IP,[address] instruction which is used to access GOT[k] for
>> calling dynamically linked subroutines. This bypasses the LD-aligner
>> to deliver IP to fetch faster.
>>
>> But you side-stepped answering my question. My question is what do you
>> do when the Jump address will not arrive for another 20 cycles.
>>
> While waiting for the register value, other instructions would continue
> to queue and execute. Then that processing would be dumped because of
> the branch miss. I suppose hardware could be added to suppress
> processing until the register value is known. An option for a larger build.

>>>>> Branches can now use a postfix immediate to extend the branch range.
>>>>> This allows 32 and 64-bit displacements in addition to the existing
>>>>> 17-bit one. However, the assembler cannot know which to use in
>>>>> advance, so choosing a larger branch displacement size should be an
>>>>> option.
>>
>> I use GOT[k] to branch farther than the 28-bit unconditional branch
>> displacement can reach. We have not yet run into a subroutine that
>> needs branches of more then 18-bits conditionally or 28-bits uncon-
>> ditionally.

> I have yet to use GOT addressing.

> There are issues to resolve in the Q+ frontend. The next PC value for
> the BTB is not available for about three clocks. To go backwards in
> time, the next PC needs to be cached, or rather the displacement to the
> next PC to reduce cache size.

What you need is an index and a set to directly access the cache--all
the other stuff can be done in arears {AGEN and cache tag check}

> The first time a next PC is needed it will
> not be available for three clocks. Once cached it would be available
> within a clock. The next PC displacement is the sum of the lengths of
> next four instructions. There is not enough room in the FPGA to add
> another cache and associated logic, however. Next PC = PC + 20 seems a
> whole lot simpler to me.

> Thus, I may go back to using a fixed size instruction or rather
> instructions with fixed alignment. The position of instructions could be
> as if they were fixed length while remaining variable length.

If the first part of an instruction decodes to the length of the instruction
easily (EASILY) and cheaply, you can avoid the header and build a tree of
unary pointers each such pointer pointing at twice as many instruction
starting points as the previous. Even without headers, My 66000 can find
the instruction boundaries of up to 16 instructions per cycle without adding
"stuff" the the block of instructions.

> Instructions would just be aligned at fixed intervals. If I set the
> length to five bytes for instance, most the instruction set could be
> accommodated. Operation by “packed” instructions would be an option for
> a larger build. There could be a bit in a control register to allow
> execution by packed or unpacked instructions so there is some backwards
> compatibility to a smaller build.

Re: Tonight's tradeoff

<ul5aai$2qnln$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35655&group=comp.arch#35655

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 10 Dec 2023 16:24:33 -0500
Organization: A noiseless patient Spider
Lines: 87
Message-ID: <ul5aai$2qnln$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me>
<ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 10 Dec 2023 21:24:34 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ae747f494e10017321d13cd715626ecc";
logging-data="2973367"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+NmgD9EL0rrDwIyo3gaGSwzorHEKzFIYo="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:d2jLGwo7sTNUlL+N0dMINFDjduw=
Content-Language: en-US
In-Reply-To: <9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
 by: Robert Finch - Sun, 10 Dec 2023 21:24 UTC

On 2023-12-10 10:11 a.m., MitchAlsup wrote:
> Robert Finch wrote:
>
>> On 2023-12-09 11:06 p.m., MitchAlsup wrote:
>>> Robert Finch wrote:
>>>
>>> I have a LD IP,[address] instruction which is used to access GOT[k] for
>>> calling dynamically linked subroutines. This bypasses the LD-aligner
>>> to deliver IP to fetch faster.
>>>
>>> But you side-stepped answering my question. My question is what do you
>>> do when the Jump address will not arrive for another 20 cycles.
>>>
>> While waiting for the register value, other instructions would
>> continue to queue and execute. Then that processing would be dumped
>> because of the branch miss. I suppose hardware could be added to
>> suppress processing until the register value is known. An option for a
>> larger build.
>
>>>>>> Branches can now use a postfix immediate to extend the branch
>>>>>> range. This allows 32 and 64-bit displacements in addition to the
>>>>>> existing 17-bit one. However, the assembler cannot know which to
>>>>>> use in advance, so choosing a larger branch displacement size
>>>>>> should be an option.
>>>
>>> I use GOT[k] to branch farther than the 28-bit unconditional branch
>>> displacement can reach. We have not yet run into a subroutine that
>>> needs branches of more then 18-bits conditionally or 28-bits uncon-
>>> ditionally.
>
>> I have yet to use GOT addressing.
>
>> There are issues to resolve in the Q+ frontend. The next PC value for
>> the BTB is not available for about three clocks. To go backwards in
>> time, the next PC needs to be cached, or rather the displacement to
>> the next PC to reduce cache size.
>
> What you need is an index and a set to directly access the cache--all
> the other stuff can be done in arears {AGEN and cache tag check}
>
>>                               The first time a next PC is needed it
>> will not be available for three clocks. Once cached it would be
>> available within a clock. The next PC displacement is the sum of the
>> lengths of next four instructions. There is not enough room in the
>> FPGA to add another cache and associated logic, however. Next PC = PC
>> + 20 seems a whole lot simpler to me.
>
>> Thus, I may go back to using a fixed size instruction or rather
>> instructions with fixed alignment. The position of instructions could
>> be as if they were fixed length while remaining variable length.
>
> If the first part of an instruction decodes to the length of the
> instruction
> easily (EASILY) and cheaply, you can avoid the header and build a tree of
> unary pointers each such pointer pointing at twice as many instruction
> starting points as the previous. Even without headers, My 66000 can find
> the instruction boundaries of up to 16 instructions per cycle without
> adding
> "stuff" the the block of instructions.
>
>> Instructions would just be aligned at fixed intervals. If I set the
>> length to five bytes for instance, most the instruction set could be
>> accommodated. Operation by “packed” instructions would be an option
>> for a larger build. There could be a bit in a control register to
>> allow execution by packed or unpacked instructions so there is some
>> backwards compatibility to a smaller build.

I cannot get it to work at a decent speed for only six instructions.
With byte-aligned instructions 64-decoders are in use. (They’re really
small). Then output from the appropriate ones are selected. It is
partially the fullness of the FPGA and routing congestion because of the
design. Routing is taking 90% of the time. Logic is only about 10%.

I did some experimenting with block headers and ended up with a block
trailer instead of a header, for the assembler’s benefit which needs to
know all the instruction lengths before the trailer can be output. Only
the index of the instruction group is needed, so usually there are only
a couple of indexes used per instruction block. It can likely get by
with a 24-bit trailer containing four indexes plus the assumed one.
Usually only one or two bytes are wasted at the end of a block.
I assembled the boot rom and there are 4.9 bytes per instruction
average, including the overhead of block trailers and wasted bytes.
Branche and postfixes are five bytes, and there are a lot of them.

Code density is a little misleading because branches occupy five bytes
but do both a compare and branch operation. So they should maybe count
as two instructions.

Re: Tonight's tradeoff

<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35659&group=comp.arch#35659

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 10 Dec 2023 22:39:10 +0000
Organization: novaBBS
Message-ID: <d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <s929N.28687$rx%7.18632@fx47.iad> <2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me> <ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad> <3a714dde36640fc4c47255d0a170aaee@news.novabbs.com> <ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me> <ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me> <ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me> <5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com> <ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me> <59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com> <ul38cp$2e25u$1@dont-email.me> <689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com> <ul3ngl$2jhoh$1@dont-email.me> <9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com> <ul5aai$2qnln$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="3708357"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$p3OrR1jEopFzJDJm/4cumeA1SR6GQDTJY.iQWysHnZEvReTjlD3mO
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
 by: MitchAlsup - Sun, 10 Dec 2023 22:39 UTC

Robert Finch wrote:

> On 2023-12-10 10:11 a.m., MitchAlsup wrote:
>> Robert Finch wrote:
>>
>>> On 2023-12-09 11:06 p.m., MitchAlsup wrote:
>>>> Robert Finch wrote:
>>>>
>>>> I have a LD IP,[address] instruction which is used to access GOT[k] for
>>>> calling dynamically linked subroutines. This bypasses the LD-aligner
>>>> to deliver IP to fetch faster.
>>>>
>>>> But you side-stepped answering my question. My question is what do you
>>>> do when the Jump address will not arrive for another 20 cycles.
>>>>
>>> While waiting for the register value, other instructions would
>>> continue to queue and execute. Then that processing would be dumped
>>> because of the branch miss. I suppose hardware could be added to
>>> suppress processing until the register value is known. An option for a
>>> larger build.
>>
>>>>>>> Branches can now use a postfix immediate to extend the branch
>>>>>>> range. This allows 32 and 64-bit displacements in addition to the
>>>>>>> existing 17-bit one. However, the assembler cannot know which to
>>>>>>> use in advance, so choosing a larger branch displacement size
>>>>>>> should be an option.
>>>>
>>>> I use GOT[k] to branch farther than the 28-bit unconditional branch
>>>> displacement can reach. We have not yet run into a subroutine that
>>>> needs branches of more then 18-bits conditionally or 28-bits uncon-
>>>> ditionally.
>>
>>> I have yet to use GOT addressing.
>>
>>> There are issues to resolve in the Q+ frontend. The next PC value for
>>> the BTB is not available for about three clocks. To go backwards in
>>> time, the next PC needs to be cached, or rather the displacement to
>>> the next PC to reduce cache size.
>>
>> What you need is an index and a set to directly access the cache--all
>> the other stuff can be done in arears {AGEN and cache tag check}
>>
>>>                               The first time a next PC is needed it
>>> will not be available for three clocks. Once cached it would be
>>> available within a clock. The next PC displacement is the sum of the
>>> lengths of next four instructions. There is not enough room in the
>>> FPGA to add another cache and associated logic, however. Next PC = PC
>>> + 20 seems a whole lot simpler to me.
>>
>>> Thus, I may go back to using a fixed size instruction or rather
>>> instructions with fixed alignment. The position of instructions could
>>> be as if they were fixed length while remaining variable length.
>>
>> If the first part of an instruction decodes to the length of the
>> instruction
>> easily (EASILY) and cheaply, you can avoid the header and build a tree of
>> unary pointers each such pointer pointing at twice as many instruction
>> starting points as the previous. Even without headers, My 66000 can find
>> the instruction boundaries of up to 16 instructions per cycle without
>> adding
>> "stuff" the the block of instructions.
>>
>>> Instructions would just be aligned at fixed intervals. If I set the
>>> length to five bytes for instance, most the instruction set could be
>>> accommodated. Operation by “packed” instructions would be an option
>>> for a larger build. There could be a bit in a control register to
>>> allow execution by packed or unpacked instructions so there is some
>>> backwards compatibility to a smaller build.

> I cannot get it to work at a decent speed for only six instructions.
> With byte-aligned instructions 64-decoders are in use. (They’re really
> small). Then output from the appropriate ones are selected. It is
> partially the fullness of the FPGA and routing congestion because of the
> design. Routing is taking 90% of the time. Logic is only about 10%.

> I did some experimenting with block headers and ended up with a block
> trailer instead of a header, for the assembler’s benefit which needs to
> know all the instruction lengths before the trailer can be output. Only
> the index of the instruction group is needed, so usually there are only
> a couple of indexes used per instruction block. It can likely get by
> with a 24-bit trailer containing four indexes plus the assumed one.
> Usually only one or two bytes are wasted at the end of a block.
> I assembled the boot rom and there are 4.9 bytes per instruction
> average, including the overhead of block trailers and wasted bytes.
> Branche and postfixes are five bytes, and there are a lot of them.

> Code density is a little misleading because branches occupy five bytes
> but do both a compare and branch operation. So they should maybe count
> as two instructions.

Re: Tonight's tradeoff

<3ae8f54031d2208b92c5b58070888a22@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35660&group=comp.arch#35660

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 10 Dec 2023 22:52:04 +0000
Organization: novaBBS
Message-ID: <3ae8f54031d2208b92c5b58070888a22@news.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <s929N.28687$rx%7.18632@fx47.iad> <2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me> <ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad> <3a714dde36640fc4c47255d0a170aaee@news.novabbs.com> <ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me> <ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me> <ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me> <5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com> <ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me> <59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com> <ul38cp$2e25u$1@dont-email.me> <689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com> <ul3ngl$2jhoh$1@dont-email.me> <9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com> <ul5aai$2qnln$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="3709334"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$1OkUe9uaXZOcp/vFEVFep.Gvh0pjGREnfA.EpdnBDC6bIFAbVuKtm
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
 by: MitchAlsup - Sun, 10 Dec 2023 22:52 UTC

Robert Finch wrote:

> On 2023-12-10 10:11 a.m., MitchAlsup wrote:
>> Robert Finch wrote:
>>
>>> On 2023-12-09 11:06 p.m., MitchAlsup wrote:
>>>> Robert Finch wrote:
>>>>
>>>> I have a LD IP,[address] instruction which is used to access GOT[k] for
>>>> calling dynamically linked subroutines. This bypasses the LD-aligner
>>>> to deliver IP to fetch faster.
>>>>
>>>> But you side-stepped answering my question. My question is what do you
>>>> do when the Jump address will not arrive for another 20 cycles.
>>>>
>>> While waiting for the register value, other instructions would
>>> continue to queue and execute. Then that processing would be dumped
>>> because of the branch miss. I suppose hardware could be added to
>>> suppress processing until the register value is known. An option for a
>>> larger build.
>>
>>>>>>> Branches can now use a postfix immediate to extend the branch
>>>>>>> range. This allows 32 and 64-bit displacements in addition to the
>>>>>>> existing 17-bit one. However, the assembler cannot know which to
>>>>>>> use in advance, so choosing a larger branch displacement size
>>>>>>> should be an option.
>>>>
>>>> I use GOT[k] to branch farther than the 28-bit unconditional branch
>>>> displacement can reach. We have not yet run into a subroutine that
>>>> needs branches of more then 18-bits conditionally or 28-bits uncon-
>>>> ditionally.
>>
>>> I have yet to use GOT addressing.
>>
>>> There are issues to resolve in the Q+ frontend. The next PC value for
>>> the BTB is not available for about three clocks. To go backwards in
>>> time, the next PC needs to be cached, or rather the displacement to
>>> the next PC to reduce cache size.
>>
>> What you need is an index and a set to directly access the cache--all
>> the other stuff can be done in arears {AGEN and cache tag check}
>>
>>>                               The first time a next PC is needed it
>>> will not be available for three clocks. Once cached it would be
>>> available within a clock. The next PC displacement is the sum of the
>>> lengths of next four instructions. There is not enough room in the
>>> FPGA to add another cache and associated logic, however. Next PC = PC
>>> + 20 seems a whole lot simpler to me.
>>
>>> Thus, I may go back to using a fixed size instruction or rather
>>> instructions with fixed alignment. The position of instructions could
>>> be as if they were fixed length while remaining variable length.
>>
>> If the first part of an instruction decodes to the length of the
>> instruction
>> easily (EASILY) and cheaply, you can avoid the header and build a tree of
>> unary pointers each such pointer pointing at twice as many instruction
>> starting points as the previous. Even without headers, My 66000 can find
>> the instruction boundaries of up to 16 instructions per cycle without
>> adding
>> "stuff" the the block of instructions.
>>
>>> Instructions would just be aligned at fixed intervals. If I set the
>>> length to five bytes for instance, most the instruction set could be
>>> accommodated. Operation by “packed” instructions would be an option
>>> for a larger build. There could be a bit in a control register to
>>> allow execution by packed or unpacked instructions so there is some
>>> backwards compatibility to a smaller build.

> I cannot get it to work at a decent speed for only six instructions.
> With byte-aligned instructions 64-decoders are in use. (They’re really
> small). Then output from the appropriate ones are selected. It is
> partially the fullness of the FPGA and routing congestion because of the
> design. Routing is taking 90% of the time. Logic is only about 10%.

That wire:logic ratio is "not that much out of line" for long distance
bussing of data.

My word oriented design would cut the decoders down to 16-decoders and
they have to look at 7-bits to produce 3×5-bit vectors. A tree of
AND gates takes it from here basically performing FF1.

> I did some experimenting with block headers and ended up with a block
> trailer instead of a header, for the assembler’s benefit which needs to
> know all the instruction lengths before the trailer can be output. Only
> the index of the instruction group is needed, so usually there are only
> a couple of indexes used per instruction block. It can likely get by
> with a 24-bit trailer containing four indexes plus the assumed one.
> Usually only one or two bytes are wasted at the end of a block.
> I assembled the boot rom and there are 4.9 bytes per instruction
> average, including the overhead of block trailers and wasted bytes.
> Branche and postfixes are five bytes, and there are a lot of them.

> Code density is a little misleading because branches occupy five bytes
> but do both a compare and branch operation. So they should maybe count
> as two instructions.

Sooner or later you have to mash everything down to {bits, bytes, words}
Instructions having VLE and having non-identity units of work performed,
bytes are probably the best representation. My eXcel spreadsheet stuff
uses bits.

Re: Tonight's tradeoff

<ul6ajs$32loo$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35673&group=comp.arch#35673

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Mon, 11 Dec 2023 01:35:39 -0500
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <ul6ajs$32loo$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me>
<ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
<ul5aai$2qnln$1@dont-email.me>
<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 11 Dec 2023 06:35:40 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b524b390cee77beb3d06aabce4700250";
logging-data="3233560"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Qy5XYHRJxB0H7/eAXrjfTWFvIXVmeRYU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:rEA/oF9eeTOCJ3b+hxvHxIaHAc0=
Content-Language: en-US
In-Reply-To: <d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
 by: Robert Finch - Mon, 11 Dec 2023 06:35 UTC

Decided to take another try at it. Got tired of jumping through hoops to
support a block header approach. The assembler is just not setup to
support that style of processor. It was also a bit tricky to convert
branch target address to instruction groups and numbers. I still like
the VLE. Simplified the length decoders so they generate a 3-bit number
instead of a 5-bit one. I went a bit insane with the instruction lengths
and allowed for up to 18 bytes. But that is wasteful. There were a
couple of postfixes that were that long to support 128-bit immediates.
The Isa has multiple postfixes repeated now instead of really long ones.
32-bit immediate chunks.

Found a signal slowing things down. I had bypassed the output of the
i-cache ram in case a write and a read to the same address happened at
the same time. The read address was coming from the PC and ended up
feeding through 44 logic levels all the way to the instruction length
decoder. The output of the i-cache was going through the bypass mux
before feeding other logic. It is now just output directly from the RAM.
Not sure why I bypassed it in the first place. If there was a i-cache
update because of a cache miss, then it should be stalled.

Even with 44 logic levels and lots of routing timing was okay at 20 MHz.
Shooting for 40 MHz now.

Re: Tonight's tradeoff

<ul6me9$34850$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35677&group=comp.arch#35677

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Mon, 11 Dec 2023 03:57:26 -0600
Organization: A noiseless patient Spider
Lines: 99
Message-ID: <ul6me9$34850$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me>
<ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
<ul5aai$2qnln$1@dont-email.me>
<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
<ul6ajs$32loo$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 11 Dec 2023 09:57:29 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="fd0ed741af5b334f2e4384b6847b03ad";
logging-data="3285152"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19TaIqy6++aYzm3UuOmRq0i"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:WKLIC6HSQ6kRJjx77Geed/1uDyw=
Content-Language: en-US
In-Reply-To: <ul6ajs$32loo$1@dont-email.me>
 by: BGB - Mon, 11 Dec 2023 09:57 UTC

On 12/11/2023 12:35 AM, Robert Finch wrote:
> Decided to take another try at it. Got tired of jumping through hoops to
> support a block header approach. The assembler is just not setup to
> support that style of processor. It was also a bit tricky to convert
> branch target address to instruction groups and numbers. I still like
> the VLE. Simplified the length decoders so they generate a 3-bit number
> instead of a 5-bit one. I went a bit insane with the instruction lengths
> and allowed for up to 18 bytes. But that is wasteful. There were a
> couple of postfixes that were that long to support 128-bit immediates.
> The Isa has multiple postfixes repeated now instead of really long ones.
> 32-bit immediate chunks.
>

No 128-bit constant load in my case, one does two 64 bit constant loads
for this. In terms of the pipeline, it actually moves four 33-bit
chunks, with each 64-bit constant effectively being glued together
inside the pipeline.

....

> Found a signal slowing things down. I had bypassed the  output of the
> i-cache ram in case a write and a read to the same address happened at
> the same time. The read address was coming from the PC and ended up
> feeding through 44 logic levels all the way to the instruction length
> decoder. The output of the i-cache was going through the bypass mux
> before feeding other logic. It is now just output directly from the RAM.
> Not sure why I bypassed it in the first place. If there was a i-cache
> update because of a cache miss, then it should be stalled.
>
> Even with 44 logic levels and lots of routing timing was okay at 20 MHz.
> Shooting for 40 MHz now.
>

Ironically, as a side-effect of some of the logic tweaks made in trying
to get my core running at 75MHz, but now reverting to 50MHz; on the
50MHz clock it seems I generally have around 2ns of slack (with around
16-20 logic levels).

This is after re-enabling a lot of the features I had disabled in the
attempted clock-boost to 75MHz.

Most recent ISA tweak:
Had modified the decoding rules slightly so that both the W.m and W.i
bits were given to the immediate for 2RI-Imm10 ops in XG2 Mode
(previously these bits were unused, and I debated between extending the
immediate values or leaving them as possible opcode bits; but if used
for opcode they would lead to new instructions that are effectively N/E
in Baseline Mode, which wouldn't be ideal either).

Relatedly tweaked the JCMPZ-Disp11s and "MOV.x (GBR, Disp10u*Sx), Rn"
ops to effectively be Disp13 and Disp12.

This increases branch-displacement to +/-8K for compare-with-zero, and
MOV.L/Q to 16K/32K for loading global variables (from 4K/8K).

It seems the gains were fairly modest though:
Relatively few functions are large enough to benefit from the larger
branch limit;
It seems it runs out of scalar-type global variables (in Doom) well
before hitting the 16K/32K limit (and this is mostly N/A for structs and
arrays).

Like, it seems Doom only has around 12K worth of scalar global
variables, and nearly the entire rest of the ".data" and ".bss" section
is structs and arrays (which need LEA's to access). So, it only gains a
few percent on the total hit rate (55 to 57).

Roughly the remaining 43% being mostly LEA's to arrays, which end up
needing a 64-bit (Disp33) encoding.

TBD if I will keep this, or revert to only adding W.i, in XG2 Mode (to
keep W.m for possible XG2-Only 2RI opcodes; or possibly make the
Imm12/Disp13s case specific to these particular instructions, though
this is "less good" for the instruction decoder).

If I keep this, will probably end up stuck with it in any case.

Also, in my compiler, I am finding less obvious "low hanging fruit" than
I had hoped (and a lot of the remaining inefficiencies are things that
would "actually take" effort, *).

*: Say, for example, computing an expression that is being passed to a
function, actually directing the output directly to the function
argument register, rather than putting it a temporary and then MOV'ing
it to the needed register.

Alas...

Re: Tonight's tradeoff

<ulc8vq$s0ll$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35718&group=comp.arch#35718

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Wed, 13 Dec 2023 07:44:36 -0500
Organization: A noiseless patient Spider
Lines: 117
Message-ID: <ulc8vq$s0ll$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me>
<ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
<ul5aai$2qnln$1@dont-email.me>
<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
<ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 13 Dec 2023 12:44:42 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="33b57197564d39c0738098d5a054b799";
logging-data="918197"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19nvzs4L/sgneDip5f7vHbsfqLwBefXtAY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:LKCoM1aiUHRcwnWKjPGXFi8GPo4=
Content-Language: en-US
In-Reply-To: <ul6me9$34850$1@dont-email.me>
 by: Robert Finch - Wed, 13 Dec 2023 12:44 UTC

On 2023-12-11 4:57 a.m., BGB wrote:
> On 12/11/2023 12:35 AM, Robert Finch wrote:
>> Decided to take another try at it. Got tired of jumping through hoops
>> to support a block header approach. The assembler is just not setup to
>> support that style of processor. It was also a bit tricky to convert
>> branch target address to instruction groups and numbers. I still like
>> the VLE. Simplified the length decoders so they generate a 3-bit
>> number instead of a 5-bit one. I went a bit insane with the
>> instruction lengths and allowed for up to 18 bytes. But that is
>> wasteful. There were a couple of postfixes that were that long to
>> support 128-bit immediates. The Isa has multiple postfixes repeated
>> now instead of really long ones. 32-bit immediate chunks.
>>
>
> No 128-bit constant load in my case, one does two 64 bit constant loads
> for this. In terms of the pipeline, it actually moves four 33-bit
> chunks, with each 64-bit constant effectively being glued together
> inside the pipeline.
>
> ...
>
>
>> Found a signal slowing things down. I had bypassed the  output of the
>> i-cache ram in case a write and a read to the same address happened at
>> the same time. The read address was coming from the PC and ended up
>> feeding through 44 logic levels all the way to the instruction length
>> decoder. The output of the i-cache was going through the bypass mux
>> before feeding other logic. It is now just output directly from the
>> RAM. Not sure why I bypassed it in the first place. If there was a
>> i-cache update because of a cache miss, then it should be stalled.
>>
>> Even with 44 logic levels and lots of routing timing was okay at 20
>> MHz. Shooting for 40 MHz now.
>>
>
>
> Ironically, as a side-effect of some of the logic tweaks made in trying
> to get my core running at 75MHz, but now reverting to 50MHz; on the
> 50MHz clock it seems I generally have around 2ns of slack (with around
> 16-20 logic levels).

I am sitting just on the other side of 50 MHz operation a couple of ns
short.
>
> This is after re-enabling a lot of the features I had disabled in the
> attempted clock-boost to 75MHz.
>
>
>
>
> Most recent ISA tweak:
> Had modified the decoding rules slightly so that both the W.m and W.i
> bits were given to the immediate for 2RI-Imm10 ops in XG2 Mode
> (previously these bits were unused, and I debated between extending the
> immediate values or leaving them as possible opcode bits; but if used
> for opcode they would lead to new instructions that are effectively N/E
> in Baseline Mode, which wouldn't be ideal either).
>
>
> Relatedly tweaked the JCMPZ-Disp11s and "MOV.x (GBR, Disp10u*Sx), Rn"
> ops to effectively be Disp13 and Disp12.
>
> This increases branch-displacement to +/-8K for compare-with-zero, and
> MOV.L/Q to 16K/32K for loading global variables (from 4K/8K).
>
> It seems the gains were fairly modest though:
> Relatively few functions are large enough to benefit from the larger
> branch limit;
> It seems it runs out of scalar-type global variables (in Doom) well
> before hitting the 16K/32K limit (and this is mostly N/A for structs and
> arrays).
>
>
> Like, it seems Doom only has around 12K worth of scalar global
> variables, and nearly the entire rest of the ".data" and ".bss" section
> is structs and arrays (which need LEA's to access). So, it only gains a
> few percent on the total hit rate (55 to 57).
>
> Roughly the remaining 43% being mostly LEA's to arrays, which end up
> needing a 64-bit (Disp33) encoding.
>
>
> TBD if I will keep this, or revert to only adding W.i, in XG2 Mode (to
> keep W.m for possible XG2-Only 2RI opcodes; or possibly make the
> Imm12/Disp13s case specific to these particular instructions, though
> this is "less good" for the instruction decoder).
>
> If I keep this, will probably end up stuck with it in any case.
>
>
> Also, in my compiler, I am finding less obvious "low hanging fruit" than
> I had hoped (and a lot of the remaining inefficiencies are things that
> would "actually take" effort, *).
>
> *: Say, for example, computing an expression that is being passed to a
> function, actually directing the output directly to the function
> argument register, rather than putting it a temporary and then MOV'ing
> it to the needed register.
>
> Alas...
>
>
>
I got timing to work at 40+ MHz by using 32-bit instruction parcels
rather than byte-oriented ones.

An issue with 32-bit parcels is that float constants do not fit well
into them because of the opcode present in a postfix. A 32-bit postfix
has only 25 available bits for a constant. The next size up has 57 bits
available. One thought I had was to reduce the floating-point precision
to correspond. Single precision floats would be 25 bits, double
precision 57 bits and quad precision 121 bits. All seven bits short of
the usual.

I could try and use 40-bit parcels but they would need to be at fixed
locations on the cache line for performance, and it would waste bytes.

Re: Tonight's tradeoff

<b760042dbbb80a72bd2174b1656a67e0@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35725&group=comp.arch#35725

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Wed, 13 Dec 2023 19:13:32 +0000
Organization: novaBBS
Message-ID: <b760042dbbb80a72bd2174b1656a67e0@news.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me> <ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me> <ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me> <5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com> <ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me> <59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com> <ul38cp$2e25u$1@dont-email.me> <689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com> <ul3ngl$2jhoh$1@dont-email.me> <9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com> <ul5aai$2qnln$1@dont-email.me> <d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com> <ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me> <ulc8vq$s0ll$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="4013665"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$nmKr3reTsk59XqsbyK/7Euzd9xGN3Bc1B/N.KgSxFVbTS2K1JXKC.
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
 by: MitchAlsup - Wed, 13 Dec 2023 19:13 UTC

Robert Finch wrote:

> On 2023-12-11 4:57 a.m., BGB wrote:
>>
>>
> I got timing to work at 40+ MHz by using 32-bit instruction parcels
> rather than byte-oriented ones.

> An issue with 32-bit parcels is that float constants do not fit well
> into them because of the opcode present in a postfix. A 32-bit postfix
> has only 25 available bits for a constant. The next size up has 57 bits
> available. One thought I had was to reduce the floating-point precision
> to correspond. Single precision floats would be 25 bits, double
> precision 57 bits and quad precision 121 bits. All seven bits short of
> the usual.

It is issues such as you mention that my approach was different. The
instruction-specifier contains everything the decoder needs to know
about where the operands are, how to rout them into calculation, what
to calculate and where to deliver the result. Should the instruction
want constants for an operand* they are concatenated sequentially
after the I-S and come in 32-bit and 64-bit quantities. Should a
32-bit constant be consumed in a 64-bit calculation it is widened
during route.

(*) except for the 16-bit immediates and displacements from the
Major OpCode table.

> I could try and use 40-bit parcels but they would need to be at fixed
> locations on the cache line for performance, and it would waste bytes.

In effect I only have 32-bit parcels.

Re: Tonight's tradeoff

<uldc1v$11jqo$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35729&group=comp.arch#35729

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Wed, 13 Dec 2023 17:43:08 -0500
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <uldc1v$11jqo$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <ukj875$33k1l$1@dont-email.me>
<ukmb6n$3q23h$1@dont-email.me> <ukmho1$3qusm$1@dont-email.me>
<uko6ok$dk88$1@dont-email.me> <ukon2b$jilv$1@dont-email.me>
<uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me>
<ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
<ul5aai$2qnln$1@dont-email.me>
<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
<ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me>
<ulc8vq$s0ll$1@dont-email.me>
<b760042dbbb80a72bd2174b1656a67e0@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 13 Dec 2023 22:43:11 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e57d102209b20bfb24719e457754afe9";
logging-data="1101656"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+/tfF5txCadCQhdNMHuBT16sb/69aSYxs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:uOUzovFnk7J8W+si7wxrypKRI4s=
In-Reply-To: <b760042dbbb80a72bd2174b1656a67e0@news.novabbs.com>
Content-Language: en-US
 by: Robert Finch - Wed, 13 Dec 2023 22:43 UTC

On 2023-12-13 2:13 p.m., MitchAlsup wrote:
> Robert Finch wrote:
>
>> On 2023-12-11 4:57 a.m., BGB wrote:
>>>
>>>
>> I got timing to work at 40+ MHz by using 32-bit instruction parcels
>> rather than byte-oriented ones.
>
>> An issue with 32-bit parcels is that float constants do not fit well
>> into them because of the opcode present in a postfix. A 32-bit postfix
>> has only 25 available bits for a constant. The next size up has 57
>> bits available. One thought I had was to reduce the floating-point
>> precision to correspond. Single precision floats would be 25 bits,
>> double precision 57 bits and quad precision 121 bits. All seven bits
>> short of the usual.
>
> It is issues such as you mention that my approach was different. The
> instruction-specifier contains everything the decoder needs to know
> about where the operands are, how to rout them into calculation, what
> to calculate and where to deliver the result. Should the instruction
> want constants for an operand* they are concatenated sequentially after
> the I-S and come in 32-bit and 64-bit quantities. Should a
> 32-bit constant be consumed in a 64-bit calculation it is widened
> during route.
>
> (*) except for the 16-bit immediates and displacements from the
> Major OpCode table.
>
>> I could try and use 40-bit parcels but they would need to be at fixed
>> locations on the cache line for performance, and it would waste bytes.
>
> In effect I only have 32-bit parcels.

Got timing to work at 40 MHz using 40-bit instruction parcels, with
the parcels at fixed positions within a cache line. It requires only 12
length decoders. There is some wasted space at the end of a cache line.
Room for a header. Not ultra-efficient, but it should work. Assembled
the boot ROM and got an average of 5.86 bytes per instruction.

The larger parcels are needed for this design to support 64-regs. Still
some work to do on the PC increment at the end of a cache line.

Re: Tonight's tradeoff

<ule5gc$190j3$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35734&group=comp.arch#35734

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Wed, 13 Dec 2023 23:57:31 -0600
Organization: A noiseless patient Spider
Lines: 165
Message-ID: <ule5gc$190j3$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me>
<ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
<ul5aai$2qnln$1@dont-email.me>
<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
<ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me>
<ulc8vq$s0ll$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 14 Dec 2023 05:57:32 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="471f0bfd2f5302d6124df84eb17c3e7e";
logging-data="1344099"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19mlOeHmi8gGC1Wl7Xewcmt"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:bzlrV2OJ1+aq7U2prKycWgkD/Uw=
Content-Language: en-US
In-Reply-To: <ulc8vq$s0ll$1@dont-email.me>
 by: BGB - Thu, 14 Dec 2023 05:57 UTC

On 12/13/2023 6:44 AM, Robert Finch wrote:
> On 2023-12-11 4:57 a.m., BGB wrote:
>> On 12/11/2023 12:35 AM, Robert Finch wrote:
>>> Decided to take another try at it. Got tired of jumping through hoops
>>> to support a block header approach. The assembler is just not setup
>>> to support that style of processor. It was also a bit tricky to
>>> convert branch target address to instruction groups and numbers. I
>>> still like the VLE. Simplified the length decoders so they generate a
>>> 3-bit number instead of a 5-bit one. I went a bit insane with the
>>> instruction lengths and allowed for up to 18 bytes. But that is
>>> wasteful. There were a couple of postfixes that were that long to
>>> support 128-bit immediates. The Isa has multiple postfixes repeated
>>> now instead of really long ones. 32-bit immediate chunks.
>>>
>>
>> No 128-bit constant load in my case, one does two 64 bit constant
>> loads for this. In terms of the pipeline, it actually moves four
>> 33-bit chunks, with each 64-bit constant effectively being glued
>> together inside the pipeline.
>>
>> ...
>>
>>
>>> Found a signal slowing things down. I had bypassed the  output of the
>>> i-cache ram in case a write and a read to the same address happened
>>> at the same time. The read address was coming from the PC and ended
>>> up feeding through 44 logic levels all the way to the instruction
>>> length decoder. The output of the i-cache was going through the
>>> bypass mux before feeding other logic. It is now just output directly
>>> from the RAM. Not sure why I bypassed it in the first place. If there
>>> was a i-cache update because of a cache miss, then it should be stalled.
>>>
>>> Even with 44 logic levels and lots of routing timing was okay at 20
>>> MHz. Shooting for 40 MHz now.
>>>
>>
>>
>> Ironically, as a side-effect of some of the logic tweaks made in
>> trying to get my core running at 75MHz, but now reverting to 50MHz; on
>> the 50MHz clock it seems I generally have around 2ns of slack (with
>> around 16-20 logic levels).
>
> I am sitting just on the other side of 50 MHz operation a couple of ns
> short.

Trying to get to 75, the last ns was the hardest...

And, as before, the compromises made along the way ended up hurting more
than what was gained.

>>
>> This is after re-enabling a lot of the features I had disabled in the
>> attempted clock-boost to 75MHz.
>>
>>
>>
>>
>> Most recent ISA tweak:
>> Had modified the decoding rules slightly so that both the W.m and W.i
>> bits were given to the immediate for 2RI-Imm10 ops in XG2 Mode
>> (previously these bits were unused, and I debated between extending
>> the immediate values or leaving them as possible opcode bits; but if
>> used for opcode they would lead to new instructions that are
>> effectively N/E in Baseline Mode, which wouldn't be ideal either).
>>
>>
>> Relatedly tweaked the JCMPZ-Disp11s and "MOV.x (GBR, Disp10u*Sx), Rn"
>> ops to effectively be Disp13 and Disp12.
>>
>> This increases branch-displacement to +/-8K for compare-with-zero, and
>> MOV.L/Q to 16K/32K for loading global variables (from 4K/8K).
>>
>> It seems the gains were fairly modest though:
>> Relatively few functions are large enough to benefit from the larger
>> branch limit;
>> It seems it runs out of scalar-type global variables (in Doom) well
>> before hitting the 16K/32K limit (and this is mostly N/A for structs
>> and arrays).
>>
>>
>> Like, it seems Doom only has around 12K worth of scalar global
>> variables, and nearly the entire rest of the ".data" and ".bss"
>> section is structs and arrays (which need LEA's to access). So, it
>> only gains a few percent on the total hit rate (55 to 57).
>>
>> Roughly the remaining 43% being mostly LEA's to arrays, which end up
>> needing a 64-bit (Disp33) encoding.
>>
>>
>> TBD if I will keep this, or revert to only adding W.i, in XG2 Mode (to
>> keep W.m for possible XG2-Only 2RI opcodes; or possibly make the
>> Imm12/Disp13s case specific to these particular instructions, though
>> this is "less good" for the instruction decoder).
>>
>> If I keep this, will probably end up stuck with it in any case.
>>
>>
>> Also, in my compiler, I am finding less obvious "low hanging fruit"
>> than I had hoped (and a lot of the remaining inefficiencies are things
>> that would "actually take" effort, *).
>>
>> *: Say, for example, computing an expression that is being passed to a
>> function, actually directing the output directly to the function
>> argument register, rather than putting it a temporary and then MOV'ing
>> it to the needed register.
>>
>> Alas...
>>
>>
>>
> I got timing to work at 40+ MHz by using 32-bit instruction parcels
> rather than byte-oriented ones.
>

16/32 here.

Though, XG2 Mode is 32-bit only (and also requires 32-bit alignment for
the instruction stream).

Baseline mode is almost entirely free-form, except that there are a few
"quirk" cases (say, 96-bit op not allowed if ((PC&0xE)==0x0E), 32-bit
alignment is needed for branch-tables in "switch()", ...).

Also the 16-bit ops are scalar only.

A lot of the quirk cases go away in XG2 mode because the instruction
stream is always 32-bit aligned.

> An issue with 32-bit parcels is that float constants do not fit well
> into them because of the opcode present in a postfix. A 32-bit postfix
> has only 25 available bits for a constant. The next size up has 57 bits
> available. One thought I had was to reduce the floating-point precision
> to correspond. Single precision floats would be 25 bits, double
> precision 57 bits and quad precision 121 bits. All seven bits short of
> the usual.
>

This part is a combination game...

My case:
24+9: 33
24+24+16: 64

Though, a significant majority of typical floating point constants can
be represented exactly Binary16, so there ended up being a feature where
floating point constants are represented as Binary16 whenever it is
possible to do so exactly (with the instruction then converting them to
Binary64).

FLDCH Imm16, Rn //Load value as Binary16 to Binary64
PLDCH Imm32, Rn //Load value as 2x Binary16 to 2x Binary32
PLDCXH Imm64, Xn //Load value as 4x Binary16 to 4x Binary32

> I could try and use 40-bit parcels but they would need to be at fixed
> locations on the cache line for performance, and it would waste bytes.
>

Yeah...
Could be worse I guess.

Re: Tonight's tradeoff

<ulihhf$22p1k$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35770&group=comp.arch#35770

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Fri, 15 Dec 2023 16:47:27 -0500
Organization: A noiseless patient Spider
Lines: 12
Message-ID: <ulihhf$22p1k$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <ukb8rd$1im31$1@dont-email.me>
<9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me>
<ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
<ul5aai$2qnln$1@dont-email.me>
<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
<ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me>
<ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 15 Dec 2023 21:47:27 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="528ecf386566c4e9545c84440e446a32";
logging-data="2188340"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18WdYuFaMEhRrcSdtn8EGgkngNRM/vEZSk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:t3mrm/wg79p5NYevkSt3lLIKovc=
In-Reply-To: <ule5gc$190j3$1@dont-email.me>
Content-Language: en-US
 by: Robert Finch - Fri, 15 Dec 2023 21:47 UTC

Just realizing that two more bits of branch displacement can be squeezed
out of the design, if the branch target were to use an instruction
number in the block for the low four bits of the target instead of using
the low six bits for a relative displacement. The low order six bits of
the instruction pointer can be recovered from the instruction number,
which need be only four bits.

Currently the branch displacement is seventeen bits, just one short of
the highly desirable eight-teen bit displacement. Adding two extra bits
of displacement is of limited value though since most branches can be
accommodated with only twelve bits.

Re: Tonight's tradeoff

<um4633$1jhqf$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35979&group=comp.arch#35979

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Fri, 22 Dec 2023 09:22:27 -0500
Organization: A noiseless patient Spider
Lines: 17
Message-ID: <um4633$1jhqf$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me>
<ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
<ul5aai$2qnln$1@dont-email.me>
<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
<ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me>
<ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me>
<ulihhf$22p1k$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 22 Dec 2023 14:22:27 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7180623bf72a6afad1af765f4f7fa9b6";
logging-data="1689423"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19XFlNYYHh9rnPJxDYOepn/EnC3vMlLmrA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:SVXPdKJvowBDr5HHWfVM+oZ09KM=
In-Reply-To: <ulihhf$22p1k$1@dont-email.me>
Content-Language: en-US
 by: Robert Finch - Fri, 22 Dec 2023 14:22 UTC

Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able to
update previous checkpoints, not just the current one. Which checkpoint
gets updated depends on which checkpoint the instruction falls under. It
is the register valid bit that needs to be updated. I used a “brute
force” approach to implement this and it is 40k LUTs. This is about five
times too large a solution. If I reduce the number of checkpoints
supported to four from sixteen, then the component is 20k LUTs. Still
too large.

The issue is there are 256 valid bits times 16 checkpoints which means
4096 registers. Muxing the register inputs and outputs uses a lot of LUTs.

One thought is to stall until all the instructions with targets in a
given checkpoint are finished executing before starting a new checkpoint
region. It would seriously impact the CPU performance.

Re: Tonight's tradeoff

<G1jhN.74715$PuZ9.26873@fx11.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35980&group=comp.arch#35980

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx11.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
References: <uis67u$fkj4$1@dont-email.me> <3a714dde36640fc4c47255d0a170aaee@news.novabbs.com> <ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me> <ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me> <ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me> <5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com> <ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me> <59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com> <ul38cp$2e25u$1@dont-email.me> <689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com> <ul3ngl$2jhoh$1@dont-email.me> <9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com> <ul5aai$2qnln$1@dont-email.me> <d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com> <ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me> <ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me> <ulihhf$22p1k$1@dont-email.me> <um4633$1jhqf$1@dont-email.me>
In-Reply-To: <um4633$1jhqf$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 55
Message-ID: <G1jhN.74715$PuZ9.26873@fx11.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 22 Dec 2023 16:44:22 UTC
Date: Fri, 22 Dec 2023 11:42:51 -0500
X-Received-Bytes: 4235
 by: EricP - Fri, 22 Dec 2023 16:42 UTC

Robert Finch wrote:
> Stuck on checkpoint RAM now. Everything was going good until…. I
> realized that while instructions are executing they need to be able to
> update previous checkpoints, not just the current one. Which checkpoint
> gets updated depends on which checkpoint the instruction falls under. It
> is the register valid bit that needs to be updated. I used a “brute
> force” approach to implement this and it is 40k LUTs. This is about five
> times too large a solution. If I reduce the number of checkpoints
> supported to four from sixteen, then the component is 20k LUTs. Still
> too large.
>
> The issue is there are 256 valid bits times 16 checkpoints which means
> 4096 registers. Muxing the register inputs and outputs uses a lot of LUTs.
>
> One thought is to stall until all the instructions with targets in a
> given checkpoint are finished executing before starting a new checkpoint
> region. It would seriously impact the CPU performance.
>

(I don't have a solution, just passing on some info on this particular
checkpointing issue.)

Sounds like you might be using the same free register checkpoint algorithm
I came up with for my simulator, which I assumed was a custom sram design.

There is 1 bit for each physical register that is free.
The checkpoint for a Bcc conditional branch copies the free bit vector,
in your case 256 bits, to a row in the checkpoint sram.
As each instruction retires and frees up its old dest physical register
and it must mark the register free in *all* checkpoint contexts.

That requires the ability to set all the free flags for a single register,
which means an sram design that can write a whole row, and also set all the
bits in one column, in your case set the 16 bits in each checkpoint for one
of the 256 registers.

I was assuming an ASIC design so a small custom sram seemed reasonable.
But for an FPGA it requires 256*16 flip-flops plus decoders, etc.

I think the risc-v Berkeley Out-of-Order Machine (BOOM) folks might have
independently come up with the same approach on their BOOM-3 SonicBoom.
Their note [5] describes the same problem as my column setting solves.

https://docs.boom-core.org/en/latest/sections/rename-stage.html

While their target was 22nm ASIC, they say below that they
implemented a version of BOOM-3 on an FPGA but don't give details.
But their project might be open source so maybe the details
are available online.

Sonicboom: The 3rd generation berkeley out-of-order machine
http://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-CARRV2020.pdf

Re: Tonight's tradeoff

<b17e9898b9b3ba0f907faa7b1d8bfbf6@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35981&group=comp.arch#35981

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Fri, 22 Dec 2023 17:49:35 +0000
Organization: novaBBS
Message-ID: <b17e9898b9b3ba0f907faa7b1d8bfbf6@news.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me> <5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com> <ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me> <59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com> <ul38cp$2e25u$1@dont-email.me> <689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com> <ul3ngl$2jhoh$1@dont-email.me> <9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com> <ul5aai$2qnln$1@dont-email.me> <d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com> <ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me> <ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me> <ulihhf$22p1k$1@dont-email.me> <um4633$1jhqf$1@dont-email.me> <G1jhN.74715$PuZ9.26873@fx11.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="797164"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$.lVydG8HOg/V5esKnJR/EuE.0aN.VFEIFdhRNrdMcTQiQMhZvqHiC
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
 by: MitchAlsup - Fri, 22 Dec 2023 17:49 UTC

EricP wrote:

> Robert Finch wrote:
>> Stuck on checkpoint RAM now. Everything was going good until…. I
>> realized that while instructions are executing they need to be able to
>> update previous checkpoints, not just the current one. Which checkpoint
>> gets updated depends on which checkpoint the instruction falls under. It
>> is the register valid bit that needs to be updated. I used a “brute
>> force” approach to implement this and it is 40k LUTs. This is about five
>> times too large a solution. If I reduce the number of checkpoints
>> supported to four from sixteen, then the component is 20k LUTs. Still
>> too large.
>>
>> The issue is there are 256 valid bits times 16 checkpoints which means
>> 4096 registers. Muxing the register inputs and outputs uses a lot of LUTs.
>>
>> One thought is to stall until all the instructions with targets in a
>> given checkpoint are finished executing before starting a new checkpoint
>> region. It would seriously impact the CPU performance.
>>

> (I don't have a solution, just passing on some info on this particular
> checkpointing issue.)

> Sounds like you might be using the same free register checkpoint algorithm
> I came up with for my simulator, which I assumed was a custom sram design.

> There is 1 bit for each physical register that is free.
> The checkpoint for a Bcc conditional branch copies the free bit vector,
> in your case 256 bits, to a row in the checkpoint sram.
> As each instruction retires and frees up its old dest physical register
> and it must mark the register free in *all* checkpoint contexts.

> That requires the ability to set all the free flags for a single register,
> which means an sram design that can write a whole row, and also set all the
> bits in one column, in your case set the 16 bits in each checkpoint for one
> of the 256 registers.

Two points::
1) the register that gets freed up when you know this newly allocated register
will retire, can be determined with a small amount of logic (2 gates) per
cell in your 256×16 matrix--no need for the column write/clear/set. You can
use this overwrite across columns to perform register write elision.

2) There are going to be allocations where you do not allocate any register
to a particular instruction because the register is overwritten IN the same
issue bundle. Here you can use a different "forwarding" notation so the
result is captured by the stations and used without ever seeing the file.

I called this matrix the "History Table" in Mc 88120, it provided valid
bits back to the aRN->pRN CAMs <backup> and valid bits back to the register
pool <successful retire>.

Back then, we recognized that the architectural registers were a strict
subset of the physical registers, so that as long as there were exactly
31 (then: 32 now) valid registers in the pRF, one could always read
values to be written into reservation station entries. In effect, the
whole thing was a RoB--Once the RoB gets big enough, there is no reason
to have both a RoB and a aRF; just let the RoB do everything and change
its name to Physical Register File. This eliminates the copy to aRF
at retirement.

> I was assuming an ASIC design so a small custom sram seemed reasonable.
> But for an FPGA it requires 256*16 flip-flops plus decoders, etc.

> I think the risc-v Berkeley Out-of-Order Machine (BOOM) folks might have
> independently come up with the same approach on their BOOM-3 SonicBoom.
> Their note [5] describes the same problem as my column setting solves.

> https://docs.boom-core.org/en/latest/sections/rename-stage.html

I was doing something very similar n 1991.

> While their target was 22nm ASIC, they say below that they
> implemented a version of BOOM-3 on an FPGA but don't give details.
> But their project might be open source so maybe the details
> are available online.

> Sonicboom: The 3rd generation berkeley out-of-order machine
> http://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-CARRV2020.pdf

Re: Tonight's tradeoff

<um4ivg$1llic$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35982&group=comp.arch#35982

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Fri, 22 Dec 2023 12:02:16 -0600
Organization: A noiseless patient Spider
Lines: 126
Message-ID: <um4ivg$1llic$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <ukj875$33k1l$1@dont-email.me>
<ukmb6n$3q23h$1@dont-email.me> <ukmho1$3qusm$1@dont-email.me>
<uko6ok$dk88$1@dont-email.me> <ukon2b$jilv$1@dont-email.me>
<uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me>
<ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
<ul5aai$2qnln$1@dont-email.me>
<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
<ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me>
<ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me>
<ulihhf$22p1k$1@dont-email.me> <um4633$1jhqf$1@dont-email.me>
<G1jhN.74715$PuZ9.26873@fx11.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 22 Dec 2023 18:02:24 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="1be517e4f8e749698817d678c69568e7";
logging-data="1758796"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19hLmlKKf2Fc9RLWoMjw6y2"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:1x/DYCmwwVyFuYnKO1N/aUGjiOs=
In-Reply-To: <G1jhN.74715$PuZ9.26873@fx11.iad>
Content-Language: en-US
 by: BGB - Fri, 22 Dec 2023 18:02 UTC

On 12/22/2023 10:42 AM, EricP wrote:
> Robert Finch wrote:
>> Stuck on checkpoint RAM now. Everything was going good until…. I
>> realized that while instructions are executing they need to be able to
>> update previous checkpoints, not just the current one. Which
>> checkpoint gets updated depends on which checkpoint the instruction
>> falls under. It is the register valid bit that needs to be updated. I
>> used a “brute force” approach to implement this and it is 40k LUTs.
>> This is about five times too large a solution. If I reduce the number
>> of checkpoints supported to four from sixteen, then the component is
>> 20k LUTs. Still too large.
>>
>> The issue is there are 256 valid bits times 16 checkpoints which means
>> 4096 registers. Muxing the register inputs and outputs uses a lot of
>> LUTs.
>>
>> One thought is to stall until all the instructions with targets in a
>> given checkpoint are finished executing before starting a new
>> checkpoint region. It would seriously impact the CPU performance.
>>
>
> (I don't have a solution, just passing on some info on this particular
> checkpointing issue.)
>
> Sounds like you might be using the same free register checkpoint algorithm
> I came up with for my simulator, which I assumed was a custom sram design.
>
> There is 1 bit for each physical register that is free.
> The checkpoint for a Bcc conditional branch copies the free bit vector,
> in your case 256 bits, to a row in the checkpoint sram.
> As each instruction retires and frees up its old dest physical register
> and it must mark the register free in *all* checkpoint contexts.
>
> That requires the ability to set all the free flags for a single register,
> which means an sram design that can write a whole row, and also set all the
> bits in one column, in your case set the 16 bits in each checkpoint for one
> of the 256 registers.
>
> I was assuming an ASIC design so a small custom sram seemed reasonable.
> But for an FPGA it requires 256*16 flip-flops plus decoders, etc.
>
> I think the risc-v Berkeley Out-of-Order Machine (BOOM) folks might have
> independently come up with the same approach on their BOOM-3 SonicBoom.
> Their note [5] describes the same problem as my column setting solves.
>
> https://docs.boom-core.org/en/latest/sections/rename-stage.html
>
> While their target was 22nm ASIC, they say below that they
> implemented a version of BOOM-3 on an FPGA but don't give details.
> But their project might be open source so maybe the details
> are available online.
>
> Sonicboom: The 3rd generation berkeley out-of-order machine
> http://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-CARRV2020.pdf
>

Yeah... Generally in-order designs are a lot more viable on an FPGAs.
Also if one wants a core that had smaller die area and doesn't need the
fastest possible single-thread performance (though, in theory, good
static scheduling and good L1 caches could at least reduce the issues).

Comparably, it seems that SRAM is a lot cheaper on FPGAs relative to
logic (whereas ASICs apparently have cheap logic but expensive SRAMs),
so a bigger L1 cache is more of an option (though limited some by
timing, and that much past 16K or 32K one is solidly in diminishing
returns territory).

While, say, 2-way can do better than 1-way, the relative cost of a 2-way
cache associativity tends to be higher than doubling the size of the L1
cache.

Where, say, a 16K 2-way cache would still have a lower average hit rate
than a 32K 1-way cache (but, 32K will need less LUTs and have better
timing).

Scheduling is theoretically "not that hard", but my compiler isn't super
good at it. It partly uses a strategy of "check if instructions can be
swapped, and if for a window of N instructions, if this swap will result
in better-case numbers than the prior ordering".

It then repeats thus process up to a certain number of times.

Results are far from ideal, as it tends to miss cases where a swap would
make things worse for prior instructions. Also it can't determine the
"globally best" ordering, as evaluating every possible reordering of
every instruction trace in a program is well outside the range of
"computationally feasible" as-is (nor even necessarily enumerating every
possible ordering within a given sequence of instructions).

Partial issue being that (excluding impossible swaps), this problem has
a roughly "O(N!)" complexity curve.

Well, and "Can I swap two instructions out of a window of 6
instructions?" does not give optimal results (doesn't help that the
WEXifier looks like a horrid mess due to working with instructions as
16-bit words despite only supporting 32-bit instructions; early on I
think I didn't realize that I would be entirely excluding 16-bit
instructions from this process).

Better might be to evaluate a symmetric window of 12..16 instructions
(say, consider interlocks from the prior 5..7 instructions, the two in
the middle, and the following 5..7). For sake of ranking (with bundling)
one pretending that each instruction has roughly twice its actual
interlock latency (this way, hopefully, bundling instructions is less
likely to result in interlock penalties).

Though, partial tradeoff as I still want compile times that aren't glacial.

....

Though, from elsewhere, will note that I have 64 GPRs, basic
predication, and bundling (via daisy-chaining instructions) all within a
32-bit instruction word (though, the variant of the ISA with "full" 64
GPR support excludes 16-bit ops; as there isn't enough encoding space to
support both fully orthogonal 64 GPRs and 16-bit ops at the same time).

....

>
>

Re: Tonight's tradeoff

<um5b5c$1p88e$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35991&group=comp.arch#35991

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Fri, 22 Dec 2023 19:55:06 -0500
Organization: A noiseless patient Spider
Lines: 110
Message-ID: <um5b5c$1p88e$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <ukon2b$jilv$1@dont-email.me>
<uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me>
<ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
<ul5aai$2qnln$1@dont-email.me>
<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
<ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me>
<ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me>
<ulihhf$22p1k$1@dont-email.me> <um4633$1jhqf$1@dont-email.me>
<G1jhN.74715$PuZ9.26873@fx11.iad>
<b17e9898b9b3ba0f907faa7b1d8bfbf6@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 23 Dec 2023 00:55:08 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3237b509371d6c1cdc373b7875282564";
logging-data="1876238"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+9/enBF1DY8LYkmmdL3S8AZDsrwQ9lNA0="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:0KC8SQKKn9nmdIQLab6VXYAdWO4=
In-Reply-To: <b17e9898b9b3ba0f907faa7b1d8bfbf6@news.novabbs.com>
Content-Language: en-US
 by: Robert Finch - Sat, 23 Dec 2023 00:55 UTC

On 2023-12-22 12:49 p.m., MitchAlsup wrote:
> EricP wrote:
>
>> Robert Finch wrote:
>>> Stuck on checkpoint RAM now. Everything was going good until…. I
>>> realized that while instructions are executing they need to be able
>>> to update previous checkpoints, not just the current one. Which
>>> checkpoint gets updated depends on which checkpoint the instruction
>>> falls under. It is the register valid bit that needs to be updated. I
>>> used a “brute force” approach to implement this and it is 40k LUTs.
>>> This is about five times too large a solution. If I reduce the number
>>> of checkpoints supported to four from sixteen, then the component is
>>> 20k LUTs. Still too large.
>>>
>>> The issue is there are 256 valid bits times 16 checkpoints which
>>> means 4096 registers. Muxing the register inputs and outputs uses a
>>> lot of LUTs.
>>>
>>> One thought is to stall until all the instructions with targets in a
>>> given checkpoint are finished executing before starting a new
>>> checkpoint region. It would seriously impact the CPU performance.
>>>
I think I maybe found a solution using a block RAM and about 8k LUTs.
>
>> (I don't have a solution, just passing on some info on this particular
>> checkpointing issue.)
>
>> Sounds like you might be using the same free register checkpoint
>> algorithm
>> I came up with for my simulator, which I assumed was a custom sram
>> design.
>
>> There is 1 bit for each physical register that is free.
>> The checkpoint for a Bcc conditional branch copies the free bit vector,
>> in your case 256 bits, to a row in the checkpoint sram.
>> As each instruction retires and frees up its old dest physical register
>> and it must mark the register free in *all* checkpoint contexts.
>
>> That requires the ability to set all the free flags for a single
>> register,
>> which means an sram design that can write a whole row, and also set
>> all the
>> bits in one column, in your case set the 16 bits in each checkpoint
>> for one
>> of the 256 registers.
>
Not sure about setting bits in all checkpoints. I probably have not just
understood the issue yet. Partially terminology. There are two different
things happening. The register free/available which is being managed
with fifos and the register contents valid bit. At the far end of the
pipeline, registers that were used are made free again by adding to the
free fifo. This is somewhat inefficient because they could be freed
sooner, but it would require more logic, instead more registers are
used, they are available from the RAM anyway.
The register contents valid bit is cleared when a target register is
assigned, and set once a value is loaded into the target register. The
valid bit is also set for instructions that are stomped on as the old
value is valid. When a checkpoint is restored, it restores the state of
the valid bit along with the physical register tag. I am not
understanding why the valid bit would need to be modified in all
checkpoints. I would think it should reflect the pre-branch state of things.

> Two points::
> 1) the register that gets freed up when you know this newly allocated
> register
> will retire,  can be determined with a small amount of logic (2 gates) per
> cell in your 256×16 matrix--no need for the column write/clear/set. You can
> use this overwrite across columns to perform register write elision.

I just record the old register number in the ROB when a new one is
allocated.
>
> 2) There are going to be allocations where you do not allocate any register
> to a particular instruction because the register is overwritten IN the same
> issue bundle. Here you can use a different "forwarding" notation so the
> result is captured by the stations and used without ever seeing the file.
>
> I called this matrix the "History Table" in Mc 88120, it provided valid
> bits back to the aRN->pRN CAMs <backup> and valid bits back to the
> register pool <successful retire>.
>
> Back then, we recognized that the architectural registers were a strict
> subset of the physical registers, so that as long as there were exactly
> 31 (then: 32 now) valid registers in the pRF, one could always read
> values to be written into reservation station entries. In effect, the
> whole thing was a RoB--Once the RoB gets big enough, there is no reason
> to have both a RoB and a aRF; just let the RoB do everything and change
> its name to Physical Register File. This eliminates the copy to aRF
> at retirement.
>
>> I was assuming an ASIC design so a small custom sram seemed reasonable.
>> But for an FPGA it requires 256*16 flip-flops plus decoders, etc.
>
>> I think the risc-v Berkeley Out-of-Order Machine (BOOM) folks might have
>> independently come up with the same approach on their BOOM-3 SonicBoom.
>> Their note [5] describes the same problem as my column setting solves.
>
>> https://docs.boom-core.org/en/latest/sections/rename-stage.html
>
> I was doing something very similar n 1991.
>
>> While their target was 22nm ASIC, they say below that they
>> implemented a version of BOOM-3 on an FPGA but don't give details.
>> But their project might be open source so maybe the details
>> are available online.
>
>> Sonicboom: The 3rd generation berkeley out-of-order machine
>> http://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-CARRV2020.pdf

Re: Tonight's tradeoff

<um5snr$1v5mq$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35992&group=comp.arch#35992

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sat, 23 Dec 2023 00:55:07 -0500
Organization: A noiseless patient Spider
Lines: 9
Message-ID: <um5snr$1v5mq$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <ukon2b$jilv$1@dont-email.me>
<uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me>
<ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
<ul5aai$2qnln$1@dont-email.me>
<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
<ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me>
<ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me>
<ulihhf$22p1k$1@dont-email.me> <um4633$1jhqf$1@dont-email.me>
<G1jhN.74715$PuZ9.26873@fx11.iad>
<b17e9898b9b3ba0f907faa7b1d8bfbf6@news.novabbs.com>
<um5b5c$1p88e$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 23 Dec 2023 05:55:08 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c20d81bbacf6eb75ea1c710af149931e";
logging-data="2070234"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+gP6gSbSGIal5YjIEkZ7eK8AkfpQCm3rc="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:0Dvqhv6D/49+evqBKj1e6wpwxQc=
Content-Language: en-US
In-Reply-To: <um5b5c$1p88e$1@dont-email.me>
 by: Robert Finch - Sat, 23 Dec 2023 05:55 UTC

Whoo hoo! Broke the 1 instruction per clock barrier.

----- Stats -----
Clock ticks: 265 Instructions: 279:
113 IPC: 1.052830
I-Cache hit clocks: 109

Re: Tonight's tradeoff

<pDFhN.60926$yEgf.42972@fx09.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35996&group=comp.arch#35996

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!newsfeed.hasname.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx09.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
References: <uis67u$fkj4$1@dont-email.me> <ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me> <5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com> <ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me> <59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com> <ul38cp$2e25u$1@dont-email.me> <689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com> <ul3ngl$2jhoh$1@dont-email.me> <9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com> <ul5aai$2qnln$1@dont-email.me> <d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com> <ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me> <ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me> <ulihhf$22p1k$1@dont-email.me> <um4633$1jhqf$1@dont-email.me> <G1jhN.74715$PuZ9.26873@fx11.iad> <b17e9898b9b3ba0f907faa7b1d8bfbf6@news.novabbs.com> <um5b5c$1p88e$1@dont-email.me>
In-Reply-To: <um5b5c$1p88e$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 99
Message-ID: <pDFhN.60926$yEgf.42972@fx09.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 23 Dec 2023 18:26:29 UTC
Date: Sat, 23 Dec 2023 13:26:17 -0500
X-Received-Bytes: 6601
 by: EricP - Sat, 23 Dec 2023 18:26 UTC

Robert Finch wrote:
> On 2023-12-22 12:49 p.m., MitchAlsup wrote:
>> EricP wrote:
>>
>>> Robert Finch wrote:
>>>> Stuck on checkpoint RAM now. Everything was going good until…. I
>>>> realized that while instructions are executing they need to be able
>>>> to update previous checkpoints, not just the current one. Which
>>>> checkpoint gets updated depends on which checkpoint the instruction
>>>> falls under. It is the register valid bit that needs to be updated.
>>>> I used a “brute force” approach to implement this and it is 40k
>>>> LUTs. This is about five times too large a solution. If I reduce the
>>>> number of checkpoints supported to four from sixteen, then the
>>>> component is 20k LUTs. Still too large.
>>>>
>>>> The issue is there are 256 valid bits times 16 checkpoints which
>>>> means 4096 registers. Muxing the register inputs and outputs uses a
>>>> lot of LUTs.
>>>>
>>>> One thought is to stall until all the instructions with targets in a
>>>> given checkpoint are finished executing before starting a new
>>>> checkpoint region. It would seriously impact the CPU performance.
>>>>
> I think I maybe found a solution using a block RAM and about 8k LUTs.
>>
>>> (I don't have a solution, just passing on some info on this particular
>>> checkpointing issue.)
>>
>>> Sounds like you might be using the same free register checkpoint
>>> algorithm
>>> I came up with for my simulator, which I assumed was a custom sram
>>> design.
>>
>>> There is 1 bit for each physical register that is free.
>>> The checkpoint for a Bcc conditional branch copies the free bit vector,
>>> in your case 256 bits, to a row in the checkpoint sram.
>>> As each instruction retires and frees up its old dest physical register
>>> and it must mark the register free in *all* checkpoint contexts.
>>
>>> That requires the ability to set all the free flags for a single
>>> register,
>>> which means an sram design that can write a whole row, and also set
>>> all the
>>> bits in one column, in your case set the 16 bits in each checkpoint
>>> for one
>>> of the 256 registers.
>>
> Not sure about setting bits in all checkpoints. I probably have not just
> understood the issue yet. Partially terminology. There are two different
> things happening. The register free/available which is being managed
> with fifos and the register contents valid bit. At the far end of the
> pipeline, registers that were used are made free again by adding to the
> free fifo. This is somewhat inefficient because they could be freed
> sooner, but it would require more logic, instead more registers are
> used, they are available from the RAM anyway.
> The register contents valid bit is cleared when a target register is
> assigned, and set once a value is loaded into the target register. The
> valid bit is also set for instructions that are stomped on as the old
> value is valid. When a checkpoint is restored, it restores the state of
> the valid bit along with the physical register tag. I am not
> understanding why the valid bit would need to be modified in all
> checkpoints. I would think it should reflect the pre-branch state of
> things.

This has to do with free physical register list checkpointing and
a particular gotcha that occurs if one tries to use a vanilla sram
to save the free map bit vector for each checkpoint.
It sounds like the BOOM people stepped in this gotcha at some point.

Say a design has a bit vector indicating which physical registers are free.
Rename allocates a register by using a priority selector to scan that
vector and select a free PR to assign as a new dest PR.
When this instruction retires, the old dest PR is freed and
the new dest PR becomes the architectural register.

When Decode sees a conditional branch Bcc it allocates a
checkpoint in a circular buffer by incrementing the head counter,
copies the *current* free bit vector into the new checkpoint row,
and saves the new checkpoint index # in the Bcc uOp.
If a branch mispredict occurs then we can restore the state at the
Bcc by copying various state info from the Bcc checkpoint index #.
This includes copying back the saved free vector to the current free vector.
When the Bcc uOp retires we increment the circular tail counter
to recover the checkpoint buffer row.

The problem occurs when an old dest PR is in use so its free bit is clear
when the checkpoint is saved. Then the instruction retires and marks the
old dest PR as free in the bit vector. Then Bcc triggers a mispredict
and restores the free vector that was copied when the checkpoint was saved,
including the then not-free state of the PR freed after the checkpoint.
Result: the PR is lost from the free list. After enough mispredicts you
run out of free physical registers and hang at Rename waiting to allocate.

It needs some way to edit the checkpointed free bit vector so that
no matter what order of PR-allocate, retire-PR-free, checkpoint save #X,
and rollback to checkpoint #Y, that the correct free vector gets restored.

Re: Tonight's tradeoff

<53a87400b23b25dc38cbae0568aab468@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=36342&group=comp.arch#36342

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sat, 23 Dec 2023 23:19:47 +0000
Organization: novaBBS
Message-ID: <53a87400b23b25dc38cbae0568aab468@news.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me> <5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com> <ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me> <59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com> <ul38cp$2e25u$1@dont-email.me> <689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com> <ul3ngl$2jhoh$1@dont-email.me> <9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com> <ul5aai$2qnln$1@dont-email.me> <d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com> <ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me> <ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me> <ulihhf$22p1k$1@dont-email.me> <um4633$1jhqf$1@dont-email.me> <G1jhN.74715$PuZ9.26873@fx11.iad> <b17e9898b9b3ba0f907faa7b1d8bfbf6@news.novabbs.com> <um5b5c$1p88e$1@dont-email.me> <pDFhN.60926$yEgf.42972@fx09.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="949268"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$wp0HGr/3DzFImzYq093xWuTSr6iScRjooSpu2P0en8ziBcZSURrsW
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Spam-Checker-Version: SpamAssassin 4.0.0
 by: MitchAlsup - Sat, 23 Dec 2023 23:19 UTC

EricP wrote:

> Robert Finch wrote:
>> On 2023-12-22 12:49 p.m., MitchAlsup wrote:
>>> EricP wrote:
>>>
>>>> Robert Finch wrote:
>>>>> Stuck on checkpoint RAM now. Everything was going good until…. I
>>>>> realized that while instructions are executing they need to be able
>>>>> to update previous checkpoints, not just the current one. Which
>>>>> checkpoint gets updated depends on which checkpoint the instruction
>>>>> falls under. It is the register valid bit that needs to be updated.
>>>>> I used a “brute force” approach to implement this and it is 40k
>>>>> LUTs. This is about five times too large a solution. If I reduce the
>>>>> number of checkpoints supported to four from sixteen, then the
>>>>> component is 20k LUTs. Still too large.
>>>>>
>>>>> The issue is there are 256 valid bits times 16 checkpoints which
>>>>> means 4096 registers. Muxing the register inputs and outputs uses a
>>>>> lot of LUTs.
>>>>>
>>>>> One thought is to stall until all the instructions with targets in a
>>>>> given checkpoint are finished executing before starting a new
>>>>> checkpoint region. It would seriously impact the CPU performance.
>>>>>
>> I think I maybe found a solution using a block RAM and about 8k LUTs.
>>>
>>>> (I don't have a solution, just passing on some info on this particular
>>>> checkpointing issue.)
>>>
>>>> Sounds like you might be using the same free register checkpoint
>>>> algorithm
>>>> I came up with for my simulator, which I assumed was a custom sram
>>>> design.
>>>
>>>> There is 1 bit for each physical register that is free.
>>>> The checkpoint for a Bcc conditional branch copies the free bit vector,
>>>> in your case 256 bits, to a row in the checkpoint sram.
>>>> As each instruction retires and frees up its old dest physical register
>>>> and it must mark the register free in *all* checkpoint contexts.
>>>
>>>> That requires the ability to set all the free flags for a single
>>>> register,
>>>> which means an sram design that can write a whole row, and also set
>>>> all the
>>>> bits in one column, in your case set the 16 bits in each checkpoint
>>>> for one
>>>> of the 256 registers.
>>>
>> Not sure about setting bits in all checkpoints. I probably have not just
>> understood the issue yet. Partially terminology. There are two different
>> things happening. The register free/available which is being managed
>> with fifos and the register contents valid bit. At the far end of the
>> pipeline, registers that were used are made free again by adding to the
>> free fifo. This is somewhat inefficient because they could be freed
>> sooner, but it would require more logic, instead more registers are
>> used, they are available from the RAM anyway.
>> The register contents valid bit is cleared when a target register is
>> assigned, and set once a value is loaded into the target register. The
>> valid bit is also set for instructions that are stomped on as the old
>> value is valid. When a checkpoint is restored, it restores the state of
>> the valid bit along with the physical register tag. I am not
>> understanding why the valid bit would need to be modified in all
>> checkpoints. I would think it should reflect the pre-branch state of
>> things.

> This has to do with free physical register list checkpointing and
> a particular gotcha that occurs if one tries to use a vanilla sram
> to save the free map bit vector for each checkpoint.
> It sounds like the BOOM people stepped in this gotcha at some point.

> Say a design has a bit vector indicating which physical registers are free.
> Rename allocates a register by using a priority selector to scan that
> vector and select a free PR to assign as a new dest PR.
> When this instruction retires, the old dest PR is freed and
> the new dest PR becomes the architectural register.

It is often the case where a logical register can be used in more
than one result in a single checkpoint. When this is the case, no
physical register need be allocated to the now-dead result, so we
invented a way to convey this result is only captured from the
operand bus and was not even contemplated to be written into the
pRF. This makes the pool of free registers go further--up to 30%
further.......

> When Decode sees a conditional branch Bcc it allocates a
> checkpoint in a circular buffer by incrementing the head counter,
> copies the *current* free bit vector into the new checkpoint row,
> and saves the new checkpoint index # in the Bcc uOp.
> If a branch mispredict occurs then we can restore the state at the
> Bcc by copying various state info from the Bcc checkpoint index #.
> This includes copying back the saved free vector to the current free vector.
> When the Bcc uOp retires we increment the circular tail counter
> to recover the checkpoint buffer row.

> The problem occurs when an old dest PR is in use so its free bit is clear
> when the checkpoint is saved. Then the instruction retires and marks the
> old dest PR as free in the bit vector. Then Bcc triggers a mispredict
> and restores the free vector that was copied when the checkpoint was saved,
> including the then not-free state of the PR freed after the checkpoint.
> Result: the PR is lost from the free list. After enough mispredicts you
> run out of free physical registers and hang at Rename waiting to allocate.

Michael Shebanow and I have a patent on that dated around 1992 (filing).
Our design could be retiring one or more checkpoints, backing up a mis-
pedicted branch, and issuing instructions on the alternate path; all in
the same clock.

> It needs some way to edit the checkpointed free bit vector so that
> no matter what order of PR-allocate, retire-PR-free, checkpoint save #X,
> and rollback to checkpoint #Y, that the correct free vector gets restored.

Re: Tonight's tradeoff

<um84tq$2a541$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=36343&group=comp.arch#36343

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sat, 23 Dec 2023 21:27:04 -0500
Organization: A noiseless patient Spider
Lines: 153
Message-ID: <um84tq$2a541$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
<ul5aai$2qnln$1@dont-email.me>
<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
<ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me>
<ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me>
<ulihhf$22p1k$1@dont-email.me> <um4633$1jhqf$1@dont-email.me>
<G1jhN.74715$PuZ9.26873@fx11.iad>
<b17e9898b9b3ba0f907faa7b1d8bfbf6@news.novabbs.com>
<um5b5c$1p88e$1@dont-email.me> <pDFhN.60926$yEgf.42972@fx09.iad>
<53a87400b23b25dc38cbae0568aab468@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 24 Dec 2023 02:27:06 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e3ff7c6b0495916c2b93b4b925333616";
logging-data="2430081"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+lIoOBcPlrb6dDAqPsUTVOVpaZrf0r7E0="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Ua9+/xWk9DEr/r4RYlpNxsiF0Z4=
Content-Language: en-US
In-Reply-To: <53a87400b23b25dc38cbae0568aab468@news.novabbs.com>
 by: Robert Finch - Sun, 24 Dec 2023 02:27 UTC

On 2023-12-23 6:19 p.m., MitchAlsup wrote:
> EricP wrote:
>
>> Robert Finch wrote:
>>> On 2023-12-22 12:49 p.m., MitchAlsup wrote:
>>>> EricP wrote:
>>>>
>>>>> Robert Finch wrote:
>>>>>> Stuck on checkpoint RAM now. Everything was going good until…. I
>>>>>> realized that while instructions are executing they need to be
>>>>>> able to update previous checkpoints, not just the current one.
>>>>>> Which checkpoint gets updated depends on which checkpoint the
>>>>>> instruction falls under. It is the register valid bit that needs
>>>>>> to be updated. I used a “brute force” approach to implement this
>>>>>> and it is 40k LUTs. This is about five times too large a solution.
>>>>>> If I reduce the number of checkpoints supported to four from
>>>>>> sixteen, then the component is 20k LUTs. Still too large.
>>>>>>
>>>>>> The issue is there are 256 valid bits times 16 checkpoints which
>>>>>> means 4096 registers. Muxing the register inputs and outputs uses
>>>>>> a lot of LUTs.
>>>>>>
>>>>>> One thought is to stall until all the instructions with targets in
>>>>>> a given checkpoint are finished executing before starting a new
>>>>>> checkpoint region. It would seriously impact the CPU performance.
>>>>>>
>>> I think I maybe found a solution using a block RAM and about 8k LUTs.
>>>>
>>>>> (I don't have a solution, just passing on some info on this particular
>>>>> checkpointing issue.)
>>>>
>>>>> Sounds like you might be using the same free register checkpoint
>>>>> algorithm
>>>>> I came up with for my simulator, which I assumed was a custom sram
>>>>> design.
>>>>
>>>>> There is 1 bit for each physical register that is free.
>>>>> The checkpoint for a Bcc conditional branch copies the free bit
>>>>> vector,
>>>>> in your case 256 bits, to a row in the checkpoint sram.
>>>>> As each instruction retires and frees up its old dest physical
>>>>> register
>>>>> and it must mark the register free in *all* checkpoint contexts.
>>>>
>>>>> That requires the ability to set all the free flags for a single
>>>>> register,
>>>>> which means an sram design that can write a whole row, and also set
>>>>> all the
>>>>> bits in one column, in your case set the 16 bits in each checkpoint
>>>>> for one
>>>>> of the 256 registers.
>>>>
>>> Not sure about setting bits in all checkpoints. I probably have not
>>> just understood the issue yet. Partially terminology. There are two
>>> different things happening. The register free/available which is
>>> being managed with fifos and the register contents valid bit. At the
>>> far end of the pipeline, registers that were used are made free again
>>> by adding to the free fifo. This is somewhat inefficient because they
>>> could be freed sooner, but it would require more logic, instead more
>>> registers are used, they are available from the RAM anyway.
>>> The register contents valid bit is cleared when a target register is
>>> assigned, and set once a value is loaded into the target register.
>>> The valid bit is also set for instructions that are stomped on as the
>>> old value is valid. When a checkpoint is restored, it restores the
>>> state of the valid bit along with the physical register tag. I am not
>>> understanding why the valid bit would need to be modified in all
>>> checkpoints. I would think it should reflect the pre-branch state of
>>> things.
>
>> This has to do with free physical register list checkpointing and
>> a particular gotcha that occurs if one tries to use a vanilla sram
>> to save the free map bit vector for each checkpoint.
>> It sounds like the BOOM people stepped in this gotcha at some point.
>
>> Say a design has a bit vector indicating which physical registers are
>> free.
>> Rename allocates a register by using a priority selector to scan that
>> vector and select a free PR to assign as a new dest PR.
>> When this instruction retires, the old dest PR is freed and
>> the new dest PR becomes the architectural register.
>
> It is often the case where a logical register can be used in more
> than one result in a single checkpoint. When this is the case, no
> physical register need be allocated to the now-dead result, so we
> invented a way to convey this result is only captured from the operand
> bus and was not even contemplated to be written into the pRF. This makes
> the pool of free registers go further--up to 30%
> further.......

Sounds good, but Q+ is not using forwarding busses, everything is
through the register file ATM. Saved in my mind for a later version.
>
>> When Decode sees a conditional branch Bcc it allocates a
>> checkpoint in a circular buffer by incrementing the head counter,
>> copies the *current* free bit vector into the new checkpoint row,
>> and saves the new checkpoint index # in the Bcc uOp.
>> If a branch mispredict occurs then we can restore the state at the
>> Bcc by copying various state info from the Bcc checkpoint index #.
>> This includes copying back the saved free vector to the current free
>> vector.
>> When the Bcc uOp retires we increment the circular tail counter
>> to recover the checkpoint buffer row.

Did things slightly differently have only one checkpoint index, using a
branch counter, if the number of branches outstanding is greater than
the size of the checkpoint array, then the machine is stalled. Otherwise
it is assumed the checkpoints can be reused.

Also using fifos to allocate and free registers in the FPGA because I
think it uses less resources than manipulating bit vectors.
>
>> The problem occurs when an old dest PR is in use so its free bit is clear
>> when the checkpoint is saved. Then the instruction retires and marks the
>> old dest PR as free in the bit vector. Then Bcc triggers a mispredict
>> and restores the free vector that was copied when the checkpoint was
>> saved,
>> including the then not-free state of the PR freed after the checkpoint.
>> Result: the PR is lost from the free list. After enough mispredicts you
>> run out of free physical registers and hang at Rename waiting to
>> allocate.

I think there may be an issue here pending in Q+. But I have not yet run
into it. Thank-you for a description of the issue. I'll know what it is
when it occurs.
>
> Michael Shebanow and I have a patent on that dated around 1992 (filing).
> Our design could be retiring one or more checkpoints, backing up a mis-
> pedicted branch, and issuing instructions on the alternate path; all in
> the same clock.
>
>> It needs some way to edit the checkpointed free bit vector so that
>> no matter what order of PR-allocate, retire-PR-free, checkpoint save #X,
>> and rollback to checkpoint #Y, that the correct free vector gets
>> restored.

I should look at patents more often. I have been going by descriptions
of things found on the web, and reading some text books.

Update the register valid bits is done across multiple clock cycles in
Q+. There are only eight right ports to the valid bits. So, when
instructions are stomped on during a branch miss it takes clock cycles.
Branches are horrendously slow ATM. There can be 20 instructions fetched
ahead by the time the branch is decoded. They all have to be stomped on
if the branch is taken. Branches are taking many clocks, but they
execute, or skip over lots of NOPs, so the IPC is up.

I configured the number of ALUs to two instead of one and did not make
any difference. The difference is washed out by the time taken for
branches and memory ops. Hit IPC=2.4 executing mainly NOPs.

Re: Tonight's tradeoff

<uniiuk$1t2bq$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=36745&group=comp.arch#36745

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Mon, 8 Jan 2024 23:44:03 -0500
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <uniiuk$1t2bq$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
<ul5aai$2qnln$1@dont-email.me>
<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
<ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me>
<ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me>
<ulihhf$22p1k$1@dont-email.me> <um4633$1jhqf$1@dont-email.me>
<G1jhN.74715$PuZ9.26873@fx11.iad>
<b17e9898b9b3ba0f907faa7b1d8bfbf6@news.novabbs.com>
<um5b5c$1p88e$1@dont-email.me> <pDFhN.60926$yEgf.42972@fx09.iad>
<53a87400b23b25dc38cbae0568aab468@news.novabbs.com>
<um84tq$2a541$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 9 Jan 2024 04:44:04 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="900f6961eec7142cf8f9bf6a7b23f45e";
logging-data="2001274"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX186yc9pCYvK+o6an5Ok0bcXoCNXRjRSUjg="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:1XrWVGo0/XIIG4woUQb92EtSgEk=
Content-Language: en-US
In-Reply-To: <um84tq$2a541$1@dont-email.me>
 by: Robert Finch - Tue, 9 Jan 2024 04:44 UTC

Predicated logic and the PRED modifier on my mind tonight.

I think I have discovered an interesting way to handle predicated logic.
If a predicate is true the instruction is scheduled and executes
normally. If the predicate is false the instruction is modified to a
special copy operation and scheduled to execute on an ALU regardless of
what the original execution unit would be. What makes this efficient is
that only a single target register read port is required for the ALU
unit versus having a target register read port for every functional
unit. The copy mux is present in the ALU only and not in the other
functional units. For most instructions there is no predication.

Supporting the PRED modifier pushed my core over the size limit to 141k
LUTs for the system. There are only 136k LUTs available. PRED doubled
the size of the scheduler. As the scheduler must check for a previous
PRED modifier to know how to schedule instructions. It uses the PRED
coverage window of eight instructions, so the scheduler searches up to
eight instructions backwards from the current position for a PRED. The
coverage window of the PRED was set to eight instructions to accommodate
vector instructions which expand into eight separate instructions in the
micro-code in most cases.

Re: Tonight's tradeoff

<unimdt$1tet2$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=36746&group=comp.arch#36746

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Mon, 8 Jan 2024 23:43:19 -0600
Organization: A noiseless patient Spider
Lines: 103
Message-ID: <unimdt$1tet2$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
<ul3ngl$2jhoh$1@dont-email.me>
<9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com>
<ul5aai$2qnln$1@dont-email.me>
<d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com>
<ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me>
<ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me>
<ulihhf$22p1k$1@dont-email.me> <um4633$1jhqf$1@dont-email.me>
<G1jhN.74715$PuZ9.26873@fx11.iad>
<b17e9898b9b3ba0f907faa7b1d8bfbf6@news.novabbs.com>
<um5b5c$1p88e$1@dont-email.me> <pDFhN.60926$yEgf.42972@fx09.iad>
<53a87400b23b25dc38cbae0568aab468@news.novabbs.com>
<um84tq$2a541$1@dont-email.me> <uniiuk$1t2bq$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 9 Jan 2024 05:43:25 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="46fd41c377e3790f692508038b545070";
logging-data="2014114"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+nWGLzkkI1wB8gMa/bjhtU"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:zoYQ9NbvWrfmErKtcCNOR2S/kx8=
In-Reply-To: <uniiuk$1t2bq$1@dont-email.me>
Content-Language: en-US
 by: BGB - Tue, 9 Jan 2024 05:43 UTC

On 1/8/2024 10:44 PM, Robert Finch wrote:
> Predicated logic and the PRED modifier on my mind tonight.
>
> I think I have discovered an interesting way to handle predicated logic.
> If a predicate is true the instruction is scheduled and executes
> normally. If the predicate is false the instruction is modified to a
> special copy operation and scheduled to execute on an ALU regardless of
> what the original execution unit would be. What makes this efficient is
> that only a single target register read port is required for the ALU
> unit versus having a target register read port for every functional
> unit. The copy mux is present in the ALU only and not in the other
> functional units. For most instructions there is no predication.
>
> Supporting the PRED modifier pushed my core over the size limit to 141k
> LUTs for the system. There are only 136k LUTs available. PRED doubled
> the size of the scheduler. As the scheduler must check for a previous
> PRED modifier to know how to schedule instructions. It uses the PRED
> coverage window of eight instructions, so the scheduler searches up to
> eight instructions backwards from the current position for a PRED. The
> coverage window of the PRED was set to eight instructions to accommodate
> vector instructions which expand into eight separate instructions in the
> micro-code in most cases.
>
>

I handled predication slightly differently:
The pipeline has a tag pattern, and there is an SR.T bit (PPT):
00z: (Y) Always
01z: (N) Never
100: (N) If-True, T=0
101: (Y) If-True, T=1
110: (Y) If-False, T=1
111: (N) If-False, T=0

Most operations:
Y: Forward the operation;
N: Replace operation with NOP.
Branch operations:
Y: Forward the operation;
N: Replace operation with BRA_NB.

The BRA_NB operator:
If branch-predictor predicted a branch:
Initiate a branch to the following instruction;
Else: NOP.

Originally, this logic was handled in EX1, but has now been partly moved
to ID2 (so is performed along with register fetch). This change
effectively increased the latency of CMPxx handling (to 2 cycles), but
did improve FPGA timing (total negative slack).

In my case, predication is encoded in the base instruction.
00: Execute if True (E0..E3, E8..EB)
01: Execute if False (E4..E7, EC..EF)
10: Scalar (F0..F3, F8..FB)
11: Wide-Execute (F4..F7, FC..FF)

Though, one sub-block was special:
00:
EAnm-ZeoZ: Mirror F0, Wide-Execute, If-True
EBnm-ZeoZ: Mirror F2, Wide-Execute, If-True
01:
EEnm-ZeoZ: Mirror F0, Wide-Execute, If-False
EFnm-ZeoZ: Mirror F2, Wide-Execute, If-False
10:
FAii-iiii: Load Imm24u into R0
FBii-iiii: Load Imm24n into R0
11:
FEii-iiii: Jumbo Prefix, Extends Immediate
FFwZ-Zjii: Jumbo Prefix, Extends Instruction

Well, and with a few more special cases:
FEii-iiii-FAii-iiii: Load Imm48u into R0
FEii-iiii-FBii-iiii: Load Imm48n into R0
FFii-iiii-FAii-iiii: BRA Abs48 (but, only within the same mode, *)
FFii-iiii-FBii-iiii: BSR Abs48 (but, only within the same mode)

*: Sadly, inter-mode jumps require encoding something like:
MOV Imm64, R1; JMP R1

At the moment, the BJX2 CPU core weighs in at around 37k LUTs.
Or:
FF : 11.5k
LUT : 37.0k
MUX : 0.5k
CARRY: 0.9k
BMEM : 34 (AKA: BRAM)
MULT : 38 (AKA: DSP)
DMEM : 1.8k (AKA: LUTRAM)

Current timing slack for 50MHz domain (WNS): 1.783ns

A lingering residual effect of my effort to try to get the CPU running
at 75MHz, is that it is now slightly smaller and has a larger amount of
timing slack (though, had also stumbled into and fixed a few logic bugs
that had resulted from all this).

....

Re: Tonight's tradeoff

<KObnN.95717$q3F7.73256@fx45.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=36755&group=comp.arch#36755

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx45.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
References: <uis67u$fkj4$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me> <5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com> <ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me> <59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com> <ul38cp$2e25u$1@dont-email.me> <689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com> <ul3ngl$2jhoh$1@dont-email.me> <9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com> <ul5aai$2qnln$1@dont-email.me> <d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com> <ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me> <ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me> <ulihhf$22p1k$1@dont-email.me> <um4633$1jhqf$1@dont-email.me> <G1jhN.74715$PuZ9.26873@fx11.iad> <b17e9898b9b3ba0f907faa7b1d8bfbf6@news.novabbs.com> <um5b5c$1p88e$1@dont-email.me> <pDFhN.60926$yEgf.42972@fx09.iad> <53a87400b23b25dc38cbae0568aab468@news.novabbs.com> <um84tq$2a541$1@dont-email.me> <uniiuk$1t2bq$1@dont-email.me>
In-Reply-To: <uniiuk$1t2bq$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 46
Message-ID: <KObnN.95717$q3F7.73256@fx45.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 09 Jan 2024 13:24:58 UTC
Date: Tue, 09 Jan 2024 08:23:24 -0500
X-Received-Bytes: 3754
 by: EricP - Tue, 9 Jan 2024 13:23 UTC

Robert Finch wrote:
> Predicated logic and the PRED modifier on my mind tonight.
>
> I think I have discovered an interesting way to handle predicated logic.
> If a predicate is true the instruction is scheduled and executes
> normally. If the predicate is false the instruction is modified to a
> special copy operation and scheduled to execute on an ALU regardless of
> what the original execution unit would be. What makes this efficient is
> that only a single target register read port is required for the ALU
> unit versus having a target register read port for every functional
> unit. The copy mux is present in the ALU only and not in the other
> functional units. For most instructions there is no predication.

Yes, the general case is each uOp has predicate source and a bool to test.
If the value matches the predicate you execute the ON_MATCH part of the uOp,
if it does not match then execute the ON_NO_MATCH part.

condition = True | False

(pred == condition) ? ON_MATCH : ON_NO_MATCH;

The ON_NO_MATCH uOp function is usually some housekeeping.
On an in-order it might diddle the scoreboard to indicate the register
write is done. On OoO it might copy the old dest register to new.

Note that the source register dependencies change between match and no_match.

if (pred == True) ADD r3 = r2 + r1

If pred == True then it matches and the uOp is dependent on r2 and r1.
If pred != True then it no_match and uOp is dependent on the old dest r3
as a source to copy to the new dest r3.

Dynamically pruning the unnecessary uOp source register dependencies
for the alternate part can allow it to launch earlier.

Also predicated LD and ST have some particular issues to think about.
For example, under TSO a younger LD cannot bypass an older LD.
If an older LD has an unresolved predicate then we don't know if it exists
so we have to block the younger LD until the older predicate resolves.
The LD's ON_NO_MATCH housekeeping might include diddling the LSQ dependency
matrix to wake up any younger LD's in the LSQ that had been blocked.

(Yes, I'm sure one could get fancier with replay traps.)

Re: Tonight's tradeoff

<ab91a26bce530ed2cdb5c4bd2a2b84cc@www.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=36763&group=comp.arch#36763

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Tue, 9 Jan 2024 20:38:41 +0000
Organization: novaBBS
Message-ID: <ab91a26bce530ed2cdb5c4bd2a2b84cc@www.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com> <ul38cp$2e25u$1@dont-email.me> <689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com> <ul3ngl$2jhoh$1@dont-email.me> <9e680eb7912fdce7201137ba1ba77c95@news.novabbs.com> <ul5aai$2qnln$1@dont-email.me> <d8a18e91627a2153c0eb0764f6d981ca@news.novabbs.com> <ul6ajs$32loo$1@dont-email.me> <ul6me9$34850$1@dont-email.me> <ulc8vq$s0ll$1@dont-email.me> <ule5gc$190j3$1@dont-email.me> <ulihhf$22p1k$1@dont-email.me> <um4633$1jhqf$1@dont-email.me> <G1jhN.74715$PuZ9.26873@fx11.iad> <b17e9898b9b3ba0f907faa7b1d8bfbf6@news.novabbs.com> <um5b5c$1p88e$1@dont-email.me> <pDFhN.60926$yEgf.42972@fx09.iad> <53a87400b23b25dc38cbae0568aab468@news.novabbs.com> <um84tq$2a541$1@dont-email.me> <uniiuk$1t2bq$1@dont-email.me> <KObnN.95717$q3F7.73256@fx45.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2886866"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-How-To-Filter: Use 'X-Rslight-Site' to filter SITE or 'X-Rslight-Posting-User' to filter USER
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: ebd9cd10a9ebda631fbccab5347a0f771d5a2118
X-Rslight-Site: 68b84ac64ba7bf9e8243ccdbd244a609fc0488d2
 by: MitchAlsup - Tue, 9 Jan 2024 20:38 UTC

EricP wrote:

> Robert Finch wrote:
>> Predicated logic and the PRED modifier on my mind tonight.
>>
>> I think I have discovered an interesting way to handle predicated logic.
>> If a predicate is true the instruction is scheduled and executes
>> normally. If the predicate is false the instruction is modified to a
>> special copy operation and scheduled to execute on an ALU regardless of
>> what the original execution unit would be. What makes this efficient is
>> that only a single target register read port is required for the ALU
>> unit versus having a target register read port for every functional
>> unit. The copy mux is present in the ALU only and not in the other
>> functional units. For most instructions there is no predication.

> Yes, the general case is each uOp has predicate source and a bool to test.
> If the value matches the predicate you execute the ON_MATCH part of the uOp,
> if it does not match then execute the ON_NO_MATCH part.

> condition = True | False

> (pred == condition) ? ON_MATCH : ON_NO_MATCH;

> The ON_NO_MATCH uOp function is usually some housekeeping.
> On an in-order it might diddle the scoreboard to indicate the register
> write is done. On OoO it might copy the old dest register to new.

A SB handles this situation with greater elegance than a reservation station.
The SB can merely clear the dependency without writing to the RF, so the
now released reader reads the older value. {Thornton SB}

The value capturing reservation station entry has to first capture and then
ignore the delivered result (and so does the RF/RoB. {Thomasulo RS}

The Value-free RS entry is more like the SB than the Thomasulo RS.

A typical SB Can be used to hold result delivery on instructions in the
shadow of a PRED to avoid the data-flow mechanism from getting unkempt.
Both then-clause and else-clause can be held while the condition is
evaluating,...

> Note that the source register dependencies change between match and no_match.

> if (pred == True) ADD r3 = r2 + r1

> If pred == True then it matches and the uOp is dependent on r2 and r1.
> If pred != True then it no_match and uOp is dependent on the old dest r3
> as a source to copy to the new dest r3.

Yes, and there can be multiple instructions in the shadow of a PRED.

> Dynamically pruning the unnecessary uOp source register dependencies
> for the alternate part can allow it to launch earlier.

As illustrated above, no need to stall launch if you can stall result
delivery. {A key component of the Thornton SB}

> Also predicated LD and ST have some particular issues to think about.
> For example, under TSO a younger LD cannot bypass an older LD.

Easy:: don't do TSO <most of the time> or SC <outside of ATOMIC stuff>.

> If an older LD has an unresolved predicate then we don't know if it exists
> so we have to block the younger LD until the older predicate resolves.

This is why TSO and SC are slower than causal or weaker. Consider a memory
reorder buffer which allows generated addresses to probe the cache and
determine hit as operand data-flow permits--BUT holds onto the data and
writes (LD or reads (ST) to) the RF in program order. This violates TSO
and SC but mono-threaded codes are immune to this memory ordering problem
{and multi-threaded programs are immune except while performing ATIMIC
things.}

TSO and SC is simply slower when trying to perform memory reference inst-
ructions in both the then-clause and in the else clause while waiting the
resolution of the condition--even if no results are written into RF until
after resolution.

> The LD's ON_NO_MATCH housekeeping might include diddling the LSQ dependency
> matrix to wake up any younger LD's in the LSQ that had been blocked.

> (Yes, I'm sure one could get fancier with replay traps.)


devel / comp.arch / Re: Tonight's tradeoff

Pages:123456789101112
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor