Welcome to Rocksolid Light

mail files register newsreader groups login

Message-ID:

"It's God. No, not Richard Stallman, or Linus Torvalds, but God." (By Matt Welsh)

Re: Tonight's tradeoff

Subject	Author
Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	EricP
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	Scott Lurndal
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	Scott Lurndal
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Scott Lurndal
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Scott Lurndal
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Scott Lurndal
Re: Tonight's tradeoff	Anton Ertl
Re: Tonight's tradeoff	EricP
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Anton Ertl
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	Scott Lurndal
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	Anton Ertl
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	Scott Lurndal
Re: Tonight's tradeoff	Anton Ertl
Re: Tonight's tradeoff	Scott Lurndal
Re: Tonight's tradeoff	Anton Ertl
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	Scott Lurndal
Re: Tonight's tradeoff	EricP
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	EricP
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	EricP
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	EricP
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	EricP
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	EricP
Re: Tonight's tradeoff	Chris M. Thomasson
Re: Tonight's tradeoff	EricP
Re: Tonight's tradeoff	Anton Ertl
Re: Tonight's tradeoff	Chris M. Thomasson
Re: Tonight's tradeoff	Chris M. Thomasson
Re: Tonight's tradeoff	BGB
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch
Re: Tonight's tradeoff	Scott Lurndal
Re: Tonight's tradeoff	MitchAlsup
Re: Tonight's tradeoff	Robert Finch

Pages:1 234 5 6 7 8 9 10 11 12

Re: Tonight's tradeoff

<ukbcav$1j5qe$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35335&group=comp.arch#35335

copy link Newsgroups: comp.arch

Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Thu, 30 Nov 2023 20:19:24 -0500
Organization: A noiseless patient Spider
Lines: 131
Message-ID: <ukbcav$1j5qe$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com>
<ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<gC4aN.155430$_Oab.116148@fx15.iad>
<b32150df757ef8894bac87db8b695882@news.novabbs.com>
<ukb3ko$1hv5s$1@dont-email.me>
<242110bb39a3a3f4afd4f747b0178333@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 1 Dec 2023 01:19:28 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="579b79356b0ea203efc2b134288f0458";
logging-data="1677134"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/nZdeDFVLgWaoYXpbTZqSI1UtwLd7jPio="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:2+cyyEqjM+N9s4Zmy8pvkCfdvjY=
Content-Language: en-US
In-Reply-To: <242110bb39a3a3f4afd4f747b0178333@news.novabbs.com>

by: Robert Finch - Fri, 1 Dec 2023 01:19 UTC

On 2023-11-30 6:06 p.m., MitchAlsup wrote:
> Robert Finch wrote:
>
>> On 2023-11-30 3:30 p.m., MitchAlsup wrote:
>>> EricP wrote:
>>>
>>>> Robert Finch wrote:
>>>>> The Q+ register file is implemented with one block-RAM per read
>>>>> port. With a 64-bit width this gives 512 registers in a block RAM.
>>>>> 192 registers are needed for renaming a 64-entry architectural
>>>>> register file. That leaves 320 registers unused. My thought was to
>>>>> support two banks of registers, one for the highest operating mode,
>>>>> and the other for remaining operating modes. On exceptions the
>>>>> register bank could be switched. But to do this there are now
>>>>> 128-register effectively being renamed which leads to 384 physical
>>>>> registers to manage. This doubles the size of the register
>>>>> management code. Unless, a pipeline flush occurs for exception
>>>>> processing which I think would allow the renamer to reuse the same
>>>>> hardware to manage a new bank of registers. But that hinges on all
>>>>> references to registers in the current bank being unused.
>>>>>
>>>>> My other thought was that with approximately three times the number
>>>>> of architectural registers required, using 256 physical registers
>>>>> would allow 85 architectural registers. Perhaps some of the
>>>>> registers could be banked for different operating modes. Banking
>>>>> four registers per mode would use up 16.
>>>>>
>>>>> If the 512-register file were divided by three, 170 physical
>>>>> registers could be available for renaming. This is less than the
>>>>> ideal 192 registers but maybe close enough to not impact
>>>>> performance adversely.
>>>>>
>>>
>>>> I don't understand the problem.
>>>> You want 64 architecture registers, each which needs a physical
>>>> register,
>>>> plus 128 registers for in-flight instructions, so 196 physical
>>>> registers.
>>>
>>>> If you add a second bank of 64 architecture registers for interrupts
>>>> then each needs a physical register. But that doesn't change the number
>>>> of in-flight registers so thats 256 physical total.
>>>> Plus two sets of rename banks, one for each mode.
>>>
>>>> If you drain the pipeline before switching register banks then all
>>>> of the 128 in-flight registers will be free at the time of switch.
>>>
>>> A couple of bits of state and you don't need to drain the pipeline,
>>> you just have to find the youngest instruction with the property that
>>> all older instructions cannot raise an exception; these can be
>>> allowed to finish execution while you are fetching instruction for
>>> the new context.
>
>> Not quite comprehending. Will not the registers for the new context be
>> improperly mapped if there are registers in use for the old map?
>
> All the in-flight destination registers will get written by the in-flight
> instructions. All the instruction of the new context will allocate
> registers
> from the pool which is not currently in-flight. So, while there is mental
> confusion on how this gets pulled off in HW, it does get pulled off just
> fine. When the new context STs the registers of the old context, it obtains
> the correct register from the old context {{Should HW be doing this the
> same orchestration applies--and it still works.}}
>
>> I
>> think a state bit could be used to pause a fetch of a register still
>> in use in the old map, but that is draining the pipeline anyway.
>
> You are assuming a RAT, I am not using a RAT but a CAM where I can restore
> to any checkpoint by simply rewriting the valid bit vector.

I think the RAT can be restored to a specific checkpoint as well using
just an index value. Q+ has a checkpoint RAM of which one of the
checkpoints is the active RAT. The RAT is really 16 tables. I stored a
bit vector of the valid registers in the ROB so that the valid
register set may be reset when a checkpoint is restored.
>
>> When the context swaps, a new set of target registers is always
>> established before the registers are used.
>
> You still have to deal with the transient state and the CAM version works
> with either SW or HW save/restore.
>
>> So incoming references in
>> the new context should always map to the new registers?
>
> Which they will--as illustrated above.
>
>>>
>>>> If you can switch to interrupt mode without draining the pipeline then
>>>> some of those 128 will be in-use for the old mode, some for the new
>>>> mode
>>>> (and the uOps carry a privilege mode flag so you can do things like
>>>> check LD or ST ops against the appropriate PTE mode access control).
>>>
>>> And 1 bit of state keeps track of which is which.
>
>> Did some experimenting and the RAT turns out to be too large if more
>> registers are incorporated. Even as few as 256 regs caused the RAT to
>> increase in size substantially. So, I may go the alternate route of
>> making register wider rather than deeper, having 128-bit wide
>> registers instead.
>
> Register ports (or equivalently RAT ports) are one of the things that most
> limit issue width. K9 was to have 22 RAT ports, and was similar in size
> to the {standard decoded Register File.}

The Q+ RAT has 16 read and 8 write ports. I am trying for a 4-wide
machine. It is using about as many LUTs as the register file. The RAT is
implemented with LUT ram instead of block RAMs. I do not like the size,
but it adds a lot to the operation of the machine.

>
>> There is an eight bit sequence number bit associated with each
>> instruction. So it can easily be detected the age of an instruction. I
>
> I assign a 4-bit number (16-checkpints) to all instructions issued in
> the same clock cycle. This gives a 6-wide machine up to 96 instructions
> in-flight; and makes backing up (misprediction) simple and fast.
>
The same thing is done with Q+. It support 16 checkpoints with a
four-bit number too. Having read that 16 is almost the same as infinity.

>> found a really slick way of detecting instruction age using a matrix
>> approach on the web. But I did not fully understand it. So I just use
>> eight bit counters for now.
>
>> There is a two bit privilege mode flag for instructions in the ROB. I
>> suppose the ROB entries could be called uOps.

Re: Tonight's tradeoff

<d6ab2e326ad266d384a93712fccb6fe7@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35337&group=comp.arch#35337

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Fri, 1 Dec 2023 02:43:20 +0000
Organization: novaBBS
Message-ID: <d6ab2e326ad266d384a93712fccb6fe7@news.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com> <ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me> <987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com> <ujrm4a$2llie$1@dont-email.me> <d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com> <bps8N.150652$wvv7.7314@fx14.iad> <2023Nov26.164506@mips.complang.tuwien.ac.at> <tDP8N.30031$ayBd.8559@fx07.iad> <2023Nov27.085708@mips.complang.tuwien.ac.at> <s929N.28687$rx%7.18632@fx47.iad> <2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me> <gC4aN.155430$_Oab.116148@fx15.iad> <b32150df757ef8894bac87db8b695882@news.novabbs.com> <ukb3ko$1hv5s$1@dont-email.me> <242110bb39a3a3f4afd4f747b0178333@news.novabbs.com> <ukbcav$1j5qe$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2660720"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$iMgzxdSyYD.3LlG2YGMsUeuT4R1MjMLygOGafWyCpjIeK.ZHt6tJu

by: MitchAlsup - Fri, 1 Dec 2023 02:43 UTC

Robert Finch wrote:

> On 2023-11-30 6:06 p.m., MitchAlsup wrote:
>> Robert Finch wrote:
>>
>>> On 2023-11-30 3:30 p.m., MitchAlsup wrote:
>>>> EricP wrote:
>>>>
>>>>> Robert Finch wrote:
>>>>>> The Q+ register file is implemented with one block-RAM per read
>>>>>> port. With a 64-bit width this gives 512 registers in a block RAM.
>>>>>> 192 registers are needed for renaming a 64-entry architectural
>>>>>> register file. That leaves 320 registers unused. My thought was to
>>>>>> support two banks of registers, one for the highest operating mode,
>>>>>> and the other for remaining operating modes. On exceptions the
>>>>>> register bank could be switched. But to do this there are now
>>>>>> 128-register effectively being renamed which leads to 384 physical
>>>>>> registers to manage. This doubles the size of the register
>>>>>> management code. Unless, a pipeline flush occurs for exception
>>>>>> processing which I think would allow the renamer to reuse the same
>>>>>> hardware to manage a new bank of registers. But that hinges on all
>>>>>> references to registers in the current bank being unused.
>>>>>>
>>>>>> My other thought was that with approximately three times the number
>>>>>> of architectural registers required, using 256 physical registers
>>>>>> would allow 85 architectural registers. Perhaps some of the
>>>>>> registers could be banked for different operating modes. Banking
>>>>>> four registers per mode would use up 16.
>>>>>>
>>>>>> If the 512-register file were divided by three, 170 physical
>>>>>> registers could be available for renaming. This is less than the
>>>>>> ideal 192 registers but maybe close enough to not impact
>>>>>> performance adversely.
>>>>>>
>>>>
>>>>> I don't understand the problem.
>>>>> You want 64 architecture registers, each which needs a physical
>>>>> register,
>>>>> plus 128 registers for in-flight instructions, so 196 physical
>>>>> registers.
>>>>
>>>>> If you add a second bank of 64 architecture registers for interrupts
>>>>> then each needs a physical register. But that doesn't change the number
>>>>> of in-flight registers so thats 256 physical total.
>>>>> Plus two sets of rename banks, one for each mode.
>>>>
>>>>> If you drain the pipeline before switching register banks then all
>>>>> of the 128 in-flight registers will be free at the time of switch.
>>>>
>>>> A couple of bits of state and you don't need to drain the pipeline,
>>>> you just have to find the youngest instruction with the property that
>>>> all older instructions cannot raise an exception; these can be
>>>> allowed to finish execution while you are fetching instruction for
>>>> the new context.
>>
>>> Not quite comprehending. Will not the registers for the new context be
>>> improperly mapped if there are registers in use for the old map?
>>
>> All the in-flight destination registers will get written by the in-flight
>> instructions. All the instruction of the new context will allocate
>> registers
>> from the pool which is not currently in-flight. So, while there is mental
>> confusion on how this gets pulled off in HW, it does get pulled off just
>> fine. When the new context STs the registers of the old context, it obtains
>> the correct register from the old context {{Should HW be doing this the
>> same orchestration applies--and it still works.}}
>>
>>> I
>>> think a state bit could be used to pause a fetch of a register still
>>> in use in the old map, but that is draining the pipeline anyway.
>>
>> You are assuming a RAT, I am not using a RAT but a CAM where I can restore
>> to any checkpoint by simply rewriting the valid bit vector.

> I think the RAT can be restored to a specific checkpoint as well using
> just an index value. Q+ has a checkpoint RAM of which one of the
> checkpoints is the active RAT. The RAT is really 16 tables. I stored a
> bit vector of the valid registers in the ROB so that the valid
> register set may be reset when a checkpoint is restored.
>>
>>> When the context swaps, a new set of target registers is always
>>> established before the registers are used.
>>
>> You still have to deal with the transient state and the CAM version works
>> with either SW or HW save/restore.
>>
>>> So incoming references in
>>> the new context should always map to the new registers?
>>
>> Which they will--as illustrated above.
>>
>>>>
>>>>> If you can switch to interrupt mode without draining the pipeline then
>>>>> some of those 128 will be in-use for the old mode, some for the new
>>>>> mode
>>>>> (and the uOps carry a privilege mode flag so you can do things like
>>>>> check LD or ST ops against the appropriate PTE mode access control).
>>>>
>>>> And 1 bit of state keeps track of which is which.
>>
>>> Did some experimenting and the RAT turns out to be too large if more
>>> registers are incorporated. Even as few as 256 regs caused the RAT to
>>> increase in size substantially. So, I may go the alternate route of
>>> making register wider rather than deeper, having 128-bit wide
>>> registers instead.
>>
>> Register ports (or equivalently RAT ports) are one of the things that most
>> limit issue width. K9 was to have 22 RAT ports, and was similar in size
>> to the {standard decoded Register File.}

> The Q+ RAT has 16 read and 8 write ports. I am trying for a 4-wide
> machine. It is using about as many LUTs as the register file. The RAT is
> implemented with LUT ram instead of block RAMs. I do not like the size,
> but it adds a lot to the operation of the machine.

>>
>>> There is an eight bit sequence number bit associated with each
>>> instruction. So it can easily be detected the age of an instruction. I
>>
>> I assign a 4-bit number (16-checkpints) to all instructions issued in
>> the same clock cycle. This gives a 6-wide machine up to 96 instructions
>> in-flight; and makes backing up (misprediction) simple and fast.
>>
> The same thing is done with Q+. It support 16 checkpoints with a
> four-bit number too. Having read that 16 is almost the same as infinity.

Branch repair (from misprediction) has to be fast--especially if you are
going for 0-cycle repair.

Checkpoint recovery (from interrupt) just has to pick 1 of the checkpoints
that has achieved the consistent state (no older instructions can raise an
exception).

Exception recovery can backup to the checkpoint containing the instruction
which raised the exception, and then single step forward until the exception
is identified. Thus, you do not need "order" at a granularity smaller than
a checkpoint.

One can use pseudo-exceptions to solve difficult timing or sequencing
problems, saving certain kinds of state transitions in the instruction
queuing mechanism. For example, one could use pseudo-exception to regain
memory order in an ATOMIC event when you detect the order was less than
sequentially consistent.

>>> found a really slick way of detecting instruction age using a matrix
>>> approach on the web. But I did not fully understand it. So I just use
>>> eight bit counters for now.
>>
>>> There is a two bit privilege mode flag for instructions in the ROB. I
>>> suppose the ROB entries could be called uOps.

Re: Tonight's tradeoff

<ukh85m$2op0g$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35371&group=comp.arch#35371

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.swapon.de!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 3 Dec 2023 01:45:09 -0500
Organization: A noiseless patient Spider
Lines: 185
Message-ID: <ukh85m$2op0g$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com>
<ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<gC4aN.155430$_Oab.116148@fx15.iad>
<b32150df757ef8894bac87db8b695882@news.novabbs.com>
<ukb3ko$1hv5s$1@dont-email.me>
<242110bb39a3a3f4afd4f747b0178333@news.novabbs.com>
<ukbcav$1j5qe$1@dont-email.me>
<d6ab2e326ad266d384a93712fccb6fe7@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 3 Dec 2023 06:45:10 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0e52f42b52b4ed63b9d1a56279cc4461";
logging-data="2909200"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+bt8OehxE8GmaAqqg2A7HjQFQWMgKkllM="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:gshA1b2ailmrNk5gzvlRSgsGqfw=
In-Reply-To: <d6ab2e326ad266d384a93712fccb6fe7@news.novabbs.com>
Content-Language: en-US

by: Robert Finch - Sun, 3 Dec 2023 06:45 UTC

On 2023-11-30 9:43 p.m., MitchAlsup wrote:
> Robert Finch wrote:
>
>> On 2023-11-30 6:06 p.m., MitchAlsup wrote:
>>> Robert Finch wrote:
>>>
>>>> On 2023-11-30 3:30 p.m., MitchAlsup wrote:
>>>>> EricP wrote:
>>>>>
>>>>>> Robert Finch wrote:
>>>>>>> The Q+ register file is implemented with one block-RAM per read
>>>>>>> port. With a 64-bit width this gives 512 registers in a block
>>>>>>> RAM. 192 registers are needed for renaming a 64-entry
>>>>>>> architectural register file. That leaves 320 registers unused. My
>>>>>>> thought was to support two banks of registers, one for the
>>>>>>> highest operating mode, and the other for remaining operating
>>>>>>> modes. On exceptions the register bank could be switched. But to
>>>>>>> do this there are now 128-register effectively being renamed
>>>>>>> which leads to 384 physical registers to manage. This doubles the
>>>>>>> size of the register management code. Unless, a pipeline flush
>>>>>>> occurs for exception processing which I think would allow the
>>>>>>> renamer to reuse the same hardware to manage a new bank of
>>>>>>> registers. But that hinges on all references to registers in the
>>>>>>> current bank being unused.
>>>>>>>
>>>>>>> My other thought was that with approximately three times the
>>>>>>> number of architectural registers required, using 256 physical
>>>>>>> registers would allow 85 architectural registers. Perhaps some of
>>>>>>> the registers could be banked for different operating modes.
>>>>>>> Banking four registers per mode would use up 16.
>>>>>>>
>>>>>>> If the 512-register file were divided by three, 170 physical
>>>>>>> registers could be available for renaming. This is less than the
>>>>>>> ideal 192 registers but maybe close enough to not impact
>>>>>>> performance adversely.
>>>>>>>
>>>>>
>>>>>> I don't understand the problem.
>>>>>> You want 64 architecture registers, each which needs a physical
>>>>>> register,
>>>>>> plus 128 registers for in-flight instructions, so 196 physical
>>>>>> registers.
>>>>>
>>>>>> If you add a second bank of 64 architecture registers for interrupts
>>>>>> then each needs a physical register. But that doesn't change the
>>>>>> number
>>>>>> of in-flight registers so thats 256 physical total.
>>>>>> Plus two sets of rename banks, one for each mode.
>>>>>
>>>>>> If you drain the pipeline before switching register banks then all
>>>>>> of the 128 in-flight registers will be free at the time of switch.
>>>>>
>>>>> A couple of bits of state and you don't need to drain the pipeline,
>>>>> you just have to find the youngest instruction with the property
>>>>> that all older instructions cannot raise an exception; these can be
>>>>> allowed to finish execution while you are fetching instruction for
>>>>> the new context.
>>>
>>>> Not quite comprehending. Will not the registers for the new context
>>>> be improperly mapped if there are registers in use for the old map?
>>>
>>> All the in-flight destination registers will get written by the
>>> in-flight
>>> instructions. All the instruction of the new context will allocate
>>> registers
>>> from the pool which is not currently in-flight. So, while there is
>>> mental
>>> confusion on how this gets pulled off in HW, it does get pulled off just
>>> fine. When the new context STs the registers of the old context, it
>>> obtains
>>> the correct register from the old context {{Should HW be doing this the
>>> same orchestration applies--and it still works.}}
>>>
>>>> I
>>>> think a state bit could be used to pause a fetch of a register still
>>>> in use in the old map, but that is draining the pipeline anyway.
>>>
>>> You are assuming a RAT, I am not using a RAT but a CAM where I can
>>> restore
>>> to any checkpoint by simply rewriting the valid bit vector.
>
>> I think the RAT can be restored to a specific checkpoint as well using
>> just an index value. Q+ has a checkpoint RAM of which one of the
>> checkpoints is the active RAT. The RAT is really 16 tables. I stored a
>> bit vector of the valid registers in the ROB so that the valid
>> register set may be reset when a checkpoint is restored.
>>>
>>>> When the context swaps, a new set of target registers is always
>>>> established before the registers are used.
>>>
>>> You still have to deal with the transient state and the CAM version
>>> works
>>> with either SW or HW save/restore.
>>>
>>>> So incoming references in
>>>> the new context should always map to the new registers?
>>>
>>> Which they will--as illustrated above.
>>>
>>>>>
>>>>>> If you can switch to interrupt mode without draining the pipeline
>>>>>> then
>>>>>> some of those 128 will be in-use for the old mode, some for the
>>>>>> new mode
>>>>>> (and the uOps carry a privilege mode flag so you can do things like
>>>>>> check LD or ST ops against the appropriate PTE mode access control).
>>>>>
>>>>> And 1 bit of state keeps track of which is which.
>>>
>>>> Did some experimenting and the RAT turns out to be too large if more
>>>> registers are incorporated. Even as few as 256 regs caused the RAT
>>>> to increase in size substantially. So, I may go the alternate route
>>>> of making register wider rather than deeper, having 128-bit wide
>>>> registers instead.
>>>
>>> Register ports (or equivalently RAT ports) are one of the things that
>>> most
>>> limit issue width. K9 was to have 22 RAT ports, and was similar in
>>> size to the {standard decoded Register File.}
>
>> The Q+ RAT has 16 read and 8 write ports. I am trying for a 4-wide
>> machine. It is using about as many LUTs as the register file. The RAT
>> is implemented with LUT ram instead of block RAMs. I do not like the
>> size, but it adds a lot to the operation of the machine.
>
>>>
>>>> There is an eight bit sequence number bit associated with each
>>>> instruction. So it can easily be detected the age of an instruction. I
>>>
>>> I assign a 4-bit number (16-checkpints) to all instructions issued in
>>> the same clock cycle. This gives a 6-wide machine up to 96 instructions
>>> in-flight; and makes backing up (misprediction) simple and fast.
>>>
>> The same thing is done with Q+. It support 16 checkpoints with a
>> four-bit number too. Having read that 16 is almost the same as infinity.
>
> Branch repair (from misprediction) has to be fast--especially if you are
> going for 0-cycle repair.

I think I am far away from zero-cycle repair. Does getting zero-cycle
repair mean fetching from both branch directions and then selecting the
correct one? I will be happy if I can get branching to work at all. It
is my first implementation using checkpoints. All the details of
handling branches are not yet worked out in code for Q+. I think enough
of the code is in place to get rough timing estimates. Not sure how well
the BTB will work. A gselect predictor is also being used. Expecting a
lot of branch mispredictions.

> Checkpoint recovery (from interrupt) just has to pick 1 of the checkpoints
> that has achieved the consistent state (no older instructions can raise an
> exception).

Sounds straight-forward enough.
>
> Exception recovery can backup to the checkpoint containing the
> instruction which raised the exception, and then single step forward
> until the exception
> is identified. Thus, you do not need "order" at a granularity smaller than
> a checkpoint.

Click here to read the complete article

Re: Tonight's tradeoff

<9K1bN.257199$wvv7.25292@fx14.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35377&group=comp.arch#35377

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
References: <uis67u$fkj4$1@dont-email.me> <71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com> <ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me> <987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com> <ujrm4a$2llie$1@dont-email.me> <d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com> <bps8N.150652$wvv7.7314@fx14.iad> <2023Nov26.164506@mips.complang.tuwien.ac.at> <tDP8N.30031$ayBd.8559@fx07.iad> <2023Nov27.085708@mips.complang.tuwien.ac.at> <s929N.28687$rx%7.18632@fx47.iad> <2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me> <ukb8rd$1im31$1@dont-email.me>
In-Reply-To: <ukb8rd$1im31$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 28
Message-ID: <9K1bN.257199$wvv7.25292@fx14.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 03 Dec 2023 16:08:37 UTC
Date: Sun, 03 Dec 2023 11:07:48 -0500
X-Received-Bytes: 2562

by: EricP - Sun, 3 Dec 2023 16:07 UTC

Robert Finch wrote:
> Figured it out. Each architectural register in the RAT must refer to N
> physical registers, where N is the number of banks. Setting N to 4
> results in a RAT that is only about 50% larger than one supporting only
> a single bank. The operating mode is used to select the physical
> register. The first eight registers are shared between all operating
> modes so arguments can be passed to syscalls. It is tempting to have
> eight banks of registers, one for each hardware interrupt level.

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.
For example, if there are 2 modes User and Super and a bank for each,
since User and Super are mutually exclusive,
64 of your 256 physical registers will be sitting unused tied
to the other mode bank, so max of 75% utilization efficiency.

If you have 8 register banks then only 3/10 of the physical registers
are available to use, the other 7/10 are sitting idle attached to
arch registers in other modes consuming power.

Also you don't have to play overlapped-register-bank games to pass
args to/from syscalls. You can have specific instructions that reach
into other banks: Move To User Reg, Move From User Reg.
Since only syscall passes args into the OS you only need to access
the user mode bank from the OS kernel bank.

Re: Tonight's tradeoff

<c5c7c98d9bd820f9ded6a099317ea827@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35380&group=comp.arch#35380

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 3 Dec 2023 16:49:33 +0000
Organization: novaBBS
Message-ID: <c5c7c98d9bd820f9ded6a099317ea827@news.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com> <ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me> <987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com> <ujrm4a$2llie$1@dont-email.me> <d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com> <bps8N.150652$wvv7.7314@fx14.iad> <2023Nov26.164506@mips.complang.tuwien.ac.at> <tDP8N.30031$ayBd.8559@fx07.iad> <2023Nov27.085708@mips.complang.tuwien.ac.at> <s929N.28687$rx%7.18632@fx47.iad> <2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me> <gC4aN.155430$_Oab.116148@fx15.iad> <b32150df757ef8894bac87db8b695882@news.novabbs.com> <ukb3ko$1hv5s$1@dont-email.me> <242110bb39a3a3f4afd4f747b0178333@news.novabbs.com> <ukbcav$1j5qe$1@dont-email.me> <d6ab2e326ad266d384a93712fccb6fe7@news.novabbs.com> <ukh85m$2op0g$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2935363"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$4iYbUWhL/DHTJ6LxDCahiO19uNmnttQywPleFXg38hzlYIQae/oNm
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949

by: MitchAlsup - Sun, 3 Dec 2023 16:49 UTC

Robert Finch wrote:

> On 2023-11-30 9:43 p.m., MitchAlsup wrote:
>> four-bit number too. Having read that 16 is almost the same as infinity.
>>
>> Branch repair (from misprediction) has to be fast--especially if you are
>> going for 0-cycle repair.

> I think I am far away from zero-cycle repair. Does getting zero-cycle
> repair mean fetching from both branch directions and then selecting the
> correct one?

No, zero cycle means you access the ICache twice per cycle, once on the
predicted path and once on the alternate path. The alternate path inst
are put in a buffer indexed by branch number. {{This happens 10-12 cycles
before the branch prediction is resolved}}

When the branch instruction is launched out of its inst queue, the buffer
is read, and if the branch prediction failed, you have the instructions
from the mispredicted path ready to decode in the subsequent cycle.

> I will be happy if I can get branching to work at all. It
> is my first implementation using checkpoints. All the details of
> handling branches are not yet worked out in code for Q+. I think enough
> of the code is in place to get rough timing estimates. Not sure how well
> the BTB will work. A gselect predictor is also being used. Expecting a
> lot of branch mispredictions.

>> Checkpoint recovery (from interrupt) just has to pick 1 of the checkpoints
>> that has achieved the consistent state (no older instructions can raise an
>> exception).

> Sounds straight-forward enough.
>>
>> Exception recovery can backup to the checkpoint containing the
>> instruction which raised the exception, and then single step forward
>> until the exception
>> is identified. Thus, you do not need "order" at a granularity smaller than
>> a checkpoint.

> This sounds a little trickier to do. Q+ currently takes an exception
> when things commit. It looks in the exception field of the queue entry
> for a fault code. If there is one it performs almost the same operation
> as a branch except it is occurring at the commit stage.
>>
>> One can use pseudo-exceptions to solve difficult timing or sequencing
>> problems, saving certain kinds of state transitions in the instruction
>> queuing mechanism. For example, one could use pseudo-exception to regain
>> memory order in an ATOMIC event when you detect the order was less than
>> sequentially consistent.
>>
> Noted.

> Gone back to using variable length instructions. Had to pipeline the
> instruction length decode across three clock cycles to get it to meet
> timing.

Curious:: I got VLE to decode in 4-gates of delay, and I can PARSE up to
16 instruction boundaries in a single cycle (using a tree of multiplexers.)

DECODE, then, only has to process the 32-bit instructions and route the
constants in at Forwarding.

Now:: I also use 3 cycles after ICache access, but 1 of the cycles includes
tag comparison and set select, so I consider this a 2½ cycle decode; the ½
cycle part performs the VLE and instruction-specifier rout to decoder[k].

Re: Tonight's tradeoff

<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35381&group=comp.arch#35381

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 3 Dec 2023 16:58:38 +0000
Organization: novaBBS
Message-ID: <3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com> <ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me> <987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com> <ujrm4a$2llie$1@dont-email.me> <d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com> <bps8N.150652$wvv7.7314@fx14.iad> <2023Nov26.164506@mips.complang.tuwien.ac.at> <tDP8N.30031$ayBd.8559@fx07.iad> <2023Nov27.085708@mips.complang.tuwien.ac.at> <s929N.28687$rx%7.18632@fx47.iad> <2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me> <ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2936178"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$u4ZkZU68gAkFmRXq0epWROSwCw7S1ddN6UhWLyI0QviNjCarfa5Dq

by: MitchAlsup - Sun, 3 Dec 2023 16:58 UTC

EricP wrote:

> Robert Finch wrote:
>> Figured it out. Each architectural register in the RAT must refer to N
>> physical registers, where N is the number of banks. Setting N to 4
>> results in a RAT that is only about 50% larger than one supporting only
>> a single bank. The operating mode is used to select the physical
>> register. The first eight registers are shared between all operating
>> modes so arguments can be passed to syscalls. It is tempting to have
>> eight banks of registers, one for each hardware interrupt level.

> A consequence of multiple architecture register banks is each extra
> bank keeps a set of mostly unused physical register attached to them.

A waste.....

> For example, if there are 2 modes User and Super and a bank for each,
> since User and Super are mutually exclusive,
> 64 of your 256 physical registers will be sitting unused tied
> to the other mode bank, so max of 75% utilization efficiency.

> If you have 8 register banks then only 3/10 of the physical registers
> are available to use, the other 7/10 are sitting idle attached to
> arch registers in other modes consuming power.

> Also you don't have to play overlapped-register-bank games to pass
> args to/from syscalls. You can have specific instructions that reach
> into other banks: Move To User Reg, Move From User Reg.
> Since only syscall passes args into the OS you only need to access
> the user mode bank from the OS kernel bank.

Whereas: Exceptions, interrupts save and restore 32-registers::
A SysCall in My 66000 only saves and restores 24 of the 32 registers.
So when control arrives, there are 8 argument registers from the
Caller and 24 registers from Guest OS already loaded. So, SysCall
handler already has its stack, and a variety of pointers to data
structures it is interested in.

On the way back, RET only restores 24 registers so Guest OS can pass
back as many as 8 result registers.

Re: Tonight's tradeoff

<ukijmt$2vmae$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35383&group=comp.arch#35383

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 3 Dec 2023 14:08:12 -0500
Organization: A noiseless patient Spider
Lines: 66
Message-ID: <ukijmt$2vmae$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com>
<ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 3 Dec 2023 19:08:13 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0e52f42b52b4ed63b9d1a56279cc4461";
logging-data="3135822"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19VpNaZ3t6o19xqc4eqfriVUwDlebrvUVQ="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:jsejnn6rV3TVVs+4gCOWjQCBuWc=
Content-Language: en-US
In-Reply-To: <3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>

by: Robert Finch - Sun, 3 Dec 2023 19:08 UTC

On 2023-12-03 11:58 a.m., MitchAlsup wrote:
> EricP wrote:
>
>> Robert Finch wrote:
>>> Figured it out. Each architectural register in the RAT must refer to
>>> N physical registers, where N is the number of banks. Setting N to 4
>>> results in a RAT that is only about 50% larger than one supporting
>>> only a single bank. The operating mode is used to select the physical
>>> register. The first eight registers are shared between all operating
>>> modes so arguments can be passed to syscalls. It is tempting to have
>>> eight banks of registers, one for each hardware interrupt level.
>
>> A consequence of multiple architecture register banks is each extra
>> bank keeps a set of mostly unused physical register attached to them.
>
> A waste.....
>
Part of the reason to support multiple banks is that the block RAM is
present and consuming power whether or not it is being used.

>> For example, if there are 2 modes User and Super and a bank for each,
>> since User and Super are mutually exclusive,
>> 64 of your 256 physical registers will be sitting unused tied
>> to the other mode bank, so max of 75% utilization efficiency.
>
>> If you have 8 register banks then only 3/10 of the physical registers
>> are available to use, the other 7/10 are sitting idle attached to
>> arch registers in other modes consuming power.
>
>> Also you don't have to play overlapped-register-bank games to pass
>> args to/from syscalls. You can have specific instructions that reach
>> into other banks: Move To User Reg, Move From User Reg.
>> Since only syscall passes args into the OS you only need to access
>> the user mode bank from the OS kernel bank.
>
> Whereas: Exceptions, interrupts save and restore 32-registers::
> A SysCall in My 66000 only saves and restores 24 of the 32 registers.
> So when control arrives, there are 8 argument registers from the Caller
> and 24 registers from Guest OS already loaded. So, SysCall
> handler already has its stack, and a variety of pointers to data
> structures it is interested in.
>
> On the way back, RET only restores 24 registers so Guest OS can pass
> back as many as 8 result registers.

Q+ has 64 registers. They may end up being 128-bit. It may take 4x as
long (or more) to store and load them as it does on My66000. Where
changing the register bank is just modifying a single bit.

Q+ Status:
Added complexity to the done state. It is now two bits as some
instructions can issue to two functional units and the instruction is
not done until it is done on both units. These instructions include
jump, branch to subroutine, and return instructions. They need to
execute on both the ALU and FCU. The scheduler can now also schedule the
same instruction on more than one unit, if decode indicates to execute
on multiple units.

Still too early to tell but, it is looking like the core will run at
close to 60 MHz at least that is a goal. Executing a maximum of 4
instructions per cycle, it should be close to 240 MIPs peak. A much more
realistic estimate would be 50 MIPs. Given costly branch misprediction,
and a lack of forwarding between units. All this assuming I have not
made too many boo-boos.

Re: Tonight's tradeoff

<ukikc5$2vmae$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35384&group=comp.arch#35384

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 3 Dec 2023 14:19:33 -0500
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <ukikc5$2vmae$2@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com>
<ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 3 Dec 2023 19:19:34 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0e52f42b52b4ed63b9d1a56279cc4461";
logging-data="3135822"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+TKyNDDxdKCAAHg3EsqPuSPSqTwsEdaa8="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Br4P5cOMnASejM16aycHdZ4sVXc=
In-Reply-To: <9K1bN.257199$wvv7.25292@fx14.iad>
Content-Language: en-US

by: Robert Finch - Sun, 3 Dec 2023 19:19 UTC

On 2023-12-03 11:07 a.m., EricP wrote:
> Robert Finch wrote:
>> Figured it out. Each architectural register in the RAT must refer to N
>> physical registers, where N is the number of banks. Setting N to 4
>> results in a RAT that is only about 50% larger than one supporting
>> only a single bank. The operating mode is used to select the physical
>> register. The first eight registers are shared between all operating
>> modes so arguments can be passed to syscalls. It is tempting to have
>> eight banks of registers, one for each hardware interrupt level.
>
> A consequence of multiple architecture register banks is each extra
> bank keeps a set of mostly unused physical register attached to them.
> For example, if there are 2 modes User and Super and a bank for each,
> since User and Super are mutually exclusive,
> 64 of your 256 physical registers will be sitting unused tied
> to the other mode bank, so max of 75% utilization efficiency.
>
Yes.

> If you have 8 register banks then only 3/10 of the physical registers
> are available to use, the other 7/10 are sitting idle attached to
> arch registers in other modes consuming power.

Yes too.
>
> Also you don't have to play overlapped-register-bank games to pass
> args to/from syscalls. You can have specific instructions that reach
> into other banks: Move To User Reg, Move From User Reg.
> Since only syscall passes args into the OS you only need to access
> the user mode bank from the OS kernel bank.
>
>
The Q+ move instruction is setup this way. It has a couple of extra bits
in the register specifiers. The instruction could also look at the CPU's
previous operating mode, and current operating mode to determine
register specs.

Re: Tonight's tradeoff

<ukj875$33k1l$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35400&group=comp.arch#35400

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 3 Dec 2023 18:58:10 -0600
Organization: A noiseless patient Spider
Lines: 53
Message-ID: <ukj875$33k1l$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com>
<ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 4 Dec 2023 00:58:13 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7f62334d3366f2a68b6185e423893c6e";
logging-data="3264565"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/4Va3tPIejPlxnDGnEXASM"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:m2/5ciRWJsW862A+Jiw7hxYoY9g=
Content-Language: en-US
In-Reply-To: <3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>

by: BGB - Mon, 4 Dec 2023 00:58 UTC

On 12/3/2023 10:58 AM, MitchAlsup wrote:
> EricP wrote:
>
>> Robert Finch wrote:
>>> Figured it out. Each architectural register in the RAT must refer to
>>> N physical registers, where N is the number of banks. Setting N to 4
>>> results in a RAT that is only about 50% larger than one supporting
>>> only a single bank. The operating mode is used to select the physical
>>> register. The first eight registers are shared between all operating
>>> modes so arguments can be passed to syscalls. It is tempting to have
>>> eight banks of registers, one for each hardware interrupt level.
>
>> A consequence of multiple architecture register banks is each extra
>> bank keeps a set of mostly unused physical register attached to them.
>
> A waste.....
>
>> For example, if there are 2 modes User and Super and a bank for each,
>> since User and Super are mutually exclusive,
>> 64 of your 256 physical registers will be sitting unused tied
>> to the other mode bank, so max of 75% utilization efficiency.
>
>> If you have 8 register banks then only 3/10 of the physical registers
>> are available to use, the other 7/10 are sitting idle attached to
>> arch registers in other modes consuming power.
>
>> Also you don't have to play overlapped-register-bank games to pass
>> args to/from syscalls. You can have specific instructions that reach
>> into other banks: Move To User Reg, Move From User Reg.
>> Since only syscall passes args into the OS you only need to access
>> the user mode bank from the OS kernel bank.
>
> Whereas: Exceptions, interrupts save and restore 32-registers::
> A SysCall in My 66000 only saves and restores 24 of the 32 registers.
> So when control arrives, there are 8 argument registers from the Caller
> and 24 registers from Guest OS already loaded. So, SysCall
> handler already has its stack, and a variety of pointers to data
> structures it is interested in.
>
> On the way back, RET only restores 24 registers so Guest OS can pass
> back as many as 8 result registers.

I had handled it by saving/restoring 64 of the 64 registers...
For syscalls, it basically messes with the registers in the captured
register state for the calling task.

A newer change involves saving/restoring registers more directly to/from
the task context for syscalls, which reduces the task-switch overhead by
around 50% (but is mostly N/A for other kinds of interrupts).

....

Re: Tonight's tradeoff

<ukmb6n$3q23h$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35431&group=comp.arch#35431

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.chmurka.net!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Tue, 5 Dec 2023 00:07:34 -0500
Organization: A noiseless patient Spider
Lines: 75
Message-ID: <ukmb6n$3q23h$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com>
<ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 5 Dec 2023 05:07:36 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="000bad77c4a45bbaf24bc632ba734d2b";
logging-data="3999857"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX197ja7yIDZJoFlfaEu2mxszhpleCeNhNVs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:apfczu0fIiUEHs1W9C4alYGYdVU=
In-Reply-To: <ukj875$33k1l$1@dont-email.me>
Content-Language: en-US

by: Robert Finch - Tue, 5 Dec 2023 05:07 UTC

On 2023-12-03 7:58 p.m., BGB wrote:
> On 12/3/2023 10:58 AM, MitchAlsup wrote:
>> EricP wrote:
>>
>>> Robert Finch wrote:
>>>> Figured it out. Each architectural register in the RAT must refer to
>>>> N physical registers, where N is the number of banks. Setting N to 4
>>>> results in a RAT that is only about 50% larger than one supporting
>>>> only a single bank. The operating mode is used to select the
>>>> physical register. The first eight registers are shared between all
>>>> operating modes so arguments can be passed to syscalls. It is
>>>> tempting to have eight banks of registers, one for each hardware
>>>> interrupt level.
>>
>>> A consequence of multiple architecture register banks is each extra
>>> bank keeps a set of mostly unused physical register attached to them.
>>
>> A waste.....
>>
>>> For example, if there are 2 modes User and Super and a bank for each,
>>> since User and Super are mutually exclusive,
>>> 64 of your 256 physical registers will be sitting unused tied
>>> to the other mode bank, so max of 75% utilization efficiency.
>>
>>> If you have 8 register banks then only 3/10 of the physical registers
>>> are available to use, the other 7/10 are sitting idle attached to
>>> arch registers in other modes consuming power.
>>
>>> Also you don't have to play overlapped-register-bank games to pass
>>> args to/from syscalls. You can have specific instructions that reach
>>> into other banks: Move To User Reg, Move From User Reg.
>>> Since only syscall passes args into the OS you only need to access
>>> the user mode bank from the OS kernel bank.
>>
>> Whereas: Exceptions, interrupts save and restore 32-registers::
>> A SysCall in My 66000 only saves and restores 24 of the 32 registers.
>> So when control arrives, there are 8 argument registers from the
>> Caller and 24 registers from Guest OS already loaded. So, SysCall
>> handler already has its stack, and a variety of pointers to data
>> structures it is interested in.
>>
>> On the way back, RET only restores 24 registers so Guest OS can pass
>> back as many as 8 result registers.
>
> I had handled it by saving/restoring 64 of the 64 registers...
> For syscalls, it basically messes with the registers in the captured
> register state for the calling task.
>
> A newer change involves saving/restoring registers more directly to/from
> the task context for syscalls, which reduces the task-switch overhead by
> around 50% (but is mostly N/A for other kinds of interrupts).
>
> ...
>
>
I am toying with the idea of adding context save and restore
instructions. I would try to get them to work on a cache-line worth of
data, four registers accessed for read or write at the same time.
Context save / restore would be a macro instruction made up of sixteen
individual instructions each of which saves or restores four registers.
It is a bit of a hoop to jump through for an infrequently used
operation. However, it is good to have to clean context switch code.

Added the REGS instruction modifier. The modifier causes the following
load or store instruction to repeat using the registers specified in the
register list bitmask for the source or target register. In theory it
can also be applied to other instructions but that was not the intent.
It is pretty much useless for other instructions, but a register list
could be supplied to the MOV instruction to zero out multiple registers
with a single instruction. Or possibly the ADDI instruction could be
used to load a constant into multiple registers. I could put code in to
disable REGS use with anything other than load and store ops, but why
add extra hardware?

Re: Tonight's tradeoff

<ukmho1$3qusm$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35432&group=comp.arch#35432

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Tue, 5 Dec 2023 00:59:11 -0600
Organization: A noiseless patient Spider
Lines: 212
Message-ID: <ukmho1$3qusm$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com>
<ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 5 Dec 2023 06:59:13 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="dcd61a4d137569b14300071fc5194bdd";
logging-data="4029334"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX197Br2C4GOcLn9WIt424Fwk"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:UtyBz+JMM6RLDM60RCxWLUzbx6k=
Content-Language: en-US
In-Reply-To: <ukmb6n$3q23h$1@dont-email.me>

by: BGB - Tue, 5 Dec 2023 06:59 UTC

On 12/4/2023 11:07 PM, Robert Finch wrote:
> On 2023-12-03 7:58 p.m., BGB wrote:
>> On 12/3/2023 10:58 AM, MitchAlsup wrote:
>>> EricP wrote:
>>>
>>>> Robert Finch wrote:
>>>>> Figured it out. Each architectural register in the RAT must refer
>>>>> to N physical registers, where N is the number of banks. Setting N
>>>>> to 4 results in a RAT that is only about 50% larger than one
>>>>> supporting only a single bank. The operating mode is used to select
>>>>> the physical register. The first eight registers are shared between
>>>>> all operating modes so arguments can be passed to syscalls. It is
>>>>> tempting to have eight banks of registers, one for each hardware
>>>>> interrupt level.
>>>
>>>> A consequence of multiple architecture register banks is each extra
>>>> bank keeps a set of mostly unused physical register attached to them.
>>>
>>> A waste.....
>>>
>>>> For example, if there are 2 modes User and Super and a bank for each,
>>>> since User and Super are mutually exclusive,
>>>> 64 of your 256 physical registers will be sitting unused tied
>>>> to the other mode bank, so max of 75% utilization efficiency.
>>>
>>>> If you have 8 register banks then only 3/10 of the physical registers
>>>> are available to use, the other 7/10 are sitting idle attached to
>>>> arch registers in other modes consuming power.
>>>
>>>> Also you don't have to play overlapped-register-bank games to pass
>>>> args to/from syscalls. You can have specific instructions that reach
>>>> into other banks: Move To User Reg, Move From User Reg.
>>>> Since only syscall passes args into the OS you only need to access
>>>> the user mode bank from the OS kernel bank.
>>>
>>> Whereas: Exceptions, interrupts save and restore 32-registers::
>>> A SysCall in My 66000 only saves and restores 24 of the 32 registers.
>>> So when control arrives, there are 8 argument registers from the
>>> Caller and 24 registers from Guest OS already loaded. So, SysCall
>>> handler already has its stack, and a variety of pointers to data
>>> structures it is interested in.
>>>
>>> On the way back, RET only restores 24 registers so Guest OS can pass
>>> back as many as 8 result registers.
>>
>> I had handled it by saving/restoring 64 of the 64 registers...
>> For syscalls, it basically messes with the registers in the captured
>> register state for the calling task.
>>
>> A newer change involves saving/restoring registers more directly
>> to/from the task context for syscalls, which reduces the task-switch
>> overhead by around 50% (but is mostly N/A for other kinds of interrupts).
>>
>> ...
>>
>>
> I am toying with the idea of adding context save and restore
> instructions. I would try to get them to work on a cache-line worth of
> data, four registers accessed for read or write at the same time.
> Context save / restore would be a macro instruction made up of sixteen
> individual instructions each of which saves or restores four registers.
> It is a bit of a hoop to jump through for an infrequently used
> operation. However, it is good to have to clean context switch code.
>
> Added the REGS instruction modifier. The modifier causes the following
> load or store instruction to repeat using the registers specified in the
> register list bitmask for the source or target register. In theory it
> can also be applied to other instructions but that was not the intent.
> It is pretty much useless for other instructions, but a register list
> could be supplied to the MOV instruction to zero out multiple registers
> with a single instruction. Or possibly the ADDI instruction could be
> used to load a constant into multiple registers. I could put code in to
> disable REGS use with anything other than load and store ops, but why
> add extra hardware?
>

In my case, it is partly a limitation of not really being able to make
it wider than it is already absent adding a 4th register write port and
likely imposing a 256-bit alignment requirement; for a task that is
mostly limited by L1 cache misses...

Like, saving registers would be ~ 40 cycles or so (with another ~ 40 to
restore them), saving/restoring 2 registers per cycle with GPRs, if not
for all the L1 misses.

Reason it is not similar for normal function calls (besides these
saving/restoring normal registers), is because often the stack is still
"warm" in the L1 cache.

For interrupts, in the time from one interrupt to another, most of the
L1 cache contents from the previous interrupt are already gone.

So, these instruction sequences are around 80% L1 miss penalty, vs
around 5% for normal prologs/epilogs.

This is similar for the inner loops for "memcpy()", which average
roughly 90% L1 miss penalty.

And, say, "memcpy()" averages around 300MB/sec if just copying the same
small buffer over and over again, but then quickly drops to 70MB/sec if
copying memory that falls outside the L1 cache.

Though, comparably, it seems that the drop-off from L2 cache to DRAM is
currently a little smaller.

So, the external DRAM interface can push ~ 100MB/sec with the current
interface (supports SWAP operations, moving 512 bits at a time, and
using a sequence number to transition from one request to another).

But, it is around 70MB/s for requests to make it around the ringbus.

Though, I have noted that if things stay within the limits of what fits
in the L2 cache, multiple parties can access the L2 cache at the same
time without too much impact on each other.

So, say, a modest resolutions, the screen refresh does not impact the
CPU, and the rasterizer module is also mostly independent.

Still, about the highest screen resolution it can really sustain
effectively is ~ 640x480 256-color, or ~ 18MB/sec.

This may be more timing related though, since for screen refresh there
is a relatively tight deadline between when the requests start being
sent, and when the L2 cache needs to hit for that request, and failing
this will result in graphical glitches.

Though, generally what it means is, if the framebuffer image isn't in
the L2 cache, it is gonna look like crap; and effectively the limit is
more "how big of a framebuffer can I fit in the L2 cache".

On the XC7A200T, I can afford a 512K L2 cache, which is just so big
enough to fit 640x400 or 640x480 (but 800x600 is kinda pushing it, and
fights a bit more with the main CPU).

OTOH, it is likely the case than on the XC7A100T (which can only afford
a 256K L2 cache), that 640x400 256-color is pushing it (but color cell
mode still works fine).

Had noted though that trying to set the screen resolution at one point
to 800x600 RGB555 (72 Hz), pulls around 70MB/sec, and basically was
almost entirely broken and seemingly bogged down the CPU (which could no
longer access memory in a timely manner).

Also seemingly stuff running on the CPU can effect screen artifacts in
these modes, presumably by knocking stuff out of the L2 cache.

Also, it seems like despite my ringbus being a lot faster than my
original bus, it has still managed to become an issue due to latency.

But, despite this, on average, things like interlocks and branch-miss
penalties and similar are now still weighing in a fair bit as well (with
interlock penalties closely following cache misses as the main source of
pipeline stalls).

Well, and these two combined burning around 30% of the total
clock-cycles, with another ~ 2-3% or so being spent on branches, ...

Well, and my recent effort to try to improve FPGA timing enough try to
get it up to 75MHz, did have the drawback of "in general" increasing the
number of cycles spent on interlocks (but, returning a lot of the
instructions to their original latency values, would make the FPGA
timing-constraints issues a bit worse).

But, if I could entirely eliminate these sources of latency, this would
only gain ~30%, and at this point would either need to somehow increase
the average bundle with, or find ways to reduce the total number of
instructions that need to be executed (both of these being more
compiler-related territory).

Though, OTOH, I have noted that in many cases I am beating RISC-V
(RV64IM) in terms of total ".text" size in XG2 mode (only 32/64/96 bit
encodings) when both are using the same C library, which implies that I
am probably "not doing too badly" on this front either (though, ideally,
I would be "more consistently" beating RISC-V at this metric, *1).

*1: RV64 "-Os -ffunction-sections -Wl,-gc-sections" is still able to
beat XG2 in terms of having smaller ".text" (though, "-O2" and "-O3" are
bigger; BJX2 Baseline does beat RV64IM, but this is not a fair test as
BJX2 Baseline has 16-bit ops).

Click here to read the complete article

Re: Tonight's tradeoff

<uko6ok$dk88$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35452&group=comp.arch#35452

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.swapon.de!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Tue, 5 Dec 2023 17:04:04 -0500
Organization: A noiseless patient Spider
Lines: 250
Message-ID: <uko6ok$dk88$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com>
<ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 5 Dec 2023 22:04:05 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="000bad77c4a45bbaf24bc632ba734d2b";
logging-data="446728"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/PlJRYCraahdnK6ZOIv2SkSHowfjMzJm4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:4odzblfi3fPkOJCN6kRR39JBsnU=
In-Reply-To: <ukmho1$3qusm$1@dont-email.me>
Content-Language: en-US

by: Robert Finch - Tue, 5 Dec 2023 22:04 UTC

On 2023-12-05 1:59 a.m., BGB wrote:
> On 12/4/2023 11:07 PM, Robert Finch wrote:
>> On 2023-12-03 7:58 p.m., BGB wrote:
>>> On 12/3/2023 10:58 AM, MitchAlsup wrote:
>>>> EricP wrote:
>>>>
>>>>> Robert Finch wrote:
>>>>>> Figured it out. Each architectural register in the RAT must refer
>>>>>> to N physical registers, where N is the number of banks. Setting N
>>>>>> to 4 results in a RAT that is only about 50% larger than one
>>>>>> supporting only a single bank. The operating mode is used to
>>>>>> select the physical register. The first eight registers are shared
>>>>>> between all operating modes so arguments can be passed to
>>>>>> syscalls. It is tempting to have eight banks of registers, one for
>>>>>> each hardware interrupt level.
>>>>
>>>>> A consequence of multiple architecture register banks is each extra
>>>>> bank keeps a set of mostly unused physical register attached to them.
>>>>
>>>> A waste.....
>>>>
>>>>> For example, if there are 2 modes User and Super and a bank for each,
>>>>> since User and Super are mutually exclusive,
>>>>> 64 of your 256 physical registers will be sitting unused tied
>>>>> to the other mode bank, so max of 75% utilization efficiency.
>>>>
>>>>> If you have 8 register banks then only 3/10 of the physical registers
>>>>> are available to use, the other 7/10 are sitting idle attached to
>>>>> arch registers in other modes consuming power.
>>>>
>>>>> Also you don't have to play overlapped-register-bank games to pass
>>>>> args to/from syscalls. You can have specific instructions that reach
>>>>> into other banks: Move To User Reg, Move From User Reg.
>>>>> Since only syscall passes args into the OS you only need to access
>>>>> the user mode bank from the OS kernel bank.
>>>>
>>>> Whereas: Exceptions, interrupts save and restore 32-registers::
>>>> A SysCall in My 66000 only saves and restores 24 of the 32 registers.
>>>> So when control arrives, there are 8 argument registers from the
>>>> Caller and 24 registers from Guest OS already loaded. So, SysCall
>>>> handler already has its stack, and a variety of pointers to data
>>>> structures it is interested in.
>>>>
>>>> On the way back, RET only restores 24 registers so Guest OS can pass
>>>> back as many as 8 result registers.
>>>
>>> I had handled it by saving/restoring 64 of the 64 registers...
>>> For syscalls, it basically messes with the registers in the captured
>>> register state for the calling task.
>>>
>>> A newer change involves saving/restoring registers more directly
>>> to/from the task context for syscalls, which reduces the task-switch
>>> overhead by around 50% (but is mostly N/A for other kinds of
>>> interrupts).
>>>
>>> ...
>>>
>>>
>> I am toying with the idea of adding context save and restore
>> instructions. I would try to get them to work on a cache-line worth of
>> data, four registers accessed for read or write at the same time.
>> Context save / restore would be a macro instruction made up of sixteen
>> individual instructions each of which saves or restores four
>> registers. It is a bit of a hoop to jump through for an infrequently
>> used operation. However, it is good to have to clean context switch code.
>>
>> Added the REGS instruction modifier. The modifier causes the following
>> load or store instruction to repeat using the registers specified in
>> the register list bitmask for the source or target register. In theory
>> it can also be applied to other instructions but that was not the
>> intent. It is pretty much useless for other instructions, but a
>> register list could be supplied to the MOV instruction to zero out
>> multiple registers with a single instruction. Or possibly the ADDI
>> instruction could be used to load a constant into multiple registers.
>> I could put code in to disable REGS use with anything other than load
>> and store ops, but why add extra hardware?
>>
>
> In my case, it is partly a limitation of not really being able to make
> it wider than it is already absent adding a 4th register write port and
> likely imposing a 256-bit alignment requirement; for a task that is
> mostly limited by L1 cache misses...
>
>
> Like, saving registers would be ~ 40 cycles or so (with another ~ 40 to
> restore them), saving/restoring 2 registers per cycle with GPRs, if not
> for all the L1 misses.
>
> Reason it is not similar for normal function calls (besides these
> saving/restoring normal registers), is because often the stack is still
> "warm" in the L1 cache.
>
> For interrupts, in the time from one interrupt to another, most of the
> L1 cache contents from the previous interrupt are already gone.
>
>
> So, these instruction sequences are around 80% L1 miss penalty, vs
> around 5% for normal prologs/epilogs.
>
> This is similar for the inner loops for "memcpy()", which average
> roughly 90% L1 miss penalty.
>
>
>
> And, say, "memcpy()" averages around 300MB/sec if just copying the same
> small buffer over and over again, but then quickly drops to 70MB/sec if
> copying memory that falls outside the L1 cache.
>
> Though, comparably, it seems that the drop-off from L2 cache to DRAM is
> currently a little smaller.
>
> So, the external DRAM interface can push ~ 100MB/sec with the current
> interface (supports SWAP operations, moving 512 bits at a time, and
> using a sequence number to transition from one request to another).
>
> But, it is around 70MB/s for requests to make it around the ringbus.
>
>
> Though, I have noted that if things stay within the limits of what fits
> in the L2 cache, multiple parties can access the L2 cache at the same
> time without too much impact on each other.
>
> So, say, a modest resolutions, the screen refresh does not impact the
> CPU, and the rasterizer module is also mostly independent.
>
>
>
> Still, about the highest screen resolution it can really sustain
> effectively is ~ 640x480 256-color, or ~ 18MB/sec.
>
> This may be more timing related though, since for screen refresh there
> is a relatively tight deadline between when the requests start being
> sent, and when the L2 cache needs to hit for that request, and failing
> this will result in graphical glitches.
>
> Though, generally what it means is, if the framebuffer image isn't in
> the L2 cache, it is gonna look like crap; and effectively the limit is
> more "how big of a framebuffer can I fit in the L2 cache".
>
> On the XC7A200T, I can afford a 512K L2 cache, which is just so big
> enough to fit 640x400 or 640x480 (but 800x600 is kinda pushing it, and
> fights a bit more with the main CPU).
>
> OTOH, it is likely the case than on the XC7A100T (which can only afford
> a 256K L2 cache), that 640x400 256-color is pushing it (but color cell
> mode still works fine).
>
> Had noted though that trying to set the screen resolution at one point
> to 800x600 RGB555 (72 Hz), pulls around 70MB/sec, and basically was
> almost entirely broken and seemingly bogged down the CPU (which could no
> longer access memory in a timely manner).
>
> Also seemingly stuff running on the CPU can effect screen artifacts in
> these modes, presumably by knocking stuff out of the L2 cache.
>
>
>
> Also, it seems like despite my ringbus being a lot faster than my
> original bus, it has still managed to become an issue due to latency.
>
> But, despite this, on average, things like interlocks and branch-miss
> penalties and similar are now still weighing in a fair bit as well (with
> interlock penalties closely following cache misses as the main source of
> pipeline stalls).
>
> Well, and these two combined burning around 30% of the total
> clock-cycles, with another ~ 2-3% or so being spent on branches, ...
>
>
> Well, and my recent effort to try to improve FPGA timing enough try to
> get it up to 75MHz, did have the drawback of "in general" increasing the
> number of cycles spent on interlocks (but, returning a lot of the
> instructions to their original latency values, would make the FPGA
> timing-constraints issues a bit worse).
>
> But, if I could entirely eliminate these sources of latency, this would
> only gain ~30%, and at this point would either need to somehow increase
> the average bundle with, or find ways to reduce the total number of
> instructions that need to be executed (both of these being more
> compiler-related territory).
>
>
> Though, OTOH, I have noted that in many cases I am beating RISC-V
> (RV64IM) in terms of total ".text" size in XG2 mode (only 32/64/96 bit
> encodings) when both are using the same C library, which implies that I
> am probably "not doing too badly" on this front either (though, ideally,
> I would be "more consistently" beating RISC-V at this metric, *1).
>
>
> *1: RV64 "-Os -ffunction-sections -Wl,-gc-sections" is still able to
> beat XG2 in terms of having smaller ".text" (though, "-O2" and "-O3" are
> bigger; BJX2 Baseline does beat RV64IM, but this is not a fair test as
> BJX2 Baseline has 16-bit ops).
>
> Though, BGBCC also has an "/Os" option, it seems to have very little
> effect on XG2 Mode (it mostly does things to try to increase the number
> of 16-bit ops used, which is N/A in XG2).
>
> Where, here, one can use ".text" size as a stand-in for total
> instruction count (and by extension, the number of instructions that
> need to be executed).
>
> Though, in some past tests, it seemed like RISC-V needed to execute a
> larger number of instructions to render each frame in Doom, which
> doesn't really make follow if both have a roughly similar number of
> instructions in the emitted binaries (and if both are essentially
> running the same code).
>
> So, something seems curious here...
>
>
> ...
>
>
For the Q+ MPU and SOC the bus system is organized like a tree with the
root being at the CPU. The system bus operates with asynchronous
transactions. The bus then fans out through bus bridges to various
system components. Responses coming back from devices are buffered and
merge results together into a more common bus when there are open spaces
in the bus. I think it is fairly fast (well at least for homebrew FPGA).
Bus accesses are single cycle, but they may have a varying amount of
latency. Writes are “posted” so they are essentially single cycle. Reads
percolate back up the tree to the CPU. It operates at the CPU clock rate
(currently 40MHz) and transfers 128-bits at a time. Maximum peak
transfer rate would then be 640 MB/s. Copying memory is bound to be much
slower due to the read latency. Devices on the bus have a configuration
block which looks something like a PCI config block, so devices
addressing may be controlled by the OS.

Click here to read the complete article

Re: Tonight's tradeoff

<a5fb0db1da6f48f46eb636a3bd9b267f@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35460&group=comp.arch#35460

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Tue, 5 Dec 2023 23:47:10 +0000
Organization: novaBBS
Message-ID: <a5fb0db1da6f48f46eb636a3bd9b267f@news.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com> <ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me> <987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com> <ujrm4a$2llie$1@dont-email.me> <d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com> <bps8N.150652$wvv7.7314@fx14.iad> <2023Nov26.164506@mips.complang.tuwien.ac.at> <tDP8N.30031$ayBd.8559@fx07.iad> <2023Nov27.085708@mips.complang.tuwien.ac.at> <s929N.28687$rx%7.18632@fx47.iad> <2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me> <ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad> <3a714dde36640fc4c47255d0a170aaee@news.novabbs.com> <ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me> <ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="3184401"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Site: $2y$10$d/1krssrRpIbpREKthr9GOFLcpXkF.E5X03YPp6k8hTTJMPeq/mHu
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949

by: MitchAlsup - Tue, 5 Dec 2023 23:47 UTC

Robert Finch wrote:

>
> For the Q+ MPU and SOC the bus system is organized like a tree with the
> root being at the CPU. The system bus operates with asynchronous
> transactions. The bus then fans out through bus bridges to various
> system components. Responses coming back from devices are buffered and
> merge results together into a more common bus when there are open spaces
> in the bus. I think it is fairly fast (well at least for homebrew FPGA).
> Bus accesses are single cycle, but they may have a varying amount of
> latency.

My "bus" is similar, but is, in effect, a 4-wire protocol done with
transactions on the buss. Read goes to Mem CTL, when "ordered" Snoops
go out, Snoop responses go to requesting core, Mem response goes to
core. When core has SNOOP responses and mem data it sends DONE to
mem Ctl. The arriving DONE allows the next access to that same cache
line to begin (that is DONE "orders" successive accesses to the same
line addresses, while allowing independent accesses to proceed inde-
pendently.

The data width of my "bus" is 1 cache line, or ½ cache line at DDR.
Control is ~90-bits including a 66-bit address.
SNOOP responses are packed.

> Writes are “posted” so they are essentially single cycle.

Writes to DRAM are "posted"
Writes to config space are strongly ordered
Writes to MMI/O are sequentially Consistent

> Reads
> percolate back up the tree to the CPU. It operates at the CPU clock rate
> (currently 40MHz) and transfers 128-bits at a time. Maximum peak
> transfer rate would then be 640 MB/s. Copying memory is bound to be much
> slower due to the read latency. Devices on the bus have a configuration
> block which looks something like a PCI config block, so devices
> addressing may be controlled by the OS.

> Multiple devices access the main DRAM memory via a memory controller.

I interpose the LLC (L3) between the "bus" and the Mem Ctl. This interposition
is what eliminates RowHammer. The L3 is not really a cache it is a preview
of the state DRAM will eventually achieve or has already achieved. It is,
in essence, an infinite write buffer between the MC and DRC and a near
infinite read buffer between DRC and MC.

> Several devices that are bus masters have their own ports to the memory
> controller and do not use up time on the main system bus tree. The

Yes, PCIe HostBridge has master access to the "bus" all "devices" are
down under HostBridge. With CLX enabled, one can even place DRAM out on
PCIe tree,...

Re: Tonight's tradeoff

<ukon2b$jilv$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35470&group=comp.arch#35470

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Tue, 5 Dec 2023 20:42:16 -0600
Organization: A noiseless patient Spider
Lines: 556
Message-ID: <ukon2b$jilv$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com>
<ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 6 Dec 2023 02:42:19 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="325390cd5f4a9d68788313ed1525711c";
logging-data="641727"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19VszeYgNu4lfFA8A/S78LZ"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:uUx/+dmvPeV0NNvLA95bY5g0I/w=
In-Reply-To: <uko6ok$dk88$1@dont-email.me>
Content-Language: en-US

by: BGB - Wed, 6 Dec 2023 02:42 UTC

On 12/5/2023 4:04 PM, Robert Finch wrote:
> On 2023-12-05 1:59 a.m., BGB wrote:
>> On 12/4/2023 11:07 PM, Robert Finch wrote:
>>> On 2023-12-03 7:58 p.m., BGB wrote:
>>>> On 12/3/2023 10:58 AM, MitchAlsup wrote:
>>>>> EricP wrote:
>>>>>
>>>>>> Robert Finch wrote:
>>>>>>> Figured it out. Each architectural register in the RAT must refer
>>>>>>> to N physical registers, where N is the number of banks. Setting
>>>>>>> N to 4 results in a RAT that is only about 50% larger than one
>>>>>>> supporting only a single bank. The operating mode is used to
>>>>>>> select the physical register. The first eight registers are
>>>>>>> shared between all operating modes so arguments can be passed to
>>>>>>> syscalls. It is tempting to have eight banks of registers, one
>>>>>>> for each hardware interrupt level.
>>>>>
>>>>>> A consequence of multiple architecture register banks is each extra
>>>>>> bank keeps a set of mostly unused physical register attached to them.
>>>>>
>>>>> A waste.....
>>>>>
>>>>>> For example, if there are 2 modes User and Super and a bank for each,
>>>>>> since User and Super are mutually exclusive,
>>>>>> 64 of your 256 physical registers will be sitting unused tied
>>>>>> to the other mode bank, so max of 75% utilization efficiency.
>>>>>
>>>>>> If you have 8 register banks then only 3/10 of the physical registers
>>>>>> are available to use, the other 7/10 are sitting idle attached to
>>>>>> arch registers in other modes consuming power.
>>>>>
>>>>>> Also you don't have to play overlapped-register-bank games to pass
>>>>>> args to/from syscalls. You can have specific instructions that reach
>>>>>> into other banks: Move To User Reg, Move From User Reg.
>>>>>> Since only syscall passes args into the OS you only need to access
>>>>>> the user mode bank from the OS kernel bank.
>>>>>
>>>>> Whereas: Exceptions, interrupts save and restore 32-registers::
>>>>> A SysCall in My 66000 only saves and restores 24 of the 32 registers.
>>>>> So when control arrives, there are 8 argument registers from the
>>>>> Caller and 24 registers from Guest OS already loaded. So, SysCall
>>>>> handler already has its stack, and a variety of pointers to data
>>>>> structures it is interested in.
>>>>>
>>>>> On the way back, RET only restores 24 registers so Guest OS can pass
>>>>> back as many as 8 result registers.
>>>>
>>>> I had handled it by saving/restoring 64 of the 64 registers...
>>>> For syscalls, it basically messes with the registers in the captured
>>>> register state for the calling task.
>>>>
>>>> A newer change involves saving/restoring registers more directly
>>>> to/from the task context for syscalls, which reduces the task-switch
>>>> overhead by around 50% (but is mostly N/A for other kinds of
>>>> interrupts).
>>>>
>>>> ...
>>>>
>>>>
>>> I am toying with the idea of adding context save and restore
>>> instructions. I would try to get them to work on a cache-line worth
>>> of data, four registers accessed for read or write at the same time.
>>> Context save / restore would be a macro instruction made up of
>>> sixteen individual instructions each of which saves or restores four
>>> registers. It is a bit of a hoop to jump through for an infrequently
>>> used operation. However, it is good to have to clean context switch
>>> code.
>>>
>>> Added the REGS instruction modifier. The modifier causes the
>>> following load or store instruction to repeat using the registers
>>> specified in the register list bitmask for the source or target
>>> register. In theory it can also be applied to other instructions but
>>> that was not the intent. It is pretty much useless for other
>>> instructions, but a register list could be supplied to the MOV
>>> instruction to zero out multiple registers with a single instruction.
>>> Or possibly the ADDI instruction could be used to load a constant
>>> into multiple registers. I could put code in to disable REGS use with
>>> anything other than load and store ops, but why add extra hardware?
>>>
>>
>> In my case, it is partly a limitation of not really being able to make
>> it wider than it is already absent adding a 4th register write port
>> and likely imposing a 256-bit alignment requirement; for a task that
>> is mostly limited by L1 cache misses...
>>
>>
>> Like, saving registers would be ~ 40 cycles or so (with another ~ 40
>> to restore them), saving/restoring 2 registers per cycle with GPRs, if
>> not for all the L1 misses.
>>
>> Reason it is not similar for normal function calls (besides these
>> saving/restoring normal registers), is because often the stack is
>> still "warm" in the L1 cache.
>>
>> For interrupts, in the time from one interrupt to another, most of the
>> L1 cache contents from the previous interrupt are already gone.
>>
>>
>> So, these instruction sequences are around 80% L1 miss penalty, vs
>> around 5% for normal prologs/epilogs.
>>
>> This is similar for the inner loops for "memcpy()", which average
>> roughly 90% L1 miss penalty.
>>
>>
>>
>> And, say, "memcpy()" averages around 300MB/sec if just copying the
>> same small buffer over and over again, but then quickly drops to
>> 70MB/sec if copying memory that falls outside the L1 cache.
>>
>> Though, comparably, it seems that the drop-off from L2 cache to DRAM
>> is currently a little smaller.
>>
>> So, the external DRAM interface can push ~ 100MB/sec with the current
>> interface (supports SWAP operations, moving 512 bits at a time, and
>> using a sequence number to transition from one request to another).
>>
>> But, it is around 70MB/s for requests to make it around the ringbus.
>>
>>
>> Though, I have noted that if things stay within the limits of what
>> fits in the L2 cache, multiple parties can access the L2 cache at the
>> same time without too much impact on each other.
>>
>> So, say, a modest resolutions, the screen refresh does not impact the
>> CPU, and the rasterizer module is also mostly independent.
>>
>>
>>
>> Still, about the highest screen resolution it can really sustain
>> effectively is ~ 640x480 256-color, or ~ 18MB/sec.
>>
>> This may be more timing related though, since for screen refresh there
>> is a relatively tight deadline between when the requests start being
>> sent, and when the L2 cache needs to hit for that request, and failing
>> this will result in graphical glitches.
>>
>> Though, generally what it means is, if the framebuffer image isn't in
>> the L2 cache, it is gonna look like crap; and effectively the limit is
>> more "how big of a framebuffer can I fit in the L2 cache".
>>
>> On the XC7A200T, I can afford a 512K L2 cache, which is just so big
>> enough to fit 640x400 or 640x480 (but 800x600 is kinda pushing it, and
>> fights a bit more with the main CPU).
>>
>> OTOH, it is likely the case than on the XC7A100T (which can only
>> afford a 256K L2 cache), that 640x400 256-color is pushing it (but
>> color cell mode still works fine).
>>
>> Had noted though that trying to set the screen resolution at one point
>> to 800x600 RGB555 (72 Hz), pulls around 70MB/sec, and basically was
>> almost entirely broken and seemingly bogged down the CPU (which could
>> no longer access memory in a timely manner).
>>
>> Also seemingly stuff running on the CPU can effect screen artifacts in
>> these modes, presumably by knocking stuff out of the L2 cache.
>>
>>
>>
>> Also, it seems like despite my ringbus being a lot faster than my
>> original bus, it has still managed to become an issue due to latency.
>>
>> But, despite this, on average, things like interlocks and branch-miss
>> penalties and similar are now still weighing in a fair bit as well
>> (with interlock penalties closely following cache misses as the main
>> source of pipeline stalls).
>>
>> Well, and these two combined burning around 30% of the total
>> clock-cycles, with another ~ 2-3% or so being spent on branches, ...
>>
>>
>> Well, and my recent effort to try to improve FPGA timing enough try to
>> get it up to 75MHz, did have the drawback of "in general" increasing
>> the number of cycles spent on interlocks (but, returning a lot of the
>> instructions to their original latency values, would make the FPGA
>> timing-constraints issues a bit worse).
>>
>> But, if I could entirely eliminate these sources of latency, this
>> would only gain ~30%, and at this point would either need to somehow
>> increase the average bundle with, or find ways to reduce the total
>> number of instructions that need to be executed (both of these being
>> more compiler-related territory).
>>
>>
>> Though, OTOH, I have noted that in many cases I am beating RISC-V
>> (RV64IM) in terms of total ".text" size in XG2 mode (only 32/64/96 bit
>> encodings) when both are using the same C library, which implies that
>> I am probably "not doing too badly" on this front either (though,
>> ideally, I would be "more consistently" beating RISC-V at this metric,
>> *1).
>>
>>
>> *1: RV64 "-Os -ffunction-sections -Wl,-gc-sections" is still able to
>> beat XG2 in terms of having smaller ".text" (though, "-O2" and "-O3"
>> are bigger; BJX2 Baseline does beat RV64IM, but this is not a fair
>> test as BJX2 Baseline has 16-bit ops).
>>
>> Though, BGBCC also has an "/Os" option, it seems to have very little
>> effect on XG2 Mode (it mostly does things to try to increase the
>> number of 16-bit ops used, which is N/A in XG2).
>>
>> Where, here, one can use ".text" size as a stand-in for total
>> instruction count (and by extension, the number of instructions that
>> need to be executed).
>>
>> Though, in some past tests, it seemed like RISC-V needed to execute a
>> larger number of instructions to render each frame in Doom, which
>> doesn't really make follow if both have a roughly similar number of
>> instructions in the emitted binaries (and if both are essentially
>> running the same code).
>>
>> So, something seems curious here...
>>
>>
>> ...
>>
>>
> For the Q+ MPU and SOC the bus system is organized like a tree with the
> root being at the CPU. The system bus operates with asynchronous
> transactions. The bus then fans out through bus bridges to various
> system components. Responses coming back from devices are buffered and
> merge results together into a more common bus when there are open spaces
> in the bus. I think it is fairly fast (well at least for homebrew FPGA).
> Bus accesses are single cycle, but they may have a varying amount of
> latency. Writes are “posted” so they are essentially single cycle. Reads
> percolate back up the tree to the CPU. It operates at the CPU clock rate
> (currently 40MHz) and transfers 128-bits at a time. Maximum peak
> transfer rate would then be 640 MB/s. Copying memory is bound to be much
> slower due to the read latency. Devices on the bus have a configuration
> block which looks something like a PCI config block, so devices
> addressing may be controlled by the OS.
>

Click here to read the complete article

Re: Tonight's tradeoff

<uksqek$1aotb$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35531&group=comp.arch#35531

copy link Newsgroups: comp.arch

Path: i2pn2.org!rocksolid2!news.neodome.net!feeder1.feed.usenet.farm!feed.usenet.farm!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Thu, 7 Dec 2023 11:04:36 -0500
Organization: A noiseless patient Spider
Lines: 568
Message-ID: <uksqek$1aotb$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com>
<ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 7 Dec 2023 16:04:36 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f08aebda0808298d41bc4fa38271ff10";
logging-data="1401771"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18xfm2hOZeyQ6p7STN4qoYH9JGz5EUvb28="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Cq97UL0OXG8G44o2qstMZZQaov8=
In-Reply-To: <ukon2b$jilv$1@dont-email.me>
Content-Language: en-US

by: Robert Finch - Thu, 7 Dec 2023 16:04 UTC

On 2023-12-05 9:42 p.m., BGB wrote:
> On 12/5/2023 4:04 PM, Robert Finch wrote:
>> On 2023-12-05 1:59 a.m., BGB wrote:
>>> On 12/4/2023 11:07 PM, Robert Finch wrote:
>>>> On 2023-12-03 7:58 p.m., BGB wrote:
>>>>> On 12/3/2023 10:58 AM, MitchAlsup wrote:
>>>>>> EricP wrote:
>>>>>>
>>>>>>> Robert Finch wrote:
>>>>>>>> Figured it out. Each architectural register in the RAT must
>>>>>>>> refer to N physical registers, where N is the number of banks.
>>>>>>>> Setting N to 4 results in a RAT that is only about 50% larger
>>>>>>>> than one supporting only a single bank. The operating mode is
>>>>>>>> used to select the physical register. The first eight registers
>>>>>>>> are shared between all operating modes so arguments can be
>>>>>>>> passed to syscalls. It is tempting to have eight banks of
>>>>>>>> registers, one for each hardware interrupt level.
>>>>>>
>>>>>>> A consequence of multiple architecture register banks is each extra
>>>>>>> bank keeps a set of mostly unused physical register attached to
>>>>>>> them.
>>>>>>
>>>>>> A waste.....
>>>>>>
>>>>>>> For example, if there are 2 modes User and Super and a bank for
>>>>>>> each,
>>>>>>> since User and Super are mutually exclusive,
>>>>>>> 64 of your 256 physical registers will be sitting unused tied
>>>>>>> to the other mode bank, so max of 75% utilization efficiency.
>>>>>>
>>>>>>> If you have 8 register banks then only 3/10 of the physical
>>>>>>> registers
>>>>>>> are available to use, the other 7/10 are sitting idle attached to
>>>>>>> arch registers in other modes consuming power.
>>>>>>
>>>>>>> Also you don't have to play overlapped-register-bank games to pass
>>>>>>> args to/from syscalls. You can have specific instructions that reach
>>>>>>> into other banks: Move To User Reg, Move From User Reg.
>>>>>>> Since only syscall passes args into the OS you only need to access
>>>>>>> the user mode bank from the OS kernel bank.
>>>>>>
>>>>>> Whereas: Exceptions, interrupts save and restore 32-registers::
>>>>>> A SysCall in My 66000 only saves and restores 24 of the 32 registers.
>>>>>> So when control arrives, there are 8 argument registers from the
>>>>>> Caller and 24 registers from Guest OS already loaded. So, SysCall
>>>>>> handler already has its stack, and a variety of pointers to data
>>>>>> structures it is interested in.
>>>>>>
>>>>>> On the way back, RET only restores 24 registers so Guest OS can pass
>>>>>> back as many as 8 result registers.
>>>>>
>>>>> I had handled it by saving/restoring 64 of the 64 registers...
>>>>> For syscalls, it basically messes with the registers in the
>>>>> captured register state for the calling task.
>>>>>
>>>>> A newer change involves saving/restoring registers more directly
>>>>> to/from the task context for syscalls, which reduces the
>>>>> task-switch overhead by around 50% (but is mostly N/A for other
>>>>> kinds of interrupts).
>>>>>
>>>>> ...
>>>>>
>>>>>
>>>> I am toying with the idea of adding context save and restore
>>>> instructions. I would try to get them to work on a cache-line worth
>>>> of data, four registers accessed for read or write at the same time.
>>>> Context save / restore would be a macro instruction made up of
>>>> sixteen individual instructions each of which saves or restores four
>>>> registers. It is a bit of a hoop to jump through for an infrequently
>>>> used operation. However, it is good to have to clean context switch
>>>> code.
>>>>
>>>> Added the REGS instruction modifier. The modifier causes the
>>>> following load or store instruction to repeat using the registers
>>>> specified in the register list bitmask for the source or target
>>>> register. In theory it can also be applied to other instructions but
>>>> that was not the intent. It is pretty much useless for other
>>>> instructions, but a register list could be supplied to the MOV
>>>> instruction to zero out multiple registers with a single
>>>> instruction. Or possibly the ADDI instruction could be used to load
>>>> a constant into multiple registers. I could put code in to disable
>>>> REGS use with anything other than load and store ops, but why add
>>>> extra hardware?
>>>>
>>>
>>> In my case, it is partly a limitation of not really being able to
>>> make it wider than it is already absent adding a 4th register write
>>> port and likely imposing a 256-bit alignment requirement; for a task
>>> that is mostly limited by L1 cache misses...
>>>
>>>
>>> Like, saving registers would be ~ 40 cycles or so (with another ~ 40
>>> to restore them), saving/restoring 2 registers per cycle with GPRs,
>>> if not for all the L1 misses.
>>>
>>> Reason it is not similar for normal function calls (besides these
>>> saving/restoring normal registers), is because often the stack is
>>> still "warm" in the L1 cache.
>>>
>>> For interrupts, in the time from one interrupt to another, most of
>>> the L1 cache contents from the previous interrupt are already gone.
>>>
>>>
>>> So, these instruction sequences are around 80% L1 miss penalty, vs
>>> around 5% for normal prologs/epilogs.
>>>
>>> This is similar for the inner loops for "memcpy()", which average
>>> roughly 90% L1 miss penalty.
>>>
>>>
>>>
>>> And, say, "memcpy()" averages around 300MB/sec if just copying the
>>> same small buffer over and over again, but then quickly drops to
>>> 70MB/sec if copying memory that falls outside the L1 cache.
>>>
>>> Though, comparably, it seems that the drop-off from L2 cache to DRAM
>>> is currently a little smaller.
>>>
>>> So, the external DRAM interface can push ~ 100MB/sec with the current
>>> interface (supports SWAP operations, moving 512 bits at a time, and
>>> using a sequence number to transition from one request to another).
>>>
>>> But, it is around 70MB/s for requests to make it around the ringbus.
>>>
>>>
>>> Though, I have noted that if things stay within the limits of what
>>> fits in the L2 cache, multiple parties can access the L2 cache at the
>>> same time without too much impact on each other.
>>>
>>> So, say, a modest resolutions, the screen refresh does not impact the
>>> CPU, and the rasterizer module is also mostly independent.
>>>
>>>
>>>
>>> Still, about the highest screen resolution it can really sustain
>>> effectively is ~ 640x480 256-color, or ~ 18MB/sec.
>>>
>>> This may be more timing related though, since for screen refresh
>>> there is a relatively tight deadline between when the requests start
>>> being sent, and when the L2 cache needs to hit for that request, and
>>> failing this will result in graphical glitches.
>>>
>>> Though, generally what it means is, if the framebuffer image isn't in
>>> the L2 cache, it is gonna look like crap; and effectively the limit
>>> is more "how big of a framebuffer can I fit in the L2 cache".
>>>
>>> On the XC7A200T, I can afford a 512K L2 cache, which is just so big
>>> enough to fit 640x400 or 640x480 (but 800x600 is kinda pushing it,
>>> and fights a bit more with the main CPU).
>>>
>>> OTOH, it is likely the case than on the XC7A100T (which can only
>>> afford a 256K L2 cache), that 640x400 256-color is pushing it (but
>>> color cell mode still works fine).
>>>
>>> Had noted though that trying to set the screen resolution at one
>>> point to 800x600 RGB555 (72 Hz), pulls around 70MB/sec, and basically
>>> was almost entirely broken and seemingly bogged down the CPU (which
>>> could no longer access memory in a timely manner).
>>>
>>> Also seemingly stuff running on the CPU can effect screen artifacts
>>> in these modes, presumably by knocking stuff out of the L2 cache.
>>>
>>>
>>>
>>> Also, it seems like despite my ringbus being a lot faster than my
>>> original bus, it has still managed to become an issue due to latency.
>>>
>>> But, despite this, on average, things like interlocks and branch-miss
>>> penalties and similar are now still weighing in a fair bit as well
>>> (with interlock penalties closely following cache misses as the main
>>> source of pipeline stalls).
>>>
>>> Well, and these two combined burning around 30% of the total
>>> clock-cycles, with another ~ 2-3% or so being spent on branches, ...
>>>
>>>
>>> Well, and my recent effort to try to improve FPGA timing enough try
>>> to get it up to 75MHz, did have the drawback of "in general"
>>> increasing the number of cycles spent on interlocks (but, returning a
>>> lot of the instructions to their original latency values, would make
>>> the FPGA timing-constraints issues a bit worse).
>>>
>>> But, if I could entirely eliminate these sources of latency, this
>>> would only gain ~30%, and at this point would either need to somehow
>>> increase the average bundle with, or find ways to reduce the total
>>> number of instructions that need to be executed (both of these being
>>> more compiler-related territory).
>>>
>>>
>>> Though, OTOH, I have noted that in many cases I am beating RISC-V
>>> (RV64IM) in terms of total ".text" size in XG2 mode (only 32/64/96
>>> bit encodings) when both are using the same C library, which implies
>>> that I am probably "not doing too badly" on this front either
>>> (though, ideally, I would be "more consistently" beating RISC-V at
>>> this metric, *1).
>>>
>>>
>>> *1: RV64 "-Os -ffunction-sections -Wl,-gc-sections" is still able to
>>> beat XG2 in terms of having smaller ".text" (though, "-O2" and "-O3"
>>> are bigger; BJX2 Baseline does beat RV64IM, but this is not a fair
>>> test as BJX2 Baseline has 16-bit ops).
>>>
>>> Though, BGBCC also has an "/Os" option, it seems to have very little
>>> effect on XG2 Mode (it mostly does things to try to increase the
>>> number of 16-bit ops used, which is N/A in XG2).
>>>
>>> Where, here, one can use ".text" size as a stand-in for total
>>> instruction count (and by extension, the number of instructions that
>>> need to be executed).
>>>
>>> Though, in some past tests, it seemed like RISC-V needed to execute a
>>> larger number of instructions to render each frame in Doom, which
>>> doesn't really make follow if both have a roughly similar number of
>>> instructions in the emitted binaries (and if both are essentially
>>> running the same code).
>>>
>>> So, something seems curious here...
>>>
>>>
>>> ...
>>>
>>>
>> For the Q+ MPU and SOC the bus system is organized like a tree with
>> the root being at the CPU. The system bus operates with asynchronous
>> transactions. The bus then fans out through bus bridges to various
>> system components. Responses coming back from devices are buffered and
>> merge results together into a more common bus when there are open
>> spaces in the bus. I think it is fairly fast (well at least for
>> homebrew FPGA). Bus accesses are single cycle, but they may have a
>> varying amount of latency. Writes are “posted” so they are essentially
>> single cycle. Reads percolate back up the tree to the CPU. It operates
>> at the CPU clock rate (currently 40MHz) and transfers 128-bits at a
>> time. Maximum peak transfer rate would then be 640 MB/s. Copying
>> memory is bound to be much slower due to the read latency. Devices on
>> the bus have a configuration block which looks something like a PCI
>> config block, so devices addressing may be controlled by the OS.
>>
>
> My original bus was fairly slow:
> Put a request on the bus, as it propagates, each layer of the bus holds
> the request, until it reaches the destination, and sends back an OK
> signal, which returns back up the bus to the sender, and then the sender
> switches to sending an IDLE signal, the whole process repeats as the bus
> "tears down", and when it is done, the OK signal switches to READY, and
> the bus may then accept another request.
>
> This bus could only handle a single active request at a time, and no
> further requests could initiate (anywhere) until the prior request had
> finished.
>
>
> Experimentally, I was hard-pressed getting much over about 6MB/sec over
> this bus with 128-bit transfers... (but could get it up to around
> 16MB/sec with 256-bit SWAP messages). As noted, this kinda sucked...
>
>
>
> I then replaced this with a ring-bus:
> Every object on the node passes messages from input to output, and is
> able to drop messages onto the bus, or remove/replace messages as
> appropriate. If not handled immediately, they circle the ring until they
> can be handled.
>
> This bus was considerably faster, but still seems to suffer from latency
> issues.
>
> In this case, the latency of the ring bus was higher than the original
> bus, but had the advantage that the L1 cache could effectively drop 4
> consecutive requests onto the bus and then (in theory) they could all be
> handled within a single trip around the ring.
>
>
>
> Theoretically, the bus could move 800MB/sec at 50MHz, but practically
> seems to achieve around 70MB/s (which is in-turn effected by things that
> effect ring latency, like enabling/disabling various "shortcut paths" or
> enabling/disabling the second CPU core).
>
> A point-to-point message-passing bus could be possible, and could have
> lower latency, but was not done mostly because it seemed more
> complicated and expensive than the ring design.
>
>
> If one has two endpoints, both can achieve around 70MB/s if L2 hits, but
> this drops off if the external RAM accesses become the limiting factor.
>
>
> The RAM interface is using a modified version of the original bus, where
> both the OPM and OK signals were augmented with sequence numbers, where
> when the sent sequence number on OPM comes back via the OK signal, one
> can immediately move to the next request (incrementing the sequence
> number).
>
> While this interface still only allows a single request at a time, this
> change effectively doubles the throughput. The main reason for using
> this interface to talk to external RAM, is that the interface works
> across clock-domain crossings (as-is, the ring-bus requests can't
> survive a clock-domain crossing).
>
>
> Most of the MMIO devices are still operating on a narrower version of
> the original bus, say:
> 5b: OPM
> 28b: Addr
> 64b: DataIn
> 64b: DataOut
> 2b: OK
>
> Where, OPM:
> 00-000: IDLE
> 00-zzz: Special Command (if zzz!=000)
>
> 01-010: Load DWORD (MMIO)
> 01-011: Load QWORD (MMIO)
> 01-111: Load TILE (RAM, Old)
>
> 10-010: Store DWORD (MMIO)
> 10-011: Store QWORD (MMIO)
> 10-111: Store TILE (RAM, Old)
>
> 11-010: Swap DWORD (MMIO, Unused)
> 11-011: Swap QWORD (MMIO, Unused)
> 11-111: Swap TILE (RAM, Old)
>
> The ring-bus went over to an 8-bit OPM format, which increases the range
> of messages that can be sent.
>
>
> One advantage of the old bus is that the device-side logic is fairly
> simple. Typically, the OPM/Addr/Data signals would be mirrored to all of
> the devices, with each device having its own OK and DataOut signal.
>
> A sort of crossbar existed, where whichever device sets its OK value to
> something other than READY has its OK and Data signals passed back up
> the bus.
>
> Also it works because MMIO only allows a single active request at a time
> (and the MMIO bus interface on the ringbus will effectively serialize
> all accesses into the MMIO space on a "first come, first serve" basis).
>
>
> Note that accessing MMIO is comparably slow.
> Some devices, like the display / VRAM module, have been partly moved
> over to the ringbus (with the screen's frame-buffer mapped into RAM),
> but still uses the MMIO interface for access to display control
> registers and similar.
>
>
> The SDcard interface still goes over MMIO, but ended up being modified
> to allow sending/receiving 8 bytes at a time over SPI (with 8-bit
> transfers, accessing the MMIO bus was a bigger source of latency than
> actually sending bytes over SPI at 5MHz).
>
> As-is, I am running the SDcard at 12.5 MHz:
> 16.7MHz and 25MHz did not work reliably;
> Going over 25MHz was out-of-spec;
> Even with 8-byte transfers, MMIO access can still become a bottleneck.
>
>
> A UHS-II interface could in theory run at similar speeds to RAM, but
> would likely need a different interface to make use of this.
>
>
> One possibility would be to map the SDcard into the physical address
> space as a huge non-volatile RAM-like space (on the ring-bus). Had
> on/off considered this a few times, but didn't get to it.
>
> Effectively, it would require redesigning the whole SDcard and
> filesystem interface (essentially moving nearly all of the SDcard logic
> into hardware).
>
>
>> Multiple devices access the main DRAM memory via a memory controller.
>> Several devices that are bus masters have their own ports to the
>> memory controller and do not use up time on the main system bus tree.
>> The frame buffer has a streaming data port. The frame buffer streaming
>> cache is 8kB and loaded in 1kB strips at 800MB/s from the DRAM IIRC.
>> Other devices share a system cache which is only 16kB due to limited
>> number block RAMs. There are about a half dozen read ports, so the
>> block RAMs are replicated. With all the ports accessing simultaneously
>> there could be 8*40*16 MB/s being transferred, or about 5.1 GB/s for
>> reads.
>>
>
> I had put everything on the ring-bus, with the L2 also serving as the
> bridge to access external DRAM (via a direct connection to the DDR
> interface module).
>
>
>
>> The CPU itself has only L1 caches of 8kB I$ and 16kB D$. The D$ can be
>> dual ported, but is not configured that way ATM due to resource
>> limitations. The caches will request data in blocks the size of a
>> cache line. A cache line is broken into four consecutive 128-bit
>> accesses. So, data comes back from the boot ROM in a burst at 640 MB/s.
>>
>
> In my case:
> L1 I$: 16K or 32K
>     32K helps notably with GLQuake and similar.
>     Doom works well with 16K.
> L1 D$: 16K or 32K
>     Mostly 32K works well.
>     Had tried 64K, but bad for timing, and little effect on performance.
>
> IIRC, had evaluated running the CPU at 25MHz with 128K L1 caches and a
> small L2 cache, but modeling this had showed that performance would suck
> (even if nearly all of the instructions had a 1-cycle latency).
>
>
>
>> IIRC there were no display issues with an 800x600x16 bpp display, but
>> I could not get Thor to do much more than clear the screen. So, it was
>> a display of random dots that was stable. There is a separate text
>> display controller with its own dedicated block RAM for displays.
>>
>
> My display module is a little weird, as it was based around a
> cell-oriented design:
> Cells are typically 128 or 256 bits, representing 8x8 pixels.
>
> Text and 2bpp color-cell modes use 128-bit cells, say:
> ( 29: 0): Pair of 15-bit colors;
> ( 31:30): 10
> ( 61:32): Misc
> ( 63:62): 00
> (127:64): Pixel bits, 8x8x1 bit, raster order
>
> The 4bpp color-cell mode is more like:
> ( 29: 0): Colors A/B
> ( 31: 30): 11
> ( 61: 32): Colors C/D
> ( 63: 62): 11
> ( 93: 64): Colors E/F
> ( 95: 94): 00
> (125: 96): Colors G/H
> (127:126): 00
> (159:128): Pixels A/B (4x4x2)
> (191:160): Pixels C/D (4x4x2)
> (223:192): Pixels E/F (4x4x2)
> (255:224): Pixels G/H (4x4x2)
>
> In the bitmapped modes:
> 128-bit cell selects 256-color modes (4x4 pixels)
> 256-bit cell selects hi-color modes (4x4 pixels)
>
>
> So:
> 640x400 would be configured as 160x100 cells.
> 800x600 would be configured as 200x150 cells.
>
> The 800x600 256-color mode held up OK when I had the display module
> outputting at a non-standard 36Hz refresh, but increasing this to a more
> standard 72Hz blows out the memory bandwidth.
>
>
> Theoretically, the DDR RAM interface could support these resolutions if
> all the timing and latency was good. But, no so good when it is
> implemented by the display module hammering out a series of prefetch
> requests over the ring-bus just ahead of the current raster position.
>
> Though, the cell-oriented display modes still work better than my
> attempt at a linear framebuffer mode (due to cache/timing issues, not
> even a 320x200 linear framebuffer mode worked without looking like a
> broken mess).
>
>
> I suspect this is because, with the cell-oriented modes, each cell has 4
> or 8 chances for the prefetch to succeed before it actually gets drawn,
> whereas in the linear raster mode, there is only 1 chance.
>
> It is likely that a linear framebuffer would require two stages:
> Prefetch 1: Somewhat ahead of current raster position, hopefully gets
> data into L2;
> Prefects 2: Closer to the raster position, intended to actually fetch
> the pixel data.
>
> Prefetches are used here rather than actual loads, mostly because these
> will get cleaned up quickly, whereas with actual fetches, a back-log
> scenario would result in the whole bus getting clogged up with
> unresolved requests.
>
>
> However, the CPU can use normal loads, since the CPU will patiently wait
> for the previous request(s) to finish before doing anything else (and
> thus avoids flooding the ring-bus with requests).
>
> However, a downside of prefetches, is that one has to keep asking the L2
> cache each time whether or not it has the data in question yet.
>
>
>
>
> As for the "BJX2 doesn't always generate smaller .text than RISC-V
> issue", went looking at the ASM, and noted there is a big difference:
> GCC "-Os" generates very tight and efficient code, but needs to work
> within the limits of what the ISA provides;
> BGBCC has a bit more to work with, but the relative quality of the
> generated code is fairly poor in comparison.
>
>
> Like, say:
> MOV.Q R8, (SP, 40)
> .lbl:
> MOV.Q (SP, 40), R8
> //BGBCC: "Sure why not?..."
> ...
> MOV R2, R9
> MOV R9, R2
> BRA .lbl
> //BGBCC: "Seems fine to me..."
>
> So, I look at the ASM, and once again feel a groan at how crappy a lot
> of it is.
>
>
> Or:
> if(!ptr)
>     ...
> Was failing to go down the logic path that would have allowed it to use
> the BREQ/BRNE instructions (so was always producing a two-op sequence).
>
> Have noticed that code that writes, say:
> if(ptr==NULL)
>     ...
> Ends up using a 3-instruction sequence, because it doesn't recognize
> this pattern as being the same as the "!ptr" case, ...
>
> Did at least find a few more "low hanging fruit" cases that shaved a few
> more kB off the binary.
>
> Well, and also added a case to partially optimize:
> return(bar());
> To merge the 3AC "RET" into the "CSRV" operation, and thus save the use
> of a temporary (and roughly two otherwise unnecessary MOV instructions
> whenever this happens).
>
>
>
> But, ironically, it was still "mostly" generating code with fewer
> instructions, despite the still relatively weak code generation at times.
>
>
> Also it seems:
> void foo()
> {
>      //does nothing
> }
> void bar()
> {
>     ...
>     foo();
>     ...
> }
>
> GCC seems to be clever enough to realize that "foo()" does nothing, and
> will eliminate the function and function call entirely.
>
> BGBCC has no such optimization.
>
> ...
>
>
Finally got a synthesis for a complete Q+ system done. Turns out to be
about 10% too large for the XC7A200 :) It should easily fit in the next
larger part. Scratching my head wondering how to reduce sizes while not
losing too much functionality. I could go with just the CPU and a serial
port, remove the frame buffer, sprites, etc.

Click here to read the complete article

Re: Tonight's tradeoff

<ukt9hb$1d4fp$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35542&group=comp.arch#35542

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Thu, 7 Dec 2023 14:22:01 -0600
Organization: A noiseless patient Spider
Lines: 698
Message-ID: <ukt9hb$1d4fp$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com>
<ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 7 Dec 2023 20:22:04 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f0cffd37592bee6d23d9340da84b6934";
logging-data="1479161"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/k5DywMjcciKj79Mbss2Oy"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Vu2lEETy20F3yyq95JsLiorMcLk=
In-Reply-To: <uksqek$1aotb$1@dont-email.me>
Content-Language: en-US

by: BGB - Thu, 7 Dec 2023 20:22 UTC

On 12/7/2023 10:04 AM, Robert Finch wrote:
> On 2023-12-05 9:42 p.m., BGB wrote:
>> On 12/5/2023 4:04 PM, Robert Finch wrote:
>>> On 2023-12-05 1:59 a.m., BGB wrote:
>>>> On 12/4/2023 11:07 PM, Robert Finch wrote:
>>>>> On 2023-12-03 7:58 p.m., BGB wrote:
>>>>>> On 12/3/2023 10:58 AM, MitchAlsup wrote:
>>>>>>> EricP wrote:
>>>>>>>
>>>>>>>> Robert Finch wrote:
>>>>>>>>> Figured it out. Each architectural register in the RAT must
>>>>>>>>> refer to N physical registers, where N is the number of banks.
>>>>>>>>> Setting N to 4 results in a RAT that is only about 50% larger
>>>>>>>>> than one supporting only a single bank. The operating mode is
>>>>>>>>> used to select the physical register. The first eight registers
>>>>>>>>> are shared between all operating modes so arguments can be
>>>>>>>>> passed to syscalls. It is tempting to have eight banks of
>>>>>>>>> registers, one for each hardware interrupt level.
>>>>>>>
>>>>>>>> A consequence of multiple architecture register banks is each extra
>>>>>>>> bank keeps a set of mostly unused physical register attached to
>>>>>>>> them.
>>>>>>>
>>>>>>> A waste.....
>>>>>>>
>>>>>>>> For example, if there are 2 modes User and Super and a bank for
>>>>>>>> each,
>>>>>>>> since User and Super are mutually exclusive,
>>>>>>>> 64 of your 256 physical registers will be sitting unused tied
>>>>>>>> to the other mode bank, so max of 75% utilization efficiency.
>>>>>>>
>>>>>>>> If you have 8 register banks then only 3/10 of the physical
>>>>>>>> registers
>>>>>>>> are available to use, the other 7/10 are sitting idle attached to
>>>>>>>> arch registers in other modes consuming power.
>>>>>>>
>>>>>>>> Also you don't have to play overlapped-register-bank games to pass
>>>>>>>> args to/from syscalls. You can have specific instructions that
>>>>>>>> reach
>>>>>>>> into other banks: Move To User Reg, Move From User Reg.
>>>>>>>> Since only syscall passes args into the OS you only need to access
>>>>>>>> the user mode bank from the OS kernel bank.
>>>>>>>
>>>>>>> Whereas: Exceptions, interrupts save and restore 32-registers::
>>>>>>> A SysCall in My 66000 only saves and restores 24 of the 32
>>>>>>> registers.
>>>>>>> So when control arrives, there are 8 argument registers from the
>>>>>>> Caller and 24 registers from Guest OS already loaded. So, SysCall
>>>>>>> handler already has its stack, and a variety of pointers to data
>>>>>>> structures it is interested in.
>>>>>>>
>>>>>>> On the way back, RET only restores 24 registers so Guest OS can pass
>>>>>>> back as many as 8 result registers.
>>>>>>
>>>>>> I had handled it by saving/restoring 64 of the 64 registers...
>>>>>> For syscalls, it basically messes with the registers in the
>>>>>> captured register state for the calling task.
>>>>>>
>>>>>> A newer change involves saving/restoring registers more directly
>>>>>> to/from the task context for syscalls, which reduces the
>>>>>> task-switch overhead by around 50% (but is mostly N/A for other
>>>>>> kinds of interrupts).
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>
>>>>> I am toying with the idea of adding context save and restore
>>>>> instructions. I would try to get them to work on a cache-line worth
>>>>> of data, four registers accessed for read or write at the same
>>>>> time. Context save / restore would be a macro instruction made up
>>>>> of sixteen individual instructions each of which saves or restores
>>>>> four registers. It is a bit of a hoop to jump through for an
>>>>> infrequently used operation. However, it is good to have to clean
>>>>> context switch code.
>>>>>
>>>>> Added the REGS instruction modifier. The modifier causes the
>>>>> following load or store instruction to repeat using the registers
>>>>> specified in the register list bitmask for the source or target
>>>>> register. In theory it can also be applied to other instructions
>>>>> but that was not the intent. It is pretty much useless for other
>>>>> instructions, but a register list could be supplied to the MOV
>>>>> instruction to zero out multiple registers with a single
>>>>> instruction. Or possibly the ADDI instruction could be used to load
>>>>> a constant into multiple registers. I could put code in to disable
>>>>> REGS use with anything other than load and store ops, but why add
>>>>> extra hardware?
>>>>>
>>>>
>>>> In my case, it is partly a limitation of not really being able to
>>>> make it wider than it is already absent adding a 4th register write
>>>> port and likely imposing a 256-bit alignment requirement; for a task
>>>> that is mostly limited by L1 cache misses...
>>>>
>>>>
>>>> Like, saving registers would be ~ 40 cycles or so (with another ~ 40
>>>> to restore them), saving/restoring 2 registers per cycle with GPRs,
>>>> if not for all the L1 misses.
>>>>
>>>> Reason it is not similar for normal function calls (besides these
>>>> saving/restoring normal registers), is because often the stack is
>>>> still "warm" in the L1 cache.
>>>>
>>>> For interrupts, in the time from one interrupt to another, most of
>>>> the L1 cache contents from the previous interrupt are already gone.
>>>>
>>>>
>>>> So, these instruction sequences are around 80% L1 miss penalty, vs
>>>> around 5% for normal prologs/epilogs.
>>>>
>>>> This is similar for the inner loops for "memcpy()", which average
>>>> roughly 90% L1 miss penalty.
>>>>
>>>>
>>>>
>>>> And, say, "memcpy()" averages around 300MB/sec if just copying the
>>>> same small buffer over and over again, but then quickly drops to
>>>> 70MB/sec if copying memory that falls outside the L1 cache.
>>>>
>>>> Though, comparably, it seems that the drop-off from L2 cache to DRAM
>>>> is currently a little smaller.
>>>>
>>>> So, the external DRAM interface can push ~ 100MB/sec with the
>>>> current interface (supports SWAP operations, moving 512 bits at a
>>>> time, and using a sequence number to transition from one request to
>>>> another).
>>>>
>>>> But, it is around 70MB/s for requests to make it around the ringbus.
>>>>
>>>>
>>>> Though, I have noted that if things stay within the limits of what
>>>> fits in the L2 cache, multiple parties can access the L2 cache at
>>>> the same time without too much impact on each other.
>>>>
>>>> So, say, a modest resolutions, the screen refresh does not impact
>>>> the CPU, and the rasterizer module is also mostly independent.
>>>>
>>>>
>>>>
>>>> Still, about the highest screen resolution it can really sustain
>>>> effectively is ~ 640x480 256-color, or ~ 18MB/sec.
>>>>
>>>> This may be more timing related though, since for screen refresh
>>>> there is a relatively tight deadline between when the requests start
>>>> being sent, and when the L2 cache needs to hit for that request, and
>>>> failing this will result in graphical glitches.
>>>>
>>>> Though, generally what it means is, if the framebuffer image isn't
>>>> in the L2 cache, it is gonna look like crap; and effectively the
>>>> limit is more "how big of a framebuffer can I fit in the L2 cache".
>>>>
>>>> On the XC7A200T, I can afford a 512K L2 cache, which is just so big
>>>> enough to fit 640x400 or 640x480 (but 800x600 is kinda pushing it,
>>>> and fights a bit more with the main CPU).
>>>>
>>>> OTOH, it is likely the case than on the XC7A100T (which can only
>>>> afford a 256K L2 cache), that 640x400 256-color is pushing it (but
>>>> color cell mode still works fine).
>>>>
>>>> Had noted though that trying to set the screen resolution at one
>>>> point to 800x600 RGB555 (72 Hz), pulls around 70MB/sec, and
>>>> basically was almost entirely broken and seemingly bogged down the
>>>> CPU (which could no longer access memory in a timely manner).
>>>>
>>>> Also seemingly stuff running on the CPU can effect screen artifacts
>>>> in these modes, presumably by knocking stuff out of the L2 cache.
>>>>
>>>>
>>>>
>>>> Also, it seems like despite my ringbus being a lot faster than my
>>>> original bus, it has still managed to become an issue due to latency.
>>>>
>>>> But, despite this, on average, things like interlocks and
>>>> branch-miss penalties and similar are now still weighing in a fair
>>>> bit as well (with interlock penalties closely following cache misses
>>>> as the main source of pipeline stalls).
>>>>
>>>> Well, and these two combined burning around 30% of the total
>>>> clock-cycles, with another ~ 2-3% or so being spent on branches, ...
>>>>
>>>>
>>>> Well, and my recent effort to try to improve FPGA timing enough try
>>>> to get it up to 75MHz, did have the drawback of "in general"
>>>> increasing the number of cycles spent on interlocks (but, returning
>>>> a lot of the instructions to their original latency values, would
>>>> make the FPGA timing-constraints issues a bit worse).
>>>>
>>>> But, if I could entirely eliminate these sources of latency, this
>>>> would only gain ~30%, and at this point would either need to somehow
>>>> increase the average bundle with, or find ways to reduce the total
>>>> number of instructions that need to be executed (both of these being
>>>> more compiler-related territory).
>>>>
>>>>
>>>> Though, OTOH, I have noted that in many cases I am beating RISC-V
>>>> (RV64IM) in terms of total ".text" size in XG2 mode (only 32/64/96
>>>> bit encodings) when both are using the same C library, which implies
>>>> that I am probably "not doing too badly" on this front either
>>>> (though, ideally, I would be "more consistently" beating RISC-V at
>>>> this metric, *1).
>>>>
>>>>
>>>> *1: RV64 "-Os -ffunction-sections -Wl,-gc-sections" is still able to
>>>> beat XG2 in terms of having smaller ".text" (though, "-O2" and "-O3"
>>>> are bigger; BJX2 Baseline does beat RV64IM, but this is not a fair
>>>> test as BJX2 Baseline has 16-bit ops).
>>>>
>>>> Though, BGBCC also has an "/Os" option, it seems to have very little
>>>> effect on XG2 Mode (it mostly does things to try to increase the
>>>> number of 16-bit ops used, which is N/A in XG2).
>>>>
>>>> Where, here, one can use ".text" size as a stand-in for total
>>>> instruction count (and by extension, the number of instructions that
>>>> need to be executed).
>>>>
>>>> Though, in some past tests, it seemed like RISC-V needed to execute
>>>> a larger number of instructions to render each frame in Doom, which
>>>> doesn't really make follow if both have a roughly similar number of
>>>> instructions in the emitted binaries (and if both are essentially
>>>> running the same code).
>>>>
>>>> So, something seems curious here...
>>>>
>>>>
>>>> ...
>>>>
>>>>
>>> For the Q+ MPU and SOC the bus system is organized like a tree with
>>> the root being at the CPU. The system bus operates with asynchronous
>>> transactions. The bus then fans out through bus bridges to various
>>> system components. Responses coming back from devices are buffered
>>> and merge results together into a more common bus when there are open
>>> spaces in the bus. I think it is fairly fast (well at least for
>>> homebrew FPGA). Bus accesses are single cycle, but they may have a
>>> varying amount of latency. Writes are “posted” so they are
>>> essentially single cycle. Reads percolate back up the tree to the
>>> CPU. It operates at the CPU clock rate (currently 40MHz) and
>>> transfers 128-bits at a time. Maximum peak transfer rate would then
>>> be 640 MB/s. Copying memory is bound to be much slower due to the
>>> read latency. Devices on the bus have a configuration block which
>>> looks something like a PCI config block, so devices addressing may be
>>> controlled by the OS.
>>>
>>
>> My original bus was fairly slow:
>> Put a request on the bus, as it propagates, each layer of the bus
>> holds the request, until it reaches the destination, and sends back an
>> OK signal, which returns back up the bus to the sender, and then the
>> sender switches to sending an IDLE signal, the whole process repeats
>> as the bus "tears down", and when it is done, the OK signal switches
>> to READY, and the bus may then accept another request.
>>
>> This bus could only handle a single active request at a time, and no
>> further requests could initiate (anywhere) until the prior request had
>> finished.
>>
>>
>> Experimentally, I was hard-pressed getting much over about 6MB/sec
>> over this bus with 128-bit transfers... (but could get it up to around
>> 16MB/sec with 256-bit SWAP messages). As noted, this kinda sucked...
>>
>>
>>
>> I then replaced this with a ring-bus:
>> Every object on the node passes messages from input to output, and is
>> able to drop messages onto the bus, or remove/replace messages as
>> appropriate. If not handled immediately, they circle the ring until
>> they can be handled.
>>
>> This bus was considerably faster, but still seems to suffer from
>> latency issues.
>>
>> In this case, the latency of the ring bus was higher than the original
>> bus, but had the advantage that the L1 cache could effectively drop 4
>> consecutive requests onto the bus and then (in theory) they could all
>> be handled within a single trip around the ring.
>>
>>
>>
>> Theoretically, the bus could move 800MB/sec at 50MHz, but practically
>> seems to achieve around 70MB/s (which is in-turn effected by things
>> that effect ring latency, like enabling/disabling various "shortcut
>> paths" or enabling/disabling the second CPU core).
>>
>> A point-to-point message-passing bus could be possible, and could have
>> lower latency, but was not done mostly because it seemed more
>> complicated and expensive than the ring design.
>>
>>
>> If one has two endpoints, both can achieve around 70MB/s if L2 hits,
>> but this drops off if the external RAM accesses become the limiting
>> factor.
>>
>>
>> The RAM interface is using a modified version of the original bus,
>> where both the OPM and OK signals were augmented with sequence
>> numbers, where when the sent sequence number on OPM comes back via the
>> OK signal, one can immediately move to the next request (incrementing
>> the sequence number).
>>
>> While this interface still only allows a single request at a time,
>> this change effectively doubles the throughput. The main reason for
>> using this interface to talk to external RAM, is that the interface
>> works across clock-domain crossings (as-is, the ring-bus requests
>> can't survive a clock-domain crossing).
>>
>>
>> Most of the MMIO devices are still operating on a narrower version of
>> the original bus, say:
>>    5b: OPM
>>    28b: Addr
>>    64b: DataIn
>>    64b: DataOut
>>    2b: OK
>>
>> Where, OPM:
>>    00-000: IDLE
>>    00-zzz: Special Command (if zzz!=000)
>>
>>    01-010: Load DWORD (MMIO)
>>    01-011: Load QWORD (MMIO)
>>    01-111: Load TILE (RAM, Old)
>>
>>    10-010: Store DWORD (MMIO)
>>    10-011: Store QWORD (MMIO)
>>    10-111: Store TILE (RAM, Old)
>>
>>    11-010: Swap DWORD (MMIO, Unused)
>>    11-011: Swap QWORD (MMIO, Unused)
>>    11-111: Swap TILE (RAM, Old)
>>
>> The ring-bus went over to an 8-bit OPM format, which increases the
>> range of messages that can be sent.
>>
>>
>> One advantage of the old bus is that the device-side logic is fairly
>> simple. Typically, the OPM/Addr/Data signals would be mirrored to all
>> of the devices, with each device having its own OK and DataOut signal.
>>
>> A sort of crossbar existed, where whichever device sets its OK value
>> to something other than READY has its OK and Data signals passed back
>> up the bus.
>>
>> Also it works because MMIO only allows a single active request at a
>> time (and the MMIO bus interface on the ringbus will effectively
>> serialize all accesses into the MMIO space on a "first come, first
>> serve" basis).
>>
>>
>> Note that accessing MMIO is comparably slow.
>> Some devices, like the display / VRAM module, have been partly moved
>> over to the ringbus (with the screen's frame-buffer mapped into RAM),
>> but still uses the MMIO interface for access to display control
>> registers and similar.
>>
>>
>> The SDcard interface still goes over MMIO, but ended up being modified
>> to allow sending/receiving 8 bytes at a time over SPI (with 8-bit
>> transfers, accessing the MMIO bus was a bigger source of latency than
>> actually sending bytes over SPI at 5MHz).
>>
>> As-is, I am running the SDcard at 12.5 MHz:
>>    16.7MHz and 25MHz did not work reliably;
>>    Going over 25MHz was out-of-spec;
>>    Even with 8-byte transfers, MMIO access can still become a bottleneck.
>>
>>
>> A UHS-II interface could in theory run at similar speeds to RAM, but
>> would likely need a different interface to make use of this.
>>
>>
>> One possibility would be to map the SDcard into the physical address
>> space as a huge non-volatile RAM-like space (on the ring-bus). Had
>> on/off considered this a few times, but didn't get to it.
>>
>> Effectively, it would require redesigning the whole SDcard and
>> filesystem interface (essentially moving nearly all of the SDcard
>> logic into hardware).
>>
>>
>>> Multiple devices access the main DRAM memory via a memory controller.
>>> Several devices that are bus masters have their own ports to the
>>> memory controller and do not use up time on the main system bus tree.
>>> The frame buffer has a streaming data port. The frame buffer
>>> streaming cache is 8kB and loaded in 1kB strips at 800MB/s from the
>>> DRAM IIRC. Other devices share a system cache which is only 16kB due
>>> to limited number block RAMs. There are about a half dozen read
>>> ports, so the block RAMs are replicated. With all the ports accessing
>>> simultaneously there could be 8*40*16 MB/s being transferred, or
>>> about 5.1 GB/s for reads.
>>>
>>
>> I had put everything on the ring-bus, with the L2 also serving as the
>> bridge to access external DRAM (via a direct connection to the DDR
>> interface module).
>>
>>
>>
>>> The CPU itself has only L1 caches of 8kB I$ and 16kB D$. The D$ can
>>> be dual ported, but is not configured that way ATM due to resource
>>> limitations. The caches will request data in blocks the size of a
>>> cache line. A cache line is broken into four consecutive 128-bit
>>> accesses. So, data comes back from the boot ROM in a burst at 640 MB/s.
>>>
>>
>> In my case:
>>    L1 I$: 16K or 32K
>>      32K helps notably with GLQuake and similar.
>>      Doom works well with 16K.
>>    L1 D$: 16K or 32K
>>      Mostly 32K works well.
>>      Had tried 64K, but bad for timing, and little effect on performance.
>>
>> IIRC, had evaluated running the CPU at 25MHz with 128K L1 caches and a
>> small L2 cache, but modeling this had showed that performance would
>> suck (even if nearly all of the instructions had a 1-cycle latency).
>>
>>
>>
>>> IIRC there were no display issues with an 800x600x16 bpp display, but
>>> I could not get Thor to do much more than clear the screen. So, it
>>> was a display of random dots that was stable. There is a separate
>>> text display controller with its own dedicated block RAM for displays.
>>>
>>
>> My display module is a little weird, as it was based around a
>> cell-oriented design:
>>    Cells are typically 128 or 256 bits, representing 8x8 pixels.
>>
>> Text and 2bpp color-cell modes use 128-bit cells, say:
>>    ( 29: 0): Pair of 15-bit colors;
>>    ( 31:30): 10
>>    ( 61:32): Misc
>>    ( 63:62): 00
>>    (127:64): Pixel bits, 8x8x1 bit, raster order
>>
>> The 4bpp color-cell mode is more like:
>>    ( 29: 0): Colors A/B
>>    ( 31: 30): 11
>>    ( 61: 32): Colors C/D
>>    ( 63: 62): 11
>>    ( 93: 64): Colors E/F
>>    ( 95: 94): 00
>>    (125: 96): Colors G/H
>>    (127:126): 00
>>    (159:128): Pixels A/B (4x4x2)
>>    (191:160): Pixels C/D (4x4x2)
>>    (223:192): Pixels E/F (4x4x2)
>>    (255:224): Pixels G/H (4x4x2)
>>
>> In the bitmapped modes:
>>    128-bit cell selects 256-color modes (4x4 pixels)
>>    256-bit cell selects hi-color modes (4x4 pixels)
>>
>>
>> So:
>>    640x400 would be configured as 160x100 cells.
>>    800x600 would be configured as 200x150 cells.
>>
>> The 800x600 256-color mode held up OK when I had the display module
>> outputting at a non-standard 36Hz refresh, but increasing this to a
>> more standard 72Hz blows out the memory bandwidth.
>>
>>
>> Theoretically, the DDR RAM interface could support these resolutions
>> if all the timing and latency was good. But, no so good when it is
>> implemented by the display module hammering out a series of prefetch
>> requests over the ring-bus just ahead of the current raster position.
>>
>> Though, the cell-oriented display modes still work better than my
>> attempt at a linear framebuffer mode (due to cache/timing issues, not
>> even a 320x200 linear framebuffer mode worked without looking like a
>> broken mess).
>>
>>
>> I suspect this is because, with the cell-oriented modes, each cell has
>> 4 or 8 chances for the prefetch to succeed before it actually gets
>> drawn, whereas in the linear raster mode, there is only 1 chance.
>>
>> It is likely that a linear framebuffer would require two stages:
>> Prefetch 1: Somewhat ahead of current raster position, hopefully gets
>> data into L2;
>> Prefects 2: Closer to the raster position, intended to actually fetch
>> the pixel data.
>>
>> Prefetches are used here rather than actual loads, mostly because
>> these will get cleaned up quickly, whereas with actual fetches, a
>> back-log scenario would result in the whole bus getting clogged up
>> with unresolved requests.
>>
>>
>> However, the CPU can use normal loads, since the CPU will patiently
>> wait for the previous request(s) to finish before doing anything else
>> (and thus avoids flooding the ring-bus with requests).
>>
>> However, a downside of prefetches, is that one has to keep asking the
>> L2 cache each time whether or not it has the data in question yet.
>>
>>
>>
>>
>> As for the "BJX2 doesn't always generate smaller .text than RISC-V
>> issue", went looking at the ASM, and noted there is a big difference:
>> GCC "-Os" generates very tight and efficient code, but needs to work
>> within the limits of what the ISA provides;
>> BGBCC has a bit more to work with, but the relative quality of the
>> generated code is fairly poor in comparison.
>>
>>
>> Like, say:
>>    MOV.Q R8, (SP, 40)
>>    .lbl:
>>    MOV.Q (SP, 40), R8
>> //BGBCC: "Sure why not?..."
>>    ...
>>    MOV R2, R9
>>    MOV R9, R2
>>    BRA .lbl
>> //BGBCC: "Seems fine to me..."
>>
>> So, I look at the ASM, and once again feel a groan at how crappy a lot
>> of it is.
>>
>>
>> Or:
>>    if(!ptr)
>>      ...
>> Was failing to go down the logic path that would have allowed it to
>> use the BREQ/BRNE instructions (so was always producing a two-op
>> sequence).
>>
>> Have noticed that code that writes, say:
>>    if(ptr==NULL)
>>      ...
>> Ends up using a 3-instruction sequence, because it doesn't recognize
>> this pattern as being the same as the "!ptr" case, ...
>>
>> Did at least find a few more "low hanging fruit" cases that shaved a
>> few more kB off the binary.
>>
>> Well, and also added a case to partially optimize:
>>    return(bar());
>> To merge the 3AC "RET" into the "CSRV" operation, and thus save the
>> use of a temporary (and roughly two otherwise unnecessary MOV
>> instructions whenever this happens).
>>
>>
>>
>> But, ironically, it was still "mostly" generating code with fewer
>> instructions, despite the still relatively weak code generation at times.
>>
>>
>> Also it seems:
>>    void foo()
>>    {
>>       //does nothing
>>    }
>>    void bar()
>>    {
>>      ...
>>      foo();
>>      ...
>>    }
>>
>> GCC seems to be clever enough to realize that "foo()" does nothing,
>> and will eliminate the function and function call entirely.
>>
>> BGBCC has no such optimization.
>>
>> ...
>>
>>
> Finally got a synthesis for a complete Q+ system done. Turns out to be
> about 10% too large for the XC7A200 :) It should easily fit in the next
> larger part. Scratching my head wondering how to reduce sizes while not
> losing too much functionality. I could go with just the CPU and a serial
> port, remove the frame buffer, sprites, etc.

Click here to read the complete article

Re: Tonight's tradeoff

<ukth60$1e6ab$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35547&group=comp.arch#35547

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Thu, 7 Dec 2023 17:32:32 -0500
Organization: A noiseless patient Spider
Lines: 719
Message-ID: <ukth60$1e6ab$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com>
<ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 7 Dec 2023 22:32:33 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f08aebda0808298d41bc4fa38271ff10";
logging-data="1513803"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+K0fzfv2gjo/OPTptdVtRoJ2CydNxlAh8="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:sME4W27bIuFa4ed8w1mTvV4igBw=
In-Reply-To: <ukt9hb$1d4fp$1@dont-email.me>
Content-Language: en-US

by: Robert Finch - Thu, 7 Dec 2023 22:32 UTC

On 2023-12-07 3:22 p.m., BGB wrote:
> On 12/7/2023 10:04 AM, Robert Finch wrote:
>> On 2023-12-05 9:42 p.m., BGB wrote:
>>> On 12/5/2023 4:04 PM, Robert Finch wrote:
>>>> On 2023-12-05 1:59 a.m., BGB wrote:
>>>>> On 12/4/2023 11:07 PM, Robert Finch wrote:
>>>>>> On 2023-12-03 7:58 p.m., BGB wrote:
>>>>>>> On 12/3/2023 10:58 AM, MitchAlsup wrote:
>>>>>>>> EricP wrote:
>>>>>>>>
>>>>>>>>> Robert Finch wrote:
>>>>>>>>>> Figured it out. Each architectural register in the RAT must
>>>>>>>>>> refer to N physical registers, where N is the number of banks.
>>>>>>>>>> Setting N to 4 results in a RAT that is only about 50% larger
>>>>>>>>>> than one supporting only a single bank. The operating mode is
>>>>>>>>>> used to select the physical register. The first eight
>>>>>>>>>> registers are shared between all operating modes so arguments
>>>>>>>>>> can be passed to syscalls. It is tempting to have eight banks
>>>>>>>>>> of registers, one for each hardware interrupt level.
>>>>>>>>
>>>>>>>>> A consequence of multiple architecture register banks is each
>>>>>>>>> extra
>>>>>>>>> bank keeps a set of mostly unused physical register attached to
>>>>>>>>> them.
>>>>>>>>
>>>>>>>> A waste.....
>>>>>>>>
>>>>>>>>> For example, if there are 2 modes User and Super and a bank for
>>>>>>>>> each,
>>>>>>>>> since User and Super are mutually exclusive,
>>>>>>>>> 64 of your 256 physical registers will be sitting unused tied
>>>>>>>>> to the other mode bank, so max of 75% utilization efficiency.
>>>>>>>>
>>>>>>>>> If you have 8 register banks then only 3/10 of the physical
>>>>>>>>> registers
>>>>>>>>> are available to use, the other 7/10 are sitting idle attached to
>>>>>>>>> arch registers in other modes consuming power.
>>>>>>>>
>>>>>>>>> Also you don't have to play overlapped-register-bank games to pass
>>>>>>>>> args to/from syscalls. You can have specific instructions that
>>>>>>>>> reach
>>>>>>>>> into other banks: Move To User Reg, Move From User Reg.
>>>>>>>>> Since only syscall passes args into the OS you only need to access
>>>>>>>>> the user mode bank from the OS kernel bank.
>>>>>>>>
>>>>>>>> Whereas: Exceptions, interrupts save and restore 32-registers::
>>>>>>>> A SysCall in My 66000 only saves and restores 24 of the 32
>>>>>>>> registers.
>>>>>>>> So when control arrives, there are 8 argument registers from the
>>>>>>>> Caller and 24 registers from Guest OS already loaded. So, SysCall
>>>>>>>> handler already has its stack, and a variety of pointers to data
>>>>>>>> structures it is interested in.
>>>>>>>>
>>>>>>>> On the way back, RET only restores 24 registers so Guest OS can
>>>>>>>> pass
>>>>>>>> back as many as 8 result registers.
>>>>>>>
>>>>>>> I had handled it by saving/restoring 64 of the 64 registers...
>>>>>>> For syscalls, it basically messes with the registers in the
>>>>>>> captured register state for the calling task.
>>>>>>>
>>>>>>> A newer change involves saving/restoring registers more directly
>>>>>>> to/from the task context for syscalls, which reduces the
>>>>>>> task-switch overhead by around 50% (but is mostly N/A for other
>>>>>>> kinds of interrupts).
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>>
>>>>>> I am toying with the idea of adding context save and restore
>>>>>> instructions. I would try to get them to work on a cache-line
>>>>>> worth of data, four registers accessed for read or write at the
>>>>>> same time. Context save / restore would be a macro instruction
>>>>>> made up of sixteen individual instructions each of which saves or
>>>>>> restores four registers. It is a bit of a hoop to jump through for
>>>>>> an infrequently used operation. However, it is good to have to
>>>>>> clean context switch code.
>>>>>>
>>>>>> Added the REGS instruction modifier. The modifier causes the
>>>>>> following load or store instruction to repeat using the registers
>>>>>> specified in the register list bitmask for the source or target
>>>>>> register. In theory it can also be applied to other instructions
>>>>>> but that was not the intent. It is pretty much useless for other
>>>>>> instructions, but a register list could be supplied to the MOV
>>>>>> instruction to zero out multiple registers with a single
>>>>>> instruction. Or possibly the ADDI instruction could be used to
>>>>>> load a constant into multiple registers. I could put code in to
>>>>>> disable REGS use with anything other than load and store ops, but
>>>>>> why add extra hardware?
>>>>>>
>>>>>
>>>>> In my case, it is partly a limitation of not really being able to
>>>>> make it wider than it is already absent adding a 4th register write
>>>>> port and likely imposing a 256-bit alignment requirement; for a
>>>>> task that is mostly limited by L1 cache misses...
>>>>>
>>>>>
>>>>> Like, saving registers would be ~ 40 cycles or so (with another ~
>>>>> 40 to restore them), saving/restoring 2 registers per cycle with
>>>>> GPRs, if not for all the L1 misses.
>>>>>
>>>>> Reason it is not similar for normal function calls (besides these
>>>>> saving/restoring normal registers), is because often the stack is
>>>>> still "warm" in the L1 cache.
>>>>>
>>>>> For interrupts, in the time from one interrupt to another, most of
>>>>> the L1 cache contents from the previous interrupt are already gone.
>>>>>
>>>>>
>>>>> So, these instruction sequences are around 80% L1 miss penalty, vs
>>>>> around 5% for normal prologs/epilogs.
>>>>>
>>>>> This is similar for the inner loops for "memcpy()", which average
>>>>> roughly 90% L1 miss penalty.
>>>>>
>>>>>
>>>>>
>>>>> And, say, "memcpy()" averages around 300MB/sec if just copying the
>>>>> same small buffer over and over again, but then quickly drops to
>>>>> 70MB/sec if copying memory that falls outside the L1 cache.
>>>>>
>>>>> Though, comparably, it seems that the drop-off from L2 cache to
>>>>> DRAM is currently a little smaller.
>>>>>
>>>>> So, the external DRAM interface can push ~ 100MB/sec with the
>>>>> current interface (supports SWAP operations, moving 512 bits at a
>>>>> time, and using a sequence number to transition from one request to
>>>>> another).
>>>>>
>>>>> But, it is around 70MB/s for requests to make it around the ringbus.
>>>>>
>>>>>
>>>>> Though, I have noted that if things stay within the limits of what
>>>>> fits in the L2 cache, multiple parties can access the L2 cache at
>>>>> the same time without too much impact on each other.
>>>>>
>>>>> So, say, a modest resolutions, the screen refresh does not impact
>>>>> the CPU, and the rasterizer module is also mostly independent.
>>>>>
>>>>>
>>>>>
>>>>> Still, about the highest screen resolution it can really sustain
>>>>> effectively is ~ 640x480 256-color, or ~ 18MB/sec.
>>>>>
>>>>> This may be more timing related though, since for screen refresh
>>>>> there is a relatively tight deadline between when the requests
>>>>> start being sent, and when the L2 cache needs to hit for that
>>>>> request, and failing this will result in graphical glitches.
>>>>>
>>>>> Though, generally what it means is, if the framebuffer image isn't
>>>>> in the L2 cache, it is gonna look like crap; and effectively the
>>>>> limit is more "how big of a framebuffer can I fit in the L2 cache".
>>>>>
>>>>> On the XC7A200T, I can afford a 512K L2 cache, which is just so big
>>>>> enough to fit 640x400 or 640x480 (but 800x600 is kinda pushing it,
>>>>> and fights a bit more with the main CPU).
>>>>>
>>>>> OTOH, it is likely the case than on the XC7A100T (which can only
>>>>> afford a 256K L2 cache), that 640x400 256-color is pushing it (but
>>>>> color cell mode still works fine).
>>>>>
>>>>> Had noted though that trying to set the screen resolution at one
>>>>> point to 800x600 RGB555 (72 Hz), pulls around 70MB/sec, and
>>>>> basically was almost entirely broken and seemingly bogged down the
>>>>> CPU (which could no longer access memory in a timely manner).
>>>>>
>>>>> Also seemingly stuff running on the CPU can effect screen artifacts
>>>>> in these modes, presumably by knocking stuff out of the L2 cache.
>>>>>
>>>>>
>>>>>
>>>>> Also, it seems like despite my ringbus being a lot faster than my
>>>>> original bus, it has still managed to become an issue due to latency.
>>>>>
>>>>> But, despite this, on average, things like interlocks and
>>>>> branch-miss penalties and similar are now still weighing in a fair
>>>>> bit as well (with interlock penalties closely following cache
>>>>> misses as the main source of pipeline stalls).
>>>>>
>>>>> Well, and these two combined burning around 30% of the total
>>>>> clock-cycles, with another ~ 2-3% or so being spent on branches, ...
>>>>>
>>>>>
>>>>> Well, and my recent effort to try to improve FPGA timing enough try
>>>>> to get it up to 75MHz, did have the drawback of "in general"
>>>>> increasing the number of cycles spent on interlocks (but, returning
>>>>> a lot of the instructions to their original latency values, would
>>>>> make the FPGA timing-constraints issues a bit worse).
>>>>>
>>>>> But, if I could entirely eliminate these sources of latency, this
>>>>> would only gain ~30%, and at this point would either need to
>>>>> somehow increase the average bundle with, or find ways to reduce
>>>>> the total number of instructions that need to be executed (both of
>>>>> these being more compiler-related territory).
>>>>>
>>>>>
>>>>> Though, OTOH, I have noted that in many cases I am beating RISC-V
>>>>> (RV64IM) in terms of total ".text" size in XG2 mode (only 32/64/96
>>>>> bit encodings) when both are using the same C library, which
>>>>> implies that I am probably "not doing too badly" on this front
>>>>> either (though, ideally, I would be "more consistently" beating
>>>>> RISC-V at this metric, *1).
>>>>>
>>>>>
>>>>> *1: RV64 "-Os -ffunction-sections -Wl,-gc-sections" is still able
>>>>> to beat XG2 in terms of having smaller ".text" (though, "-O2" and
>>>>> "-O3" are bigger; BJX2 Baseline does beat RV64IM, but this is not a
>>>>> fair test as BJX2 Baseline has 16-bit ops).
>>>>>
>>>>> Though, BGBCC also has an "/Os" option, it seems to have very
>>>>> little effect on XG2 Mode (it mostly does things to try to increase
>>>>> the number of 16-bit ops used, which is N/A in XG2).
>>>>>
>>>>> Where, here, one can use ".text" size as a stand-in for total
>>>>> instruction count (and by extension, the number of instructions
>>>>> that need to be executed).
>>>>>
>>>>> Though, in some past tests, it seemed like RISC-V needed to execute
>>>>> a larger number of instructions to render each frame in Doom, which
>>>>> doesn't really make follow if both have a roughly similar number of
>>>>> instructions in the emitted binaries (and if both are essentially
>>>>> running the same code).
>>>>>
>>>>> So, something seems curious here...
>>>>>
>>>>>
>>>>> ...
>>>>>
>>>>>
>>>> For the Q+ MPU and SOC the bus system is organized like a tree with
>>>> the root being at the CPU. The system bus operates with asynchronous
>>>> transactions. The bus then fans out through bus bridges to various
>>>> system components. Responses coming back from devices are buffered
>>>> and merge results together into a more common bus when there are
>>>> open spaces in the bus. I think it is fairly fast (well at least for
>>>> homebrew FPGA). Bus accesses are single cycle, but they may have a
>>>> varying amount of latency. Writes are “posted” so they are
>>>> essentially single cycle. Reads percolate back up the tree to the
>>>> CPU. It operates at the CPU clock rate (currently 40MHz) and
>>>> transfers 128-bits at a time. Maximum peak transfer rate would then
>>>> be 640 MB/s. Copying memory is bound to be much slower due to the
>>>> read latency. Devices on the bus have a configuration block which
>>>> looks something like a PCI config block, so devices addressing may
>>>> be controlled by the OS.
>>>>
>>>
>>> My original bus was fairly slow:
>>> Put a request on the bus, as it propagates, each layer of the bus
>>> holds the request, until it reaches the destination, and sends back
>>> an OK signal, which returns back up the bus to the sender, and then
>>> the sender switches to sending an IDLE signal, the whole process
>>> repeats as the bus "tears down", and when it is done, the OK signal
>>> switches to READY, and the bus may then accept another request.
>>>
>>> This bus could only handle a single active request at a time, and no
>>> further requests could initiate (anywhere) until the prior request
>>> had finished.
>>>
>>>
>>> Experimentally, I was hard-pressed getting much over about 6MB/sec
>>> over this bus with 128-bit transfers... (but could get it up to
>>> around 16MB/sec with 256-bit SWAP messages). As noted, this kinda
>>> sucked...
>>>
>>>
>>>
>>> I then replaced this with a ring-bus:
>>> Every object on the node passes messages from input to output, and is
>>> able to drop messages onto the bus, or remove/replace messages as
>>> appropriate. If not handled immediately, they circle the ring until
>>> they can be handled.
>>>
>>> This bus was considerably faster, but still seems to suffer from
>>> latency issues.
>>>
>>> In this case, the latency of the ring bus was higher than the
>>> original bus, but had the advantage that the L1 cache could
>>> effectively drop 4 consecutive requests onto the bus and then (in
>>> theory) they could all be handled within a single trip around the ring.
>>>
>>>
>>>
>>> Theoretically, the bus could move 800MB/sec at 50MHz, but practically
>>> seems to achieve around 70MB/s (which is in-turn effected by things
>>> that effect ring latency, like enabling/disabling various "shortcut
>>> paths" or enabling/disabling the second CPU core).
>>>
>>> A point-to-point message-passing bus could be possible, and could
>>> have lower latency, but was not done mostly because it seemed more
>>> complicated and expensive than the ring design.
>>>
>>>
>>> If one has two endpoints, both can achieve around 70MB/s if L2 hits,
>>> but this drops off if the external RAM accesses become the limiting
>>> factor.
>>>
>>>
>>> The RAM interface is using a modified version of the original bus,
>>> where both the OPM and OK signals were augmented with sequence
>>> numbers, where when the sent sequence number on OPM comes back via
>>> the OK signal, one can immediately move to the next request
>>> (incrementing the sequence number).
>>>
>>> While this interface still only allows a single request at a time,
>>> this change effectively doubles the throughput. The main reason for
>>> using this interface to talk to external RAM, is that the interface
>>> works across clock-domain crossings (as-is, the ring-bus requests
>>> can't survive a clock-domain crossing).
>>>
>>>
>>> Most of the MMIO devices are still operating on a narrower version of
>>> the original bus, say:
>>>    5b: OPM
>>>    28b: Addr
>>>    64b: DataIn
>>>    64b: DataOut
>>>    2b: OK
>>>
>>> Where, OPM:
>>>    00-000: IDLE
>>>    00-zzz: Special Command (if zzz!=000)
>>>
>>>    01-010: Load DWORD (MMIO)
>>>    01-011: Load QWORD (MMIO)
>>>    01-111: Load TILE (RAM, Old)
>>>
>>>    10-010: Store DWORD (MMIO)
>>>    10-011: Store QWORD (MMIO)
>>>    10-111: Store TILE (RAM, Old)
>>>
>>>    11-010: Swap DWORD (MMIO, Unused)
>>>    11-011: Swap QWORD (MMIO, Unused)
>>>    11-111: Swap TILE (RAM, Old)
>>>
>>> The ring-bus went over to an 8-bit OPM format, which increases the
>>> range of messages that can be sent.
>>>
>>>
>>> One advantage of the old bus is that the device-side logic is fairly
>>> simple. Typically, the OPM/Addr/Data signals would be mirrored to all
>>> of the devices, with each device having its own OK and DataOut signal.
>>>
>>> A sort of crossbar existed, where whichever device sets its OK value
>>> to something other than READY has its OK and Data signals passed back
>>> up the bus.
>>>
>>> Also it works because MMIO only allows a single active request at a
>>> time (and the MMIO bus interface on the ringbus will effectively
>>> serialize all accesses into the MMIO space on a "first come, first
>>> serve" basis).
>>>
>>>
>>> Note that accessing MMIO is comparably slow.
>>> Some devices, like the display / VRAM module, have been partly moved
>>> over to the ringbus (with the screen's frame-buffer mapped into RAM),
>>> but still uses the MMIO interface for access to display control
>>> registers and similar.
>>>
>>>
>>> The SDcard interface still goes over MMIO, but ended up being
>>> modified to allow sending/receiving 8 bytes at a time over SPI (with
>>> 8-bit transfers, accessing the MMIO bus was a bigger source of
>>> latency than actually sending bytes over SPI at 5MHz).
>>>
>>> As-is, I am running the SDcard at 12.5 MHz:
>>>    16.7MHz and 25MHz did not work reliably;
>>>    Going over 25MHz was out-of-spec;
>>>    Even with 8-byte transfers, MMIO access can still become a
>>> bottleneck.
>>>
>>>
>>> A UHS-II interface could in theory run at similar speeds to RAM, but
>>> would likely need a different interface to make use of this.
>>>
>>>
>>> One possibility would be to map the SDcard into the physical address
>>> space as a huge non-volatile RAM-like space (on the ring-bus). Had
>>> on/off considered this a few times, but didn't get to it.
>>>
>>> Effectively, it would require redesigning the whole SDcard and
>>> filesystem interface (essentially moving nearly all of the SDcard
>>> logic into hardware).
>>>
>>>
>>>> Multiple devices access the main DRAM memory via a memory
>>>> controller. Several devices that are bus masters have their own
>>>> ports to the memory controller and do not use up time on the main
>>>> system bus tree. The frame buffer has a streaming data port. The
>>>> frame buffer streaming cache is 8kB and loaded in 1kB strips at
>>>> 800MB/s from the DRAM IIRC. Other devices share a system cache which
>>>> is only 16kB due to limited number block RAMs. There are about a
>>>> half dozen read ports, so the block RAMs are replicated. With all
>>>> the ports accessing simultaneously there could be 8*40*16 MB/s being
>>>> transferred, or about 5.1 GB/s for reads.
>>>>
>>>
>>> I had put everything on the ring-bus, with the L2 also serving as the
>>> bridge to access external DRAM (via a direct connection to the DDR
>>> interface module).
>>>
>>>
>>>
>>>> The CPU itself has only L1 caches of 8kB I$ and 16kB D$. The D$ can
>>>> be dual ported, but is not configured that way ATM due to resource
>>>> limitations. The caches will request data in blocks the size of a
>>>> cache line. A cache line is broken into four consecutive 128-bit
>>>> accesses. So, data comes back from the boot ROM in a burst at 640 MB/s.
>>>>
>>>
>>> In my case:
>>>    L1 I$: 16K or 32K
>>>      32K helps notably with GLQuake and similar.
>>>      Doom works well with 16K.
>>>    L1 D$: 16K or 32K
>>>      Mostly 32K works well.
>>>      Had tried 64K, but bad for timing, and little effect on
>>> performance.
>>>
>>> IIRC, had evaluated running the CPU at 25MHz with 128K L1 caches and
>>> a small L2 cache, but modeling this had showed that performance would
>>> suck (even if nearly all of the instructions had a 1-cycle latency).
>>>
>>>
>>>
>>>> IIRC there were no display issues with an 800x600x16 bpp display,
>>>> but I could not get Thor to do much more than clear the screen. So,
>>>> it was a display of random dots that was stable. There is a separate
>>>> text display controller with its own dedicated block RAM for displays.
>>>>
>>>
>>> My display module is a little weird, as it was based around a
>>> cell-oriented design:
>>>    Cells are typically 128 or 256 bits, representing 8x8 pixels.
>>>
>>> Text and 2bpp color-cell modes use 128-bit cells, say:
>>>    ( 29: 0): Pair of 15-bit colors;
>>>    ( 31:30): 10
>>>    ( 61:32): Misc
>>>    ( 63:62): 00
>>>    (127:64): Pixel bits, 8x8x1 bit, raster order
>>>
>>> The 4bpp color-cell mode is more like:
>>>    ( 29: 0): Colors A/B
>>>    ( 31: 30): 11
>>>    ( 61: 32): Colors C/D
>>>    ( 63: 62): 11
>>>    ( 93: 64): Colors E/F
>>>    ( 95: 94): 00
>>>    (125: 96): Colors G/H
>>>    (127:126): 00
>>>    (159:128): Pixels A/B (4x4x2)
>>>    (191:160): Pixels C/D (4x4x2)
>>>    (223:192): Pixels E/F (4x4x2)
>>>    (255:224): Pixels G/H (4x4x2)
>>>
>>> In the bitmapped modes:
>>>    128-bit cell selects 256-color modes (4x4 pixels)
>>>    256-bit cell selects hi-color modes (4x4 pixels)
>>>
>>>
>>> So:
>>>    640x400 would be configured as 160x100 cells.
>>>    800x600 would be configured as 200x150 cells.
>>>
>>> The 800x600 256-color mode held up OK when I had the display module
>>> outputting at a non-standard 36Hz refresh, but increasing this to a
>>> more standard 72Hz blows out the memory bandwidth.
>>>
>>>
>>> Theoretically, the DDR RAM interface could support these resolutions
>>> if all the timing and latency was good. But, no so good when it is
>>> implemented by the display module hammering out a series of prefetch
>>> requests over the ring-bus just ahead of the current raster position.
>>>
>>> Though, the cell-oriented display modes still work better than my
>>> attempt at a linear framebuffer mode (due to cache/timing issues, not
>>> even a 320x200 linear framebuffer mode worked without looking like a
>>> broken mess).
>>>
>>>
>>> I suspect this is because, with the cell-oriented modes, each cell
>>> has 4 or 8 chances for the prefetch to succeed before it actually
>>> gets drawn, whereas in the linear raster mode, there is only 1 chance.
>>>
>>> It is likely that a linear framebuffer would require two stages:
>>> Prefetch 1: Somewhat ahead of current raster position, hopefully gets
>>> data into L2;
>>> Prefects 2: Closer to the raster position, intended to actually fetch
>>> the pixel data.
>>>
>>> Prefetches are used here rather than actual loads, mostly because
>>> these will get cleaned up quickly, whereas with actual fetches, a
>>> back-log scenario would result in the whole bus getting clogged up
>>> with unresolved requests.
>>>
>>>
>>> However, the CPU can use normal loads, since the CPU will patiently
>>> wait for the previous request(s) to finish before doing anything else
>>> (and thus avoids flooding the ring-bus with requests).
>>>
>>> However, a downside of prefetches, is that one has to keep asking the
>>> L2 cache each time whether or not it has the data in question yet.
>>>
>>>
>>>
>>>
>>> As for the "BJX2 doesn't always generate smaller .text than RISC-V
>>> issue", went looking at the ASM, and noted there is a big difference:
>>> GCC "-Os" generates very tight and efficient code, but needs to work
>>> within the limits of what the ISA provides;
>>> BGBCC has a bit more to work with, but the relative quality of the
>>> generated code is fairly poor in comparison.
>>>
>>>
>>> Like, say:
>>>    MOV.Q R8, (SP, 40)
>>>    .lbl:
>>>    MOV.Q (SP, 40), R8
>>> //BGBCC: "Sure why not?..."
>>>    ...
>>>    MOV R2, R9
>>>    MOV R9, R2
>>>    BRA .lbl
>>> //BGBCC: "Seems fine to me..."
>>>
>>> So, I look at the ASM, and once again feel a groan at how crappy a
>>> lot of it is.
>>>
>>>
>>> Or:
>>>    if(!ptr)
>>>      ...
>>> Was failing to go down the logic path that would have allowed it to
>>> use the BREQ/BRNE instructions (so was always producing a two-op
>>> sequence).
>>>
>>> Have noticed that code that writes, say:
>>>    if(ptr==NULL)
>>>      ...
>>> Ends up using a 3-instruction sequence, because it doesn't recognize
>>> this pattern as being the same as the "!ptr" case, ...
>>>
>>> Did at least find a few more "low hanging fruit" cases that shaved a
>>> few more kB off the binary.
>>>
>>> Well, and also added a case to partially optimize:
>>>    return(bar());
>>> To merge the 3AC "RET" into the "CSRV" operation, and thus save the
>>> use of a temporary (and roughly two otherwise unnecessary MOV
>>> instructions whenever this happens).
>>>
>>>
>>>
>>> But, ironically, it was still "mostly" generating code with fewer
>>> instructions, despite the still relatively weak code generation at
>>> times.
>>>
>>>
>>> Also it seems:
>>>    void foo()
>>>    {
>>>       //does nothing
>>>    }
>>>    void bar()
>>>    {
>>>      ...
>>>      foo();
>>>      ...
>>>    }
>>>
>>> GCC seems to be clever enough to realize that "foo()" does nothing,
>>> and will eliminate the function and function call entirely.
>>>
>>> BGBCC has no such optimization.
>>>
>>> ...
>>>
>>>
>> Finally got a synthesis for a complete Q+ system done. Turns out to be
>> about 10% too large for the XC7A200 :) It should easily fit in the
>> next larger part. Scratching my head wondering how to reduce sizes
>> while not losing too much functionality. I could go with just the CPU
>> and a serial port, remove the frame buffer, sprites, etc.
>
> I can fit:
> XC7A200T: dual core and a rasterizer module.
> XC7A100T: single core and a rasterizer module.
> XC7S50: Single core with reduced features.
>     Say, 2-wide 4R2W register file, 32 GPRs, etc.
> XC7S25: A 1-wide core with no FPU or MMU.
>     But, an XC7S25 would probably be better served with an RV32I core.
>     Well, and/or an SH-2 variant (*).
>
>
> Early on, I could fit dual core onto an XC7A100T, but the feature-set
> has expanded enough to make this a problem. Would need to trim stuff
> down a little to make this happen (though, part of this is that the L2
> cache and DDR RAM module burn a lot of LUTs on having a 512-bit RAM
> interface; but this is needed to get decent RAM bandwidth, as DDR
> bandwidth suffers considerably if I use 128-bit burst transfers).
>
> Well, also early on, the display module also had 32K of VRAM and I was
> displaying Doom and Quake using color-cells (color-cell encoding the
> screen image each time the screen was redrawn).
>
> Also ironically, when I first added the 320x200 hi-color mode, it was
> slower than using color-cell, mostly due to copying over the MMIO bus
> being slower than the color-cell encoder. But, this is no longer true.
> The high-color mode did have the advantage of better image quality
> though (it is sorta hit-miss vs 256 color mode with a fixed system
> palette; color-cell has better color but obvious block artifacts,
> whereas the 256 color mode lacks block artifacts but worse color fidelity).
>
> ...
>
>
> *: This is closer to what I had intended my 32-bit BSR1 design for, but
> annoyingly it came out a little bigger than a modified SH-2 based design
> (B32V).
>
>
> Where B32V was:
> Similar feature-set to SH-2;
>     No FPU or MMU;
>     Shifts were only in fixed amounts;
>       Bigger shifts built-up from smaller shifts;
>       Variable shift was via a runtime call and "shift slide".
>     Registers R0..R15
>       R0 was special
>       R15 was a stack pointer.
>     No integer multiply;
>     ...
> Little endian IIRC, but aligned-only memory access;
> Omitted the auto-increment addressing modes;
>     Addressing modes: (Rm), (Rm, R0)
>     It left out most other addressing modes.
> Cheaper interrupt mechanism.
>     Closer to the mechanism used on BJX2.
> Instruction encoding was otherwise kept from SuperH.
>     Effectively, fixed-length 16-bit instructions (ZnmZ, Znii, ...).
>
> Thus far, the B32V experiment was able to achieve the smallest LUT count
> (around 4000 LUTs IIRC), but didn't end up using it for much.
>
> Core would have been borderline too minimalist to even run something
> like Doom (if it had a display interface).
>
> Where, Doom seems to need a few things:
> Full general-purpose shift operations;
> A (not dead slow) integer multiplier;
> ...
>
> Comparably, attempts at both my BSR1 design, and RV32I, failed to be
> quite as small. But, RV32I would have been competitive in this space.
>
>
> Though, one additional limiting factor was that both BSR1 and B32V were
> designed around a 16-bit address space:
> 0000..7FFF: ROM
> C000..DFFF: RAM
> E000..FFFF: MMIO
>
> In this case, they would have used 16-bit pointers, albeit with a 32-bit
> register size (though, in these, 'int' was reduced to 16-bits, with
> 'long' as the native 32-bit type).
>
> Though, a vestige of this still exists in BJX2 (in the Boot ROM).
> But, as noted, a 16-bit address space would not be sufficient to run Doom.
>
> Then again, part of the initial design and also the initial Verilog code
> for the BJX2 core was derived from the BSR1 core, which was in turn
> partly derived from the B32V core (IIRC).
>
> But, ironically, the initial design for BJX2 was more or less bolting a
> bunch of stuff from the BJX1-64C variant on top of the BSR1 design
> (where BJX1 was basically ended up being like "What if I did the x86-64
> thing just using SH4 as a base?"; but could have in-theory been
> backwards compatible with SH-4, and used "hardware" interfaces partly
> derived from the Sega Dreamcast, but never got onto an actual FPGA and
> was likely unworkable).
>
> Did try briefly (without much success) at trying to get Dreamcast ports
> of Linux to boot on the emulator for it. Did get some simpler SuperH
> Linux ports to boot though (mostly ones that were No-MMU and did all of
> their IO via a "debug UART"; rather than, say, trying to use the
> PowerVR2 graphics chip and similar).
>
>
>
> Early versions of BJX2 used 32-bit pointers, until I went over to the
> 64-bit layout (with 16 tag bits).
>
> Did experiment briefly with an ABI using 128-bit pointers with a 96-bit
> addresses, but have shelved this for now (at best, this will kinda suck
> for the extra performance and memory-usage overheads, while otherwise
> being almost entirely overkill at this point).
>
> The more practical-seeming option was to keep programs as still using
> the 64-bit pointers, but then being able to use 128-bit "__huge"
> pointers in the off chance they actually need the 128-bit pointers for
> something.
>
>
> ...
>
Got the core to fit, about 95% full.

Click here to read the complete article

Re: Tonight's tradeoff

<ukuvq3$1nphg$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35561&group=comp.arch#35561

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!nntp.comgw.net!paganini.bofh.team!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Fri, 8 Dec 2023 06:48:19 -0500
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <ukuvq3$1nphg$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<71cb5ad7604b3d909df865a19ee3d52e@news.novabbs.com>
<ujb40q$3eepe$1@dont-email.me> <ujrfaa$2h1v9$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 8 Dec 2023 11:48:20 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="07d18a1461f277889be01938f709876a";
logging-data="1828400"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18eZy0aZhdwzyMZoVJOzf3t0PYFtFw+5yI="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:QwEmSivaBJfMLfvq14Ero1f9G2g=
In-Reply-To: <ukth60$1e6ab$1@dont-email.me>
Content-Language: en-US

by: Robert Finch - Fri, 8 Dec 2023 11:48 UTC

What happens when there is a sequence of numerous branches in row, such
that the machine would run out of checkpoints for the branches?

Suppose you go
Bra tgt1
Bra tgt1
… 30 times
Bra tgt1

Will the machine still work? Or will it crash?
I have Q+ stalling until checkpoints are available, but it seems like a
loss of performance. It is extra hardware to check for the case that
might be preventable with software. I mean how often would a sequence
like the above occur?

Re: Tonight's tradeoff

<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35571&group=comp.arch#35571

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Fri, 8 Dec 2023 17:53:12 +0000
Organization: novaBBS
Message-ID: <5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com> <ujrm4a$2llie$1@dont-email.me> <d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com> <bps8N.150652$wvv7.7314@fx14.iad> <2023Nov26.164506@mips.complang.tuwien.ac.at> <tDP8N.30031$ayBd.8559@fx07.iad> <2023Nov27.085708@mips.complang.tuwien.ac.at> <s929N.28687$rx%7.18632@fx47.iad> <2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me> <ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad> <3a714dde36640fc4c47255d0a170aaee@news.novabbs.com> <ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me> <ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me> <ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="3466729"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$jWNH/BS6n54a5.W1fYQjkOgEylxdcJA15yxmVsEzX0SdYrF.G5WQy
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949

by: MitchAlsup - Fri, 8 Dec 2023 17:53 UTC

Robert Finch wrote:

> What happens when there is a sequence of numerous branches in row, such
> that the machine would run out of checkpoints for the branches?

Stall Insert.

> Suppose you go
> Bra tgt1
> Bra tgt1
> … 30 times
> Bra tgt1

Unconditional Branches do not need a checkpoint (all by themselves).

> Will the machine still work? Or will it crash?
> I have Q+ stalling until checkpoints are available, but it seems like a
> loss of performance. It is extra hardware to check for the case that
> might be preventable with software. I mean how often would a sequence
> like the above occur?

Unconditional branches can be dealt with completely in the front end
{they do not need to be executed--except as they alter IP.}

On the other hand:: compilers are pretty good at cleaning up branches
to unconditional branches.

How will you tell for sure:: Read the ASM your compiler produces (a lot
of it).

Re: Tonight's tradeoff

<ukvo1m$1rddt$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35572&group=comp.arch#35572

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Fri, 8 Dec 2023 12:41:55 -0600
Organization: A noiseless patient Spider
Lines: 68
Message-ID: <ukvo1m$1rddt$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me>
<ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 8 Dec 2023 18:41:59 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="dd11b010c98c2e0ebcf1a63cb087ca72";
logging-data="1947069"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+vTcE9TZ1ILn/x0yPiw4bV"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:NKVFLf4R8bxVUn/KSgy7xE2caVw=
In-Reply-To: <5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
Content-Language: en-US

by: BGB - Fri, 8 Dec 2023 18:41 UTC

On 12/8/2023 11:53 AM, MitchAlsup wrote:
> Robert Finch wrote:
>
>> What happens when there is a sequence of numerous branches in row,
>> such that the machine would run out of checkpoints for the branches?
>
> Stall Insert.
>
>> Suppose you go
>> Bra tgt1
>> Bra tgt1
>> … 30 times
>> Bra tgt1
>
> Unconditional Branches do not need a checkpoint (all by themselves).
>

In my case, I don't use "checkpoints".
Granted, it is a fairly naive in-order / stalling-pipeline design as well.

>> Will the machine still work? Or will it crash?
>> I have Q+ stalling until checkpoints are available, but it seems like
>> a loss of performance. It is extra hardware to check for the case that
>> might be preventable with software. I mean how often would a sequence
>> like the above occur?
>
> Unconditional branches can be dealt with completely in the front end
> {they do not need to be executed--except as they alter IP.}
>
> On the other hand:: compilers are pretty good at cleaning up branches
> to unconditional branches.
>

Hmm... Computer branch to unconditional branch was how I ended up
implementing things like "switch()".

Will not claim this is ideal for performance though (and does currently
have the limitation that in direct branch-to-branch cases, the branch
predictor doesn't work), so a "table of offsets" may be a better option,
but would be harder to set up in terms of relocs.

Ironically, the prolog compression would also have this issue, except
that typically these functions start with a "MOV LR, R1", which then
"protects" the following BSR instruction and allows the branch predictor
to work.

Granted, this is one of those "needs to be this way otherwise the CPU
craps itself" cases.

Also the branch predictor doesn't work if one crosses certain boundaries:
4K for Disp8
64K for Disp11
16MB for Disp20
The Disp23 cases will also be rejected by the branch predictor.

Though, these are more a result of "carry propagation isn't free", and
this logic is somewhat latency sensitive.

These cases fall back to the slow branch case, as do Disp33 and Abs48
branches.

> How will you tell for sure:: Read the ASM your compiler produces (a lot
> of it).

Re: Tonight's tradeoff

<ul2vv2$2d290$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35619&group=comp.arch#35619

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sat, 9 Dec 2023 19:15:28 -0500
Organization: A noiseless patient Spider
Lines: 28
Message-ID: <ul2vv2$2d290$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me>
<987455c358f93a9a7896c9af3d5f2b75@news.novabbs.com>
<ujrm4a$2llie$1@dont-email.me>
<d1f73b9de9ff6f86dac089ebd4bca037@news.novabbs.com>
<bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me>
<ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 10 Dec 2023 00:15:30 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f6b966f05750f6708fc1b1367d8a02b0";
logging-data="2525472"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+plw0YMzAFNv3PTh5LVcHjlXFbwOPKZt8="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:u6CND9QiLsUXboWoBIS4MmCNLoM=
Content-Language: en-US
In-Reply-To: <ukvo1m$1rddt$1@dont-email.me>

by: Robert Finch - Sun, 10 Dec 2023 00:15 UTC

Getting a bit lazy on the Q+ instruction commit in the interest of
increasing the fmax. The results are already in the register file, so
all the commit has to do is:

1) Update the branch predictor.
2) Free up physical registers
3) Free load/store queue entries associated with the ROB entry.
4) Commit oddball instructions.
5) Process any outstanding exceptions.
6) Free the ROB entry
7) Gather performance statistics.

What needs to be committed is computed in the clock cycle before the
commit. This pipelined signal adds a cycle of latency to the commit, but
it only really affects oddball instructions rarely executed, and
exceptions. Commit also will not commit if the commit pointer is near
the queue pointer. Commit will also only commit up to the first oddball
instruction or exception.

Decided to axe the branch-to-register feature of conditional branch
instructions because the branch target would not be known at enqueue
time. It would require updating the ROB in two places.

Branches can now use a postfix immediate to extend the branch range.
This allows 32 and 64-bit displacements in addition to the existing
17-bit one. However, the assembler cannot know which to use in advance,
so choosing a larger branch displacement size should be an option.

Re: Tonight's tradeoff

<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35620&group=comp.arch#35620

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 10 Dec 2023 01:02:00 +0000
Organization: novaBBS
Message-ID: <59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <bps8N.150652$wvv7.7314@fx14.iad> <2023Nov26.164506@mips.complang.tuwien.ac.at> <tDP8N.30031$ayBd.8559@fx07.iad> <2023Nov27.085708@mips.complang.tuwien.ac.at> <s929N.28687$rx%7.18632@fx47.iad> <2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me> <ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad> <3a714dde36640fc4c47255d0a170aaee@news.novabbs.com> <ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me> <ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me> <ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me> <5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com> <ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="3609898"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Site: $2y$10$5VWQwcn3gt5.hPYwlGWXa.CSBC0C4TkiEz.IYmHMji6v5sMS.3eta
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949

by: MitchAlsup - Sun, 10 Dec 2023 01:02 UTC

Robert Finch wrote:

> Getting a bit lazy on the Q+ instruction commit in the interest of
> increasing the fmax. The results are already in the register file, so
> all the commit has to do is:

> 1) Update the branch predictor.
> 2) Free up physical registers

By the time you write the physical register into the file, you are in
a position to free up the now permanently invisible physical register
it replaced.

> 3) Free load/store queue entries associated with the ROB entry.

Spectré:: write miss buffer data into Cache and TLB.
This is also where I write ST.data into cache.

> 4) Commit oddball instructions.
> 5) Process any outstanding exceptions.
> 6) Free the ROB entry
> 7) Gather performance statistics.

> What needs to be committed is computed in the clock cycle before the
> commit. This pipelined signal adds a cycle of latency to the commit, but
> it only really affects oddball instructions rarely executed, and
> exceptions. Commit also will not commit if the commit pointer is near
> the queue pointer. Commit will also only commit up to the first oddball
> instruction or exception.

> Decided to axe the branch-to-register feature of conditional branch
> instructions because the branch target would not be known at enqueue
> time. It would require updating the ROB in two places.

Question:: How would you handle::

IDIV R6,R7,R8
JMP R6

> Branches can now use a postfix immediate to extend the branch range.
> This allows 32 and 64-bit displacements in addition to the existing
> 17-bit one. However, the assembler cannot know which to use in advance,
> so choosing a larger branch displacement size should be an option.

Re: Tonight's tradeoff

<ul38cp$2e25u$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35622&group=comp.arch#35622

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.nntp4.net!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sat, 9 Dec 2023 21:39:17 -0500
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <ul38cp$2e25u$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me>
<ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 10 Dec 2023 02:39:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f6b966f05750f6708fc1b1367d8a02b0";
logging-data="2558142"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX183e31i5FFBjihIWyuxdJL1VtES8OFlI/w="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:0Fma6WdiQ49TOYGfrs/cQO2Su+c=
In-Reply-To: <59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
Content-Language: en-US

by: Robert Finch - Sun, 10 Dec 2023 02:39 UTC

On 2023-12-09 8:02 p.m., MitchAlsup wrote:
> Robert Finch wrote:
>
>> Getting a bit lazy on the Q+ instruction commit in the interest of
>> increasing the fmax. The results are already in the register file, so
>> all the commit has to do is:
>
>> 1) Update the branch predictor.
>> 2) Free up physical registers
>
> By the time you write the physical register into the file, you are in
> a position to free up the now permanently invisible physical register
> it replaced.
>
Hey thanks, I should have thought of that. While there are more physical
registers available than needed (256 and only about 204 are needed), so
it would probably run okay, I think I see a way to reduce multiplexor
usage by freeing the register when it is written.

>> 3) Free load/store queue entries associated with the ROB entry.
>
> Spectré:: write miss buffer data into Cache and TLB.
> This is also where I write ST.data into cache.
>
Is miss data for a TLB page fault? I have this stored in a register in
the TLB which must be read by the CPU during exception handling.
Otherwise the TLB has a hidden page walker that updates the TLB.
Scratching my head now over writing the store data at commit time.

>> 4)    Commit oddball instructions.
>> 5)    Process any outstanding exceptions.
>> 6)    Free the ROB entry
>> 7)    Gather performance statistics.
>
>> What needs to be committed is computed in the clock cycle before the
>> commit. This pipelined signal adds a cycle of latency to the commit,
>> but it only really affects oddball instructions rarely executed, and
>> exceptions. Commit also will not commit if the commit pointer is near
>> the queue pointer. Commit will also only commit up to the first
>> oddball instruction or exception.
>
>> Decided to axe the branch-to-register feature of conditional branch
>> instructions because the branch target would not be known at enqueue
>> time. It would require updating the ROB in two places.
>
> Question:: How would you handle::
>
>     IDIV    R6,R7,R8
>     JMP     R6
>
> ??
>
There is a JMP address[Rn] instruction (really JSR Rt,address[Rn]) in
the instruction set which is always treated as a branch miss when it
executes. The RTS instruction could also be used, it allows the return
address register to be specified and it is a couple of bytes shorter. It
was just that conditional branches had the feature removed. It required
a third register be read for the flow control unit too.

>> Branches can now use a postfix immediate to extend the branch range.
>> This allows 32 and 64-bit displacements in addition to the existing
>> 17-bit one. However, the assembler cannot know which to use in
>> advance, so choosing a larger branch displacement size should be an
>> option.

Re: Tonight's tradeoff

<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35623&group=comp.arch#35623

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 10 Dec 2023 04:06:39 +0000
Organization: novaBBS
Message-ID: <689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
References: <uis67u$fkj4$1@dont-email.me> <bps8N.150652$wvv7.7314@fx14.iad> <2023Nov26.164506@mips.complang.tuwien.ac.at> <tDP8N.30031$ayBd.8559@fx07.iad> <2023Nov27.085708@mips.complang.tuwien.ac.at> <s929N.28687$rx%7.18632@fx47.iad> <2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me> <ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad> <3a714dde36640fc4c47255d0a170aaee@news.novabbs.com> <ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me> <ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me> <ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me> <ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me> <ukuvq3$1nphg$1@dont-email.me> <5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com> <ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me> <59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com> <ul38cp$2e25u$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="3622012"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$yFkMRxjdi0ml4qiwDOYtG.Syz/7TiZIQvX.kAQnxY2Hrmn9JULnku

by: MitchAlsup - Sun, 10 Dec 2023 04:06 UTC

Robert Finch wrote:

> On 2023-12-09 8:02 p.m., MitchAlsup wrote:
>> Robert Finch wrote:
>>
>>> Getting a bit lazy on the Q+ instruction commit in the interest of
>>> increasing the fmax. The results are already in the register file, so
>>> all the commit has to do is:
>>
>>> 1) Update the branch predictor.
>>> 2) Free up physical registers
>>
>> By the time you write the physical register into the file, you are in
>> a position to free up the now permanently invisible physical register
>> it replaced.
>>
> Hey thanks, I should have thought of that. While there are more physical
> registers available than needed (256 and only about 204 are needed), so
> it would probably run okay, I think I see a way to reduce multiplexor
> usage by freeing the register when it is written.

You are welcome.

>>> 3) Free load/store queue entries associated with the ROB entry.
>>
>> Spectré:: write miss buffer data into Cache and TLB.
>> This is also where I write ST.data into cache.
>>
> Is miss data for a TLB page fault?

I leave TLB replacements in the miss buffer simply because they are so
seldom that I don't feel it necessary to build yet another buffer.
TLB plus any tablewalk acceleration is deferred until the casuing
instruction retires.

> I have this stored in a register in
> the TLB which must be read by the CPU during exception handling.

Technically, the TLB is the storage and comparators, while the rest
of the table walking mechanics {including the TLB} are the MMU.
> Otherwise the TLB has a hidden page walker that updates the TLB.

If you don't defer TLB update until after the causing instruction retires
Spectré-like attacks have a covert channel at their disposal.

> Scratching my head now over writing the store data at commit time.

My 6-wide machine has a conditional-cache (memory reorder buffer)
after execution, calculation instructions can raise no exception.
This is the commit point. Between commit and retire, the conditional
cache updated the Data Cache. So there is a period of time the
pipeline builds up state, and once it has been determined that
nothing can prevent the manifestations of those instructions from
taking place, there is a period of time state gets updated. Once
all state is updated, the instruction has retired.

>>> 4)    Commit oddball instructions.
>>> 5)    Process any outstanding exceptions.
>>> 6)    Free the ROB entry
>>> 7)    Gather performance statistics.
>>
>>> What needs to be committed is computed in the clock cycle before the
>>> commit. This pipelined signal adds a cycle of latency to the commit,
>>> but it only really affects oddball instructions rarely executed, and
>>> exceptions. Commit also will not commit if the commit pointer is near
>>> the queue pointer. Commit will also only commit up to the first
>>> oddball instruction or exception.
>>
>>> Decided to axe the branch-to-register feature of conditional branch
>>> instructions because the branch target would not be known at enqueue
>>> time. It would require updating the ROB in two places.
>>
>> Question:: How would you handle::
>>
>>     IDIV    R6,R7,R8
>>     JMP     R6
>>
>> ??
>>
> There is a JMP address[Rn] instruction (really JSR Rt,address[Rn]) in
> the instruction set which is always treated as a branch miss when it
> executes. The RTS instruction could also be used, it allows the return
> address register to be specified and it is a couple of bytes shorter. It
> was just that conditional branches had the feature removed. It required
> a third register be read for the flow control unit too.

I have a LD IP,[address] instruction which is used to access GOT[k] for
calling dynamically linked subroutines. This bypasses the LD-aligner
to deliver IP to fetch faster.

But you side-stepped answering my question. My question is what do you
do when the Jump address will not arrive for another 20 cycles.

>>> Branches can now use a postfix immediate to extend the branch range.
>>> This allows 32 and 64-bit displacements in addition to the existing
>>> 17-bit one. However, the assembler cannot know which to use in
>>> advance, so choosing a larger branch displacement size should be an
>>> option.

I use GOT[k] to branch farther than the 28-bit unconditional branch
displacement can reach. We have not yet run into a subroutine that
needs branches of more then 18-bits conditionally or 28-bits uncon-
ditionally.

Re: Tonight's tradeoff

<ul3ngl$2jhoh$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35626&group=comp.arch#35626

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Tonight's tradeoff
Date: Sun, 10 Dec 2023 01:57:24 -0500
Organization: A noiseless patient Spider
Lines: 137
Message-ID: <ul3ngl$2jhoh$1@dont-email.me>
References: <uis67u$fkj4$1@dont-email.me> <bps8N.150652$wvv7.7314@fx14.iad>
<2023Nov26.164506@mips.complang.tuwien.ac.at>
<tDP8N.30031$ayBd.8559@fx07.iad>
<2023Nov27.085708@mips.complang.tuwien.ac.at>
<s929N.28687$rx%7.18632@fx47.iad>
<2023Nov27.171049@mips.complang.tuwien.ac.at> <ukaeef$1ecg7$1@dont-email.me>
<ukb8rd$1im31$1@dont-email.me> <9K1bN.257199$wvv7.25292@fx14.iad>
<3a714dde36640fc4c47255d0a170aaee@news.novabbs.com>
<ukj875$33k1l$1@dont-email.me> <ukmb6n$3q23h$1@dont-email.me>
<ukmho1$3qusm$1@dont-email.me> <uko6ok$dk88$1@dont-email.me>
<ukon2b$jilv$1@dont-email.me> <uksqek$1aotb$1@dont-email.me>
<ukt9hb$1d4fp$1@dont-email.me> <ukth60$1e6ab$1@dont-email.me>
<ukuvq3$1nphg$1@dont-email.me>
<5c98e6b975febf33250db9651fe2e8ba@news.novabbs.com>
<ukvo1m$1rddt$1@dont-email.me> <ul2vv2$2d290$1@dont-email.me>
<59f7236fe010c2abcc42925d8d352c9d@news.novabbs.com>
<ul38cp$2e25u$1@dont-email.me>
<689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 10 Dec 2023 06:57:25 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="942c72864c75c39a1139a8a076293736";
logging-data="2737937"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Kw/rbGu5dw2fe3wr5qu8orlY/tGO+C6c="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:ZU2ZnlL1ObOPc425JOIuSnXiODA=
In-Reply-To: <689fd9bea2f70dcbb1116df5d75d353b@news.novabbs.com>
Content-Language: en-US

by: Robert Finch - Sun, 10 Dec 2023 06:57 UTC

On 2023-12-09 11:06 p.m., MitchAlsup wrote:
> Robert Finch wrote:
>
>> On 2023-12-09 8:02 p.m., MitchAlsup wrote:
>>> Robert Finch wrote:
>>>
>>>> Getting a bit lazy on the Q+ instruction commit in the interest of
>>>> increasing the fmax. The results are already in the register file,
>>>> so all the commit has to do is:
>>>
>>>> 1)    Update the branch predictor.
>>>> 2)    Free up physical registers
>>>
>>> By the time you write the physical register into the file, you are in
>>> a position to free up the now permanently invisible physical register
>>> it replaced.
>>>
>> Hey thanks, I should have thought of that. While there are more
>> physical registers available than needed (256 and only about 204 are
>> needed), so it would probably run okay, I think I see a way to reduce
>> multiplexor usage by freeing the register when it is written.
>
> You are welcome.
>
>>>> 3)    Free load/store queue entries associated with the ROB entry.
>>>
>>> Spectré:: write miss buffer data into Cache and TLB.
>>> This is also where I write ST.data into cache.
>>>
>> Is miss data for a TLB page fault?
>
> I leave TLB replacements in the miss buffer simply because they are so
> seldom that I don't feel it necessary to build yet another buffer.
> TLB plus any tablewalk acceleration is deferred until the casuing
> instruction retires.
>
>> I have this stored in a register in the TLB which must be read by the
>> CPU during exception handling.
>
> Technically, the TLB is the storage and comparators, while the rest
> of the table walking mechanics {including the TLB} are the MMU.
>
>> Otherwise the TLB has a hidden page walker that updates the TLB.
>
> If you don't defer TLB update until after the causing instruction retires
> Spectré-like attacks have a covert channel at their disposal.

I am tempted to try that approach, Q+ buffers TLB misses already. All
the details added to mitigate Spectré-like attacks would seem to add
hardware though.
>
>> Scratching my head now over writing the store data at commit time.
>
> My 6-wide machine has a conditional-cache (memory reorder buffer)
> after execution, calculation instructions can raise no exception.
> This is the commit point. Between commit and retire, the conditional
> cache updated the Data Cache. So there is a period of time the pipeline
> builds up state, and once it has been determined that
> nothing can prevent the manifestations of those instructions from
> taking place, there is a period of time state gets updated. Once
> all state is updated, the instruction has retired.
>
>>>> 4)    Commit oddball instructions.
>>>> 5)    Process any outstanding exceptions.
>>>> 6)    Free the ROB entry
>>>> 7)    Gather performance statistics.
>>>
>>>> What needs to be committed is computed in the clock cycle before the
>>>> commit. This pipelined signal adds a cycle of latency to the commit,
>>>> but it only really affects oddball instructions rarely executed, and
>>>> exceptions. Commit also will not commit if the commit pointer is
>>>> near the queue pointer. Commit will also only commit up to the first
>>>> oddball instruction or exception.
>>>
>>>> Decided to axe the branch-to-register feature of conditional branch
>>>> instructions because the branch target would not be known at enqueue
>>>> time. It would require updating the ROB in two places.
>>>
>>> Question:: How would you handle::
>>>
>>>      IDIV    R6,R7,R8
>>>      JMP     R6
>>>
>>> ??
>>>
>> There is a JMP address[Rn] instruction (really JSR Rt,address[Rn]) in
>> the instruction set which is always treated as a branch miss when it
>> executes. The RTS instruction could also be used, it allows the return
>> address register to be specified and it is a couple of bytes shorter.
>> It was just that conditional branches had the feature removed. It
>> required a third register be read for the flow control unit too.
>
> I have a LD IP,[address] instruction which is used to access GOT[k] for
> calling dynamically linked subroutines. This bypasses the LD-aligner
> to deliver IP to fetch faster.
>
> But you side-stepped answering my question. My question is what do you
> do when the Jump address will not arrive for another 20 cycles.
>
While waiting for the register value, other instructions would continue
to queue and execute. Then that processing would be dumped because of
the branch miss. I suppose hardware could be added to suppress
processing until the register value is known. An option for a larger build.

>>>> Branches can now use a postfix immediate to extend the branch range.
>>>> This allows 32 and 64-bit displacements in addition to the existing
>>>> 17-bit one. However, the assembler cannot know which to use in
>>>> advance, so choosing a larger branch displacement size should be an
>>>> option.
>
> I use GOT[k] to branch farther than the 28-bit unconditional branch
> displacement can reach. We have not yet run into a subroutine that
> needs branches of more then 18-bits conditionally or 28-bits uncon-
> ditionally.

I have yet to use GOT addressing.

There are issues to resolve in the Q+ frontend. The next PC value for
the BTB is not available for about three clocks. To go backwards in
time, the next PC needs to be cached, or rather the displacement to the
next PC to reduce cache size. The first time a next PC is needed it will
not be available for three clocks. Once cached it would be available
within a clock. The next PC displacement is the sum of the lengths of
next four instructions. There is not enough room in the FPGA to add
another cache and associated logic, however. Next PC = PC + 20 seems a
whole lot simpler to me.

Thus, I may go back to using a fixed size instruction or rather
instructions with fixed alignment. The position of instructions could be
as if they were fixed length while remaining variable length.
Instructions would just be aligned at fixed intervals. If I set the
length to five bytes for instance, most the instruction set could be
accommodated. Operation by “packed” instructions would be an option for
a larger build. There could be a bit in a control register to allow
execution by packed or unpacked instructions so there is some backwards
compatibility to a smaller build.

devel / comp.arch / Re: Tonight's tradeoff

Pages:1 234 5 6 7 8 9 10 11 12

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor