Welcome to Rocksolid Light

mail files register newsreader groups login

Message-ID:

An economist is a man who would marry Farrah Fawcett-Majors for her money.

Re: Misc: Another (possible) way to more MHz...

Subject	Author
Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	robf...@gmail.com
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	EricP
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	Timothy McCaffrey
Re: Misc: Another (possible) way to more MHz...	MitchAlsup
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	EricP
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	EricP
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	MitchAlsup
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	Kent Dickey
Re: Misc: Another (possible) way to more MHz...	robf...@gmail.com
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	Kent Dickey
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	Kent Dickey
Re: Misc: Another (possible) way to more MHz...	MitchAlsup
Re: Misc: Another (possible) way to more MHz...	MitchAlsup
Re: Misc: Another (possible) way to more MHz...	MitchAlsup
Re: Misc: Another (possible) way to more MHz...	EricP
Re: Misc: Another (possible) way to more MHz...	Kent Dickey
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	Kent Dickey
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	robf...@gmail.com
Re: Misc: Another (possible) way to more MHz...	MitchAlsup
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	Michael S
Re: Misc: Another (possible) way to more MHz...	Michael S
Re: Misc: Another (possible) way to more MHz...	BGB

Pages:12

Re: Misc: Another (possible) way to more MHz...

<joXSM.67418$EIy4.18153@fx48.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34413&group=comp.arch#34413

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx48.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Misc: Another (possible) way to more MHz...
References: <uf4btl$3pe5m$1@dont-email.me> <ufd79o$2nafk$1@dont-email.me> <0fb82b76-a6df-42d8-8cde-d094d94db237n@googlegroups.com> <ufdvs9$2s0qj$1@dont-email.me> <ufei6t$2vj62$1@dont-email.me>
In-Reply-To: <ufei6t$2vj62$1@dont-email.me>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 45
Message-ID: <joXSM.67418$EIy4.18153@fx48.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 03 Oct 2023 16:34:23 UTC
Date: Tue, 03 Oct 2023 12:34:01 -0400
X-Received-Bytes: 2775

by: EricP - Tue, 3 Oct 2023 16:34 UTC

Kent Dickey wrote:
> In article <ufdvs9$2s0qj$1@dont-email.me>, BGB <cr88192@gmail.com> wrote:
>> I am fiddling around a bit with it, and have been getting the core
>> "closer" to being able to boost the speed, but the "Worst Negative
>> Slack" is still at around 2.59ns, and is putting up a whole lot of a
>> fight here...
>>
>> Looks like for a lot of the failing paths are sort of like:
>> ~ 12-14 levels;
>> ~ 50-130 high-fanout;
>> ~ 4.5 ns of logic delay;
>> ~ 10.5 ns of net-delay.
>>
>>
>> What makes things harder is that I am trying to pull this off while
>> staying with 32K L1 caches, ...
>
> A good rule of thumb is 1ns per LUT level. So if you have 14 levels of LUTs,
> then you're aiming at 70MHz or so. Either reduce levels of LUTs, or just
> settle for 70MHz.
>
> Note that 14 levels of LUTs is equivalent to about 30 levels of gates. This is
> a slow design independent of it being in an FPGA, and independent of
> any FPGA routing issues.
>
> If you want to not optimize your control and other logic, that's your
> choice. But you're mixing things up. You're saying an ALU cannot be
> done within 10ns on an FPGA, and I'm pointing out that's not true.
> Definitely doing an ADD over 2 clocks is unnecessary in an FPGA, you
> just need to use the DSP48 blocks. Synthesized adders in FPGA have poor
> performance when they got to 32bits or wider. And on an FPGA, bad
> decisions compound--if you have a huge slow ALU, it makes everything else
> slower as well (since everything gets further apart).

I'm now looking at a PDF book
"Designing with Xilinx FPGAs Using Vivado", 2017, and it says
"Embedded in the CLB is a high-performance look-ahead carry chain which
enables the FPGA to implement very high-performance adders. Current FPGAs
have carry chains which can implement a 64-bit adder at 500 MHz."

Unfortunately I can't find where it says how just yet.
I'd have expected this would be a library macro.

Re: Misc: Another (possible) way to more MHz...

<ufi0tv$3oguv$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34418&group=comp.arch#34418

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: kegs@provalid.com (Kent Dickey)
Newsgroups: comp.arch
Subject: Re: Misc: Another (possible) way to more MHz...
Date: Tue, 3 Oct 2023 21:26:23 -0000 (UTC)
Organization: provalid.com
Lines: 68
Message-ID: <ufi0tv$3oguv$1@dont-email.me>
References: <uf4btl$3pe5m$1@dont-email.me> <ufdvs9$2s0qj$1@dont-email.me> <ufei6t$2vj62$1@dont-email.me> <joXSM.67418$EIy4.18153@fx48.iad>
Injection-Date: Tue, 3 Oct 2023 21:26:23 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="283b04ef2ab0644adad269c54ae2f803";
logging-data="3949535"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+GJlpTLyLL/zDqUiV0IJLQ"
Cancel-Lock: sha1:yrH3EcWhcC7NsRNJVyQj+/Q7WF0=
Originator: kegs@provalid.com (Kent Dickey)
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)

by: Kent Dickey - Tue, 3 Oct 2023 21:26 UTC

In article <joXSM.67418$EIy4.18153@fx48.iad>,
EricP <ThatWouldBeTelling@thevillage.com> wrote:
>Kent Dickey wrote:
>> In article <ufdvs9$2s0qj$1@dont-email.me>, BGB <cr88192@gmail.com> wrote:
>>> I am fiddling around a bit with it, and have been getting the core
>>> "closer" to being able to boost the speed, but the "Worst Negative
>>> Slack" is still at around 2.59ns, and is putting up a whole lot of a
>>> fight here...
>>>
>>> Looks like for a lot of the failing paths are sort of like:
>>> ~ 12-14 levels;
>>> ~ 50-130 high-fanout;
>>> ~ 4.5 ns of logic delay;
>>> ~ 10.5 ns of net-delay.
>>>
>>>
>>> What makes things harder is that I am trying to pull this off while
>>> staying with 32K L1 caches, ...
>>
>> A good rule of thumb is 1ns per LUT level. So if you have 14 levels of LUTs,
>> then you're aiming at 70MHz or so. Either reduce levels of LUTs, or just
>> settle for 70MHz.
>>
>> Note that 14 levels of LUTs is equivalent to about 30 levels of gates.
>This is
>> a slow design independent of it being in an FPGA, and independent of
>> any FPGA routing issues.
>>
>> If you want to not optimize your control and other logic, that's your
>> choice. But you're mixing things up. You're saying an ALU cannot be
>> done within 10ns on an FPGA, and I'm pointing out that's not true.
>> Definitely doing an ADD over 2 clocks is unnecessary in an FPGA, you
>> just need to use the DSP48 blocks. Synthesized adders in FPGA have poor
>> performance when they got to 32bits or wider. And on an FPGA, bad
>> decisions compound--if you have a huge slow ALU, it makes everything else
>> slower as well (since everything gets further apart).
>
>I'm now looking at a PDF book
>"Designing with Xilinx FPGAs Using Vivado", 2017, and it says
>"Embedded in the CLB is a high-performance look-ahead carry chain which
>enables the FPGA to implement very high-performance adders. Current FPGAs
>have carry chains which can implement a 64-bit adder at 500 MHz."
>
>Unfortunately I can't find where it says how just yet.
>I'd have expected this would be a library macro.

I looked up UG574, and what this is referring to is now one slice
can add 8 bits, and generate the carry. The same carry chain exists,
but now it's just 8 vertical slices to do a 64-bit adder, so the physical
constraint I once ran into in the past is resolved on these more modern
FPGAs.

For my purposes, I need just a few 48-bit adders, and I just let
synthesis take care of it, and this works at 200MHz for me, but I'm not
using Artix, so it's not a fair comparison. I took BGB's comments at
face value, and assumed the adder could be causing problems for their
ALU and suggested workarounds (I assumed it was a SIMD-capable adder
with a lot of complexity on killing the intermediate carries, but now
I'm not so sure).

In any case, in later posts, it's clear BGB's ALU issues are not the
adder itself. I suspect out[63:0] = in1[63:0] + in2[63:0] would be around
most 3-4ns, and so not a problem. Xilinx documentation usually claims
performance on the highest speed model with -3 speed. Artix is the lowest
model and -1 is the lowest speed, so it's much slower, so I wouldn't
expect 64-bit adds at 500MHz on Artix.

Kent

Re: Misc: Another (possible) way to more MHz...

<ufie52$3qs2k$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34420&group=comp.arch#34420

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Another (possible) way to more MHz...
Date: Tue, 3 Oct 2023 20:11:59 -0500
Organization: A noiseless patient Spider
Lines: 175
Message-ID: <ufie52$3qs2k$1@dont-email.me>
References: <uf4btl$3pe5m$1@dont-email.me> <ufdvs9$2s0qj$1@dont-email.me>
<ufei6t$2vj62$1@dont-email.me> <joXSM.67418$EIy4.18153@fx48.iad>
<ufi0tv$3oguv$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 4 Oct 2023 01:12:02 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="dcea170deb8270329d5447a70c72a71e";
logging-data="4026452"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/qZGZKJPPC6E1zIkP+yUVs"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:YgXjIKMHL5CjhLOHwMCWEo7OwEA=
Content-Language: en-US
In-Reply-To: <ufi0tv$3oguv$1@dont-email.me>

by: BGB - Wed, 4 Oct 2023 01:11 UTC

On 10/3/2023 4:26 PM, Kent Dickey wrote:
> In article <joXSM.67418$EIy4.18153@fx48.iad>,
> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>> Kent Dickey wrote:
>>> In article <ufdvs9$2s0qj$1@dont-email.me>, BGB <cr88192@gmail.com> wrote:
>>>> I am fiddling around a bit with it, and have been getting the core
>>>> "closer" to being able to boost the speed, but the "Worst Negative
>>>> Slack" is still at around 2.59ns, and is putting up a whole lot of a
>>>> fight here...
>>>>
>>>> Looks like for a lot of the failing paths are sort of like:
>>>> ~ 12-14 levels;
>>>> ~ 50-130 high-fanout;
>>>> ~ 4.5 ns of logic delay;
>>>> ~ 10.5 ns of net-delay.
>>>>
>>>>
>>>> What makes things harder is that I am trying to pull this off while
>>>> staying with 32K L1 caches, ...
>>>
>>> A good rule of thumb is 1ns per LUT level. So if you have 14 levels of LUTs,
>>> then you're aiming at 70MHz or so. Either reduce levels of LUTs, or just
>>> settle for 70MHz.
>>>
>>> Note that 14 levels of LUTs is equivalent to about 30 levels of gates.
>> This is
>>> a slow design independent of it being in an FPGA, and independent of
>>> any FPGA routing issues.
>>>
>>> If you want to not optimize your control and other logic, that's your
>>> choice. But you're mixing things up. You're saying an ALU cannot be
>>> done within 10ns on an FPGA, and I'm pointing out that's not true.
>>> Definitely doing an ADD over 2 clocks is unnecessary in an FPGA, you
>>> just need to use the DSP48 blocks. Synthesized adders in FPGA have poor
>>> performance when they got to 32bits or wider. And on an FPGA, bad
>>> decisions compound--if you have a huge slow ALU, it makes everything else
>>> slower as well (since everything gets further apart).
>>
>> I'm now looking at a PDF book
>> "Designing with Xilinx FPGAs Using Vivado", 2017, and it says
>> "Embedded in the CLB is a high-performance look-ahead carry chain which
>> enables the FPGA to implement very high-performance adders. Current FPGAs
>> have carry chains which can implement a 64-bit adder at 500 MHz."
>>
>> Unfortunately I can't find where it says how just yet.
>> I'd have expected this would be a library macro.
>
> I looked up UG574, and what this is referring to is now one slice
> can add 8 bits, and generate the carry. The same carry chain exists,
> but now it's just 8 vertical slices to do a 64-bit adder, so the physical
> constraint I once ran into in the past is resolved on these more modern
> FPGAs.
>

Yeah. Pretty sure the ones I have only do 4-bit CARRY4 adders.

The CARRY4 units can apparently also be configured to do 4-bit
AND/OR/XOR as well.

LUTs seem to be able to take several sub-forms:
LUT5: 5-in, 3-out
LUT6: 6-in, 2-out (and 2x 6-in,1-out)
LUT7: 7-in, 1-out

So, for example:
C = tInv ? ~A : A;
Takes 22x LUT5 for 64 bits.

And, say:
C = A & B;
Takes 16x CARRY4's for 64-bits, but unlike A+B, does not have carry
propagate latency.

> For my purposes, I need just a few 48-bit adders, and I just let
> synthesis take care of it, and this works at 200MHz for me, but I'm not
> using Artix, so it's not a fair comparison. I took BGB's comments at
> face value, and assumed the adder could be causing problems for their
> ALU and suggested workarounds (I assumed it was a SIMD-capable adder
> with a lot of complexity on killing the intermediate carries, but now
> I'm not so sure).
>

The ALU does deal with packed-integer SIMD.

In this case, the outputs from the carry-select stages can also be used
to drive packed Int16 and Int32 (along with special case ops for sign
and zero extension).

I have noted that there is a latency impact with enabling or disabling
the support for 128-bit ALU operations (this extends the carry chain
across both adders, to does lead to higher latency).

Things like 128-bit shift have less impact on latency as the two 64-bit
shift units can operate independently and merely give the appearance of
a combined 128-bit shift

But, for the most part, I am OK with 2 cycle latency for ADD, as it has
been this way for basically as long as BJX2 has existed...

It was mostly a few other ops that were recently extended to 2 cycles.
Did end up rerouting some things though, so now "MOV Rm, Rn" is back to
being a 1-cycle operation.

In particular, early on, GPRs and CRs existed as two separate register
spaces, but I had later ended up merging them (along with the ports),
meaning that the original MOV_CR and MOV_RC cases were functionally
equivalent to MOV, still 1-cycle ops. Had ended up routing MOV through
the MOV_CR logic in this case (albeit for now leaving sign and zero
extension as having a 2 cycle latency).

> In any case, in later posts, it's clear BGB's ALU issues are not the
> adder itself. I suspect out[63:0] = in1[63:0] + in2[63:0] would be around
> most 3-4ns, and so not a problem. Xilinx documentation usually claims
> performance on the highest speed model with -3 speed. Artix is the lowest
> model and -1 is the lowest speed, so it's much slower, so I wouldn't
> expect 64-bit adds at 500MHz on Artix.
>

Yeah, I am using -1 speed-grade FPGAs.

My core would seemingly pass timing at 75MHz without too much issue if I
had a -2 speed-grade FPGA.

Well, or if I could get a Vivado license for the Kintex-7 I got off
AliExpress (for, ironically, significantly less than it would cost for
the associated Vivado license).

Seemingly, the biggest/fastest FPGA one can theoretically get (without
throwing "big money" at the problem), is the XC7A200T-2 in the Nexys
Video (but, the Nexys Video was still both rather expensive and highly
prone to being sold-out).

But, I have the QMTECH board with the XC7A200T-1 (which, ironically,
seems to have a harder time passing timing than the XC7A100T-1 in the
Nexys A7; but has significantly more LUTs and Block-RAM, so I can go
multi-core and have a bigger L2 cache).

Ironically, after first moving to targeting the QMTECH board, had
initially ended up needing to drop the L1 sizes back to 16K and similar
(but then was later able to free up the timing enough to allow going
back to 32K L1's).

Seemingly, with the -1 Artix-7's, there seems to be a negative
correlation between the size of the FPGA and how easily it passes timing.

Still fiddling with it, trying to see if I can free up enough timing to
"unlock" a 75MHz option.

It seems like 66MHz could be more easily in reach, but this would
require a bit more work on the logic (would need to create a 66MHz
clock-signal, and work on a bunch of the internal timing logic to allow
everything to operate on 66 MHz).

The difference here being "only" 2ns (but, then it is "only" another 3ns
after this point to 100MHz...).

But, trying to mop up all the "total negative slack" to make this happen
is still a bit of a thing at the moment.

> Kent

Re: Misc: Another (possible) way to more MHz...

<74b6ddc2-5efd-44d8-9097-6c2992d39b16n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34433&group=comp.arch#34433

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1a05:b0:419:588e:f8fb with SMTP id f5-20020a05622a1a0500b00419588ef8fbmr32673qtb.4.1696423909959;
Wed, 04 Oct 2023 05:51:49 -0700 (PDT)
X-Received: by 2002:a05:6808:19a2:b0:3ad:da36:1dd6 with SMTP id
bj34-20020a05680819a200b003adda361dd6mr2306892oib.1.1696423909769; Wed, 04
Oct 2023 05:51:49 -0700 (PDT)
Path: i2pn2.org!rocksolid2!news.neodome.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 4 Oct 2023 05:51:49 -0700 (PDT)
In-Reply-To: <uf4btl$3pe5m$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:58d7:c433:89d1:4124;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:58d7:c433:89d1:4124
References: <uf4btl$3pe5m$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <74b6ddc2-5efd-44d8-9097-6c2992d39b16n@googlegroups.com>
Subject: Re: Misc: Another (possible) way to more MHz...
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Wed, 04 Oct 2023 12:51:49 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1219

by: Michael S - Wed, 4 Oct 2023 12:51 UTC

They keep spamming at constant rate of 0.5 msgs per minute day after day after day.

Re: Misc: Another (possible) way to more MHz...

<1e0eb2d3-d407-447f-9488-2c7511b63c82n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34434&group=comp.arch#34434

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:b11:b0:65c:fec5:6e7 with SMTP id u17-20020a0562140b1100b0065cfec506e7mr27166qvj.7.1696424114666;
Wed, 04 Oct 2023 05:55:14 -0700 (PDT)
X-Received: by 2002:a05:6808:198e:b0:3ae:1363:751c with SMTP id
bj14-20020a056808198e00b003ae1363751cmr1250468oib.4.1696424114498; Wed, 04
Oct 2023 05:55:14 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 4 Oct 2023 05:55:14 -0700 (PDT)
In-Reply-To: <74b6ddc2-5efd-44d8-9097-6c2992d39b16n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:58d7:c433:89d1:4124;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:58d7:c433:89d1:4124
References: <uf4btl$3pe5m$1@dont-email.me> <74b6ddc2-5efd-44d8-9097-6c2992d39b16n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1e0eb2d3-d407-447f-9488-2c7511b63c82n@googlegroups.com>
Subject: Re: Misc: Another (possible) way to more MHz...
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Wed, 04 Oct 2023 12:55:14 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 1573

by: Michael S - Wed, 4 Oct 2023 12:55 UTC

On Wednesday, October 4, 2023 at 3:51:51 PM UTC+3, Michael S wrote:
> They keep spamming at constant rate of 0.5 msgs per minute day after day after day.

BTW, comp.arch can be considered relatively lucky.
comp.lang.c is spammed at much higher rate and by two seemingly unrelated spammers.

Re: Misc: Another (possible) way to more MHz...

<ufk1hg$cc1t$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34442&group=comp.arch#34442

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Another (possible) way to more MHz...
Date: Wed, 4 Oct 2023 10:49:02 -0500
Organization: A noiseless patient Spider
Lines: 114
Message-ID: <ufk1hg$cc1t$1@dont-email.me>
References: <uf4btl$3pe5m$1@dont-email.me> <ufd79o$2nafk$1@dont-email.me>
<0fb82b76-a6df-42d8-8cde-d094d94db237n@googlegroups.com>
<ufdvs9$2s0qj$1@dont-email.me> <ufei6t$2vj62$1@dont-email.me>
<joXSM.67418$EIy4.18153@fx48.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 4 Oct 2023 15:49:04 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="dcea170deb8270329d5447a70c72a71e";
logging-data="405565"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18qMJak4z8XhGRlsn7hNki3"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:8WlOlibj6AIIgv4dqfKgsOR5lzA=
In-Reply-To: <joXSM.67418$EIy4.18153@fx48.iad>
Content-Language: en-US

by: BGB - Wed, 4 Oct 2023 15:49 UTC

On 10/3/2023 11:34 AM, EricP wrote:
> Kent Dickey wrote:
>> In article <ufdvs9$2s0qj$1@dont-email.me>, BGB <cr88192@gmail.com>
>> wrote:
>>> I am fiddling around a bit with it, and have been getting the core
>>> "closer" to being able to boost the speed, but the "Worst Negative
>>> Slack" is still at around 2.59ns, and is putting up a whole lot of a
>>> fight here...
>>>
>>> Looks like for a lot of the failing paths are sort of like:
>>> ~ 12-14 levels;
>>> ~ 50-130 high-fanout;
>>> ~ 4.5 ns of logic delay;
>>> ~ 10.5 ns of net-delay.
>>>
>>>
>>> What makes things harder is that I am trying to pull this off while
>>> staying with 32K L1 caches, ...
>>
>> A good rule of thumb is 1ns per LUT level. So if you have 14 levels
>> of LUTs,
>> then you're aiming at 70MHz or so. Either reduce levels of LUTs, or just
>> settle for 70MHz.
>>
>> Note that 14 levels of LUTs is equivalent to about 30 levels of
>> gates. This is
>> a slow design independent of it being in an FPGA, and independent of
>> any FPGA routing issues.
>>
>> If you want to not optimize your control and other logic, that's your
>> choice. But you're mixing things up. You're saying an ALU cannot be
>> done within 10ns on an FPGA, and I'm pointing out that's not true.
>> Definitely doing an ADD over 2 clocks is unnecessary in an FPGA, you
>> just need to use the DSP48 blocks. Synthesized adders in FPGA have poor
>> performance when they got to 32bits or wider. And on an FPGA, bad
>> decisions compound--if you have a huge slow ALU, it makes everything else
>> slower as well (since everything gets further apart).
>
> I'm now looking at a PDF book
> "Designing with Xilinx FPGAs Using Vivado", 2017, and it says
> "Embedded in the CLB is a high-performance look-ahead carry chain which
> enables the FPGA to implement very high-performance adders. Current FPGAs
> have carry chains which can implement a 64-bit adder at 500 MHz."
>
> Unfortunately I can't find where it says how just yet.
> I'd have expected this would be a library macro.
>

Dunno there...

I have noted:
Naive add seems to be slow;
Seems to synthesize a 64-bit adder as a chain of 16 CARRY4's.
The tail end of this chain seems to have timing issues;
Mostly works well if:
Small sized value
Latency doesn't matter
One only needs the low-order bits quickly.
Carry select is typically faster;
Seems to work best in my testing with 12 or 16 bit chunks;
If given a full cycle, can usually do 128 bits in 1 cycle at 50MHz.

There is the DSP48, which can in theory be used for add. No obvious way
to invoke the DSP48 for this absent invoking it directly, which looks
"ugly as hell", would be FPGA specific (I do most of my simulation work
in Verilator, ...), and still doesn't do a full 64 bits.

More so, while the DSP48 could in theory do 48 bits add, the same DSP
couldn't also be used for packed-integer 16-bit and 32-bit SIMD ADD.

....

I have not fully confirmed, but it seems that the 128-bit paired ALU
mode is not ideal for timing at 75MHz.

Looks like most of the recent slow paths seem to be in the PC
step/update loop, and in the instruction decoder (it is going back and
forth).

PC update Loop:
Figure out how much to adjust PC based on the current instruction;
Check if branch predictor does its thing;
Check if a branch has been done;
Feed final PC back into L1 I$.

In the instruction decoder, it is mostly in the paths to get from the
instruction word to the output ports, and also the immediate output
fields, ...

Some latency seems due to the cases dealing with handling paired
registers, and the paired CR's case. But, this is only really relevant
in 96-bit mode (some of this might be easier if I dropped the core back
to 48-bits, but was trying to keep all the major ISA features intact for
this).

Still mostly battling with trying to find ways to free up the last ~ 2ns
or so (WNS ~ 2ns, TNS ~ 4k..8k ns with ~ 1k..2k failed routes). Still a
bit better than where I started.

Have started messing with some of the other synthesis modes:
Defaults: One I had been using.
PerfOptimized_high: Timing came out slightly worse than defaults.
AreaOptimized_high: LUT cost dropped a fair bit
Timing got wrecked hard (paths were failing by around 7ns).
AlternateRoutability: Still testing this one.
Increases LUT cost by a few percent relative to Defaults.
Still waiting to know timing result.
....

Re: Misc: Another (possible) way to more MHz...

<ug2d5b$c1lb$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34488&group=comp.arch#34488

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: kegs@provalid.com (Kent Dickey)
Newsgroups: comp.arch
Subject: Re: Misc: Another (possible) way to more MHz...
Date: Tue, 10 Oct 2023 02:33:15 -0000 (UTC)
Organization: provalid.com
Lines: 236
Message-ID: <ug2d5b$c1lb$1@dont-email.me>
References: <uf4btl$3pe5m$1@dont-email.me> <ufei6t$2vj62$1@dont-email.me> <joXSM.67418$EIy4.18153@fx48.iad> <ufk1hg$cc1t$1@dont-email.me>
Injection-Date: Tue, 10 Oct 2023 02:33:15 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ce943694371a0b347b05d164dcf95047";
logging-data="394923"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX197ctHq77LjnTkbkrGH0Jch"
Cancel-Lock: sha1:Cx25+dQunl7T4hXeUS4pmpxttfQ=
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
Originator: kegs@provalid.com (Kent Dickey)

by: Kent Dickey - Tue, 10 Oct 2023 02:33 UTC

In article <ufk1hg$cc1t$1@dont-email.me>, BGB <cr88192@gmail.com> wrote:
>On 10/3/2023 11:34 AM, EricP wrote:
>> Kent Dickey wrote:
>>> In article <ufdvs9$2s0qj$1@dont-email.me>, BGB <cr88192@gmail.com>
>>> wrote:
>>>> I am fiddling around a bit with it, and have been getting the core
>>>> "closer" to being able to boost the speed, but the "Worst Negative
>>>> Slack" is still at around 2.59ns, and is putting up a whole lot of a
>>>> fight here...
>>>>
>>>> Looks like for a lot of the failing paths are sort of like:
>>>> ~ 12-14 levels;
>>>> ~ 50-130 high-fanout;
>>>> ~ 4.5 ns of logic delay;
>>>> ~ 10.5 ns of net-delay.
>>>>
>>>>
>>>> What makes things harder is that I am trying to pull this off while
>>>> staying with 32K L1 caches, ...
>>>
>>> A good rule of thumb is 1ns per LUT level. So if you have 14 levels
>>> of LUTs,
>>> then you're aiming at 70MHz or so. Either reduce levels of LUTs, or just
>>> settle for 70MHz.
>>>
>>> Note that 14 levels of LUTs is equivalent to about 30 levels of
>>> gates. This is
>>> a slow design independent of it being in an FPGA, and independent of
>>> any FPGA routing issues.
>>>
>>> If you want to not optimize your control and other logic, that's your
>>> choice. But you're mixing things up. You're saying an ALU cannot be
>>> done within 10ns on an FPGA, and I'm pointing out that's not true.
>>> Definitely doing an ADD over 2 clocks is unnecessary in an FPGA, you
>>> just need to use the DSP48 blocks. Synthesized adders in FPGA have poor
>>> performance when they got to 32bits or wider. And on an FPGA, bad
>>> decisions compound--if you have a huge slow ALU, it makes everything else
>>> slower as well (since everything gets further apart).
>>
>> I'm now looking at a PDF book
>> "Designing with Xilinx FPGAs Using Vivado", 2017, and it says
>> "Embedded in the CLB is a high-performance look-ahead carry chain which
>> enables the FPGA to implement very high-performance adders. Current FPGAs
>> have carry chains which can implement a 64-bit adder at 500 MHz."
>>
>> Unfortunately I can't find where it says how just yet.
>> I'd have expected this would be a library macro.
>>
>
>Dunno there...
>
>I have noted:
> Naive add seems to be slow;
> Seems to synthesize a 64-bit adder as a chain of 16 CARRY4's.
> The tail end of this chain seems to have timing issues;
> Mostly works well if:
> Small sized value
> Latency doesn't matter
> One only needs the low-order bits quickly.
> Carry select is typically faster;
> Seems to work best in my testing with 12 or 16 bit chunks;
> If given a full cycle, can usually do 128 bits in 1 cycle at 50MHz.

First, I think you should decide on your priorities, and what's important.

There are 2 camps:

1) If you just want to play with your architecture, and getting deep into
FPGA specifics is not interesting to you, then set your clock to 50MHz and
have fun.

2) If you want to to learn how to optimize logic, and you want your
logic to run fast, then you need to learn timing tricks and work to fix
your timing bugs. You have severe timing issues. If you go down this
path, you should easily get a design that will make timing robustly at
100MHz. This means small changes will continue to make timing. Right
now you're probably stable at about 60MHz, so you have a long ways to go
(it means you have a large number of paths that need work).

I feel like you're really in camp #1, but you have some interest in camp #2.
But: you continually claim things about FPGA performance due to your
timing bugs that aren't true.

Sadly, it's not easy to do a little of both--timing is a tricky beast,
and if you stop working on it for a few weeks, you accumulate a debt.
Many designs lead to chaotic results: a minor logic change can cause
huge negative slack to appear, and more very minor changes can seem to
fix them. This just shows that you're not really making timing, but
that the tools can sometimes get lucky and make it work. You need to
get to a stable design, where small changes don't wreck timing. Tools
can sometimes get fairly bad paths working for you (12-14 LUTs in 10ns
is doable, sometimes). A rule to achieve this: fix every single timing
path that EVER occurs, based on LUT depths. If you make a minor change,
and a path shows up (being more than 10 LUTs, for example), you MUST
make a logic change that should fix it, and then continue. This sounds
like you'll just spin your wheels, but eventually you'll either: a) come
across a path you cannot fix; or b) fix all paths. When you hit b),
your timing is no longer chaotically unstable, and now minor changes
lead to no timing issues. The number of rounds to achieve this is often
around the number of days of writing code: so if took you 40 8-hour days
to write the code (ignore revisions that were discarded, etc.), it's
about 40 cycles through this process to clean it up. Note: the easiest
approach is to synthesize every single day, and make a timing fix every
day, and then it's just a few more rounds at the end to get timing
stable.

So, you've now built up a "timing debt" by writing a lot of logic, and not
optimizing it. If you want to try to pay it down, expect to spend a lot
of time on it. I suggest you'll be happier in camp #1.

>There is the DSP48, which can in theory be used for add. No obvious way
>to invoke the DSP48 for this absent invoking it directly, which looks
>"ugly as hell", would be FPGA specific (I do most of my simulation work
>in Verilator, ...), and still doesn't do a full 64 bits.

Verilator is an issue. There are simulation libraries for components like
BRAMs and DSP48s, but when I last looked into this, Xilinx wrote them
specfically in a behavioral style to likely prevent you creating an ASIC
too easily from your FPGA design. Verilator specifically doesn't allow
behavioral style Verilog. In fact, I couldn't get started with Verilator
since I needed a little bit of behavioral testbench to init my system, so
I just gave up. I used Icarus Verilog, it was fast enough for my purposes.

>More so, while the DSP48 could in theory do 48 bits add, the same DSP
>couldn't also be used for packed-integer 16-bit and 32-bit SIMD ADD.

Although sort of true, it's not necessarily true. You could just do
16-bit in each DSP. However, we've now decisively determined synthesized
64-bit adders take 3ns or so and are definitely not related to your timing
issues.

>I have not fully confirmed, but it seems that the 128-bit paired ALU
>mode is not ideal for timing at 75MHz.

A 128-bit synthesized adder should take less than 5ns.

>Looks like most of the recent slow paths seem to be in the PC
>step/update loop, and in the instruction decoder (it is going back and
>forth).
>
>PC update Loop:
> Figure out how much to adjust PC based on the current instruction;
> Check if branch predictor does its thing;
> Check if a branch has been done;
> Feed final PC back into L1 I$.
>
>
>In the instruction decoder, it is mostly in the paths to get from the
>instruction word to the output ports, and also the immediate output
>fields, ...
>
>Some latency seems due to the cases dealing with handling paired
>registers, and the paired CR's case. But, this is only really relevant
>in 96-bit mode (some of this might be easier if I dropped the core back
>to 48-bits, but was trying to keep all the major ISA features intact for
>this).
>
>
>Still mostly battling with trying to find ways to free up the last ~ 2ns
>or so (WNS ~ 2ns, TNS ~ 4k..8k ns with ~ 1k..2k failed routes). Still a
>bit better than where I started.

In reading between the lines, I think you may not be approaching timing
issues in the right way. First, it's important to note the tools work
VERY hard to reduce total negative slack (TNS). If there's a path with
a signal that's late, and it makes 1000 destinations 2ns late, that's
-2000ns. So: if there's another path that just makes timing with 20
destinations, and it moves all of that logic to the bottom right corner
of the chip, making each of those 20 paths 4ns longer, then this can be
a win. If moving those 20 paths is enough to speed that one -2ns path
(with fanout 1000) to -1.9ns, then the TNS becomes -1900 - 80 = -1980ns.
This is better TNS, and so this change will be made. The tools keep
working, and so to get these 1000 paths slightly better, it will destroy
lots of other logic that was easily making timing. What this means is
when you have a stupidly huge TNS, then the single WORST path could be
something completely random and possibly unrelated (but which was "in
the way"). To fix this, you have to search through say the top 20
(maybe top 50) worst negative slack (WNS) individual paths to find among
the worst paths with a high-fanout path--and fix THAT path (before the
high fanout). Note: the tools are working pretty hard to make it hard
to find these paths. But remember the rule: if you're aiming for 10 LUT
levels, then fix any path with more than 10 LUT levels that shows up on
the WNS list.

Click here to read the complete article

Re: Misc: Another (possible) way to more MHz...

<ug2pqu$vk2g$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34489&group=comp.arch#34489

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Another (possible) way to more MHz...
Date: Tue, 10 Oct 2023 01:09:31 -0500
Organization: A noiseless patient Spider
Lines: 433
Message-ID: <ug2pqu$vk2g$1@dont-email.me>
References: <uf4btl$3pe5m$1@dont-email.me> <ufei6t$2vj62$1@dont-email.me>
<joXSM.67418$EIy4.18153@fx48.iad> <ufk1hg$cc1t$1@dont-email.me>
<ug2d5b$c1lb$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 10 Oct 2023 06:09:34 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6950b43a9ad29bd293d5db57581fd3e0";
logging-data="1036368"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+4RrsAoB2UCEo9bTi1Rb/z"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:mv2L21xKh0abvXJ5eBtKssqPWqo=
Content-Language: en-US
In-Reply-To: <ug2d5b$c1lb$1@dont-email.me>

by: BGB - Tue, 10 Oct 2023 06:09 UTC

On 10/9/2023 9:33 PM, Kent Dickey wrote:
> In article <ufk1hg$cc1t$1@dont-email.me>, BGB <cr88192@gmail.com> wrote:
>> On 10/3/2023 11:34 AM, EricP wrote:
>>> Kent Dickey wrote:
>>>> In article <ufdvs9$2s0qj$1@dont-email.me>, BGB <cr88192@gmail.com>
>>>> wrote:
>>>>> I am fiddling around a bit with it, and have been getting the core
>>>>> "closer" to being able to boost the speed, but the "Worst Negative
>>>>> Slack" is still at around 2.59ns, and is putting up a whole lot of a
>>>>> fight here...
>>>>>
>>>>> Looks like for a lot of the failing paths are sort of like:
>>>>> ~ 12-14 levels;
>>>>> ~ 50-130 high-fanout;
>>>>> ~ 4.5 ns of logic delay;
>>>>> ~ 10.5 ns of net-delay.
>>>>>
>>>>>
>>>>> What makes things harder is that I am trying to pull this off while
>>>>> staying with 32K L1 caches, ...
>>>>
>>>> A good rule of thumb is 1ns per LUT level. So if you have 14 levels
>>>> of LUTs,
>>>> then you're aiming at 70MHz or so. Either reduce levels of LUTs, or just
>>>> settle for 70MHz.
>>>>
>>>> Note that 14 levels of LUTs is equivalent to about 30 levels of
>>>> gates. This is
>>>> a slow design independent of it being in an FPGA, and independent of
>>>> any FPGA routing issues.
>>>>
>>>> If you want to not optimize your control and other logic, that's your
>>>> choice. But you're mixing things up. You're saying an ALU cannot be
>>>> done within 10ns on an FPGA, and I'm pointing out that's not true.
>>>> Definitely doing an ADD over 2 clocks is unnecessary in an FPGA, you
>>>> just need to use the DSP48 blocks. Synthesized adders in FPGA have poor
>>>> performance when they got to 32bits or wider. And on an FPGA, bad
>>>> decisions compound--if you have a huge slow ALU, it makes everything else
>>>> slower as well (since everything gets further apart).
>>>
>>> I'm now looking at a PDF book
>>> "Designing with Xilinx FPGAs Using Vivado", 2017, and it says
>>> "Embedded in the CLB is a high-performance look-ahead carry chain which
>>> enables the FPGA to implement very high-performance adders. Current FPGAs
>>> have carry chains which can implement a 64-bit adder at 500 MHz."
>>>
>>> Unfortunately I can't find where it says how just yet.
>>> I'd have expected this would be a library macro.
>>>
>>
>> Dunno there...
>>
>> I have noted:
>> Naive add seems to be slow;
>> Seems to synthesize a 64-bit adder as a chain of 16 CARRY4's.
>> The tail end of this chain seems to have timing issues;
>> Mostly works well if:
>> Small sized value
>> Latency doesn't matter
>> One only needs the low-order bits quickly.
>> Carry select is typically faster;
>> Seems to work best in my testing with 12 or 16 bit chunks;
>> If given a full cycle, can usually do 128 bits in 1 cycle at 50MHz.
>
> First, I think you should decide on your priorities, and what's important.
>
> There are 2 camps:
>
> 1) If you just want to play with your architecture, and getting deep into
> FPGA specifics is not interesting to you, then set your clock to 50MHz and
> have fun.
>
> or
>
> 2) If you want to to learn how to optimize logic, and you want your
> logic to run fast, then you need to learn timing tricks and work to fix
> your timing bugs. You have severe timing issues. If you go down this
> path, you should easily get a design that will make timing robustly at
> 100MHz. This means small changes will continue to make timing. Right
> now you're probably stable at about 60MHz, so you have a long ways to go
> (it means you have a large number of paths that need work).
>
> I feel like you're really in camp #1, but you have some interest in camp #2.
> But: you continually claim things about FPGA performance due to your
> timing bugs that aren't true.
>

Initially, I was mostly just trying to make everything work at all...

When I started out, I was first trying to target 100 MHz, but then got
frustrated as it was nearly impossible to write much of anything
non-trivial and have it pass timing, so I had ended up dropping down to
50MHz.

Most of the development process was then "beat on it whenever it failed
timing at 50MHz, ignore it otherwise".

For the most part, for most of its existence, it was "mostly passes
timing at 50MHz and only occasionally fails" (with timing failures
seeming to be pretty much at random, and usually going away by poking at
something).

Comparably though, 50MHz was a lot easier as it was closer to "just
write whatever and it still works" in terms of timing. Logic with faster
clock-speeds is generally a bit more of a pain.

What prompted me to try to go down this path was when a random change
caused slack to jump from its typical ~ 0.5 or 0.6ns, to around 4.5 ns.
This being one of the biggest jumps in timing slack I had seen.

But, it seems, the difference between 50MHz and 75MHz is 6.7ns, and the
last few ns are still putting up a pretty big point.

Recently, had gotten TNS down to around 400ns (with a WNS of around 0.67ns).

Though, disabling a random ISA feature caused it to jump back up to
around TNS=1600ns, WNS=2.1ns; but, with LUT cost dropping a bit.

More minor fiddling, then it changes to TNS=1888, WNS=1.462, LUT cost
back up to where it was before.

....

In any case, I seem to be getting gradually closer (even if not by huge
amounts).

> Sadly, it's not easy to do a little of both--timing is a tricky beast,
> and if you stop working on it for a few weeks, you accumulate a debt.
> Many designs lead to chaotic results: a minor logic change can cause
> huge negative slack to appear, and more very minor changes can seem to
> fix them. This just shows that you're not really making timing, but
> that the tools can sometimes get lucky and make it work. You need to
> get to a stable design, where small changes don't wreck timing. Tools
> can sometimes get fairly bad paths working for you (12-14 LUTs in 10ns
> is doable, sometimes). A rule to achieve this: fix every single timing
> path that EVER occurs, based on LUT depths. If you make a minor change,
> and a path shows up (being more than 10 LUTs, for example), you MUST
> make a logic change that should fix it, and then continue. This sounds
> like you'll just spin your wheels, but eventually you'll either: a) come
> across a path you cannot fix; or b) fix all paths. When you hit b),
> your timing is no longer chaotically unstable, and now minor changes
> lead to no timing issues. The number of rounds to achieve this is often
> around the number of days of writing code: so if took you 40 8-hour days
> to write the code (ignore revisions that were discarded, etc.), it's
> about 40 cycles through this process to clean it up. Note: the easiest
> approach is to synthesize every single day, and make a timing fix every
> day, and then it's just a few more rounds at the end to get timing
> stable.
>
> So, you've now built up a "timing debt" by writing a lot of logic, and not
> optimizing it. If you want to try to pay it down, expect to spend a lot
> of time on it. I suggest you'll be happier in camp #1.
>

I have made the observation in all this that is seems more effective to
focus on paths that have more "logic delay", and mostly ignore paths
that are almost entirely "net delay".

How exactly the Verilog code gets from its "C-like" notation to a
particular representation in the FPGA is still a little bit of a mystery
at times.

It doesn't help in all this that Vivado only displays the top 10 paths
in terms of WNS.

I had mostly been spotting "slow paths" by noting some of the long and
more complex dependencies in the Verilog (and seeing if poking at them
effects timing).

>> There is the DSP48, which can in theory be used for add. No obvious way
>> to invoke the DSP48 for this absent invoking it directly, which looks
>> "ugly as hell", would be FPGA specific (I do most of my simulation work
>> in Verilator, ...), and still doesn't do a full 64 bits.
>
> Verilator is an issue. There are simulation libraries for components like
> BRAMs and DSP48s, but when I last looked into this, Xilinx wrote them
> specfically in a behavioral style to likely prevent you creating an ASIC
> too easily from your FPGA design. Verilator specifically doesn't allow
> behavioral style Verilog. In fact, I couldn't get started with Verilator
> since I needed a little bit of behavioral testbench to init my system, so
> I just gave up. I used Icarus Verilog, it was fast enough for my purposes.
>

Click here to read the complete article

Re: Misc: Another (possible) way to more MHz...

<3190484f-bec6-4f63-a6a9-13ba66f5c299n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34491&group=comp.arch#34491

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:4f:b0:412:1bc3:10f3 with SMTP id y15-20020a05622a004f00b004121bc310f3mr285592qtw.13.1696940136400;
Tue, 10 Oct 2023 05:15:36 -0700 (PDT)
X-Received: by 2002:a05:6808:1997:b0:3ad:ba05:a3be with SMTP id
bj23-20020a056808199700b003adba05a3bemr5942906oib.4.1696940136085; Tue, 10
Oct 2023 05:15:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 10 Oct 2023 05:15:35 -0700 (PDT)
In-Reply-To: <ug2pqu$vk2g$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <uf4btl$3pe5m$1@dont-email.me> <ufei6t$2vj62$1@dont-email.me>
<joXSM.67418$EIy4.18153@fx48.iad> <ufk1hg$cc1t$1@dont-email.me>
<ug2d5b$c1lb$1@dont-email.me> <ug2pqu$vk2g$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3190484f-bec6-4f63-a6a9-13ba66f5c299n@googlegroups.com>
Subject: Re: Misc: Another (possible) way to more MHz...
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Tue, 10 Oct 2023 12:15:36 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: robf...@gmail.com - Tue, 10 Oct 2023 12:15 UTC

On Tuesday, October 10, 2023 at 2:11:15 AM UTC-4, BGB wrote:
> On 10/9/2023 9:33 PM, Kent Dickey wrote:
> > In article <ufk1hg$cc1t$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
> >> On 10/3/2023 11:34 AM, EricP wrote:
> >>> Kent Dickey wrote:
> >>>> In article <ufdvs9$2s0qj$1...@dont-email.me>, BGB <cr8...@gmail.com>
> >>>> wrote:
> >>>>> I am fiddling around a bit with it, and have been getting the core
> >>>>> "closer" to being able to boost the speed, but the "Worst Negative
> >>>>> Slack" is still at around 2.59ns, and is putting up a whole lot of a
> >>>>> fight here...
> >>>>>
> >>>>> Looks like for a lot of the failing paths are sort of like:
> >>>>> ~ 12-14 levels;
> >>>>> ~ 50-130 high-fanout;
> >>>>> ~ 4.5 ns of logic delay;
> >>>>> ~ 10.5 ns of net-delay.
> >>>>>
> >>>>>
> >>>>> What makes things harder is that I am trying to pull this off while
> >>>>> staying with 32K L1 caches, ...
> >>>>
> >>>> A good rule of thumb is 1ns per LUT level. So if you have 14 levels
> >>>> of LUTs,
> >>>> then you're aiming at 70MHz or so. Either reduce levels of LUTs, or just
> >>>> settle for 70MHz.
> >>>>
> >>>> Note that 14 levels of LUTs is equivalent to about 30 levels of
> >>>> gates. This is
> >>>> a slow design independent of it being in an FPGA, and independent of
> >>>> any FPGA routing issues.
> >>>>
> >>>> If you want to not optimize your control and other logic, that's your
> >>>> choice. But you're mixing things up. You're saying an ALU cannot be
> >>>> done within 10ns on an FPGA, and I'm pointing out that's not true.
> >>>> Definitely doing an ADD over 2 clocks is unnecessary in an FPGA, you
> >>>> just need to use the DSP48 blocks. Synthesized adders in FPGA have poor
> >>>> performance when they got to 32bits or wider. And on an FPGA, bad
> >>>> decisions compound--if you have a huge slow ALU, it makes everything else
> >>>> slower as well (since everything gets further apart).
> >>>
> >>> I'm now looking at a PDF book
> >>> "Designing with Xilinx FPGAs Using Vivado", 2017, and it says
> >>> "Embedded in the CLB is a high-performance look-ahead carry chain which
> >>> enables the FPGA to implement very high-performance adders. Current FPGAs
> >>> have carry chains which can implement a 64-bit adder at 500 MHz."
> >>>
> >>> Unfortunately I can't find where it says how just yet.
> >>> I'd have expected this would be a library macro.
> >>>
> >>
> >> Dunno there...
> >>
> >> I have noted:
> >> Naive add seems to be slow;
> >> Seems to synthesize a 64-bit adder as a chain of 16 CARRY4's.
> >> The tail end of this chain seems to have timing issues;
> >> Mostly works well if:
> >> Small sized value
> >> Latency doesn't matter
> >> One only needs the low-order bits quickly.
> >> Carry select is typically faster;
> >> Seems to work best in my testing with 12 or 16 bit chunks;
> >> If given a full cycle, can usually do 128 bits in 1 cycle at 50MHz.
> >
> > First, I think you should decide on your priorities, and what's important.
> >
> > There are 2 camps:
> >
> > 1) If you just want to play with your architecture, and getting deep into
> > FPGA specifics is not interesting to you, then set your clock to 50MHz and
> > have fun.
> >
> > or
> >
> > 2) If you want to to learn how to optimize logic, and you want your
> > logic to run fast, then you need to learn timing tricks and work to fix
> > your timing bugs. You have severe timing issues. If you go down this
> > path, you should easily get a design that will make timing robustly at
> > 100MHz. This means small changes will continue to make timing. Right
> > now you're probably stable at about 60MHz, so you have a long ways to go
> > (it means you have a large number of paths that need work).
> >
> > I feel like you're really in camp #1, but you have some interest in camp #2.
> > But: you continually claim things about FPGA performance due to your
> > timing bugs that aren't true.
> >
> Initially, I was mostly just trying to make everything work at all...
>
> When I started out, I was first trying to target 100 MHz, but then got
> frustrated as it was nearly impossible to write much of anything
> non-trivial and have it pass timing, so I had ended up dropping down to
> 50MHz.
>
> Most of the development process was then "beat on it whenever it failed
> timing at 50MHz, ignore it otherwise".
>
> For the most part, for most of its existence, it was "mostly passes
> timing at 50MHz and only occasionally fails" (with timing failures
> seeming to be pretty much at random, and usually going away by poking at
> something).
>
>
> Comparably though, 50MHz was a lot easier as it was closer to "just
> write whatever and it still works" in terms of timing. Logic with faster
> clock-speeds is generally a bit more of a pain.
>
>
>
> What prompted me to try to go down this path was when a random change
> caused slack to jump from its typical ~ 0.5 or 0.6ns, to around 4.5 ns.
> This being one of the biggest jumps in timing slack I had seen.
>
>
>
> But, it seems, the difference between 50MHz and 75MHz is 6.7ns, and the
> last few ns are still putting up a pretty big point.
>
>
>
> Recently, had gotten TNS down to around 400ns (with a WNS of around 0.67ns).
>
> Though, disabling a random ISA feature caused it to jump back up to
> around TNS=1600ns, WNS=2.1ns; but, with LUT cost dropping a bit.
>
>
> More minor fiddling, then it changes to TNS=1888, WNS=1.462, LUT cost
> back up to where it was before.
>
> ...
>
>
> In any case, I seem to be getting gradually closer (even if not by huge
> amounts).
> > Sadly, it's not easy to do a little of both--timing is a tricky beast,
> > and if you stop working on it for a few weeks, you accumulate a debt.
> > Many designs lead to chaotic results: a minor logic change can cause
> > huge negative slack to appear, and more very minor changes can seem to
> > fix them. This just shows that you're not really making timing, but
> > that the tools can sometimes get lucky and make it work. You need to
> > get to a stable design, where small changes don't wreck timing. Tools
> > can sometimes get fairly bad paths working for you (12-14 LUTs in 10ns
> > is doable, sometimes). A rule to achieve this: fix every single timing
> > path that EVER occurs, based on LUT depths. If you make a minor change,
> > and a path shows up (being more than 10 LUTs, for example), you MUST
> > make a logic change that should fix it, and then continue. This sounds
> > like you'll just spin your wheels, but eventually you'll either: a) come
> > across a path you cannot fix; or b) fix all paths. When you hit b),
> > your timing is no longer chaotically unstable, and now minor changes
> > lead to no timing issues. The number of rounds to achieve this is often
> > around the number of days of writing code: so if took you 40 8-hour days
> > to write the code (ignore revisions that were discarded, etc.), it's
> > about 40 cycles through this process to clean it up. Note: the easiest
> > approach is to synthesize every single day, and make a timing fix every
> > day, and then it's just a few more rounds at the end to get timing
> > stable.
> >
> > So, you've now built up a "timing debt" by writing a lot of logic, and not
> > optimizing it. If you want to try to pay it down, expect to spend a lot
> > of time on it. I suggest you'll be happier in camp #1.
> >
> I have made the observation in all this that is seems more effective to
> focus on paths that have more "logic delay", and mostly ignore paths
> that are almost entirely "net delay".
>
>
> How exactly the Verilog code gets from its "C-like" notation to a
> particular representation in the FPGA is still a little bit of a mystery
> at times.
>
>
> It doesn't help in all this that Vivado only displays the top 10 paths
> in terms of WNS.
>
I think there is a command line option for reports “max_paths=10” that
may be altered.

Click here to read the complete article

Re: Misc: Another (possible) way to more MHz...

<uh6l18$3c3hv$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34633&group=comp.arch#34633

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Another (possible) way to more MHz...
Date: Mon, 23 Oct 2023 15:27:15 -0500
Organization: A noiseless patient Spider
Lines: 180
Message-ID: <uh6l18$3c3hv$2@dont-email.me>
References: <uf4btl$3pe5m$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 23 Oct 2023 20:28:24 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f569a351efcf83344b06952c50535dc8";
logging-data="3542591"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX198EGhvAhUVpaL+y1sheAPQ"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:jUYdlTyVrs5aoJ52OGMBPvBHc80=
In-Reply-To: <uf4btl$3pe5m$1@dont-email.me>
Content-Language: en-US

by: BGB - Mon, 23 Oct 2023 20:27 UTC

On 9/28/2023 12:08 PM, BGB wrote:
> I recently had an idea (that small scale testing doesn't require
> redesigning my whole pipeline):
> If one delays nearly all of the operations to at least a 2-cycle
> latency, then seemingly the timing gets a fair bit better.
>
> In particular, a few 1-cycle units:
> SHAD (bitwise shift and similar)
> CONV (various bit-repacking instructions)
> Were delayed to 2 cycle:
> SHAD:
>     2 cycle latency doesn't have much obvious impact;
> CONV: Minor impact
>     I suspect due to delaying MOV-2R and EXTx.x and similar.
>     I could special-case these in Lane 1.
>
>
> There was already a slower CONV2 path which had mostly dealt with things
> like FPU format conversion and other "more complicated" format
> converters, so the CONV path had mostly been left for operations that
> mostly involved shuffling the bits around (and the simple case
> 2-register MOV instruction and similar, etc).
>
> Note that most ALU ops were already generally 2-cycle as well.
>

Was eventually able to get the core to mostly pass timing at 75MHz (it
is still pretty tight though, and prone to fail timing seemingly at random).

Major summary:
Most instructions apart from Register MOV and Constant-Load now have a
minimum of a 2 cycle latency (2L/1T).

Currently the low-precision FP-SIMD unit is disabled, as I couldn't get
it to pass timing effectively (asking for 3-cycle Binary32 ops at 75MHz
may be asking too much; though if it worked, it would be a max of ~ 300
MFLOP at this speed).

This means it has reverted to the slower option of pipelining the
FP-SIMD operations through the main FPU. These FP-SIMD ops have also
increased from 10 to 11 cycles.

Eg:
X A x x x x A - - - X
X - B x x x x B - - X
X - - C x x x x C - X
X - - - D x x x x D X

As-is, this effectively reduces FP-SIMD maximum throughput to 18 MFLOP
(so, as-is, 50MHz would still be superior if one wants FP-SIMD performance).

Ended up disabling LDTEX and various other "edge case" features.
Currently, the 128-bit ALU operations are also disabled.

The RISC-V decoder mode is also disabled (partly for sake of
instruction-decoder latency; also because RISC-V requires full
compare-and-branch to be enabled, which is "rather expensive" in terms
of timing).

Was able to get timing to work with 16K L1 caches, but 32K still seems
to be asking too much here.

Also it turns out the branch-predictor doesn't work (seems it was
actually broken for a while). It looks like some debugging checks (that
I had forgotten about) had effectively disabled the branch predictor
from "actually doing anything". After getting it fixed enough to
"actually do something", it does make things notably faster, but is also
prone to cause things to crash.

This is more of a logic issue than a timing issue though; so debugging
the branch predictor may be a reasonable next step.

Various other structural changes.

L1 I$:
The handling of instruction-length decoding was somewhat reworked;
It now partly figures out lengths when the cache line is fetched;
This is now cached in the L1 arrays (along with the other metadata).
This was needed to speed-up the PC-step loop.

Effectively, the instruction format for BJX2 required looking at 8 bits
(7 from the instruction word, 1 to select between baseline and XG2).
This would increase to 11 bits if RISC-V mode is enabled.

An "XG2 only" core would only require 4 bits here, but going "XG2 Only"
would break binary compatibility with my existing code (and I would also
need to deal with the Boot-ROM suddenly no longer fitting into the
existing 32K footprint).

Did partially restructure the 32-bit instruction decoder:
The various instruction blocks are now decoded in parallel;
F0, F1, F2, F8, and combined FA/FB/FE/FF.
The top-level bits now select which block's outputs to use based on MUX'ing.

It as if evaluating all the blocks initially via if/else chains created
a sort of linear dependency between them, resulting in higher latency in
the decoding.

The handling of Imm16 and Imm10 ops was split into two different "Form
ID" groups, as putting them together led to more conflict in the
register field outputs.

Generally, the handling of immediate values was changed over to using
selector bits for which sort of immediate was in use, with this being
handled as a "last step".

It is possible that a lower-latency decoder could eliminate the FormID
blocks, in favor of using independent selector bits for each field, say:
Rs/Rt/Rn/Rp selectors:
000: ZR
001: Fixed Reg Selector (ZR/PC/GBR/DLR/DHR/...)
010: Mem (Rs: PC/GBR/TBR, Rt: Rt/DLR)
011: Rn_F8
100: Rm_Dfl
101: Ro_Dfl
110: Rn_Dfl
111: Rp_Dfl

Imm Selector (as-is, *):
0000: Zero
0001: Disp9 (Jumbo-FE: Imm33s)
0010: Disp5 / Imm5u (Imm via Ro field)
0011: Imm8au (0, Jumbo-FF Imm8u)
0100: Imm6u (2R, Imm6u via Rm field, Jumbo: N/A)
0101: Imm9u (Jumbo-FE: Imm33s)
0110: Imm10u (Jumbo-FE: Imm33s)
0111: Imm16u (Jumbo-FE-FE: 64 bit)
1000: Disp20s (Branch ops)
1001: Imm9n (Jumbo-FE: Imm33s)
1010: Imm10n (Jumbo-FE: Imm33s)
1011: Imm16n (Jumbo-FE-FE: 64 bit)
1100: Imm24u (Jumbo-FE: 48-bit)
1101: Imm24n (Jumbo-FE: 48-bit)
1110: Imm16s (Jumbo-FE-FE: 64 bit)
1111: Disp8s (JCMP ops, Jumbo-FE: Imm32s)

*: This currently covers all of the existent "distinct" types of
immediate field in BJX2 for 32-bit instructions.

There were some other possibilities previously (like Imm9s or Imm5s and
Imm5n) but these were pruned as they were not used anywhere in the ISA
as it exists. Where the u/n/s suffix gives the sign extension (u: Zero
Extended, n: One Extended, s: Sign Extended).

Did end up adding Branch-and-Compare-Zero ops with an 11-bit
displacement. These ended up using the existing Imm10u/Imm10n scheme,
just treating the Imm10u/Imm10n field as a combined 11-bit field.

This is mostly because this was "almost as good" as a 12-bit
displacement, but avoided needing to add any new instruction forms (or
deal with some of the non-orthogonality that would have resulted if I
put it into the F8 block; where in Baseline mode, it would have only
been encodable for R0..R31).

Granted, one could argue that maybe it is a "waste of encoding space" to
put these in the F2 block, but alas.

These ops will still mostly be relevant for loops following a structure
like:
while(n--)
{ ... stuff ... }
Which will be generally a little bit faster than, say:
for(i=0; i<n; i++)
{ ... stuff ... }

Also, unlike full branch-compare, the compare-with-zero case has much
less difficulty passing timing at this clock speed.

....

Pages:12

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor