Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

win-nt from the people who invented edlin. -- MaDsen Wikholm, mwikholm@at8.abo.fi


devel / comp.arch / Misc: 8-way vs 4-way cache, a cost mystery...

SubjectAuthor
* Misc: 8-way vs 4-way cache, a cost mystery...BGB
+* Re: Misc: 8-way vs 4-way cache, a cost mystery...MitchAlsup1
|`* Re: Misc: 8-way vs 4-way cache, a cost mystery...BGB
| `* Re: Misc: 8-way vs 4-way cache, a cost mystery...MitchAlsup1
|  `* Re: Misc: 8-way vs 4-way cache, a cost mystery...BGB
|   `* Re: Misc: 8-way vs 4-way cache, a cost mystery...MitchAlsup1
|    `* Re: Misc: 8-way vs 4-way cache, a cost mystery...BGB
|     `- Re: Misc: 8-way vs 4-way cache, a cost mystery...MitchAlsup1
`* Re: Misc: 8-way vs 4-way cache, a cost mystery...Robert Finch
 `* Re: Misc: 8-way vs 4-way cache, a cost mystery...BGB
  `- Re: Misc: 8-way vs 4-way cache, a cost mystery...Robert Finch

1
Misc: 8-way vs 4-way cache, a cost mystery...

<urlbcp$3bvto$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37742&group=comp.arch#37742

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Misc: 8-way vs 4-way cache, a cost mystery...
Date: Tue, 27 Feb 2024 12:58:25 -0600
Organization: A noiseless patient Spider
Lines: 880
Message-ID: <urlbcp$3bvto$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 27 Feb 2024 18:58:33 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="45518f4ca153a913b1a4044838a733ac";
logging-data="3538872"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+69VO7T8xRcIzPehmFv5hPuP49KpNLKPU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:IC+iXecyiCT1hjUv88ndBSjqPe4=
Content-Language: en-US
 by: BGB - Tue, 27 Feb 2024 18:58 UTC

So, I have added a small associative cache to my ringbus, and it can be
set to 4-way or 8-way, now with a "move to front" special-case which
seems to significantly improve hit-rate (in a behavioral model).

One mystery though is why the 8-way case costs significantly more than
twice the LUTs of 4-way (and more so than an expected "n*log2(n)" cost
ratio I might expect when accounting for MUX'ing).

This version did make a few design changes vs a version I had mentioned
previously (in another thread), mostly modifying the logic to select the
hit value in a way that hopefully saves a few LUTs.

But, then added the MTF logic (move-to-front), which increases LUT count
again, mostly by adding variability in terms of how values may move
around when updating the cache arrays.

In this case, the cache operates on the Ringbus, and sits at the "bridge
point" between the L1 and L2 rings, where the L1 ring contains the L1
caches and TLB, and the L2 ring contains the L2 cache, ROM areas, a
bridge over to the MMIO bus, etc. Where, the L2 cache then contains the
interface to the external DRAM.

The ringbus passes data in 128-bit chunks, with messages moving along
the ring at 1 position per clock-cycle. There are no stalls, and
generally no buffering apart from the flip-flops composing the ring itself.

This cache operates exclusively on normal physical addresses:
(47:44)==0xC
And flushes stuff on No-Cache addresses 0xD, but ignores most other
address ranges. Currently it flushes lines by setting their high 4 bits
to 0xF, which was typically the address range used for MMIO.

Current cost appears to be ~ 1.3k LUT with 4-way (MTF enabled), and
around ~ 3.5k LUT for 8-way.

Also eating around 2x to 3x as many LUTRAM cells as expected.

Partial mystery of what exactly is going on here.

Note that this logic is being run at 50 MHz...

Where, the MTF logic seems to add ~ 700 LUT for 4-way, and ~ 1300 LUT
for 8-way.

Costs seem to be larger than my theoretical estimates...

Granted, it is possible I could put an extra buffering stage on the
input signals and see if this causes LUT cost to go down.

I am not sure if anyone has some thoughts here...

But, yeah, if anything, this can be seen as evidence for my general
avoidance of associative caching. At least in this case, it is in an
area where the relative cost impact is lower...

Verilog Code:
---

/*
Bridge between the L1 and L2 Rings.

Add a small associative cache to the ring, intended to absorb conflict
misses.

*/

`include "ringbus/RbiDefs.v"

module RbiMemL1BridgeVcA(
/* verilator lint_off UNUSED */
clock, reset,
regInMmcr, regInKrr, regInSr,

l1mAddrIn, l1mAddrOut,
l1mDataIn, l1mDataOut,
l1mOpmIn, l1mOpmOut,
l1mSeqIn, l1mSeqOut,

l2mAddrIn, l2mAddrOut,
l2mDataIn, l2mDataOut,
l2mOpmIn, l2mOpmOut,
l2mSeqIn, l2mSeqOut,

unitNodeId, regRngBridge
);

input clock;
input reset;
input[63:0] regInMmcr;
input[63:0] regInKrr;
input[63:0] regInSr;

input [ 15:0] l1mSeqIn; //operation sequence
output[ 15:0] l1mSeqOut; //operation sequence
input [ 15:0] l1mOpmIn; //memory operation mode
output[ 15:0] l1mOpmOut; //memory operation mode
`input_l1addr l1mAddrIn; //memory input address
`output_l1addr l1mAddrOut; //memory output address
`input_tile l1mDataIn; //memory input data
`output_tile l1mDataOut; //memory output data

input [ 15:0] l2mSeqIn; //operation sequence
output[ 15:0] l2mSeqOut; //operation sequence
input [ 15:0] l2mOpmIn; //memory operation mode
output[ 15:0] l2mOpmOut; //memory operation mode
`input_l2addr l2mAddrIn; //memory input address
`output_l2addr l2mAddrOut; //memory output address
`input_tile l2mDataIn; //memory input data
`output_tile l2mDataOut; //memory output data

input [ 7:0] unitNodeId; //Who Are We?
input [ 7:0] regRngBridge; //Random Sequence (Updates on L1 Flush)

reg[ 15:0] tL1mSeqOut; //operation sequence
reg[ 15:0] tL1mOpmOut; //memory operation mode
`reg_l1addr tL1mAddrOut; //memory output address
`reg_tile tL1mDataOut; //memory output data

reg[ 15:0] tL2mSeqOut; //operation sequence
reg[ 15:0] tL2mOpmOut; //memory operation mode
`reg_l2addr tL2mAddrOut; //memory output address
`reg_tile tL2mDataOut; //memory output data

reg[ 15:0] tL1mSeqOut2; //operation sequence
reg[ 15:0] tL1mOpmOut2; //memory operation mode
`reg_l1addr tL1mAddrOut2; //memory output address
`reg_tile tL1mDataOut2; //memory output data

assign l1mSeqOut = tL1mSeqOut2;
assign l1mOpmOut = tL1mOpmOut2;
assign l1mAddrOut = tL1mAddrOut2;
assign l1mDataOut = tL1mDataOut2;

reg[ 15:0] tL2mSeqOut2; //operation sequence
reg[ 15:0] tL2mOpmOut2; //memory operation mode
`reg_l2addr tL2mAddrOut2; //memory output address
`reg_tile tL2mDataOut2; //memory output data

assign l2mSeqOut = tL2mSeqOut2;
assign l2mOpmOut = tL2mOpmOut2;
assign l2mAddrOut = tL2mAddrOut2;
assign l2mDataOut = tL2mDataOut2;

reg tL1mReqSent;
reg tL2mReqSent;

wire l1mRingIsIdle;
wire l2mRingIsIdle;

assign l1mRingIsIdle = (l1mOpmIn[7:0] == JX2_RBI_OPM_IDLE);
assign l2mRingIsIdle = (l2mOpmIn[7:0] == JX2_RBI_OPM_IDLE);

wire l1mRingIsReq;
wire l2mRingIsResp;
wire l2mRingIsRespOther;
wire l2mRingIsMemLdResp;

wire l1mRingIsIrq;
wire l2mRingIsIrq;
wire l2mRingIsIrqBc;

assign l1mRingIsReq = l1mOpmIn[ 7:6] == 2'b10;

assign l2mRingIsResp =
(l2mOpmIn[ 7:6] == 2'b01) &&
(l2mSeqIn[15:10] == unitNodeId[7:2]);

assign l2mRingIsMemLdResp =
l2mRingIsResp && (l2mOpmIn[ 5:4] == 2'b11);

assign l2mRingIsRespOther =
(l2mOpmIn[ 7:6] == 2'b01) &&
(l2mSeqIn[15:10] != unitNodeId[7:2]);

assign l1mRingIsIrq =
(l2mOpmIn[ 7:0] == JX2_RBI_OPM_IRQ) &&
((l2mDataIn[11:8] != unitNodeId[5:2]) ||
// (l2mDataIn[11:8] == 4'h0) ||
(l2mDataIn[11:8] == 4'hF));

assign l2mRingIsIrq =
(l2mOpmIn[ 7:0] == JX2_RBI_OPM_IRQ) &&
((l2mDataIn[11:8] == unitNodeId[5:2]) ||
(l2mDataIn[11:8] == 4'h0) ||
(l2mDataIn[11:8] == 4'hF));
assign l2mRingIsIrqBc = l2mRingIsIrq && (l2mDataIn[11:8] == 4'hF);

reg[ 15:0] tL1mSeqReq; //operation sequence
reg[ 15:0] tL1mOpmReq; //memory operation mode
`reg_l1addr tL1mAddrReq; //memory output address
reg[127:0] tL1mDataReq; //memory output data

reg[ 15:0] tL2mSeqReq; //operation sequence
reg[ 15:0] tL2mOpmReq; //memory operation mode
`reg_l2addr tL2mAddrReq; //memory output address
reg[127:0] tL2mDataReq; //memory output data

wire l1mOpmIn_IsMemStReq =
(l1mOpmIn[7:0]==JX2_RBI_OPM_STX);

wire l1mOpmIn_IsMemLdReq =
(l1mOpmIn[7:0]==JX2_RBI_OPM_PFX) ||
(l1mOpmIn[7:0]==JX2_RBI_OPM_SPX) ||
(l1mOpmIn[7:0]==JX2_RBI_OPM_LDX) ;

wire l2mOpmIn_IsReq =
(l2mOpmIn[7:0]==JX2_RBI_OPM_PFX) ||
(l2mOpmIn[7:0]==JX2_RBI_OPM_SPX) ||
(l2mOpmIn[7:0]==JX2_RBI_OPM_LDX) ||
(l2mOpmIn[7:0]==JX2_RBI_OPM_LDSQ) ||
(l2mOpmIn[7:0]==JX2_RBI_OPM_LDSL) ||
(l2mOpmIn[7:0]==JX2_RBI_OPM_LDUL) ||
(l2mOpmIn[7:0]==JX2_RBI_OPM_STX) ||
(l2mOpmIn[7:0]==JX2_RBI_OPM_STSQ) ||
(l2mOpmIn[7:0]==JX2_RBI_OPM_STSL) ;

reg[127:0] tArrDataA[63:0];
reg[127:0] tArrDataB[63:0];
reg[127:0] tArrDataC[63:0];
reg[127:0] tArrDataD[63:0];

reg[47:0] tArrAddrA[63:0];
reg[47:0] tArrAddrB[63:0];
reg[47:0] tArrAddrC[63:0];
reg[47:0] tArrAddrD[63:0];

`ifdef jx2_rbi_bridge_vca_8x
reg[127:0] tArrDataS[63:0];
reg[127:0] tArrDataT[63:0];
reg[127:0] tArrDataU[63:0];
reg[127:0] tArrDataV[63:0];

reg[47:0] tArrAddrS[63:0];
reg[47:0] tArrAddrT[63:0];
reg[47:0] tArrAddrU[63:0];
reg[47:0] tArrAddrV[63:0];
`endif

// reg[4:0] tArrIx;
// reg[4:0] tBlkStIx;

reg[5:0] tArrIx;
reg[5:0] tBlkStIx;

reg[3:0] tNxtArrChk;
reg[3:0] tArrChk;

reg tDoBlkSt;

reg[127:0] tBlkDataA;
reg[127:0] tBlkDataB;
reg[127:0] tBlkDataC;
reg[127:0] tBlkDataD;

reg[47:0] tBlkAddrA;
reg[47:0] tBlkAddrB;
reg[47:0] tBlkAddrC;
reg[47:0] tBlkAddrD;
reg tBlkHitA;
reg tBlkHitB;
reg tBlkHitC;
reg tBlkHitD;

reg[127:0] tBlkDataH;
reg[47:0] tBlkAddrH;
reg tBlkHitH;

reg[2:0] tBlkHitId;

reg[127:0] tStBlkDataA;
reg[127:0] tStBlkDataB;
reg[127:0] tStBlkDataC;
reg[127:0] tStBlkDataD;

reg[47:0] tStBlkAddrA;
reg[47:0] tStBlkAddrB;
reg[47:0] tStBlkAddrC;
reg[47:0] tStBlkAddrD;

`ifdef jx2_rbi_bridge_vca_8x

reg[127:0] tBlkDataS;
reg[127:0] tBlkDataT;
reg[127:0] tBlkDataU;
reg[127:0] tBlkDataV;

reg[47:0] tBlkAddrS;
reg[47:0] tBlkAddrT;
reg[47:0] tBlkAddrU;
reg[47:0] tBlkAddrV;
reg tBlkHitS;
reg tBlkHitT;
reg tBlkHitU;
reg tBlkHitV;

reg[127:0] tStBlkDataS;
reg[127:0] tStBlkDataT;
reg[127:0] tStBlkDataU;
reg[127:0] tStBlkDataV;

reg[47:0] tStBlkAddrS;
reg[47:0] tStBlkAddrT;
reg[47:0] tStBlkAddrU;
reg[47:0] tStBlkAddrV;

`endif

reg tStBlkAddrB_Adv;
reg tStBlkAddrC_Adv;
reg tStBlkAddrD_Adv;
reg tStBlkAddrS_Adv;
reg tStBlkAddrT_Adv;
reg tStBlkAddrU_Adv;
reg tStBlkAddrV_Adv;


Click here to read the complete article
Re: Misc: 8-way vs 4-way cache, a cost mystery...

<02479652582caa588ac2e4758b19126b@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37745&group=comp.arch#37745

  copy link   Newsgroups: comp.arch
Date: Wed, 28 Feb 2024 01:33:14 +0000
Subject: Re: Misc: 8-way vs 4-way cache, a cost mystery...
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$Zyo03.4TVsPd6ZpQSs7PKu8zZuWw4.Kr8FYXcHJGN3P.H3uMBPNMy
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <urlbcp$3bvto$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <02479652582caa588ac2e4758b19126b@www.novabbs.org>
 by: MitchAlsup1 - Wed, 28 Feb 2024 01:33 UTC

A thought::

Construct the 8-way cache from a pair of 4-way cache instances
and connect both into one 8-way with a single layer of logic
{multiplexing.}

Re: Misc: 8-way vs 4-way cache, a cost mystery...

<urm9p2$3lhgq$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37746&group=comp.arch#37746

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Misc: 8-way vs 4-way cache, a cost mystery...
Date: Tue, 27 Feb 2024 22:37:05 -0500
Organization: A noiseless patient Spider
Lines: 891
Message-ID: <urm9p2$3lhgq$1@dont-email.me>
References: <urlbcp$3bvto$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 28 Feb 2024 03:37:06 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b00ab6da4d1b5dca2733499b622596c1";
logging-data="3851802"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18oyJma4ZfWpCg9EYSNtBdHpIyrCwF0o3Y="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:3T+l4hLiL2sVkvdQV2IpOxi+PsQ=
Content-Language: en-US
In-Reply-To: <urlbcp$3bvto$1@dont-email.me>
 by: Robert Finch - Wed, 28 Feb 2024 03:37 UTC

On 2024-02-27 1:58 p.m., BGB wrote:
> So, I have added a small associative cache to my ringbus, and it can be
> set to 4-way or 8-way, now with a "move to front" special-case which
> seems to significantly improve hit-rate (in a behavioral model).
>
> One mystery though is why the 8-way case costs significantly more than
> twice the LUTs of 4-way (and more so than an expected "n*log2(n)" cost
> ratio I might expect when accounting for MUX'ing).
>
>
> This version did make a few design changes vs a version I had mentioned
> previously (in another thread), mostly modifying the logic to select the
> hit value in a way that hopefully saves a few LUTs.
>
> But, then added the MTF logic (move-to-front), which increases LUT count
> again, mostly by adding variability in terms of how values may move
> around when updating the cache arrays.
>
>
> In this case, the cache operates on the Ringbus, and sits at the "bridge
> point" between the L1 and L2 rings, where the L1 ring contains the L1
> caches and TLB, and the L2 ring contains the L2 cache, ROM areas, a
> bridge over to the MMIO bus, etc. Where, the L2 cache then contains the
> interface to the external DRAM.
>
> The ringbus passes data in 128-bit chunks, with messages moving along
> the ring at 1 position per clock-cycle. There are no stalls, and
> generally no buffering apart from the flip-flops composing the ring itself.
>
> This cache operates exclusively on normal physical addresses:
>   (47:44)==0xC
> And flushes stuff on No-Cache addresses 0xD, but ignores most other
> address ranges. Currently it flushes lines by setting their high 4 bits
> to 0xF, which was typically the address range used for MMIO.
>
>
>
> Current cost appears to be ~ 1.3k LUT with 4-way (MTF enabled), and
> around ~ 3.5k LUT for 8-way.
>
> Also eating around 2x to 3x as many LUTRAM cells as expected.
>
> Partial mystery of what exactly is going on here.
>
> Note that this logic is being run at 50 MHz...
>
I think synthesis will replicate LUTs and FF's as needed to meet timing
requirements, unless otherwise specified. I find it is almost always
more than the minimum number possible. You could try synthesizing for
size and see if it reduces the counts. Looking at schematics helps see
what is going on too.

I find it sometimes helps the synthesizer out to break things up into
smaller modules. It seems to optimize better.

>
> Where, the MTF logic seems to add ~ 700 LUT for 4-way, and ~ 1300 LUT
> for 8-way.
>
> Costs seem to be larger than my theoretical estimates...
>
>
> Granted, it is possible I could put an extra buffering stage on the
> input signals and see if this causes LUT cost to go down.
>
>
> I am not sure if anyone has some thoughts here...
>
>
> But, yeah, if anything, this can be seen as evidence for my general
> avoidance of associative caching. At least in this case, it is in an
> area where the relative cost impact is lower...
>
>
> Verilog Code:
> ---
>
> /*
> Bridge between the L1 and L2 Rings.
>
> Add a small associative cache to the ring, intended to absorb conflict
> misses.
>
>  */
>
> `include "ringbus/RbiDefs.v"
>
> module RbiMemL1BridgeVcA(
>     /* verilator lint_off UNUSED */
>     clock,            reset,
>     regInMmcr,        regInKrr,        regInSr,
>
>     l1mAddrIn,        l1mAddrOut,
>     l1mDataIn,        l1mDataOut,
>     l1mOpmIn,        l1mOpmOut,
>     l1mSeqIn,        l1mSeqOut,
>
>     l2mAddrIn,        l2mAddrOut,
>     l2mDataIn,        l2mDataOut,
>     l2mOpmIn,        l2mOpmOut,
>     l2mSeqIn,        l2mSeqOut,
>
>     unitNodeId,        regRngBridge
>     );
>
> input            clock;
> input            reset;
> input[63:0]        regInMmcr;
> input[63:0]        regInKrr;
> input[63:0]        regInSr;
>
> input [ 15:0]    l1mSeqIn;        //operation sequence
> output[ 15:0]    l1mSeqOut;        //operation sequence
> input [ 15:0]    l1mOpmIn;        //memory operation mode
> output[ 15:0]    l1mOpmOut;        //memory operation mode
> `input_l1addr    l1mAddrIn;        //memory input address
> `output_l1addr    l1mAddrOut;        //memory output address
> `input_tile        l1mDataIn;        //memory input data
> `output_tile    l1mDataOut;        //memory output data
>
> input [ 15:0]    l2mSeqIn;        //operation sequence
> output[ 15:0]    l2mSeqOut;        //operation sequence
> input [ 15:0]    l2mOpmIn;        //memory operation mode
> output[ 15:0]    l2mOpmOut;        //memory operation mode
> `input_l2addr    l2mAddrIn;        //memory input address
> `output_l2addr    l2mAddrOut;        //memory output address
> `input_tile        l2mDataIn;        //memory input data
> `output_tile    l2mDataOut;        //memory output data
>
> input [  7:0]    unitNodeId;        //Who Are We?
> input [  7:0]    regRngBridge;    //Random Sequence (Updates on L1 Flush)
>
>
>
> reg[ 15:0]        tL1mSeqOut;            //operation sequence
> reg[ 15:0]        tL1mOpmOut;            //memory operation mode
> `reg_l1addr        tL1mAddrOut;        //memory output address
> `reg_tile        tL1mDataOut;        //memory output data
>
> reg[ 15:0]        tL2mSeqOut;            //operation sequence
> reg[ 15:0]        tL2mOpmOut;            //memory operation mode
> `reg_l2addr        tL2mAddrOut;        //memory output address
> `reg_tile        tL2mDataOut;        //memory output data
>
>
> reg[ 15:0]        tL1mSeqOut2;            //operation sequence
> reg[ 15:0]        tL1mOpmOut2;            //memory operation mode
> `reg_l1addr        tL1mAddrOut2;        //memory output address
> `reg_tile        tL1mDataOut2;        //memory output data
>
> assign        l1mSeqOut    = tL1mSeqOut2;
> assign        l1mOpmOut    = tL1mOpmOut2;
> assign        l1mAddrOut    = tL1mAddrOut2;
> assign        l1mDataOut    = tL1mDataOut2;
>
>
> reg[ 15:0]        tL2mSeqOut2;            //operation sequence
> reg[ 15:0]        tL2mOpmOut2;            //memory operation mode
> `reg_l2addr        tL2mAddrOut2;        //memory output address
> `reg_tile        tL2mDataOut2;        //memory output data
>
> assign        l2mSeqOut    = tL2mSeqOut2;
> assign        l2mOpmOut    = tL2mOpmOut2;
> assign        l2mAddrOut    = tL2mAddrOut2;
> assign        l2mDataOut    = tL2mDataOut2;
>
>
>
> reg        tL1mReqSent;
> reg        tL2mReqSent;
>
> wire            l1mRingIsIdle;
> wire            l2mRingIsIdle;
>
> assign        l1mRingIsIdle = (l1mOpmIn[7:0] == JX2_RBI_OPM_IDLE);
> assign        l2mRingIsIdle = (l2mOpmIn[7:0] == JX2_RBI_OPM_IDLE);
>
> wire            l1mRingIsReq;
> wire            l2mRingIsResp;
> wire            l2mRingIsRespOther;
> wire            l2mRingIsMemLdResp;
>
> wire            l1mRingIsIrq;
> wire            l2mRingIsIrq;
> wire            l2mRingIsIrqBc;
>
> assign        l1mRingIsReq = l1mOpmIn[ 7:6] == 2'b10;
>
> assign        l2mRingIsResp =
>     (l2mOpmIn[ 7:6] == 2'b01) &&
>     (l2mSeqIn[15:10] == unitNodeId[7:2]);
>
> assign        l2mRingIsMemLdResp =
>     l2mRingIsResp && (l2mOpmIn[ 5:4] == 2'b11);
>
> assign        l2mRingIsRespOther =
>     (l2mOpmIn[ 7:6] == 2'b01) &&
>     (l2mSeqIn[15:10] != unitNodeId[7:2]);
>
> assign        l1mRingIsIrq =
>     (l2mOpmIn[ 7:0] == JX2_RBI_OPM_IRQ) &&
>     ((l2mDataIn[11:8] != unitNodeId[5:2]) ||
> //     (l2mDataIn[11:8] == 4'h0) ||
>      (l2mDataIn[11:8] == 4'hF));
>
> assign        l2mRingIsIrq =
>     (l2mOpmIn[ 7:0] == JX2_RBI_OPM_IRQ) &&
>     ((l2mDataIn[11:8] == unitNodeId[5:2]) ||
>      (l2mDataIn[11:8] == 4'h0) ||
>      (l2mDataIn[11:8] == 4'hF));
> assign        l2mRingIsIrqBc = l2mRingIsIrq && (l2mDataIn[11:8] == 4'hF);
>
> reg[ 15:0]        tL1mSeqReq;            //operation sequence
> reg[ 15:0]        tL1mOpmReq;            //memory operation mode
> `reg_l1addr        tL1mAddrReq;        //memory output address
> reg[127:0]        tL1mDataReq;        //memory output data
>
> reg[ 15:0]        tL2mSeqReq;            //operation sequence
> reg[ 15:0]        tL2mOpmReq;            //memory operation mode
> `reg_l2addr        tL2mAddrReq;        //memory output address
> reg[127:0]        tL2mDataReq;        //memory output data
>
> wire            l1mOpmIn_IsMemStReq =
>     (l1mOpmIn[7:0]==JX2_RBI_OPM_STX);
>
> wire            l1mOpmIn_IsMemLdReq =
>     (l1mOpmIn[7:0]==JX2_RBI_OPM_PFX)    ||
>     (l1mOpmIn[7:0]==JX2_RBI_OPM_SPX)    ||
>     (l1mOpmIn[7:0]==JX2_RBI_OPM_LDX)    ;
>
> wire            l2mOpmIn_IsReq =
>     (l2mOpmIn[7:0]==JX2_RBI_OPM_PFX)    ||
>     (l2mOpmIn[7:0]==JX2_RBI_OPM_SPX)    ||
>     (l2mOpmIn[7:0]==JX2_RBI_OPM_LDX)    ||
>     (l2mOpmIn[7:0]==JX2_RBI_OPM_LDSQ)    ||
>     (l2mOpmIn[7:0]==JX2_RBI_OPM_LDSL)    ||
>     (l2mOpmIn[7:0]==JX2_RBI_OPM_LDUL)    ||
>     (l2mOpmIn[7:0]==JX2_RBI_OPM_STX)    ||
>     (l2mOpmIn[7:0]==JX2_RBI_OPM_STSQ)    ||
>     (l2mOpmIn[7:0]==JX2_RBI_OPM_STSL)    ;
>
>
> reg[127:0]        tArrDataA[63:0];
> reg[127:0]        tArrDataB[63:0];
> reg[127:0]        tArrDataC[63:0];
> reg[127:0]        tArrDataD[63:0];
>
> reg[47:0]        tArrAddrA[63:0];
> reg[47:0]        tArrAddrB[63:0];
> reg[47:0]        tArrAddrC[63:0];
> reg[47:0]        tArrAddrD[63:0];
>
> `ifdef jx2_rbi_bridge_vca_8x
> reg[127:0]        tArrDataS[63:0];
> reg[127:0]        tArrDataT[63:0];
> reg[127:0]        tArrDataU[63:0];
> reg[127:0]        tArrDataV[63:0];
>
> reg[47:0]        tArrAddrS[63:0];
> reg[47:0]        tArrAddrT[63:0];
> reg[47:0]        tArrAddrU[63:0];
> reg[47:0]        tArrAddrV[63:0];
> `endif
>
>
> // reg[4:0]        tArrIx;
> // reg[4:0]        tBlkStIx;
>
> reg[5:0]        tArrIx;
> reg[5:0]        tBlkStIx;
>
> reg[3:0]        tNxtArrChk;
> reg[3:0]        tArrChk;
>
> reg                tDoBlkSt;
>
> reg[127:0]        tBlkDataA;
> reg[127:0]        tBlkDataB;
> reg[127:0]        tBlkDataC;
> reg[127:0]        tBlkDataD;
>
> reg[47:0]        tBlkAddrA;
> reg[47:0]        tBlkAddrB;
> reg[47:0]        tBlkAddrC;
> reg[47:0]        tBlkAddrD;
> reg                tBlkHitA;
> reg                tBlkHitB;
> reg                tBlkHitC;
> reg                tBlkHitD;
>
> reg[127:0]        tBlkDataH;
> reg[47:0]        tBlkAddrH;
> reg                tBlkHitH;
>
> reg[2:0]        tBlkHitId;
>
> reg[127:0]        tStBlkDataA;
> reg[127:0]        tStBlkDataB;
> reg[127:0]        tStBlkDataC;
> reg[127:0]        tStBlkDataD;
>
> reg[47:0]        tStBlkAddrA;
> reg[47:0]        tStBlkAddrB;
> reg[47:0]        tStBlkAddrC;
> reg[47:0]        tStBlkAddrD;
>
> `ifdef jx2_rbi_bridge_vca_8x
>
> reg[127:0]        tBlkDataS;
> reg[127:0]        tBlkDataT;
> reg[127:0]        tBlkDataU;
> reg[127:0]        tBlkDataV;
>
> reg[47:0]        tBlkAddrS;
> reg[47:0]        tBlkAddrT;
> reg[47:0]        tBlkAddrU;
> reg[47:0]        tBlkAddrV;
> reg                tBlkHitS;
> reg                tBlkHitT;
> reg                tBlkHitU;
> reg                tBlkHitV;
>
> reg[127:0]        tStBlkDataS;
> reg[127:0]        tStBlkDataT;
> reg[127:0]        tStBlkDataU;
> reg[127:0]        tStBlkDataV;
>
> reg[47:0]        tStBlkAddrS;
> reg[47:0]        tStBlkAddrT;
> reg[47:0]        tStBlkAddrU;
> reg[47:0]        tStBlkAddrV;
>
> `endif
>
> reg            tStBlkAddrB_Adv;
> reg            tStBlkAddrC_Adv;
> reg            tStBlkAddrD_Adv;
> reg            tStBlkAddrS_Adv;
> reg            tStBlkAddrT_Adv;
> reg            tStBlkAddrU_Adv;
> reg            tStBlkAddrV_Adv;
>
> reg            tStBlkAddrA_Flu;
> reg            tStBlkAddrB_Flu;
> reg            tStBlkAddrC_Flu;
> reg            tStBlkAddrD_Flu;
> reg            tStBlkAddrS_Flu;
> reg            tStBlkAddrT_Flu;
> reg            tStBlkAddrU_Flu;
> reg            tStBlkAddrV_Flu;
>
> always @*
> begin
> //        tArrIx        = l1mAddrIn[8:4];
>     tArrIx        = l1mAddrIn[9:4];
>     tBlkStIx    = tArrIx;
>     tDoBlkSt    = 0;
>
>     tNxtArrChk        = regRngBridge[7:4] ^ regRngBridge[3:0] ^ 4'h5;
>
>     tBlkDataA    = tArrDataA[tArrIx];
>     tBlkDataB    = tArrDataB[tArrIx];
>     tBlkDataC    = tArrDataC[tArrIx];
>     tBlkDataD    = tArrDataD[tArrIx];
>
>     tBlkAddrA    = tArrAddrA[tArrIx];
>     tBlkAddrB    = tArrAddrB[tArrIx];
>     tBlkAddrC    = tArrAddrC[tArrIx];
>     tBlkAddrD    = tArrAddrD[tArrIx];
>
> `ifdef jx2_rbi_bridge_vca_8x
>     tBlkDataS    = tArrDataS[tArrIx];
>     tBlkDataT    = tArrDataT[tArrIx];
>     tBlkDataU    = tArrDataU[tArrIx];
>     tBlkDataV    = tArrDataV[tArrIx];
>
>     tBlkAddrS    = tArrAddrS[tArrIx];
>     tBlkAddrT    = tArrAddrT[tArrIx];
>     tBlkAddrU    = tArrAddrU[tArrIx];
>     tBlkAddrV    = tArrAddrV[tArrIx];
> `endif
>
> //        tBlkHitA    = tBlkAddrA[43:4] == l1mAddrIn[43:4];
> //        tBlkHitB    = tBlkAddrB[43:4] == l1mAddrIn[43:4];
> //        tBlkHitC    = tBlkAddrC[43:4] == l1mAddrIn[43:4];
> //        tBlkHitD    = tBlkAddrD[43:4] == l1mAddrIn[43:4];
>
>     tBlkHitA    = tBlkAddrA[31:4] == l1mAddrIn[31:4];
>     tBlkHitB    = tBlkAddrB[31:4] == l1mAddrIn[31:4];
>     tBlkHitC    = tBlkAddrC[31:4] == l1mAddrIn[31:4];
>     tBlkHitD    = tBlkAddrD[31:4] == l1mAddrIn[31:4];
>
> `ifdef jx2_rbi_bridge_vca_8x
> //        tBlkHitS    = tBlkAddrS[43:4] == l1mAddrIn[43:4];
> //        tBlkHitT    = tBlkAddrT[43:4] == l1mAddrIn[43:4];
> //        tBlkHitU    = tBlkAddrU[43:4] == l1mAddrIn[43:4];
> //        tBlkHitV    = tBlkAddrV[43:4] == l1mAddrIn[43:4];
>
>     tBlkHitS    = tBlkAddrS[31:4] == l1mAddrIn[31:4];
>     tBlkHitT    = tBlkAddrT[31:4] == l1mAddrIn[31:4];
>     tBlkHitU    = tBlkAddrU[31:4] == l1mAddrIn[31:4];
>     tBlkHitV    = tBlkAddrV[31:4] == l1mAddrIn[31:4];
> `endif
>
> `ifdef jx2_rbi_bridge_vca_8x
>     tStBlkDataV    = tBlkDataU;
>     tStBlkDataU    = tBlkDataT;
>     tStBlkDataT    = tBlkDataS;
>     tStBlkDataS    = tBlkDataD;
>
>     tStBlkAddrV    = tBlkAddrU;
>     tStBlkAddrU    = tBlkAddrT;
>     tStBlkAddrT    = tBlkAddrS;
>     tStBlkAddrS    = tBlkAddrD;
> `endif
>
>     tStBlkDataD    = tBlkDataC;
>     tStBlkDataC    = tBlkDataB;
>     tStBlkDataB    = tBlkDataA;
>     tStBlkDataA    = l1mDataIn;
>
>     tStBlkAddrD    = tBlkAddrC;
>     tStBlkAddrC    = tBlkAddrB;
>     tStBlkAddrB    = tBlkAddrA;
>     tStBlkAddrA    = l1mAddrIn;
>
>     tStBlkAddrB_Adv = 1;
>     tStBlkAddrC_Adv = 1;
>     tStBlkAddrD_Adv = 1;
>     tStBlkAddrS_Adv = 1;
>     tStBlkAddrT_Adv = 1;
>     tStBlkAddrU_Adv = 1;
>     tStBlkAddrV_Adv = 1;
>
>     tStBlkAddrA_Flu = 0;
>     tStBlkAddrB_Flu = 0;
>     tStBlkAddrC_Flu = 0;
>     tStBlkAddrD_Flu = 0;
>     tStBlkAddrS_Flu = 0;
>     tStBlkAddrT_Flu = 0;
>     tStBlkAddrU_Flu = 0;
>     tStBlkAddrV_Flu = 0;
>
>     if(l1mOpmIn_IsMemStReq &&
>         (l1mAddrIn[47:44]==4'hC) &&
>         !l1mOpmIn[11])
>     begin
>         tDoBlkSt    = 1;
>     end
>
> // `ifndef def_true
> `ifdef def_true
>     if(
>         l2mRingIsMemLdResp &&
>         (l2mAddrIn[47:44]==4'hC) &&
>         !l2mOpmIn[11] &&
>         !l2mOpmIn[3])
>     begin
> //            tBlkStIx    = l2mAddrIn[8:4];
>         tBlkStIx    = l2mAddrIn[9:4];
>         tStBlkDataA    = l2mDataIn;
>         tStBlkAddrA    = l2mAddrIn;
>         tDoBlkSt    = 1;
>     end
> `endif
>
>     tStBlkAddrA[3:0] = tArrChk;
>
>     if(tStBlkAddrA[31:24]==0)
>         tDoBlkSt    = 0;
> //        if(tStBlkAddrA[47:44]!=4'hC)
> //            tDoBlkSt    = 0;
>
>     if(tStBlkAddrA[43:32]!=0)
>         tDoBlkSt    = 0;
>     if(tStBlkAddrA[31:27]!=0)
>         tDoBlkSt    = 0;
>
>     if((l1mAddrIn[47:44]==4'hD) || l1mOpmIn[11])
>     begin
>         /* If flushing a line, flush all the ways. */
>
>         tStBlkAddrB_Adv = 0;
>         tStBlkAddrC_Adv = 0;
>         tStBlkAddrD_Adv = 0;
>         tStBlkAddrS_Adv = 0;
>         tStBlkAddrT_Adv = 0;
>         tStBlkAddrU_Adv = 0;
>         tStBlkAddrV_Adv = 0;
>
>         tStBlkAddrA_Flu = 1;
>         tStBlkAddrB_Flu = 1;
>         tStBlkAddrC_Flu = 1;
>         tStBlkAddrD_Flu = 1;
>         tStBlkAddrS_Flu = 1;
>         tStBlkAddrT_Flu = 1;
>         tStBlkAddrU_Flu = 1;
>         tStBlkAddrV_Flu = 1;
>
>         tDoBlkSt    = 1;
>     end
>
> `ifdef jx2_rbi_bridge_vca_8x
>     casez({tBlkHitA, tBlkHitB, tBlkHitC, tBlkHitD,
>         tBlkHitS, tBlkHitT, tBlkHitU, tBlkHitV})
>         8'b1zzzzzzz: begin tBlkHitId = 0; tBlkHitH    = 1; end
>         8'b01zzzzzz: begin tBlkHitId = 1; tBlkHitH    = 1; end
>         8'b001zzzzz: begin tBlkHitId = 2; tBlkHitH    = 1; end
>         8'b0001zzzz: begin tBlkHitId = 3; tBlkHitH    = 1; end
>         8'b00001zzz: begin tBlkHitId = 4; tBlkHitH    = 1; end
>         8'b000001zz: begin tBlkHitId = 5; tBlkHitH    = 1; end
>         8'b0000001z: begin tBlkHitId = 6; tBlkHitH    = 1; end
>         8'b00000001: begin tBlkHitId = 7; tBlkHitH    = 1; end
>         8'b00000000: begin tBlkHitId = 0; tBlkHitH    = 0; end
>     endcase
>
>     case(tBlkHitId[2:0])
>         3'h0: begin
>             tBlkDataH    = tBlkDataA;
>             tBlkAddrH    = tBlkAddrA;
>         end
>         3'h1: begin
>             tBlkDataH    = tBlkDataB;
>             tBlkAddrH    = tBlkAddrB;
>         end
>         3'h2: begin
>             tBlkDataH    = tBlkDataC;
>             tBlkAddrH    = tBlkAddrC;
>         end
>         3'h3: begin
>             tBlkDataH    = tBlkDataD;
>             tBlkAddrH    = tBlkAddrD;
>         end
>         3'h4: begin
>             tBlkDataH    = tBlkDataS;
>             tBlkAddrH    = tBlkAddrS;
>         end
>         3'h5: begin
>             tBlkDataH    = tBlkDataT;
>             tBlkAddrH    = tBlkAddrT;
>         end
>         3'h6: begin
>             tBlkDataH    = tBlkDataU;
>             tBlkAddrH    = tBlkAddrU;
>         end
>         3'h7: begin
>             tBlkDataH    = tBlkDataV;
>             tBlkAddrH    = tBlkAddrV;
>         end
>     endcase
> `else
>     casez({tBlkHitA, tBlkHitB, tBlkHitC, tBlkHitD})
>         4'b1zzz: begin tBlkHitId = 0; tBlkHitH    = 1; end
>         4'b01zz: begin tBlkHitId = 1; tBlkHitH    = 1; end
>         4'b001z: begin tBlkHitId = 2; tBlkHitH    = 1; end
>         4'b0001: begin tBlkHitId = 3; tBlkHitH    = 1; end
>         4'b0000: begin tBlkHitId = 0; tBlkHitH    = 0; end
>     endcase
>
>     case(tBlkHitId[1:0])
>         2'h0: begin
>             tBlkDataH    = tBlkDataA;
>             tBlkAddrH    = tBlkAddrA;
>         end
>         2'h1: begin
>             tBlkDataH    = tBlkDataB;
>             tBlkAddrH    = tBlkAddrB;
>         end
>         2'h2: begin
>             tBlkDataH    = tBlkDataC;
>             tBlkAddrH    = tBlkAddrC;
>         end
>         2'h3: begin
>             tBlkDataH    = tBlkDataD;
>             tBlkAddrH    = tBlkAddrD;
>         end
>     endcase
> `endif
>
> `ifdef jx2_rbi_bridge_vca_mtf
>     /* If a line had hit in the cache, move it to the front. */
>
>     case(tBlkHitId[2:0])
>         3'h0: begin
> //            if(!tDoBlkSt)
> //            begin
> //                tStBlkDataA    = tBlkDataA;
> //                tStBlkAddrA    = tBlkAddrA;
> //            end
>         end
>
>         3'h1: begin
>             if(!tDoBlkSt)
>             begin
>                 tStBlkDataA    = tBlkDataB;
>                 tStBlkAddrA    = tBlkAddrB;
>             end
>
>             tStBlkAddrB_Adv = 1;
>             tStBlkAddrC_Adv = 0;
>             tStBlkAddrD_Adv = 0;
>             tStBlkAddrS_Adv = 0;
>             tStBlkAddrT_Adv = 0;
>             tStBlkAddrU_Adv = 0;
>             tStBlkAddrV_Adv = 0;
>
>             tDoBlkSt    = 1;
>         end
>         3'h2: begin
>             if(!tDoBlkSt)
>             begin
>                 tStBlkDataA    = tBlkDataC;
>                 tStBlkAddrA    = tBlkAddrC;
>             end
>
>             tStBlkAddrB_Adv = 1;
>             tStBlkAddrC_Adv = 1;
>             tStBlkAddrD_Adv = 0;
>             tStBlkAddrS_Adv = 0;
>             tStBlkAddrT_Adv = 0;
>             tStBlkAddrU_Adv = 0;
>             tStBlkAddrV_Adv = 0;
>
>             tDoBlkSt    = 1;
>         end
>         3'h3: begin
>             if(!tDoBlkSt)
>             begin
>                 tStBlkDataA    = tBlkDataD;
>                 tStBlkAddrA    = tBlkAddrD;
>             end
>
>             tStBlkAddrB_Adv = 1;
>             tStBlkAddrC_Adv = 1;
>             tStBlkAddrD_Adv = 1;
>             tStBlkAddrS_Adv = 0;
>             tStBlkAddrT_Adv = 0;
>             tStBlkAddrU_Adv = 0;
>             tStBlkAddrV_Adv = 0;
>
>             tDoBlkSt    = 1;
>         end
> `ifdef jx2_rbi_bridge_vca_8x
>         3'h4: begin
>             if(!tDoBlkSt)
>             begin
>                 tStBlkDataA    = tBlkDataS;
>                 tStBlkAddrA    = tBlkAddrS;
>             end
>
>             tStBlkAddrB_Adv = 1;
>             tStBlkAddrC_Adv = 1;
>             tStBlkAddrD_Adv = 1;
>             tStBlkAddrS_Adv = 1;
>             tStBlkAddrT_Adv = 0;
>             tStBlkAddrU_Adv = 0;
>             tStBlkAddrV_Adv = 0;
>
>             tDoBlkSt    = 1;
>         end
>         3'h5: begin
>             if(!tDoBlkSt)
>             begin
>                 tStBlkDataA    = tBlkDataT;
>                 tStBlkAddrA    = tBlkAddrT;
>             end
>
>             tStBlkAddrB_Adv = 1;
>             tStBlkAddrC_Adv = 1;
>             tStBlkAddrD_Adv = 1;
>             tStBlkAddrS_Adv = 1;
>             tStBlkAddrT_Adv = 1;
>             tStBlkAddrU_Adv = 0;
>             tStBlkAddrV_Adv = 0;
>
>             tDoBlkSt    = 1;
>         end
>         3'h6: begin
>             if(!tDoBlkSt)
>             begin
>                 tStBlkDataA    = tBlkDataU;
>                 tStBlkAddrA    = tBlkAddrU;
>             end
>
>             tStBlkAddrB_Adv = 1;
>             tStBlkAddrC_Adv = 1;
>             tStBlkAddrD_Adv = 1;
>             tStBlkAddrS_Adv = 1;
>             tStBlkAddrT_Adv = 1;
>             tStBlkAddrU_Adv = 1;
>             tStBlkAddrV_Adv = 0;
>
>             tDoBlkSt    = 1;
>         end
>         3'h7: begin
>             if(!tDoBlkSt)
>             begin
>                 tStBlkDataA    = tBlkDataV;
>                 tStBlkAddrA    = tBlkAddrV;
>             end
>
>             tStBlkAddrB_Adv = 1;
>             tStBlkAddrC_Adv = 1;
>             tStBlkAddrD_Adv = 1;
>             tStBlkAddrS_Adv = 1;
>             tStBlkAddrT_Adv = 1;
>             tStBlkAddrU_Adv = 1;
>             tStBlkAddrV_Adv = 1;
>
>             tDoBlkSt    = 1;
>         end
> `else
>         default: begin
>         end
> `endif
>     endcase
> `endif
>
>     tStBlkDataD    = tStBlkAddrD_Adv ? tBlkDataC : tBlkDataD;
>     tStBlkDataC    = tStBlkAddrC_Adv ? tBlkDataB : tBlkDataC;
>     tStBlkDataB    = tStBlkAddrB_Adv ? tBlkDataA : tBlkDataB;
>
>     tStBlkAddrD    = tStBlkAddrD_Adv ? tBlkAddrC : tBlkAddrD;
>     tStBlkAddrC    = tStBlkAddrC_Adv ? tBlkAddrB : tBlkAddrC;
>     tStBlkAddrB    = tStBlkAddrB_Adv ? tBlkAddrA : tBlkAddrB;
>
>     if(tStBlkAddrD_Adv ? tStBlkAddrC_Flu : tStBlkAddrD_Flu)
>         tStBlkAddrD[47:44] = 4'hF;
>     if(tStBlkAddrC_Adv ? tStBlkAddrB_Flu : tStBlkAddrC_Flu)
>         tStBlkAddrC[47:44] = 4'hF;
>     if(tStBlkAddrB_Adv ? tStBlkAddrA_Flu : tStBlkAddrB_Flu)
>         tStBlkAddrB[47:44] = 4'hF;
>     if(tStBlkAddrA_Flu)
>         tStBlkAddrA[47:44] = 4'hF;
>
> `ifdef jx2_rbi_bridge_vca_8x
>     tStBlkDataV    = tStBlkAddrV_Adv ? tBlkDataU : tBlkDataV;
>     tStBlkDataU    = tStBlkAddrU_Adv ? tBlkDataT : tBlkDataU;
>     tStBlkDataT    = tStBlkAddrT_Adv ? tBlkDataS : tBlkDataT;
>     tStBlkDataS    = tStBlkAddrS_Adv ? tBlkDataD : tBlkDataS;
>
>     tStBlkAddrV    = tStBlkAddrV_Adv ? tBlkAddrU : tBlkAddrV;
>     tStBlkAddrU    = tStBlkAddrU_Adv ? tBlkAddrT : tBlkAddrU;
>     tStBlkAddrT    = tStBlkAddrT_Adv ? tBlkAddrS : tBlkAddrT;
>     tStBlkAddrS    = tStBlkAddrS_Adv ? tBlkAddrD : tBlkAddrS;
>
>     if(tStBlkAddrV_Adv ? tStBlkAddrU_Flu : tStBlkAddrV_Flu)
>         tStBlkAddrV[47:44] = 4'hF;
>     if(tStBlkAddrU_Adv ? tStBlkAddrT_Flu : tStBlkAddrU_Flu)
>         tStBlkAddrU[47:44] = 4'hF;
>     if(tStBlkAddrT_Adv ? tStBlkAddrS_Flu : tStBlkAddrT_Flu)
>         tStBlkAddrT[47:44] = 4'hF;
>     if(tStBlkAddrS_Adv ? tStBlkAddrD_Flu : tStBlkAddrS_Flu)
>         tStBlkAddrS[47:44] = 4'hF;
> `endif
>
>     if(tBlkAddrH[3:0]!=tArrChk)
>         tBlkHitH    = 0;
>
>     if(tBlkAddrH[47:44]!=4'hC)
>         tBlkHitH    = 0;
>
>     /* Reject addresses outside normal physical space. */
>     if(l1mAddrIn[47:44]!=4'hC)
>         tBlkHitH    = 0;
>
>     if(l1mAddrIn[43:32]!=0)
>         tBlkHitH    = 0;
>
> //        tBlkHitH    = 0;
>
>
>     tL1mSeqOut  = l2mSeqIn;
>     tL1mOpmOut  = l2mOpmIn;
> //    tL1mAddrOut = l2mAddrIn;
>     tL1mDataOut = l2mDataIn;
>
>     tL2mSeqOut  = l1mSeqIn;
>     tL2mOpmOut  = l1mOpmIn;
> //    tL2mAddrOut = l1mAddrIn;
>     tL2mDataOut = l1mDataIn;
>
> `ifdef jx2_bus_mixaddr96
>     tL1mAddrOut = { UV48_00, l2mAddrIn };
>     tL2mAddrOut = l1mAddrIn[47:0];
> `else
>     tL1mAddrOut = l2mAddrIn;
>     tL2mAddrOut = l1mAddrIn;
> `endif
>
> // `ifndef def_true
> `ifdef def_true
>     if(l2mOpmIn_IsReq || l2mRingIsRespOther)
>     begin
>         if(l1mRingIsIdle)
>         begin
>             /* Avoid letting requests back into L1 ring. */
>
>             tL1mSeqOut  = l1mSeqIn;
>             tL1mOpmOut  = l1mOpmIn;
>             tL1mAddrOut = l1mAddrIn;
>             tL1mDataOut = l1mDataIn;
>
>
>             tL2mSeqOut  = l2mSeqIn;
>             tL2mOpmOut  = l2mOpmIn;
>             tL2mAddrOut = l2mAddrIn;
>             tL2mDataOut = l2mDataIn;
>         end
>     end
> `endif
>
> // `ifndef def_true
> `ifdef def_true
>     if(l1mOpmIn_IsMemLdReq && tBlkHitH)
>     begin
>         tL1mSeqOut  = l1mSeqIn;
>         tL1mOpmOut  = {
>             l1mOpmIn[15:8],
>             JX2_RBI_OPM_OKLD[7:4],
>             l1mOpmIn[11:8]};
>         tL1mAddrOut = l1mAddrIn;
>         tL1mDataOut = tBlkDataH;
>
>         tL2mSeqOut  = l2mSeqIn;
>         tL2mOpmOut  = l2mOpmIn;
>         tL2mAddrOut = l2mAddrIn;
>         tL2mDataOut = l2mDataIn;
>     end
> `endif
>
> `ifdef def_true
> // `ifndef def_true
>     if(reset)
>     begin
>         /* Clear ring during reset */
>
>         tL1mSeqOut  = 0;
>         tL1mOpmOut  = 0;
>
>         tL2mSeqOut  = 0;
>         tL2mOpmOut  = 0;
>     end
> `endif
>
> end
>
>
> always @(posedge clock)
> begin
>     tL1mSeqOut2        <= tL1mSeqOut;
>     tL1mOpmOut2        <= tL1mOpmOut;
>     tL1mAddrOut2    <= tL1mAddrOut;
>     tL1mDataOut2    <= tL1mDataOut;
>
>     tL2mSeqOut2        <= tL2mSeqOut;
>     tL2mOpmOut2        <= tL2mOpmOut;
>     tL2mAddrOut2    <= tL2mAddrOut;
>     tL2mDataOut2    <= tL2mDataOut;
>
>     tArrChk            <= tNxtArrChk;
>
>     if(tDoBlkSt)
>     begin
>         tArrDataA[tBlkStIx]        <= tStBlkDataA;
>         tArrDataB[tBlkStIx]        <= tStBlkDataB;
>         tArrDataC[tBlkStIx]        <= tStBlkDataC;
>         tArrDataD[tBlkStIx]        <= tStBlkDataD;
>
>         tArrAddrA[tBlkStIx]        <= tStBlkAddrA;
>         tArrAddrB[tBlkStIx]        <= tStBlkAddrB;
>         tArrAddrC[tBlkStIx]        <= tStBlkAddrC;
>         tArrAddrD[tBlkStIx]        <= tStBlkAddrD;
>
> `ifdef jx2_rbi_bridge_vca_8x
>         tArrDataS[tBlkStIx]        <= tStBlkDataS;
>         tArrDataT[tBlkStIx]        <= tStBlkDataT;
>         tArrDataU[tBlkStIx]        <= tStBlkDataU;
>         tArrDataV[tBlkStIx]        <= tStBlkDataV;
>
>         tArrAddrS[tBlkStIx]        <= tStBlkAddrS;
>         tArrAddrT[tBlkStIx]        <= tStBlkAddrT;
>         tArrAddrU[tBlkStIx]        <= tStBlkAddrU;
>         tArrAddrV[tBlkStIx]        <= tStBlkAddrV;
> `endif
>     end
> end
>
>
> endmodule


Click here to read the complete article
Re: Misc: 8-way vs 4-way cache, a cost mystery...

<urmkis$3nfgi$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37747&group=comp.arch#37747

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: 8-way vs 4-way cache, a cost mystery...
Date: Wed, 28 Feb 2024 00:41:24 -0600
Organization: A noiseless patient Spider
Lines: 976
Message-ID: <urmkis$3nfgi$1@dont-email.me>
References: <urlbcp$3bvto$1@dont-email.me> <urm9p2$3lhgq$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 28 Feb 2024 06:41:32 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4f525b7299752462d71b58c63cccd518";
logging-data="3915282"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+g1gQQK+ESlMc+ZmRZljPIPR0YYZpt7S4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:PXSIaejLEGLN9f4tod4/kG1rpv8=
Content-Language: en-US
In-Reply-To: <urm9p2$3lhgq$1@dont-email.me>
 by: BGB - Wed, 28 Feb 2024 06:41 UTC

On 2/27/2024 9:37 PM, Robert Finch wrote:
> On 2024-02-27 1:58 p.m., BGB wrote:
>> So, I have added a small associative cache to my ringbus, and it can
>> be set to 4-way or 8-way, now with a "move to front" special-case
>> which seems to significantly improve hit-rate (in a behavioral model).
>>
>> One mystery though is why the 8-way case costs significantly more than
>> twice the LUTs of 4-way (and more so than an expected "n*log2(n)" cost
>> ratio I might expect when accounting for MUX'ing).
>>
>>
>> This version did make a few design changes vs a version I had
>> mentioned previously (in another thread), mostly modifying the logic
>> to select the hit value in a way that hopefully saves a few LUTs.
>>
>> But, then added the MTF logic (move-to-front), which increases LUT
>> count again, mostly by adding variability in terms of how values may
>> move around when updating the cache arrays.
>>
>>
>> In this case, the cache operates on the Ringbus, and sits at the
>> "bridge point" between the L1 and L2 rings, where the L1 ring contains
>> the L1 caches and TLB, and the L2 ring contains the L2 cache, ROM
>> areas, a bridge over to the MMIO bus, etc. Where, the L2 cache then
>> contains the interface to the external DRAM.
>>
>> The ringbus passes data in 128-bit chunks, with messages moving along
>> the ring at 1 position per clock-cycle. There are no stalls, and
>> generally no buffering apart from the flip-flops composing the ring
>> itself.
>>
>> This cache operates exclusively on normal physical addresses:
>>    (47:44)==0xC
>> And flushes stuff on No-Cache addresses 0xD, but ignores most other
>> address ranges. Currently it flushes lines by setting their high 4
>> bits to 0xF, which was typically the address range used for MMIO.
>>
>>
>>
>> Current cost appears to be ~ 1.3k LUT with 4-way (MTF enabled), and
>> around ~ 3.5k LUT for 8-way.
>>
>> Also eating around 2x to 3x as many LUTRAM cells as expected.
>>
>> Partial mystery of what exactly is going on here.
>>
>> Note that this logic is being run at 50 MHz...
>>
> I think synthesis will replicate LUTs and FF's as needed to meet timing
> requirements, unless otherwise specified. I find it is almost always
> more than the minimum number possible. You could try synthesizing for
> size and see if it reduces the counts. Looking at schematics helps see
> what is going on too.
>

I am mostly using "Vivado Synthesis Defaults" and "Vivado Implementation
Defaults".

Past attempts at fiddling all that much with this has either been mostly
ineffective or has broken things like the ability to see resource
utilization or view the netlist, which kinda defeats the point (*).

*: Say, selecting anything other than "Vivado Implementation Defaults"
seemingly breaking the netlist and resource utlilization features, for
whatever reason this happens...

Can mess with the Synthesis settings at least, but often haven't gotten
enough out of this to make it worthwhile (and mostly makes sense if one
is up against resource limits of the FPGA; and/or flailing at trying to
get stuff to pass timing or similar...).

> I find it sometimes helps the synthesizer out to break things up into
> smaller modules. It seems to optimize better.
>

Looking in the schematics/...
Looks like the data and address arrays are partly segregated.
Data proceeds fairly straightforwardly from the arrays to the outputs.
Address signals are a bit more of a tangled mess.

Looks like there is a crapton of LUT3's here, with pretty much all of
the data selection is via LUT3's (3 in, 1 out). Nearly all the logic
internally appears to be on 1-bit signals, ...

Did end up adding another cycle of delays on the input side, as it
appeared that this module may have been assimilating parts of the TLB
(the preceding module on the ringbus).

On the schematic of the outer module, some of the signals flowing in and
out of the Bridge module were ones that should have been internal to the
TLB.

Hmm, adding a delay cycle has caused synthesis to (somwhow) turn it into
Block-RAM. Seems like kind of a waste though, as this would only use
around 1/4 of the BRAM... This while also burning even more LUTs...

Did slightly improve the "negative slack" value at least...

I guess, could consider an option to increase the array size to 256 so
that at least it makes full use of the BRAM's (nevermind if it is using
the access pattern for LUTRAM's).

It uses 12 BRAM's, which seems to be the expeced amount given the total
combined width of the arrays.

Probably going to stick with 4-way, as the MTF logic did increase
hit-rate a fair bit while also reducing the relative benefit of 8-way
over 4-way.

Even if still not so great for LUT cost...

At the moment, 4-way + MTF, is now weighing in at around 1900 LUTs.

Or, adding an extra cycle of delay, if anything, caused the LUT cost to
increase...

>>
>> Where, the MTF logic seems to add ~ 700 LUT for 4-way, and ~ 1300 LUT
>> for 8-way.
>>
>> Costs seem to be larger than my theoretical estimates...
>>
>>
>> Granted, it is possible I could put an extra buffering stage on the
>> input signals and see if this causes LUT cost to go down.
>>
>>
>> I am not sure if anyone has some thoughts here...
>>
>>
>> But, yeah, if anything, this can be seen as evidence for my general
>> avoidance of associative caching. At least in this case, it is in an
>> area where the relative cost impact is lower...
>>
>>
>> Verilog Code:
>> ---
>>
>> /*
>> Bridge between the L1 and L2 Rings.
>>
>> Add a small associative cache to the ring, intended to absorb conflict
>> misses.
>>
>>   */
>>
>> `include "ringbus/RbiDefs.v"
>>
>> module RbiMemL1BridgeVcA(
>>      /* verilator lint_off UNUSED */
>>      clock,            reset,
>>      regInMmcr,        regInKrr,        regInSr,
>>
>>      l1mAddrIn,        l1mAddrOut,
>>      l1mDataIn,        l1mDataOut,
>>      l1mOpmIn,        l1mOpmOut,
>>      l1mSeqIn,        l1mSeqOut,
>>
>>      l2mAddrIn,        l2mAddrOut,
>>      l2mDataIn,        l2mDataOut,
>>      l2mOpmIn,        l2mOpmOut,
>>      l2mSeqIn,        l2mSeqOut,
>>
>>      unitNodeId,        regRngBridge
>>      );
>>
>> input            clock;
>> input            reset;
>> input[63:0]        regInMmcr;
>> input[63:0]        regInKrr;
>> input[63:0]        regInSr;
>>
>> input [ 15:0]    l1mSeqIn;        //operation sequence
>> output[ 15:0]    l1mSeqOut;        //operation sequence
>> input [ 15:0]    l1mOpmIn;        //memory operation mode
>> output[ 15:0]    l1mOpmOut;        //memory operation mode
>> `input_l1addr    l1mAddrIn;        //memory input address
>> `output_l1addr    l1mAddrOut;        //memory output address
>> `input_tile        l1mDataIn;        //memory input data
>> `output_tile    l1mDataOut;        //memory output data
>>
>> input [ 15:0]    l2mSeqIn;        //operation sequence
>> output[ 15:0]    l2mSeqOut;        //operation sequence
>> input [ 15:0]    l2mOpmIn;        //memory operation mode
>> output[ 15:0]    l2mOpmOut;        //memory operation mode
>> `input_l2addr    l2mAddrIn;        //memory input address
>> `output_l2addr    l2mAddrOut;        //memory output address
>> `input_tile        l2mDataIn;        //memory input data
>> `output_tile    l2mDataOut;        //memory output data
>>
>> input [  7:0]    unitNodeId;        //Who Are We?
>> input [  7:0]    regRngBridge;    //Random Sequence (Updates on L1 Flush)
>>
>>
>>
>> reg[ 15:0]        tL1mSeqOut;            //operation sequence
>> reg[ 15:0]        tL1mOpmOut;            //memory operation mode
>> `reg_l1addr        tL1mAddrOut;        //memory output address
>> `reg_tile        tL1mDataOut;        //memory output data
>>
>> reg[ 15:0]        tL2mSeqOut;            //operation sequence
>> reg[ 15:0]        tL2mOpmOut;            //memory operation mode
>> `reg_l2addr        tL2mAddrOut;        //memory output address
>> `reg_tile        tL2mDataOut;        //memory output data
>>
>>
>> reg[ 15:0]        tL1mSeqOut2;            //operation sequence
>> reg[ 15:0]        tL1mOpmOut2;            //memory operation mode
>> `reg_l1addr        tL1mAddrOut2;        //memory output address
>> `reg_tile        tL1mDataOut2;        //memory output data
>>
>> assign        l1mSeqOut    = tL1mSeqOut2;
>> assign        l1mOpmOut    = tL1mOpmOut2;
>> assign        l1mAddrOut    = tL1mAddrOut2;
>> assign        l1mDataOut    = tL1mDataOut2;
>>
>>
>> reg[ 15:0]        tL2mSeqOut2;            //operation sequence
>> reg[ 15:0]        tL2mOpmOut2;            //memory operation mode
>> `reg_l2addr        tL2mAddrOut2;        //memory output address
>> `reg_tile        tL2mDataOut2;        //memory output data
>>
>> assign        l2mSeqOut    = tL2mSeqOut2;
>> assign        l2mOpmOut    = tL2mOpmOut2;
>> assign        l2mAddrOut    = tL2mAddrOut2;
>> assign        l2mDataOut    = tL2mDataOut2;
>>
>>
>>
>> reg        tL1mReqSent;
>> reg        tL2mReqSent;
>>
>> wire            l1mRingIsIdle;
>> wire            l2mRingIsIdle;
>>
>> assign        l1mRingIsIdle = (l1mOpmIn[7:0] == JX2_RBI_OPM_IDLE);
>> assign        l2mRingIsIdle = (l2mOpmIn[7:0] == JX2_RBI_OPM_IDLE);
>>
>> wire            l1mRingIsReq;
>> wire            l2mRingIsResp;
>> wire            l2mRingIsRespOther;
>> wire            l2mRingIsMemLdResp;
>>
>> wire            l1mRingIsIrq;
>> wire            l2mRingIsIrq;
>> wire            l2mRingIsIrqBc;
>>
>> assign        l1mRingIsReq = l1mOpmIn[ 7:6] == 2'b10;
>>
>> assign        l2mRingIsResp =
>>      (l2mOpmIn[ 7:6] == 2'b01) &&
>>      (l2mSeqIn[15:10] == unitNodeId[7:2]);
>>
>> assign        l2mRingIsMemLdResp =
>>      l2mRingIsResp && (l2mOpmIn[ 5:4] == 2'b11);
>>
>> assign        l2mRingIsRespOther =
>>      (l2mOpmIn[ 7:6] == 2'b01) &&
>>      (l2mSeqIn[15:10] != unitNodeId[7:2]);
>>
>> assign        l1mRingIsIrq =
>>      (l2mOpmIn[ 7:0] == JX2_RBI_OPM_IRQ) &&
>>      ((l2mDataIn[11:8] != unitNodeId[5:2]) ||
>> //     (l2mDataIn[11:8] == 4'h0) ||
>>       (l2mDataIn[11:8] == 4'hF));
>>
>> assign        l2mRingIsIrq =
>>      (l2mOpmIn[ 7:0] == JX2_RBI_OPM_IRQ) &&
>>      ((l2mDataIn[11:8] == unitNodeId[5:2]) ||
>>       (l2mDataIn[11:8] == 4'h0) ||
>>       (l2mDataIn[11:8] == 4'hF));
>> assign        l2mRingIsIrqBc = l2mRingIsIrq && (l2mDataIn[11:8] == 4'hF);
>>
>> reg[ 15:0]        tL1mSeqReq;            //operation sequence
>> reg[ 15:0]        tL1mOpmReq;            //memory operation mode
>> `reg_l1addr        tL1mAddrReq;        //memory output address
>> reg[127:0]        tL1mDataReq;        //memory output data
>>
>> reg[ 15:0]        tL2mSeqReq;            //operation sequence
>> reg[ 15:0]        tL2mOpmReq;            //memory operation mode
>> `reg_l2addr        tL2mAddrReq;        //memory output address
>> reg[127:0]        tL2mDataReq;        //memory output data
>>
>> wire            l1mOpmIn_IsMemStReq =
>>      (l1mOpmIn[7:0]==JX2_RBI_OPM_STX);
>>
>> wire            l1mOpmIn_IsMemLdReq =
>>      (l1mOpmIn[7:0]==JX2_RBI_OPM_PFX)    ||
>>      (l1mOpmIn[7:0]==JX2_RBI_OPM_SPX)    ||
>>      (l1mOpmIn[7:0]==JX2_RBI_OPM_LDX)    ;
>>
>> wire            l2mOpmIn_IsReq =
>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_PFX)    ||
>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_SPX)    ||
>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_LDX)    ||
>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_LDSQ)    ||
>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_LDSL)    ||
>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_LDUL)    ||
>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_STX)    ||
>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_STSQ)    ||
>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_STSL)    ;
>>
>>
>> reg[127:0]        tArrDataA[63:0];
>> reg[127:0]        tArrDataB[63:0];
>> reg[127:0]        tArrDataC[63:0];
>> reg[127:0]        tArrDataD[63:0];
>>
>> reg[47:0]        tArrAddrA[63:0];
>> reg[47:0]        tArrAddrB[63:0];
>> reg[47:0]        tArrAddrC[63:0];
>> reg[47:0]        tArrAddrD[63:0];
>>
>> `ifdef jx2_rbi_bridge_vca_8x
>> reg[127:0]        tArrDataS[63:0];
>> reg[127:0]        tArrDataT[63:0];
>> reg[127:0]        tArrDataU[63:0];
>> reg[127:0]        tArrDataV[63:0];
>>
>> reg[47:0]        tArrAddrS[63:0];
>> reg[47:0]        tArrAddrT[63:0];
>> reg[47:0]        tArrAddrU[63:0];
>> reg[47:0]        tArrAddrV[63:0];
>> `endif
>>
>>
>> // reg[4:0]        tArrIx;
>> // reg[4:0]        tBlkStIx;
>>
>> reg[5:0]        tArrIx;
>> reg[5:0]        tBlkStIx;
>>
>> reg[3:0]        tNxtArrChk;
>> reg[3:0]        tArrChk;
>>
>> reg                tDoBlkSt;
>>
>> reg[127:0]        tBlkDataA;
>> reg[127:0]        tBlkDataB;
>> reg[127:0]        tBlkDataC;
>> reg[127:0]        tBlkDataD;
>>
>> reg[47:0]        tBlkAddrA;
>> reg[47:0]        tBlkAddrB;
>> reg[47:0]        tBlkAddrC;
>> reg[47:0]        tBlkAddrD;
>> reg                tBlkHitA;
>> reg                tBlkHitB;
>> reg                tBlkHitC;
>> reg                tBlkHitD;
>>
>> reg[127:0]        tBlkDataH;
>> reg[47:0]        tBlkAddrH;
>> reg                tBlkHitH;
>>
>> reg[2:0]        tBlkHitId;
>>
>> reg[127:0]        tStBlkDataA;
>> reg[127:0]        tStBlkDataB;
>> reg[127:0]        tStBlkDataC;
>> reg[127:0]        tStBlkDataD;
>>
>> reg[47:0]        tStBlkAddrA;
>> reg[47:0]        tStBlkAddrB;
>> reg[47:0]        tStBlkAddrC;
>> reg[47:0]        tStBlkAddrD;
>>
>> `ifdef jx2_rbi_bridge_vca_8x
>>
>> reg[127:0]        tBlkDataS;
>> reg[127:0]        tBlkDataT;
>> reg[127:0]        tBlkDataU;
>> reg[127:0]        tBlkDataV;
>>
>> reg[47:0]        tBlkAddrS;
>> reg[47:0]        tBlkAddrT;
>> reg[47:0]        tBlkAddrU;
>> reg[47:0]        tBlkAddrV;
>> reg                tBlkHitS;
>> reg                tBlkHitT;
>> reg                tBlkHitU;
>> reg                tBlkHitV;
>>
>> reg[127:0]        tStBlkDataS;
>> reg[127:0]        tStBlkDataT;
>> reg[127:0]        tStBlkDataU;
>> reg[127:0]        tStBlkDataV;
>>
>> reg[47:0]        tStBlkAddrS;
>> reg[47:0]        tStBlkAddrT;
>> reg[47:0]        tStBlkAddrU;
>> reg[47:0]        tStBlkAddrV;
>>
>> `endif
>>
>> reg            tStBlkAddrB_Adv;
>> reg            tStBlkAddrC_Adv;
>> reg            tStBlkAddrD_Adv;
>> reg            tStBlkAddrS_Adv;
>> reg            tStBlkAddrT_Adv;
>> reg            tStBlkAddrU_Adv;
>> reg            tStBlkAddrV_Adv;
>>
>> reg            tStBlkAddrA_Flu;
>> reg            tStBlkAddrB_Flu;
>> reg            tStBlkAddrC_Flu;
>> reg            tStBlkAddrD_Flu;
>> reg            tStBlkAddrS_Flu;
>> reg            tStBlkAddrT_Flu;
>> reg            tStBlkAddrU_Flu;
>> reg            tStBlkAddrV_Flu;
>>
>> always @*
>> begin
>> //        tArrIx        = l1mAddrIn[8:4];
>>      tArrIx        = l1mAddrIn[9:4];
>>      tBlkStIx    = tArrIx;
>>      tDoBlkSt    = 0;
>>
>>      tNxtArrChk        = regRngBridge[7:4] ^ regRngBridge[3:0] ^ 4'h5;
>>
>>      tBlkDataA    = tArrDataA[tArrIx];
>>      tBlkDataB    = tArrDataB[tArrIx];
>>      tBlkDataC    = tArrDataC[tArrIx];
>>      tBlkDataD    = tArrDataD[tArrIx];
>>
>>      tBlkAddrA    = tArrAddrA[tArrIx];
>>      tBlkAddrB    = tArrAddrB[tArrIx];
>>      tBlkAddrC    = tArrAddrC[tArrIx];
>>      tBlkAddrD    = tArrAddrD[tArrIx];
>>
>> `ifdef jx2_rbi_bridge_vca_8x
>>      tBlkDataS    = tArrDataS[tArrIx];
>>      tBlkDataT    = tArrDataT[tArrIx];
>>      tBlkDataU    = tArrDataU[tArrIx];
>>      tBlkDataV    = tArrDataV[tArrIx];
>>
>>      tBlkAddrS    = tArrAddrS[tArrIx];
>>      tBlkAddrT    = tArrAddrT[tArrIx];
>>      tBlkAddrU    = tArrAddrU[tArrIx];
>>      tBlkAddrV    = tArrAddrV[tArrIx];
>> `endif
>>
>> //        tBlkHitA    = tBlkAddrA[43:4] == l1mAddrIn[43:4];
>> //        tBlkHitB    = tBlkAddrB[43:4] == l1mAddrIn[43:4];
>> //        tBlkHitC    = tBlkAddrC[43:4] == l1mAddrIn[43:4];
>> //        tBlkHitD    = tBlkAddrD[43:4] == l1mAddrIn[43:4];
>>
>>      tBlkHitA    = tBlkAddrA[31:4] == l1mAddrIn[31:4];
>>      tBlkHitB    = tBlkAddrB[31:4] == l1mAddrIn[31:4];
>>      tBlkHitC    = tBlkAddrC[31:4] == l1mAddrIn[31:4];
>>      tBlkHitD    = tBlkAddrD[31:4] == l1mAddrIn[31:4];
>>
>> `ifdef jx2_rbi_bridge_vca_8x
>> //        tBlkHitS    = tBlkAddrS[43:4] == l1mAddrIn[43:4];
>> //        tBlkHitT    = tBlkAddrT[43:4] == l1mAddrIn[43:4];
>> //        tBlkHitU    = tBlkAddrU[43:4] == l1mAddrIn[43:4];
>> //        tBlkHitV    = tBlkAddrV[43:4] == l1mAddrIn[43:4];
>>
>>      tBlkHitS    = tBlkAddrS[31:4] == l1mAddrIn[31:4];
>>      tBlkHitT    = tBlkAddrT[31:4] == l1mAddrIn[31:4];
>>      tBlkHitU    = tBlkAddrU[31:4] == l1mAddrIn[31:4];
>>      tBlkHitV    = tBlkAddrV[31:4] == l1mAddrIn[31:4];
>> `endif
>>
>> `ifdef jx2_rbi_bridge_vca_8x
>>      tStBlkDataV    = tBlkDataU;
>>      tStBlkDataU    = tBlkDataT;
>>      tStBlkDataT    = tBlkDataS;
>>      tStBlkDataS    = tBlkDataD;
>>
>>      tStBlkAddrV    = tBlkAddrU;
>>      tStBlkAddrU    = tBlkAddrT;
>>      tStBlkAddrT    = tBlkAddrS;
>>      tStBlkAddrS    = tBlkAddrD;
>> `endif
>>
>>      tStBlkDataD    = tBlkDataC;
>>      tStBlkDataC    = tBlkDataB;
>>      tStBlkDataB    = tBlkDataA;
>>      tStBlkDataA    = l1mDataIn;
>>
>>      tStBlkAddrD    = tBlkAddrC;
>>      tStBlkAddrC    = tBlkAddrB;
>>      tStBlkAddrB    = tBlkAddrA;
>>      tStBlkAddrA    = l1mAddrIn;
>>
>>      tStBlkAddrB_Adv = 1;
>>      tStBlkAddrC_Adv = 1;
>>      tStBlkAddrD_Adv = 1;
>>      tStBlkAddrS_Adv = 1;
>>      tStBlkAddrT_Adv = 1;
>>      tStBlkAddrU_Adv = 1;
>>      tStBlkAddrV_Adv = 1;
>>
>>      tStBlkAddrA_Flu = 0;
>>      tStBlkAddrB_Flu = 0;
>>      tStBlkAddrC_Flu = 0;
>>      tStBlkAddrD_Flu = 0;
>>      tStBlkAddrS_Flu = 0;
>>      tStBlkAddrT_Flu = 0;
>>      tStBlkAddrU_Flu = 0;
>>      tStBlkAddrV_Flu = 0;
>>
>>      if(l1mOpmIn_IsMemStReq &&
>>          (l1mAddrIn[47:44]==4'hC) &&
>>          !l1mOpmIn[11])
>>      begin
>>          tDoBlkSt    = 1;
>>      end
>>
>> // `ifndef def_true
>> `ifdef def_true
>>      if(
>>          l2mRingIsMemLdResp &&
>>          (l2mAddrIn[47:44]==4'hC) &&
>>          !l2mOpmIn[11] &&
>>          !l2mOpmIn[3])
>>      begin
>> //            tBlkStIx    = l2mAddrIn[8:4];
>>          tBlkStIx    = l2mAddrIn[9:4];
>>          tStBlkDataA    = l2mDataIn;
>>          tStBlkAddrA    = l2mAddrIn;
>>          tDoBlkSt    = 1;
>>      end
>> `endif
>>
>>      tStBlkAddrA[3:0] = tArrChk;
>>
>>      if(tStBlkAddrA[31:24]==0)
>>          tDoBlkSt    = 0;
>> //        if(tStBlkAddrA[47:44]!=4'hC)
>> //            tDoBlkSt    = 0;
>>
>>      if(tStBlkAddrA[43:32]!=0)
>>          tDoBlkSt    = 0;
>>      if(tStBlkAddrA[31:27]!=0)
>>          tDoBlkSt    = 0;
>>
>>      if((l1mAddrIn[47:44]==4'hD) || l1mOpmIn[11])
>>      begin
>>          /* If flushing a line, flush all the ways. */
>>
>>          tStBlkAddrB_Adv = 0;
>>          tStBlkAddrC_Adv = 0;
>>          tStBlkAddrD_Adv = 0;
>>          tStBlkAddrS_Adv = 0;
>>          tStBlkAddrT_Adv = 0;
>>          tStBlkAddrU_Adv = 0;
>>          tStBlkAddrV_Adv = 0;
>>
>>          tStBlkAddrA_Flu = 1;
>>          tStBlkAddrB_Flu = 1;
>>          tStBlkAddrC_Flu = 1;
>>          tStBlkAddrD_Flu = 1;
>>          tStBlkAddrS_Flu = 1;
>>          tStBlkAddrT_Flu = 1;
>>          tStBlkAddrU_Flu = 1;
>>          tStBlkAddrV_Flu = 1;
>>
>>          tDoBlkSt    = 1;
>>      end
>>
>> `ifdef jx2_rbi_bridge_vca_8x
>>      casez({tBlkHitA, tBlkHitB, tBlkHitC, tBlkHitD,
>>          tBlkHitS, tBlkHitT, tBlkHitU, tBlkHitV})
>>          8'b1zzzzzzz: begin tBlkHitId = 0; tBlkHitH    = 1; end
>>          8'b01zzzzzz: begin tBlkHitId = 1; tBlkHitH    = 1; end
>>          8'b001zzzzz: begin tBlkHitId = 2; tBlkHitH    = 1; end
>>          8'b0001zzzz: begin tBlkHitId = 3; tBlkHitH    = 1; end
>>          8'b00001zzz: begin tBlkHitId = 4; tBlkHitH    = 1; end
>>          8'b000001zz: begin tBlkHitId = 5; tBlkHitH    = 1; end
>>          8'b0000001z: begin tBlkHitId = 6; tBlkHitH    = 1; end
>>          8'b00000001: begin tBlkHitId = 7; tBlkHitH    = 1; end
>>          8'b00000000: begin tBlkHitId = 0; tBlkHitH    = 0; end
>>      endcase
>>
>>      case(tBlkHitId[2:0])
>>          3'h0: begin
>>              tBlkDataH    = tBlkDataA;
>>              tBlkAddrH    = tBlkAddrA;
>>          end
>>          3'h1: begin
>>              tBlkDataH    = tBlkDataB;
>>              tBlkAddrH    = tBlkAddrB;
>>          end
>>          3'h2: begin
>>              tBlkDataH    = tBlkDataC;
>>              tBlkAddrH    = tBlkAddrC;
>>          end
>>          3'h3: begin
>>              tBlkDataH    = tBlkDataD;
>>              tBlkAddrH    = tBlkAddrD;
>>          end
>>          3'h4: begin
>>              tBlkDataH    = tBlkDataS;
>>              tBlkAddrH    = tBlkAddrS;
>>          end
>>          3'h5: begin
>>              tBlkDataH    = tBlkDataT;
>>              tBlkAddrH    = tBlkAddrT;
>>          end
>>          3'h6: begin
>>              tBlkDataH    = tBlkDataU;
>>              tBlkAddrH    = tBlkAddrU;
>>          end
>>          3'h7: begin
>>              tBlkDataH    = tBlkDataV;
>>              tBlkAddrH    = tBlkAddrV;
>>          end
>>      endcase
>> `else
>>      casez({tBlkHitA, tBlkHitB, tBlkHitC, tBlkHitD})
>>          4'b1zzz: begin tBlkHitId = 0; tBlkHitH    = 1; end
>>          4'b01zz: begin tBlkHitId = 1; tBlkHitH    = 1; end
>>          4'b001z: begin tBlkHitId = 2; tBlkHitH    = 1; end
>>          4'b0001: begin tBlkHitId = 3; tBlkHitH    = 1; end
>>          4'b0000: begin tBlkHitId = 0; tBlkHitH    = 0; end
>>      endcase
>>
>>      case(tBlkHitId[1:0])
>>          2'h0: begin
>>              tBlkDataH    = tBlkDataA;
>>              tBlkAddrH    = tBlkAddrA;
>>          end
>>          2'h1: begin
>>              tBlkDataH    = tBlkDataB;
>>              tBlkAddrH    = tBlkAddrB;
>>          end
>>          2'h2: begin
>>              tBlkDataH    = tBlkDataC;
>>              tBlkAddrH    = tBlkAddrC;
>>          end
>>          2'h3: begin
>>              tBlkDataH    = tBlkDataD;
>>              tBlkAddrH    = tBlkAddrD;
>>          end
>>      endcase
>> `endif
>>
>> `ifdef jx2_rbi_bridge_vca_mtf
>>      /* If a line had hit in the cache, move it to the front. */
>>


Click here to read the complete article
Re: Misc: 8-way vs 4-way cache, a cost mystery...

<urmmai$3npq9$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37748&group=comp.arch#37748

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: 8-way vs 4-way cache, a cost mystery...
Date: Wed, 28 Feb 2024 01:11:06 -0600
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <urmmai$3npq9$1@dont-email.me>
References: <urlbcp$3bvto$1@dont-email.me>
<02479652582caa588ac2e4758b19126b@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 28 Feb 2024 07:11:14 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4f525b7299752462d71b58c63cccd518";
logging-data="3925833"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+eFexdbJOG0c4jXbTtFVcmI+ZlPsfvYLU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:ECO5OMl3qB5pJ20zEqBg8D0QNkI=
In-Reply-To: <02479652582caa588ac2e4758b19126b@www.novabbs.org>
Content-Language: en-US
 by: BGB - Wed, 28 Feb 2024 07:11 UTC

On 2/27/2024 7:33 PM, MitchAlsup1 wrote:
> A thought::
>
> Construct the 8-way cache from a pair of 4-way cache instances
> and connect both into one 8-way with a single layer of logic
> {multiplexing.}

Possible, I have decided for now to stick with 4-way...

But, even then, efforts at trying to optimize this seem to be causing
the LUT cost to increase rather than decrease...

Seemingly, Vivado's response to all this being to turn it almost
entirely into LUT3 instances (with a small number of LUT6's here and there).

Looking at the LUT3's, there seem to be various truth-tables in use.

But, off-hand, the patterns aren't super obvious.

A few common ones seem to be:
( I0 & I1) | (!I1 & I2)
( I0 & I1) | I2
(!I0 & I1) | I2
...

The first one seems to be a strong majority though. I think it is a bit
MUX using I1 to select the other bit (I0 or I2).

....

Re: Misc: 8-way vs 4-way cache, a cost mystery...

<04d0c9094fa116ab88e3f9615042b16b@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37750&group=comp.arch#37750

  copy link   Newsgroups: comp.arch
Date: Wed, 28 Feb 2024 23:26:29 +0000
Subject: Re: Misc: 8-way vs 4-way cache, a cost mystery...
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$RrGh0OhpoBZlLvYj841SjOY2INpuTitFHtvTSLAkQNOS/bj.gbz8m
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <urlbcp$3bvto$1@dont-email.me> <02479652582caa588ac2e4758b19126b@www.novabbs.org> <urmmai$3npq9$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <04d0c9094fa116ab88e3f9615042b16b@www.novabbs.org>
 by: MitchAlsup1 - Wed, 28 Feb 2024 23:26 UTC

BGB wrote:

> On 2/27/2024 7:33 PM, MitchAlsup1 wrote:
>> A thought::
>>
>> Construct the 8-way cache from a pair of 4-way cache instances
>> and connect both into one 8-way with a single layer of logic
>> {multiplexing.}

> Possible, I have decided for now to stick with 4-way...

> But, even then, efforts at trying to optimize this seem to be causing
> the LUT cost to increase rather than decrease...

Then you have tickled one of Verilog's insidious deamons.

How many elements in a way ?? and how many bits in an element ??
If there a way to make a "way" into a single SRAM ?? (or part of a single
SRAM) ??

What I am getting at is that "conceptually" a n-way set associative
cache is unrecognizingly different than n-copies of a 1/n direct
mapped cache coupled to a set/way selection multiplexer based on
address bits compare. {{And of course write set selection.}}

My 1-wide My 66000 implementation carefully chose 3-way (or 6-way)
L1 caches because that exactly fit the number of bits in my SRAM
macro (-2 spare bits). So cramming 3 tags, 3 line-states, and 3-bit
LRU into one 128-bit SRAM word. The 24KB cache is 3-way while the
48KB cache is the 6-way. The read speed path is 1 gate longer in
6-way configuration.

> Seemingly, Vivado's response to all this being to turn it almost
> entirely into LUT3 instances (with a small number of LUT6's here and there).

It seems to me it is failing to see the SRAM and just made it out of
flip-flops.

> Looking at the LUT3's, there seem to be various truth-tables in use.

> But, off-hand, the patterns aren't super obvious.

> A few common ones seem to be:
> ( I0 & I1) | (!I1 & I2)
> ( I0 & I1) | I2
> (!I0 & I1) | I2
> ...

These appear to be std decoding pattern recognizers, to me (although
the top one looks like a binary multiplexer}.

> The first one seems to be a strong majority though. I think it is a bit
> MUX using I1 to select the other bit (I0 or I2).

> ....

Re: Misc: 8-way vs 4-way cache, a cost mystery...

<urotrg$b0cg$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37751&group=comp.arch#37751

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: 8-way vs 4-way cache, a cost mystery...
Date: Wed, 28 Feb 2024 21:31:53 -0600
Organization: A noiseless patient Spider
Lines: 132
Message-ID: <urotrg$b0cg$1@dont-email.me>
References: <urlbcp$3bvto$1@dont-email.me>
<02479652582caa588ac2e4758b19126b@www.novabbs.org>
<urmmai$3npq9$1@dont-email.me>
<04d0c9094fa116ab88e3f9615042b16b@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 29 Feb 2024 03:32:01 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="a4009e05929523d5d19afe0436b8f762";
logging-data="360848"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18KBki3Bb3tfblROXTtvemuen/591YLLd4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:VNt8o3YOGAJ+pcTTrPUIr4PtnZ0=
In-Reply-To: <04d0c9094fa116ab88e3f9615042b16b@www.novabbs.org>
Content-Language: en-US
 by: BGB - Thu, 29 Feb 2024 03:31 UTC

On 2/28/2024 5:26 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 2/27/2024 7:33 PM, MitchAlsup1 wrote:
>>> A thought::
>>>
>>> Construct the 8-way cache from a pair of 4-way cache instances
>>> and connect both into one 8-way with a single layer of logic
>>> {multiplexing.}
>
>> Possible, I have decided for now to stick with 4-way...
>
>
>> But, even then, efforts at trying to optimize this seem to be causing
>> the LUT cost to increase rather than decrease...
>
> Then you have tickled one of Verilog's insidious deamons.
>
> How many elements in a way ?? and how many bits in an element ??
> If there a way to make a "way" into a single SRAM ?? (or part of a single
> SRAM) ??
>
> What I am getting at is that "conceptually" a n-way set associative
> cache is unrecognizingly different than n-copies of a 1/n direct mapped
> cache coupled to a set/way selection multiplexer based on
> address bits compare. {{And of course write set selection.}}
>

I am not sure.

In this case, I interpreted it as, say, 4 or 8 parallel sets of arrays,
with the corresponding match and multiplex logic.

In the first instance, adding an item always shifted each item over one
position and added a new item to the front.

The MTF logic tries to move an accessed item to the front, or shift each
item back as before it is a new address. If the address hits while
adding an items, it behaves as-if it were moving it to the front, but
effectively replaces the item being moved to the front with the data
being written.

The MTF logic seems to increase hit-rate, but eats a lot of additional LUTs.

> My 1-wide My 66000 implementation carefully chose 3-way (or 6-way)
> L1 caches because that exactly fit the number of bits in my SRAM macro
> (-2 spare bits). So cramming 3 tags, 3 line-states, and 3-bit
> LRU into one 128-bit SRAM word. The 24KB cache is 3-way while the
> 48KB cache is the 6-way. The read speed path is 1 gate longer in
> 6-way configuration.
>

I was mostly trying to go for LUTRAM here, which is theoretically (for a
64-element array):
6-bit address bits, 2 data bits in, 2 data-bits out.

So, for 4-ways, each 128+48 bits, would expect ~ 352 LUTRAM's.

But, removing 12-bits, because they are only ever 0 and are unused as
inputs, leaving 128+36, this would give 328.

Reported usage is 656, exactly twice this latter estimate. There implies
either a 2x duplication, or each LUTRAM is only holding 1-bit.

The schematic view doesn't seem to show either obvious duplication nor
1-bit LUTRAM's, so it appears as-if maybe the DMEM/LUTRAM stat is
counting in bits rather than instances?...

Well, or double-counting each because it seemingly appears in the
netlist both as a source and a destination?...

>> Seemingly, Vivado's response to all this being to turn it almost
>> entirely into LUT3 instances (with a small number of LUT6's here and
>> there).
>
> It seems to me it is failing to see the SRAM and just made it out of
> flip-flops.
>

It does appear to be using LUTRAM's;
If it were FF's, the situation would be *much* worse...

Trying to math it out, if it were FF's, it would have eaten nearly the
entire resource budget of the FPGA.

Also, the FF number would have been very large...

It did at one point seem to be using Block-RAM, but added:
(* ram_style = "distributed" *)
To the 64-element arrays, as using Block-RAM here is kind of a waste.

>> Looking at the LUT3's, there seem to be various truth-tables in use.
>
>> But, off-hand, the patterns aren't super obvious.
>
>> A few common ones seem to be:
>>    ( I0 & I1) | (!I1 & I2)
>>    ( I0 & I1) | I2
>>    (!I0 & I1) | I2
>>    ...
>
> These appear to be std decoding pattern recognizers, to me (although
> the top one looks like a binary multiplexer}.
>

Checking, the numbers being seen are in-line with what one would expect,
in terms of theoretical estimates, from nearly the entire cache being
implemented using LUT3's for all the multiplexing...

Then again, I guess for the "implementation" stage, LUT3 may be
preferable to LUT6, since a LUT6 will use the entire LUT, but LUT3 will
allow combining two LUT3s into a single LUT6.

But, I guess the main issue is that this module has ended up as one of
the most expensive modules (about the only modules that use more LUTs at
the moment being things like the L1D$ and FP-SIMD unit and similar)

>> The first one seems to be a strong majority though. I think it is a
>> bit MUX using I1 to select the other bit (I0 or I2).
>
>> ....

Re: Misc: 8-way vs 4-way cache, a cost mystery...

<943506edf2d6396f973365e603a956c3@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37757&group=comp.arch#37757

  copy link   Newsgroups: comp.arch
Date: Thu, 29 Feb 2024 19:39:35 +0000
Subject: Re: Misc: 8-way vs 4-way cache, a cost mystery...
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$2QY.6UdMrKy4yte.yaOtvexQIHwz0zFDnWcfO/igVFHAsiT1tpkvy
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <urlbcp$3bvto$1@dont-email.me> <02479652582caa588ac2e4758b19126b@www.novabbs.org> <urmmai$3npq9$1@dont-email.me> <04d0c9094fa116ab88e3f9615042b16b@www.novabbs.org> <urotrg$b0cg$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <943506edf2d6396f973365e603a956c3@www.novabbs.org>
 by: MitchAlsup1 - Thu, 29 Feb 2024 19:39 UTC

BGB wrote:

> On 2/28/2024 5:26 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> On 2/27/2024 7:33 PM, MitchAlsup1 wrote:
>>>> A thought::
>>>>
>>>> Construct the 8-way cache from a pair of 4-way cache instances
>>>> and connect both into one 8-way with a single layer of logic
>>>> {multiplexing.}
>>
>>> Possible, I have decided for now to stick with 4-way...
>>
>>
>>> But, even then, efforts at trying to optimize this seem to be causing
>>> the LUT cost to increase rather than decrease...
>>
>> Then you have tickled one of Verilog's insidious deamons.
>>
>> How many elements in a way ?? and how many bits in an element ??
>> If there a way to make a "way" into a single SRAM ?? (or part of a single
>> SRAM) ??
>>
>> What I am getting at is that "conceptually" a n-way set associative
>> cache is unrecognizingly different than n-copies of a 1/n direct mapped
>> cache coupled to a set/way selection multiplexer based on
>> address bits compare. {{And of course write set selection.}}
>>

> I am not sure.

> In this case, I interpreted it as, say, 4 or 8 parallel sets of arrays,
> with the corresponding match and multiplex logic.

They should be 4 or 8 parrallel instances of a 1-way (DM) cache with a
comparator and an output multiplexer signal and an input write signal.
The Move to Front is easier done with Not-recently-Used bits as a guise
for least recently used.

Each way has a NRU bit--at reset and when all NRU bits are set, they are
cleared asynchronously. Then as each set is hit, the NRU bit is set. You
do not reallocate the sets with the NRU bit set. I see no reason to move
one set to the front if you can alter the reallocation selection to avoid
picking it. {{3 gates per line}}

> In the first instance, adding an item always shifted each item over one
> position and added a new item to the front.

> The MTF logic tries to move an accessed item to the front, or shift each
> item back as before it is a new address. If the address hits while
> adding an items, it behaves as-if it were moving it to the front, but
> effectively replaces the item being moved to the front with the data
> being written.

> The MTF logic seems to increase hit-rate, but eats a lot of additional LUTs.

Re: Misc: 8-way vs 4-way cache, a cost mystery...

<urue9s$1nt17$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37758&group=comp.arch#37758

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Misc: 8-way vs 4-way cache, a cost mystery...
Date: Sat, 2 Mar 2024 00:43:20 -0500
Organization: A noiseless patient Spider
Lines: 996
Message-ID: <urue9s$1nt17$1@dont-email.me>
References: <urlbcp$3bvto$1@dont-email.me> <urm9p2$3lhgq$1@dont-email.me>
<urmkis$3nfgi$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 2 Mar 2024 05:43:24 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="dcd201756e0dc01e1f9b592e38a6aa9a";
logging-data="1831975"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/yIdG2VNmL/JJR1yXBRJTg74KUUTh3/Kk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:j7/DuReVQ9/Xchz41USYXoUDSck=
In-Reply-To: <urmkis$3nfgi$1@dont-email.me>
Content-Language: en-US
 by: Robert Finch - Sat, 2 Mar 2024 05:43 UTC

On 2024-02-28 1:41 a.m., BGB wrote:
> On 2/27/2024 9:37 PM, Robert Finch wrote:
>> On 2024-02-27 1:58 p.m., BGB wrote:
>>> So, I have added a small associative cache to my ringbus, and it can
>>> be set to 4-way or 8-way, now with a "move to front" special-case
>>> which seems to significantly improve hit-rate (in a behavioral model).
>>>
>>> One mystery though is why the 8-way case costs significantly more
>>> than twice the LUTs of 4-way (and more so than an expected
>>> "n*log2(n)" cost ratio I might expect when accounting for MUX'ing).
>>>
>>>
>>> This version did make a few design changes vs a version I had
>>> mentioned previously (in another thread), mostly modifying the logic
>>> to select the hit value in a way that hopefully saves a few LUTs.
>>>
>>> But, then added the MTF logic (move-to-front), which increases LUT
>>> count again, mostly by adding variability in terms of how values may
>>> move around when updating the cache arrays.
>>>
>>>
>>> In this case, the cache operates on the Ringbus, and sits at the
>>> "bridge point" between the L1 and L2 rings, where the L1 ring
>>> contains the L1 caches and TLB, and the L2 ring contains the L2
>>> cache, ROM areas, a bridge over to the MMIO bus, etc. Where, the L2
>>> cache then contains the interface to the external DRAM.
>>>
>>> The ringbus passes data in 128-bit chunks, with messages moving along
>>> the ring at 1 position per clock-cycle. There are no stalls, and
>>> generally no buffering apart from the flip-flops composing the ring
>>> itself.
>>>
>>> This cache operates exclusively on normal physical addresses:
>>>    (47:44)==0xC
>>> And flushes stuff on No-Cache addresses 0xD, but ignores most other
>>> address ranges. Currently it flushes lines by setting their high 4
>>> bits to 0xF, which was typically the address range used for MMIO.
>>>
>>>
>>>
>>> Current cost appears to be ~ 1.3k LUT with 4-way (MTF enabled), and
>>> around ~ 3.5k LUT for 8-way.
>>>
>>> Also eating around 2x to 3x as many LUTRAM cells as expected.
>>>
>>> Partial mystery of what exactly is going on here.
>>>
>>> Note that this logic is being run at 50 MHz...
>>>
>> I think synthesis will replicate LUTs and FF's as needed to meet
>> timing requirements, unless otherwise specified. I find it is almost
>> always more than the minimum number possible. You could try
>> synthesizing for size and see if it reduces the counts. Looking at
>> schematics helps see what is going on too.
>>
>
> I am mostly using "Vivado Synthesis Defaults" and "Vivado Implementation
> Defaults".
>
> Past attempts at fiddling all that much with this has either been mostly
> ineffective or has broken things like the ability to see resource
> utilization or view the netlist, which kinda defeats the point (*).
>
>
> *: Say, selecting anything other than "Vivado Implementation Defaults"
> seemingly breaking the netlist and resource utlilization features, for
> whatever reason this happens...
>
> Can mess with the Synthesis settings at least, but often haven't gotten
> enough out of this to make it worthwhile (and mostly makes sense if one
> is up against resource limits of the FPGA; and/or flailing at trying to
> get stuff to pass timing or similar...).
>
>
>> I find it sometimes helps the synthesizer out to break things up into
>> smaller modules. It seems to optimize better.
>>
>
> Looking in the schematics/...
>   Looks like the data and address arrays are partly segregated.
>   Data proceeds fairly straightforwardly from the arrays to the outputs.
>   Address signals are a bit more of a tangled mess.
>
> Looks like there is a crapton of LUT3's here, with pretty much all of
> the data selection is via LUT3's (3 in, 1 out). Nearly all the logic
> internally appears to be on 1-bit signals, ...
>
>
> Did end up adding another cycle of delays on the input side, as it
> appeared that this module may have been assimilating parts of the TLB
> (the preceding module on the ringbus).
>
> On the schematic of the outer module, some of the signals flowing in and
> out of the Bridge module were ones that should have been internal to the
> TLB.
>
>
> Hmm, adding a delay cycle has caused synthesis to (somwhow) turn it into
> Block-RAM. Seems like kind of a waste though, as this would only use
> around 1/4 of the BRAM... This while also burning even more LUTs...
>
> Did slightly improve the "negative slack" value at least...
>
>
> I guess, could consider an option to increase the array size to 256 so
> that at least it makes full use of the BRAM's (nevermind if it is using
> the access pattern for LUTRAM's).
>
> It uses 12 BRAM's, which seems to be the expeced amount given the total
> combined width of the arrays.
>
>
>
> Probably going to stick with 4-way, as the MTF logic did increase
> hit-rate a fair bit while also reducing the relative benefit of 8-way
> over 4-way.
>
> Even if still not so great for LUT cost...
>
>
>
> At the moment, 4-way + MTF, is now weighing in at around 1900 LUTs.
>
> Or, adding an extra cycle of delay, if anything, caused the LUT cost to
> increase...
>
>
>>>
>>> Where, the MTF logic seems to add ~ 700 LUT for 4-way, and ~ 1300 LUT
>>> for 8-way.
>>>
>>> Costs seem to be larger than my theoretical estimates...
>>>
>>>
>>> Granted, it is possible I could put an extra buffering stage on the
>>> input signals and see if this causes LUT cost to go down.
>>>
>>>
>>> I am not sure if anyone has some thoughts here...
>>>
>>>
>>> But, yeah, if anything, this can be seen as evidence for my general
>>> avoidance of associative caching. At least in this case, it is in an
>>> area where the relative cost impact is lower...
>>>
>>>
>>> Verilog Code:
>>> ---
>>>
>>> /*
>>> Bridge between the L1 and L2 Rings.
>>>
>>> Add a small associative cache to the ring, intended to absorb
>>> conflict misses.
>>>
>>>   */
>>>
>>> `include "ringbus/RbiDefs.v"
>>>
>>> module RbiMemL1BridgeVcA(
>>>      /* verilator lint_off UNUSED */
>>>      clock,            reset,
>>>      regInMmcr,        regInKrr,        regInSr,
>>>
>>>      l1mAddrIn,        l1mAddrOut,
>>>      l1mDataIn,        l1mDataOut,
>>>      l1mOpmIn,        l1mOpmOut,
>>>      l1mSeqIn,        l1mSeqOut,
>>>
>>>      l2mAddrIn,        l2mAddrOut,
>>>      l2mDataIn,        l2mDataOut,
>>>      l2mOpmIn,        l2mOpmOut,
>>>      l2mSeqIn,        l2mSeqOut,
>>>
>>>      unitNodeId,        regRngBridge
>>>      );
>>>
>>> input            clock;
>>> input            reset;
>>> input[63:0]        regInMmcr;
>>> input[63:0]        regInKrr;
>>> input[63:0]        regInSr;
>>>
>>> input [ 15:0]    l1mSeqIn;        //operation sequence
>>> output[ 15:0]    l1mSeqOut;        //operation sequence
>>> input [ 15:0]    l1mOpmIn;        //memory operation mode
>>> output[ 15:0]    l1mOpmOut;        //memory operation mode
>>> `input_l1addr    l1mAddrIn;        //memory input address
>>> `output_l1addr    l1mAddrOut;        //memory output address
>>> `input_tile        l1mDataIn;        //memory input data
>>> `output_tile    l1mDataOut;        //memory output data
>>>
>>> input [ 15:0]    l2mSeqIn;        //operation sequence
>>> output[ 15:0]    l2mSeqOut;        //operation sequence
>>> input [ 15:0]    l2mOpmIn;        //memory operation mode
>>> output[ 15:0]    l2mOpmOut;        //memory operation mode
>>> `input_l2addr    l2mAddrIn;        //memory input address
>>> `output_l2addr    l2mAddrOut;        //memory output address
>>> `input_tile        l2mDataIn;        //memory input data
>>> `output_tile    l2mDataOut;        //memory output data
>>>
>>> input [  7:0]    unitNodeId;        //Who Are We?
>>> input [  7:0]    regRngBridge;    //Random Sequence (Updates on L1
>>> Flush)
>>>
>>>
>>>
>>> reg[ 15:0]        tL1mSeqOut;            //operation sequence
>>> reg[ 15:0]        tL1mOpmOut;            //memory operation mode
>>> `reg_l1addr        tL1mAddrOut;        //memory output address
>>> `reg_tile        tL1mDataOut;        //memory output data
>>>
>>> reg[ 15:0]        tL2mSeqOut;            //operation sequence
>>> reg[ 15:0]        tL2mOpmOut;            //memory operation mode
>>> `reg_l2addr        tL2mAddrOut;        //memory output address
>>> `reg_tile        tL2mDataOut;        //memory output data
>>>
>>>
>>> reg[ 15:0]        tL1mSeqOut2;            //operation sequence
>>> reg[ 15:0]        tL1mOpmOut2;            //memory operation mode
>>> `reg_l1addr        tL1mAddrOut2;        //memory output address
>>> `reg_tile        tL1mDataOut2;        //memory output data
>>>
>>> assign        l1mSeqOut    = tL1mSeqOut2;
>>> assign        l1mOpmOut    = tL1mOpmOut2;
>>> assign        l1mAddrOut    = tL1mAddrOut2;
>>> assign        l1mDataOut    = tL1mDataOut2;
>>>
>>>
>>> reg[ 15:0]        tL2mSeqOut2;            //operation sequence
>>> reg[ 15:0]        tL2mOpmOut2;            //memory operation mode
>>> `reg_l2addr        tL2mAddrOut2;        //memory output address
>>> `reg_tile        tL2mDataOut2;        //memory output data
>>>
>>> assign        l2mSeqOut    = tL2mSeqOut2;
>>> assign        l2mOpmOut    = tL2mOpmOut2;
>>> assign        l2mAddrOut    = tL2mAddrOut2;
>>> assign        l2mDataOut    = tL2mDataOut2;
>>>
>>>
>>>
>>> reg        tL1mReqSent;
>>> reg        tL2mReqSent;
>>>
>>> wire            l1mRingIsIdle;
>>> wire            l2mRingIsIdle;
>>>
>>> assign        l1mRingIsIdle = (l1mOpmIn[7:0] == JX2_RBI_OPM_IDLE);
>>> assign        l2mRingIsIdle = (l2mOpmIn[7:0] == JX2_RBI_OPM_IDLE);
>>>
>>> wire            l1mRingIsReq;
>>> wire            l2mRingIsResp;
>>> wire            l2mRingIsRespOther;
>>> wire            l2mRingIsMemLdResp;
>>>
>>> wire            l1mRingIsIrq;
>>> wire            l2mRingIsIrq;
>>> wire            l2mRingIsIrqBc;
>>>
>>> assign        l1mRingIsReq = l1mOpmIn[ 7:6] == 2'b10;
>>>
>>> assign        l2mRingIsResp =
>>>      (l2mOpmIn[ 7:6] == 2'b01) &&
>>>      (l2mSeqIn[15:10] == unitNodeId[7:2]);
>>>
>>> assign        l2mRingIsMemLdResp =
>>>      l2mRingIsResp && (l2mOpmIn[ 5:4] == 2'b11);
>>>
>>> assign        l2mRingIsRespOther =
>>>      (l2mOpmIn[ 7:6] == 2'b01) &&
>>>      (l2mSeqIn[15:10] != unitNodeId[7:2]);
>>>
>>> assign        l1mRingIsIrq =
>>>      (l2mOpmIn[ 7:0] == JX2_RBI_OPM_IRQ) &&
>>>      ((l2mDataIn[11:8] != unitNodeId[5:2]) ||
>>> //     (l2mDataIn[11:8] == 4'h0) ||
>>>       (l2mDataIn[11:8] == 4'hF));
>>>
>>> assign        l2mRingIsIrq =
>>>      (l2mOpmIn[ 7:0] == JX2_RBI_OPM_IRQ) &&
>>>      ((l2mDataIn[11:8] == unitNodeId[5:2]) ||
>>>       (l2mDataIn[11:8] == 4'h0) ||
>>>       (l2mDataIn[11:8] == 4'hF));
>>> assign        l2mRingIsIrqBc = l2mRingIsIrq && (l2mDataIn[11:8] ==
>>> 4'hF);
>>>
>>> reg[ 15:0]        tL1mSeqReq;            //operation sequence
>>> reg[ 15:0]        tL1mOpmReq;            //memory operation mode
>>> `reg_l1addr        tL1mAddrReq;        //memory output address
>>> reg[127:0]        tL1mDataReq;        //memory output data
>>>
>>> reg[ 15:0]        tL2mSeqReq;            //operation sequence
>>> reg[ 15:0]        tL2mOpmReq;            //memory operation mode
>>> `reg_l2addr        tL2mAddrReq;        //memory output address
>>> reg[127:0]        tL2mDataReq;        //memory output data
>>>
>>> wire            l1mOpmIn_IsMemStReq =
>>>      (l1mOpmIn[7:0]==JX2_RBI_OPM_STX);
>>>
>>> wire            l1mOpmIn_IsMemLdReq =
>>>      (l1mOpmIn[7:0]==JX2_RBI_OPM_PFX)    ||
>>>      (l1mOpmIn[7:0]==JX2_RBI_OPM_SPX)    ||
>>>      (l1mOpmIn[7:0]==JX2_RBI_OPM_LDX)    ;
>>>
>>> wire            l2mOpmIn_IsReq =
>>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_PFX)    ||
>>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_SPX)    ||
>>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_LDX)    ||
>>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_LDSQ)    ||
>>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_LDSL)    ||
>>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_LDUL)    ||
>>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_STX)    ||
>>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_STSQ)    ||
>>>      (l2mOpmIn[7:0]==JX2_RBI_OPM_STSL)    ;
>>>
>>>
>>> reg[127:0]        tArrDataA[63:0];
>>> reg[127:0]        tArrDataB[63:0];
>>> reg[127:0]        tArrDataC[63:0];
>>> reg[127:0]        tArrDataD[63:0];
>>>
>>> reg[47:0]        tArrAddrA[63:0];
>>> reg[47:0]        tArrAddrB[63:0];
>>> reg[47:0]        tArrAddrC[63:0];
>>> reg[47:0]        tArrAddrD[63:0];
>>>
>>> `ifdef jx2_rbi_bridge_vca_8x
>>> reg[127:0]        tArrDataS[63:0];
>>> reg[127:0]        tArrDataT[63:0];
>>> reg[127:0]        tArrDataU[63:0];
>>> reg[127:0]        tArrDataV[63:0];
>>>
>>> reg[47:0]        tArrAddrS[63:0];
>>> reg[47:0]        tArrAddrT[63:0];
>>> reg[47:0]        tArrAddrU[63:0];
>>> reg[47:0]        tArrAddrV[63:0];
>>> `endif
>>>
>>>
>>> // reg[4:0]        tArrIx;
>>> // reg[4:0]        tBlkStIx;
>>>
>>> reg[5:0]        tArrIx;
>>> reg[5:0]        tBlkStIx;
>>>
>>> reg[3:0]        tNxtArrChk;
>>> reg[3:0]        tArrChk;
>>>
>>> reg                tDoBlkSt;
>>>
>>> reg[127:0]        tBlkDataA;
>>> reg[127:0]        tBlkDataB;
>>> reg[127:0]        tBlkDataC;
>>> reg[127:0]        tBlkDataD;
>>>
>>> reg[47:0]        tBlkAddrA;
>>> reg[47:0]        tBlkAddrB;
>>> reg[47:0]        tBlkAddrC;
>>> reg[47:0]        tBlkAddrD;
>>> reg                tBlkHitA;
>>> reg                tBlkHitB;
>>> reg                tBlkHitC;
>>> reg                tBlkHitD;
>>>
>>> reg[127:0]        tBlkDataH;
>>> reg[47:0]        tBlkAddrH;
>>> reg                tBlkHitH;
>>>
>>> reg[2:0]        tBlkHitId;
>>>
>>> reg[127:0]        tStBlkDataA;
>>> reg[127:0]        tStBlkDataB;
>>> reg[127:0]        tStBlkDataC;
>>> reg[127:0]        tStBlkDataD;
>>>
>>> reg[47:0]        tStBlkAddrA;
>>> reg[47:0]        tStBlkAddrB;
>>> reg[47:0]        tStBlkAddrC;
>>> reg[47:0]        tStBlkAddrD;
>>>
>>> `ifdef jx2_rbi_bridge_vca_8x
>>>
>>> reg[127:0]        tBlkDataS;
>>> reg[127:0]        tBlkDataT;
>>> reg[127:0]        tBlkDataU;
>>> reg[127:0]        tBlkDataV;
>>>
>>> reg[47:0]        tBlkAddrS;
>>> reg[47:0]        tBlkAddrT;
>>> reg[47:0]        tBlkAddrU;
>>> reg[47:0]        tBlkAddrV;
>>> reg                tBlkHitS;
>>> reg                tBlkHitT;
>>> reg                tBlkHitU;
>>> reg                tBlkHitV;
>>>
>>> reg[127:0]        tStBlkDataS;
>>> reg[127:0]        tStBlkDataT;
>>> reg[127:0]        tStBlkDataU;
>>> reg[127:0]        tStBlkDataV;
>>>
>>> reg[47:0]        tStBlkAddrS;
>>> reg[47:0]        tStBlkAddrT;
>>> reg[47:0]        tStBlkAddrU;
>>> reg[47:0]        tStBlkAddrV;
>>>
>>> `endif
>>>
>>> reg            tStBlkAddrB_Adv;
>>> reg            tStBlkAddrC_Adv;
>>> reg            tStBlkAddrD_Adv;
>>> reg            tStBlkAddrS_Adv;
>>> reg            tStBlkAddrT_Adv;
>>> reg            tStBlkAddrU_Adv;
>>> reg            tStBlkAddrV_Adv;
>>>
>>> reg            tStBlkAddrA_Flu;
>>> reg            tStBlkAddrB_Flu;
>>> reg            tStBlkAddrC_Flu;
>>> reg            tStBlkAddrD_Flu;
>>> reg            tStBlkAddrS_Flu;
>>> reg            tStBlkAddrT_Flu;
>>> reg            tStBlkAddrU_Flu;
>>> reg            tStBlkAddrV_Flu;
>>>
>>> always @*
>>> begin
>>> //        tArrIx        = l1mAddrIn[8:4];
>>>      tArrIx        = l1mAddrIn[9:4];
>>>      tBlkStIx    = tArrIx;
>>>      tDoBlkSt    = 0;
>>>
>>>      tNxtArrChk        = regRngBridge[7:4] ^ regRngBridge[3:0] ^ 4'h5;
>>>
>>>      tBlkDataA    = tArrDataA[tArrIx];
>>>      tBlkDataB    = tArrDataB[tArrIx];
>>>      tBlkDataC    = tArrDataC[tArrIx];
>>>      tBlkDataD    = tArrDataD[tArrIx];
>>>
>>>      tBlkAddrA    = tArrAddrA[tArrIx];
>>>      tBlkAddrB    = tArrAddrB[tArrIx];
>>>      tBlkAddrC    = tArrAddrC[tArrIx];
>>>      tBlkAddrD    = tArrAddrD[tArrIx];
>>>
>>> `ifdef jx2_rbi_bridge_vca_8x
>>>      tBlkDataS    = tArrDataS[tArrIx];
>>>      tBlkDataT    = tArrDataT[tArrIx];
>>>      tBlkDataU    = tArrDataU[tArrIx];
>>>      tBlkDataV    = tArrDataV[tArrIx];
>>>
>>>      tBlkAddrS    = tArrAddrS[tArrIx];
>>>      tBlkAddrT    = tArrAddrT[tArrIx];
>>>      tBlkAddrU    = tArrAddrU[tArrIx];
>>>      tBlkAddrV    = tArrAddrV[tArrIx];
>>> `endif
>>>
>>> //        tBlkHitA    = tBlkAddrA[43:4] == l1mAddrIn[43:4];
>>> //        tBlkHitB    = tBlkAddrB[43:4] == l1mAddrIn[43:4];
>>> //        tBlkHitC    = tBlkAddrC[43:4] == l1mAddrIn[43:4];
>>> //        tBlkHitD    = tBlkAddrD[43:4] == l1mAddrIn[43:4];
>>>
>>>      tBlkHitA    = tBlkAddrA[31:4] == l1mAddrIn[31:4];
>>>      tBlkHitB    = tBlkAddrB[31:4] == l1mAddrIn[31:4];
>>>      tBlkHitC    = tBlkAddrC[31:4] == l1mAddrIn[31:4];
>>>      tBlkHitD    = tBlkAddrD[31:4] == l1mAddrIn[31:4];
>>>
>>> `ifdef jx2_rbi_bridge_vca_8x
>>> //        tBlkHitS    = tBlkAddrS[43:4] == l1mAddrIn[43:4];
>>> //        tBlkHitT    = tBlkAddrT[43:4] == l1mAddrIn[43:4];
>>> //        tBlkHitU    = tBlkAddrU[43:4] == l1mAddrIn[43:4];
>>> //        tBlkHitV    = tBlkAddrV[43:4] == l1mAddrIn[43:4];
>>>
>>>      tBlkHitS    = tBlkAddrS[31:4] == l1mAddrIn[31:4];
>>>      tBlkHitT    = tBlkAddrT[31:4] == l1mAddrIn[31:4];
>>>      tBlkHitU    = tBlkAddrU[31:4] == l1mAddrIn[31:4];
>>>      tBlkHitV    = tBlkAddrV[31:4] == l1mAddrIn[31:4];
>>> `endif
>>>
>>> `ifdef jx2_rbi_bridge_vca_8x
>>>      tStBlkDataV    = tBlkDataU;
>>>      tStBlkDataU    = tBlkDataT;
>>>      tStBlkDataT    = tBlkDataS;
>>>      tStBlkDataS    = tBlkDataD;
>>>
>>>      tStBlkAddrV    = tBlkAddrU;
>>>      tStBlkAddrU    = tBlkAddrT;
>>>      tStBlkAddrT    = tBlkAddrS;
>>>      tStBlkAddrS    = tBlkAddrD;
>>> `endif
>>>
>>>      tStBlkDataD    = tBlkDataC;
>>>      tStBlkDataC    = tBlkDataB;
>>>      tStBlkDataB    = tBlkDataA;
>>>      tStBlkDataA    = l1mDataIn;
>>>
>>>      tStBlkAddrD    = tBlkAddrC;
>>>      tStBlkAddrC    = tBlkAddrB;
>>>      tStBlkAddrB    = tBlkAddrA;
>>>      tStBlkAddrA    = l1mAddrIn;
>>>
>>>      tStBlkAddrB_Adv = 1;
>>>      tStBlkAddrC_Adv = 1;
>>>      tStBlkAddrD_Adv = 1;
>>>      tStBlkAddrS_Adv = 1;
>>>      tStBlkAddrT_Adv = 1;
>>>      tStBlkAddrU_Adv = 1;
>>>      tStBlkAddrV_Adv = 1;
>>>
>>>      tStBlkAddrA_Flu = 0;
>>>      tStBlkAddrB_Flu = 0;
>>>      tStBlkAddrC_Flu = 0;
>>>      tStBlkAddrD_Flu = 0;
>>>      tStBlkAddrS_Flu = 0;
>>>      tStBlkAddrT_Flu = 0;
>>>      tStBlkAddrU_Flu = 0;
>>>      tStBlkAddrV_Flu = 0;
>>>
>>>      if(l1mOpmIn_IsMemStReq &&
>>>          (l1mAddrIn[47:44]==4'hC) &&
>>>          !l1mOpmIn[11])
>>>      begin
>>>          tDoBlkSt    = 1;
>>>      end
>>>
>>> // `ifndef def_true
>>> `ifdef def_true
>>>      if(
>>>          l2mRingIsMemLdResp &&
>>>          (l2mAddrIn[47:44]==4'hC) &&
>>>          !l2mOpmIn[11] &&
>>>          !l2mOpmIn[3])
>>>      begin
>>> //            tBlkStIx    = l2mAddrIn[8:4];
>>>          tBlkStIx    = l2mAddrIn[9:4];
>>>          tStBlkDataA    = l2mDataIn;
>>>          tStBlkAddrA    = l2mAddrIn;
>>>          tDoBlkSt    = 1;
>>>      end
>>> `endif
>>>
>>>      tStBlkAddrA[3:0] = tArrChk;
>>>
>>>      if(tStBlkAddrA[31:24]==0)
>>>          tDoBlkSt    = 0;
>>> //        if(tStBlkAddrA[47:44]!=4'hC)
>>> //            tDoBlkSt    = 0;
>>>
>>>      if(tStBlkAddrA[43:32]!=0)
>>>          tDoBlkSt    = 0;
>>>      if(tStBlkAddrA[31:27]!=0)
>>>          tDoBlkSt    = 0;
>>>
>>>      if((l1mAddrIn[47:44]==4'hD) || l1mOpmIn[11])
>>>      begin
>>>          /* If flushing a line, flush all the ways. */
>>>
>>>          tStBlkAddrB_Adv = 0;
>>>          tStBlkAddrC_Adv = 0;
>>>          tStBlkAddrD_Adv = 0;
>>>          tStBlkAddrS_Adv = 0;
>>>          tStBlkAddrT_Adv = 0;
>>>          tStBlkAddrU_Adv = 0;
>>>          tStBlkAddrV_Adv = 0;
>>>
>>>          tStBlkAddrA_Flu = 1;
>>>          tStBlkAddrB_Flu = 1;
>>>          tStBlkAddrC_Flu = 1;
>>>          tStBlkAddrD_Flu = 1;
>>>          tStBlkAddrS_Flu = 1;
>>>          tStBlkAddrT_Flu = 1;
>>>          tStBlkAddrU_Flu = 1;
>>>          tStBlkAddrV_Flu = 1;
>>>
>>>          tDoBlkSt    = 1;
>>>      end
>>>
>>> `ifdef jx2_rbi_bridge_vca_8x
>>>      casez({tBlkHitA, tBlkHitB, tBlkHitC, tBlkHitD,
>>>          tBlkHitS, tBlkHitT, tBlkHitU, tBlkHitV})
>>>          8'b1zzzzzzz: begin tBlkHitId = 0; tBlkHitH    = 1; end
>>>          8'b01zzzzzz: begin tBlkHitId = 1; tBlkHitH    = 1; end
>>>          8'b001zzzzz: begin tBlkHitId = 2; tBlkHitH    = 1; end
>>>          8'b0001zzzz: begin tBlkHitId = 3; tBlkHitH    = 1; end
>>>          8'b00001zzz: begin tBlkHitId = 4; tBlkHitH    = 1; end
>>>          8'b000001zz: begin tBlkHitId = 5; tBlkHitH    = 1; end
>>>          8'b0000001z: begin tBlkHitId = 6; tBlkHitH    = 1; end
>>>          8'b00000001: begin tBlkHitId = 7; tBlkHitH    = 1; end
>>>          8'b00000000: begin tBlkHitId = 0; tBlkHitH    = 0; end
>>>      endcase
>>>
>>>      case(tBlkHitId[2:0])
>>>          3'h0: begin
>>>              tBlkDataH    = tBlkDataA;
>>>              tBlkAddrH    = tBlkAddrA;
>>>          end
>>>          3'h1: begin
>>>              tBlkDataH    = tBlkDataB;
>>>              tBlkAddrH    = tBlkAddrB;
>>>          end
>>>          3'h2: begin
>>>              tBlkDataH    = tBlkDataC;
>>>              tBlkAddrH    = tBlkAddrC;
>>>          end
>>>          3'h3: begin
>>>              tBlkDataH    = tBlkDataD;
>>>              tBlkAddrH    = tBlkAddrD;
>>>          end
>>>          3'h4: begin
>>>              tBlkDataH    = tBlkDataS;
>>>              tBlkAddrH    = tBlkAddrS;
>>>          end
>>>          3'h5: begin
>>>              tBlkDataH    = tBlkDataT;
>>>              tBlkAddrH    = tBlkAddrT;
>>>          end
>>>          3'h6: begin
>>>              tBlkDataH    = tBlkDataU;
>>>              tBlkAddrH    = tBlkAddrU;
>>>          end
>>>          3'h7: begin
>>>              tBlkDataH    = tBlkDataV;
>>>              tBlkAddrH    = tBlkAddrV;
>>>          end
>>>      endcase
>>> `else
>>>      casez({tBlkHitA, tBlkHitB, tBlkHitC, tBlkHitD})
>>>          4'b1zzz: begin tBlkHitId = 0; tBlkHitH    = 1; end
>>>          4'b01zz: begin tBlkHitId = 1; tBlkHitH    = 1; end
>>>          4'b001z: begin tBlkHitId = 2; tBlkHitH    = 1; end
>>>          4'b0001: begin tBlkHitId = 3; tBlkHitH    = 1; end
>>>          4'b0000: begin tBlkHitId = 0; tBlkHitH    = 0; end
>>>      endcase
>>>
>>>      case(tBlkHitId[1:0])
>>>          2'h0: begin
>>>              tBlkDataH    = tBlkDataA;
>>>              tBlkAddrH    = tBlkAddrA;
>>>          end
>>>          2'h1: begin
>>>              tBlkDataH    = tBlkDataB;
>>>              tBlkAddrH    = tBlkAddrB;
>>>          end
>>>          2'h2: begin
>>>              tBlkDataH    = tBlkDataC;
>>>              tBlkAddrH    = tBlkAddrC;
>>>          end
>>>          2'h3: begin
>>>              tBlkDataH    = tBlkDataD;
>>>              tBlkAddrH    = tBlkAddrD;
>>>          end
>>>      endcase
>>> `endif
>>>
>>> `ifdef jx2_rbi_bridge_vca_mtf
>>>      /* If a line had hit in the cache, move it to the front. */
>>>
>
> Added here:
>     if(!tDoBlkSt)
>     begin
>         tStBlkDataA    = tBlkDataH;
>         tStBlkAddrA    = tBlkAddrH;
>     end
>
> And, removed the similar blocks from within the following 'case', noting
> as how the value being fed back into the arrays way the same as the
> value being selected from the arrays.
>
>
>>>      case(tBlkHitId[2:0])
>>>          3'h0: begin
>>> //            if(!tDoBlkSt)
>>> //            begin
>>> //                tStBlkDataA    = tBlkDataA;
>>> //                tStBlkAddrA    = tBlkAddrA;
>>> //            end
>>>          end
>>>
>>>          3'h1: begin
>>>              if(!tDoBlkSt)
>>>              begin
>>>                  tStBlkDataA    = tBlkDataB;
>>>                  tStBlkAddrA    = tBlkAddrB;
>>>              end
>>>
>>>              tStBlkAddrB_Adv = 1;
>>>              tStBlkAddrC_Adv = 0;
>>>              tStBlkAddrD_Adv = 0;
>>>              tStBlkAddrS_Adv = 0;
>>>              tStBlkAddrT_Adv = 0;
>>>              tStBlkAddrU_Adv = 0;
>>>              tStBlkAddrV_Adv = 0;
>>>
>>>              tDoBlkSt    = 1;
>>>          end
>>>          3'h2: begin
>>>              if(!tDoBlkSt)
>>>              begin
>>>                  tStBlkDataA    = tBlkDataC;
>>>                  tStBlkAddrA    = tBlkAddrC;
>>>              end
>>>
>>>              tStBlkAddrB_Adv = 1;
>>>              tStBlkAddrC_Adv = 1;
>>>              tStBlkAddrD_Adv = 0;
>>>              tStBlkAddrS_Adv = 0;
>>>              tStBlkAddrT_Adv = 0;
>>>              tStBlkAddrU_Adv = 0;
>>>              tStBlkAddrV_Adv = 0;
>>>
>>>              tDoBlkSt    = 1;
>>>          end
>>>          3'h3: begin
>>>              if(!tDoBlkSt)
>>>              begin
>>>                  tStBlkDataA    = tBlkDataD;
>>>                  tStBlkAddrA    = tBlkAddrD;
>>>              end
>>>
>>>              tStBlkAddrB_Adv = 1;
>>>              tStBlkAddrC_Adv = 1;
>>>              tStBlkAddrD_Adv = 1;
>>>              tStBlkAddrS_Adv = 0;
>>>              tStBlkAddrT_Adv = 0;
>>>              tStBlkAddrU_Adv = 0;
>>>              tStBlkAddrV_Adv = 0;
>>>
>>>              tDoBlkSt    = 1;
>>>          end
>>> `ifdef jx2_rbi_bridge_vca_8x
>>>          3'h4: begin
>>>              if(!tDoBlkSt)
>>>              begin
>>>                  tStBlkDataA    = tBlkDataS;
>>>                  tStBlkAddrA    = tBlkAddrS;
>>>              end
>>>
>>>              tStBlkAddrB_Adv = 1;
>>>              tStBlkAddrC_Adv = 1;
>>>              tStBlkAddrD_Adv = 1;
>>>              tStBlkAddrS_Adv = 1;
>>>              tStBlkAddrT_Adv = 0;
>>>              tStBlkAddrU_Adv = 0;
>>>              tStBlkAddrV_Adv = 0;
>>>
>>>              tDoBlkSt    = 1;
>>>          end
>>>          3'h5: begin
>>>              if(!tDoBlkSt)
>>>              begin
>>>                  tStBlkDataA    = tBlkDataT;
>>>                  tStBlkAddrA    = tBlkAddrT;
>>>              end
>>>
>>>              tStBlkAddrB_Adv = 1;
>>>              tStBlkAddrC_Adv = 1;
>>>              tStBlkAddrD_Adv = 1;
>>>              tStBlkAddrS_Adv = 1;
>>>              tStBlkAddrT_Adv = 1;
>>>              tStBlkAddrU_Adv = 0;
>>>              tStBlkAddrV_Adv = 0;
>>>
>>>              tDoBlkSt    = 1;
>>>          end
>>>          3'h6: begin
>>>              if(!tDoBlkSt)
>>>              begin
>>>                  tStBlkDataA    = tBlkDataU;
>>>                  tStBlkAddrA    = tBlkAddrU;
>>>              end
>>>
>>>              tStBlkAddrB_Adv = 1;
>>>              tStBlkAddrC_Adv = 1;
>>>              tStBlkAddrD_Adv = 1;
>>>              tStBlkAddrS_Adv = 1;
>>>              tStBlkAddrT_Adv = 1;
>>>              tStBlkAddrU_Adv = 1;
>>>              tStBlkAddrV_Adv = 0;
>>>
>>>              tDoBlkSt    = 1;
>>>          end
>>>          3'h7: begin
>>>              if(!tDoBlkSt)
>>>              begin
>>>                  tStBlkDataA    = tBlkDataV;
>>>                  tStBlkAddrA    = tBlkAddrV;
>>>              end
>>>
>>>              tStBlkAddrB_Adv = 1;
>>>              tStBlkAddrC_Adv = 1;
>>>              tStBlkAddrD_Adv = 1;
>>>              tStBlkAddrS_Adv = 1;
>>>              tStBlkAddrT_Adv = 1;
>>>              tStBlkAddrU_Adv = 1;
>>>              tStBlkAddrV_Adv = 1;
>>>
>>>              tDoBlkSt    = 1;
>>>          end
>>> `else
>>>          default: begin
>>>          end
>>> `endif
>>>      endcase
>>> `endif
>>>
>>>      tStBlkDataD    = tStBlkAddrD_Adv ? tBlkDataC : tBlkDataD;
>>> ��    tStBlkDataC    = tStBlkAddrC_Adv ? tBlkDataB : tBlkDataC;
>>>      tStBlkDataB    = tStBlkAddrB_Adv ? tBlkDataA : tBlkDataB;
>>>
>>>      tStBlkAddrD    = tStBlkAddrD_Adv ? tBlkAddrC : tBlkAddrD;
>>>      tStBlkAddrC    = tStBlkAddrC_Adv ? tBlkAddrB : tBlkAddrC;
>>>      tStBlkAddrB    = tStBlkAddrB_Adv ? tBlkAddrA : tBlkAddrB;
>>>
>>>      if(tStBlkAddrD_Adv ? tStBlkAddrC_Flu : tStBlkAddrD_Flu)
>>>          tStBlkAddrD[47:44] = 4'hF;
>>>      if(tStBlkAddrC_Adv ? tStBlkAddrB_Flu : tStBlkAddrC_Flu)
>>>          tStBlkAddrC[47:44] = 4'hF;
>>>      if(tStBlkAddrB_Adv ? tStBlkAddrA_Flu : tStBlkAddrB_Flu)
>>>          tStBlkAddrB[47:44] = 4'hF;
>>>      if(tStBlkAddrA_Flu)
>>>          tStBlkAddrA[47:44] = 4'hF;
>>>
>>> `ifdef jx2_rbi_bridge_vca_8x
>>>      tStBlkDataV    = tStBlkAddrV_Adv ? tBlkDataU : tBlkDataV;
>>>      tStBlkDataU    = tStBlkAddrU_Adv ? tBlkDataT : tBlkDataU;
>>>      tStBlkDataT    = tStBlkAddrT_Adv ? tBlkDataS : tBlkDataT;
>>>      tStBlkDataS    = tStBlkAddrS_Adv ? tBlkDataD : tBlkDataS;
>>>
>>>      tStBlkAddrV    = tStBlkAddrV_Adv ? tBlkAddrU : tBlkAddrV;
>>>      tStBlkAddrU    = tStBlkAddrU_Adv ? tBlkAddrT : tBlkAddrU;
>>>      tStBlkAddrT    = tStBlkAddrT_Adv ? tBlkAddrS : tBlkAddrT;
>>>      tStBlkAddrS    = tStBlkAddrS_Adv ? tBlkAddrD : tBlkAddrS;
>>>
>>>      if(tStBlkAddrV_Adv ? tStBlkAddrU_Flu : tStBlkAddrV_Flu)
>>>          tStBlkAddrV[47:44] = 4'hF;
>>>      if(tStBlkAddrU_Adv ? tStBlkAddrT_Flu : tStBlkAddrU_Flu)
>>>          tStBlkAddrU[47:44] = 4'hF;
>>>      if(tStBlkAddrT_Adv ? tStBlkAddrS_Flu : tStBlkAddrT_Flu)
>>>          tStBlkAddrT[47:44] = 4'hF;
>>>      if(tStBlkAddrS_Adv ? tStBlkAddrD_Flu : tStBlkAddrS_Flu)
>>>          tStBlkAddrS[47:44] = 4'hF;
>>> `endif
>>>
>>>      if(tBlkAddrH[3:0]!=tArrChk)
>>>          tBlkHitH    = 0;
>>>
>>>      if(tBlkAddrH[47:44]!=4'hC)
>>>          tBlkHitH    = 0;
>>>
>>>      /* Reject addresses outside normal physical space. */
>>>      if(l1mAddrIn[47:44]!=4'hC)
>>>          tBlkHitH    = 0;
>>>
>>>      if(l1mAddrIn[43:32]!=0)
>>>          tBlkHitH    = 0;
>>>
>>> //        tBlkHitH    = 0;
>>>
>>>
>>>      tL1mSeqOut  = l2mSeqIn;
>>>      tL1mOpmOut  = l2mOpmIn;
>>> //    tL1mAddrOut = l2mAddrIn;
>>>      tL1mDataOut = l2mDataIn;
>>>
>>>      tL2mSeqOut  = l1mSeqIn;
>>>      tL2mOpmOut  = l1mOpmIn;
>>> //    tL2mAddrOut = l1mAddrIn;
>>>      tL2mDataOut = l1mDataIn;
>>>
>>> `ifdef jx2_bus_mixaddr96
>>>      tL1mAddrOut = { UV48_00, l2mAddrIn };
>>>      tL2mAddrOut = l1mAddrIn[47:0];
>>> `else
>>>      tL1mAddrOut = l2mAddrIn;
>>>      tL2mAddrOut = l1mAddrIn;
>>> `endif
>>>
>>> // `ifndef def_true
>>> `ifdef def_true
>>>      if(l2mOpmIn_IsReq || l2mRingIsRespOther)
>>>      begin
>>>          if(l1mRingIsIdle)
>>>          begin
>>>              /* Avoid letting requests back into L1 ring. */
>>>
>>>              tL1mSeqOut  = l1mSeqIn;
>>>              tL1mOpmOut  = l1mOpmIn;
>>>              tL1mAddrOut = l1mAddrIn;
>>>              tL1mDataOut = l1mDataIn;
>>>
>>>
>>>              tL2mSeqOut  = l2mSeqIn;
>>>              tL2mOpmOut  = l2mOpmIn;
>>>              tL2mAddrOut = l2mAddrIn;
>>>              tL2mDataOut = l2mDataIn;
>>>          end
>>>      end
>>> `endif
>>>
>>> // `ifndef def_true
>>> `ifdef def_true
>>>      if(l1mOpmIn_IsMemLdReq && tBlkHitH)
>>>      begin
>>>          tL1mSeqOut  = l1mSeqIn;
>>>          tL1mOpmOut  = {
>>>              l1mOpmIn[15:8],
>>>              JX2_RBI_OPM_OKLD[7:4],
>>>              l1mOpmIn[11:8]};
>>>          tL1mAddrOut = l1mAddrIn;
>>>          tL1mDataOut = tBlkDataH;
>>>
>>>          tL2mSeqOut  = l2mSeqIn;
>>>          tL2mOpmOut  = l2mOpmIn;
>>>          tL2mAddrOut = l2mAddrIn;
>>>          tL2mDataOut = l2mDataIn;
>>>      end
>>> `endif
>>>
>>> `ifdef def_true
>>> // `ifndef def_true
>>>      if(reset)
>>>      begin
>>>          /* Clear ring during reset */
>>>
>>>          tL1mSeqOut  = 0;
>>>          tL1mOpmOut  = 0;
>>>
>>>          tL2mSeqOut  = 0;
>>>          tL2mOpmOut  = 0;
>>>      end
>>> `endif
>>>
>>> end
>>>
>>>
>>> always @(posedge clock)
>>> begin
>>>      tL1mSeqOut2        <= tL1mSeqOut;
>>>      tL1mOpmOut2        <= tL1mOpmOut;
>>>      tL1mAddrOut2    <= tL1mAddrOut;
>>>      tL1mDataOut2    <= tL1mDataOut;
>>>
>>>      tL2mSeqOut2        <= tL2mSeqOut;
>>>      tL2mOpmOut2        <= tL2mOpmOut;
>>>      tL2mAddrOut2    <= tL2mAddrOut;
>>>      tL2mDataOut2    <= tL2mDataOut;
>>>
>>>      tArrChk            <= tNxtArrChk;
>>>
>>>      if(tDoBlkSt)
>>>      begin
>>>          tArrDataA[tBlkStIx]        <= tStBlkDataA;
>>>          tArrDataB[tBlkStIx]        <= tStBlkDataB;
>>>          tArrDataC[tBlkStIx]        <= tStBlkDataC;
>>>          tArrDataD[tBlkStIx]        <= tStBlkDataD;
>>>
>>>          tArrAddrA[tBlkStIx]        <= tStBlkAddrA;
>>>          tArrAddrB[tBlkStIx]        <= tStBlkAddrB;
>>>          tArrAddrC[tBlkStIx]        <= tStBlkAddrC;
>>>          tArrAddrD[tBlkStIx]        <= tStBlkAddrD;
>>>
>>> `ifdef jx2_rbi_bridge_vca_8x
>>>          tArrDataS[tBlkStIx]        <= tStBlkDataS;
>>>          tArrDataT[tBlkStIx]        <= tStBlkDataT;
>>>          tArrDataU[tBlkStIx]        <= tStBlkDataU;
>>>          tArrDataV[tBlkStIx]        <= tStBlkDataV;
>>>
>>>          tArrAddrS[tBlkStIx]        <= tStBlkAddrS;
>>>          tArrAddrT[tBlkStIx]        <= tStBlkAddrT;
>>>          tArrAddrU[tBlkStIx]        <= tStBlkAddrU;
>>>          tArrAddrV[tBlkStIx]        <= tStBlkAddrV;
>>> `endif
>>>      end
>>> end
>>>
>>>
>>> endmodule
>>
>
OT: How does one start a new topic in Thunderbird? Follow-ups and
messages can be created, but new topics?
The rf8088 core first created in 2009 was dusted off and updated it
slightly. It should run close to 100MHz. It is 360 picoseconds short. It
executes instructions in fewer clocks cycles than the 8088. It is
probably like having a 150-200MHz 8088. It is fairly large for an 8088
core being about 5,000 LUTs. It needs to be upgraded to a 386 for
another project called Bigfoot. Bigfoot is a combination of several
backwards compatible cores in a single core. Low probability of success
on this project. A means of switching to a different ISA from Bigfoot’s
native ISA has been devised, but the challenge is how to get back to the
native instruction set, while retaining backwards compatibility with
existing ISAs. One thought is to have a jump or a call to a specific
address, which causes the ISA to change. It could be some sort of gate
for the 386.


Click here to read the complete article
Re: Misc: 8-way vs 4-way cache, a cost mystery...

<urukn1$1ovo0$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37759&group=comp.arch#37759

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: 8-way vs 4-way cache, a cost mystery...
Date: Sat, 2 Mar 2024 01:32:38 -0600
Organization: A noiseless patient Spider
Lines: 172
Message-ID: <urukn1$1ovo0$1@dont-email.me>
References: <urlbcp$3bvto$1@dont-email.me>
<02479652582caa588ac2e4758b19126b@www.novabbs.org>
<urmmai$3npq9$1@dont-email.me>
<04d0c9094fa116ab88e3f9615042b16b@www.novabbs.org>
<urotrg$b0cg$1@dont-email.me>
<943506edf2d6396f973365e603a956c3@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 2 Mar 2024 07:32:49 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b29a3cb2cfa69574017b1ed3ffe36618";
logging-data="1867520"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/cqvdoHl1xpJn8TlQRIG6SKQ6cM6OEcR8="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:FDA/o6yGsnBoCQ5wlsgIzzFuJJg=
Content-Language: en-US
In-Reply-To: <943506edf2d6396f973365e603a956c3@www.novabbs.org>
 by: BGB - Sat, 2 Mar 2024 07:32 UTC

On 2/29/2024 1:39 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 2/28/2024 5:26 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> On 2/27/2024 7:33 PM, MitchAlsup1 wrote:
>>>>> A thought::
>>>>>
>>>>> Construct the 8-way cache from a pair of 4-way cache instances
>>>>> and connect both into one 8-way with a single layer of logic
>>>>> {multiplexing.}
>>>
>>>> Possible, I have decided for now to stick with 4-way...
>>>
>>>
>>>> But, even then, efforts at trying to optimize this seem to be
>>>> causing the LUT cost to increase rather than decrease...
>>>
>>> Then you have tickled one of Verilog's insidious deamons.
>>>
>>> How many elements in a way ?? and how many bits in an element ??
>>> If there a way to make a "way" into a single SRAM ?? (or part of a
>>> single
>>> SRAM) ??
>>>
>>> What I am getting at is that "conceptually" a n-way set associative
>>> cache is unrecognizingly different than n-copies of a 1/n direct
>>> mapped cache coupled to a set/way selection multiplexer based on
>>> address bits compare. {{And of course write set selection.}}
>>>
>
>> I am not sure.
>
>> In this case, I interpreted it as, say, 4 or 8 parallel sets of
>> arrays, with the corresponding match and multiplex logic.
>
> They should be 4 or 8 parrallel instances of a 1-way (DM) cache with a
> comparator and an output multiplexer signal and an input write signal.
> The Move to Front is easier done with Not-recently-Used bits as a guise
> for least recently used.
> Each way has a NRU bit--at reset and when all NRU bits are set, they are
> cleared asynchronously. Then as each set is hit, the NRU bit is set. You
> do not reallocate the sets with the NRU bit set. I see no reason to move
> one set to the front if you can alter the reallocation selection to avoid
> picking it. {{3 gates per line}}
>

I guess it could be possible say, by having 8 bits (4x2 bits), or 24
bits (8x3 bits) to encode the permutation. Then using the permutation to
drive what index to update, rather than by having the items swap places.

This could possibly reduce LUT cost vs the existing MTF scheme...

I may need to explore this idea.

Had otherwise gotten back to trying to get more things in RV64G working,
had recently noted that another bug (that had broken Quake's rendering)
was due to the FSGNxx instructions not working correctly.

In the emulator, there is still a bug in Quake where firing weapons
seems to be inverted along the Z axis, so looking up causes the weapon
to fire down.

Otherwise, may need to look more into trying to figure out how to get
RV64 ELF binaries in a form where I can load them at arbitrary addresses.

Well, along with:
Trying to beat on my compiler to try to get more performance from BJX2;
Debating more whether I should continue to look into the possibility of
a "cleaned up" ISA design.

Priorities would be to limit how much of the compiler or other things
need to be modified, and should (once again) hopefully be mostly ASM
compatible.

Though, there are issues here:
There is also incentive to add a Zero register and move LR and GBR into
GPR space, but this would mean R0..R3 will be displaced.

Could hack over it, say, by remapping these elsewhere (say, R48..R51);
but this would just be a case of recreating the same sort of hackery
that an ISA redesign would be meant to eliminate. But, otherwise, it
would be unavoidable that any existing ASM code would be broken (well,
at least excluding any ASM code that tries to use R48..R51 or similar
and assumes that R0..R3 were not aliased to these).

But, the main incentive for doing this is to avoid some of the overhead
that exists in my existing ISA due to these registers not being GPRs,
and also to reduce the number of instruction-forms needed (For example,
one doesn't need things like a "NEG Rm, Rn" instruction if they can
encode it as "SUB ZR, Rm, Rn").

But, in practice it may not matter that much, since the decoder just
decodes "NEG Rm, Rn" as if it had been "SUB ZR, Rm, Rn" (and the cost in
the decoder for stuff like this isn't *that* large).

It could also be possible to rework the physical address space for sake
of RISC-V compatibility, say, as-is:
00000000..00007FFF: Boot ROM
00008000..0000BFFF: Boot ROM (Ext)
0000C000..0000DFFF: Boot/ISR SRAM
0000E000..0000EFFF: Boot/ISR SRAM (Ext)
00010000..0001FFFF: ROM (All Zeroes)
00020000..0002FFFF: ROM (All Break Instructions)
...
01000000..3FFFFFFF: DRAM Space
...

To, say:
00000000..0000BFFF: Boot ROM
0000C000..0000FFFF: Boot/ISR SRAM
00010000..3FFFFFFF: DRAM Space
...

Possibly with things like audio RAM and VRAM needing to be relocated. If
one tries to load a large ELF image at the 64K mark, there is a non-zero
risk it could collide with the VRAM, which is currently located at
00A00000, ...

Granted, one could argue that another option is just to "hold ones'
nose" and accept RISC-V as "the standard", given it is more popular, if
albeit slower in my implementation than my current ISA.

Apparently they have organized a "Combined Instructions SIG" effort,
which does seem to be evaluating addressing some of my complaint areas
with RISC-V.

Granted, even if they do end up adding indexed load/store and some
bigger constant-load instructions; this will still not change that
RISC-V's immediate fields are horribly dog-chewed. But, OTOH, someone
could also complain about register-fields being dog-chewed in my ISA
design. But, then again, it is possibly debatable how much of a
difference dog-chewed fields make in actual use.

But, then again, "non-dog-chewed" reg/imm fields would be one of the
main incentives for trying to redo the encoding.

Though, this is within limits (if the new design would still use jumbo
prefixes, then mangled constants are to some extent unavoidable).

Otherwise, may as well just continue to stick with what I have already...

After all, what I have already does seem to (technically) be working OK,
even if the design has gotten more hairy than I would prefer (IOW: not
particularly clean or elegant).

>> In the first instance, adding an item always shifted each item over
>> one position and added a new item to the front.
>
>> The MTF logic tries to move an accessed item to the front, or shift
>> each item back as before it is a new address. If the address hits
>> while adding an items, it behaves as-if it were moving it to the
>> front, but effectively replaces the item being moved to the front with
>> the data being written.
>
>> The MTF logic seems to increase hit-rate, but eats a lot of additional
>> LUTs.

Re: Misc: 8-way vs 4-way cache, a cost mystery...

<a0e30daf4b8b62f1aad7ca0382b012a8@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37760&group=comp.arch#37760

  copy link   Newsgroups: comp.arch
Date: Sat, 2 Mar 2024 18:00:18 +0000
Subject: Re: Misc: 8-way vs 4-way cache, a cost mystery...
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$YfPdKpxVvcfGCeT55tvePOtw.D.uIdrEjjQfiKzU7JpyO6LF1cLs.
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <urlbcp$3bvto$1@dont-email.me> <02479652582caa588ac2e4758b19126b@www.novabbs.org> <urmmai$3npq9$1@dont-email.me> <04d0c9094fa116ab88e3f9615042b16b@www.novabbs.org> <urotrg$b0cg$1@dont-email.me> <943506edf2d6396f973365e603a956c3@www.novabbs.org> <urukn1$1ovo0$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <a0e30daf4b8b62f1aad7ca0382b012a8@www.novabbs.org>
 by: MitchAlsup1 - Sat, 2 Mar 2024 18:00 UTC

BGB wrote:

> On 2/29/2024 1:39 PM, MitchAlsup1 wrote:
>>
>> They should be 4 or 8 parrallel instances of a 1-way (DM) cache with a
>> comparator and an output multiplexer signal and an input write signal.
>> The Move to Front is easier done with Not-recently-Used bits as a guise
>> for least recently used.

>> Each way has a NRU bit--at reset and when all NRU bits are set, they are
>> cleared asynchronously. Then as each set is hit, the NRU bit is set. You
>> do not reallocate the sets with the NRU bit set. I see no reason to move
>> one set to the front if you can alter the reallocation selection to avoid
>> picking it. {{3 gates per line}}
>>

> I guess it could be possible say, by having 8 bits (4x2 bits), or 24
> bits (8x3 bits) to encode the permutation. Then using the permutation to
> drive what index to update, rather than by having the items swap places.

> This could possibly reduce LUT cost vs the existing MTF scheme...

> I may need to explore this idea.

Each line in a way has a Not-Recently-Used bit. 1 SR-flip-flop per line.
Every time the way is accessed the bit is set. the set input is asserted.
When all bits have been set they are all cleared (asynchronously).
the reset input is asserted.

The line to be replaced is determined by a find first zero. There is
always at least 1 zero in the list.

I have been using this since the days of the Mc68851.....

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor