Rocksolid Light - comp.arch - Re: L3 cache allocation on Intel

Re: L3 cache allocation on Intel

<FFSYM.143096$2fS.42419@fx16.iad>

https://news.novabbs.org/devel/article-flat.php?id=34612&group=comp.arch#34612

Path: i2pn2.org!i2pn.org!news.1d4.us!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: L3 cache allocation on Intel
Newsgroups: comp.arch
References: <2023Oct18.190923@mips.complang.tuwien.ac.at> <SCVXM.40044$MJ59.16148@fx10.iad> <bTVXM.87760$8fO.38095@fx15.iad> <2023Oct19.211807@mips.complang.tuwien.ac.at> <uHyYM.84784$tnmf.71840@fx09.iad> <2023Oct21.165243@mips.complang.tuwien.ac.at>
Lines: 32
Message-ID: <FFSYM.143096$2fS.42419@fx16.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sat, 21 Oct 2023 16:05:57 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sat, 21 Oct 2023 16:05:57 GMT
X-Received-Bytes: 2255

by: Scott Lurndal - Sat, 21 Oct 2023 16:05 UTC

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>EricP <ThatWouldBeTelling@thevillage.com> writes:
>>Systematic reverse engineering of cache slice selection
>>in Intel processors, 2015
>>https://eprint.iacr.org/2015/690.pdf
>
>This paper shows in Figure 2 that the set is determined without going
>through the hash function. It also shows that the set is determined
>from the lowest bits beyond the 6 bits used for addressing the byte
>within the cache line. And, as mentioned above, the has value is
>determined by higher bits. So if there is an aligned block of 128KB
>consecutive physical addresses (say, in a huge page), it will be
>allocated in the same slice. Is this good, bad, or don't care? My
>feeling is that I would prefer to distribute consecutive cache lines
>across slices.

It may be useful to consider the access pattern to the huge
page - will accesse orignate from a single thread or from mutiple threads.

In the former case, keeping them in the same slice seems reasonable.

In the latter case, distributing them amongst slices may provide
some additional concurrency.

Note that huge pages are usually used to reduce TLB pressure, not
for caching purposes. The access patterns for a huge page may
not differ significantly from those to an similarly sized region
of 4k/16k/64k pages.

I wonder if the replacement algorithms anticipate access
pattern(s) over time, i.e. prefetch ahead when the accesses
have been sequential?

Re: L3 cache allocation on Intel

<c7029734-e3b7-4134-b2b3-d12ace41a2fan@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34613&group=comp.arch#34613

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:8d0a:b0:778:914e:fb60 with SMTP id rb10-20020a05620a8d0a00b00778914efb60mr83456qkn.11.1697905543224;
Sat, 21 Oct 2023 09:25:43 -0700 (PDT)
X-Received: by 2002:a05:6870:14ce:b0:1e9:baa0:63f6 with SMTP id
l14-20020a05687014ce00b001e9baa063f6mr2138864oab.2.1697905542917; Sat, 21 Oct
2023 09:25:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 21 Oct 2023 09:25:42 -0700 (PDT)
In-Reply-To: <2023Oct20.171424@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.183.40; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.183.40
References: <2023Oct18.190923@mips.complang.tuwien.ac.at> <vqWXM.40046$MJ59.2768@fx10.iad>
<2023Oct19.212155@mips.complang.tuwien.ac.at> <ugt4bf$t978$1@dont-email.me>
<2023Oct20.103348@mips.complang.tuwien.ac.at> <914e69a3-d66c-4ddd-a136-d54a6742bc3en@googlegroups.com>
<2023Oct20.171424@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c7029734-e3b7-4134-b2b3-d12ace41a2fan@googlegroups.com>
Subject: Re: L3 cache allocation on Intel
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Sat, 21 Oct 2023 16:25:43 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 66

by: Michael S - Sat, 21 Oct 2023 16:25 UTC

On Friday, October 20, 2023 at 6:35:26 PM UTC+3, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >On Friday, October 20, 2023 at 12:09:36=E2=80=AFPM UTC+3, Anton Ertl wrote:
> >> Preliminaries: We have n sets in an
> >> L3 slice and s slices, with t=ceil(ld(s)).
> >>
> >> To compute the slice and set, n+t bits from the physical address.
> >> Multiply this number with s, giving x. The target slice then is
> >> x/2^(n+t) (which is in the range [0,n)). The target set is the bits
> >> n+t-1..t of x.
> >>
> >> This requires an (n+t)*t-bit multiplier with an n+2t-bit result, and
> >> for, e.g., Sapphire Rapids, t=6 and n=11 (assuming that each way has
> >> 128KB; with smaller ways and more associativity, n would be smaller; n
> >> cannot be larger, because 128KB is the largest power-of-2 that divides
> >> the size of an L3 slice). This multiplier size (17*6->23 bits) seems
> >> downright cheap compared to the integer (64*64->128) or FP
> >> multipliers, and, as you point out, also cheap compared to the rest of
> >> the L3 access (both in latency and in energy).
> >> - anton
> >> --
> >> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> >> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>
> >
> >When you say "physical address" do you mean PA[16:6] a.k.a. Line Index in
> >classic Nehalem-to-SkylakeClient Intel LLC? Or something else?
> I wrote "n+t bits from the physical address", which would be 17 bits
> in the example above. So yes, maybe PA[22:6].
>

I don't understand how PA[22:17] could be of help.
More so, I don't understand how they could be non-harmful.
I mean, before Skylake Server/Cascade Lake.

In Skylake Server they use Sub-NUMA Clustering (SNC) that according
to manual "can improve the average LLC/memory latency by splitting
the LLC into disjoint clusters based on address range, with each cluster
bound to a subset of memory controllers in the system."
So with SNC, assuming that selection of memory controller takes into
account higher address bits than PA[16] (but is it? I don't know) these
higher address bits necessarily influence selection of LLC slice.

> One disadvantage of this scheme AFAICS is that we need to store
> essentially the whole physical address (apart from the bottom 6 bits)
> as tag, whereas in a normal un"hashed" cache access scheme you do not
> need to store the n bits that identify the set.
>
> I can think of ways of varying the scheme to use the physical address
> rather than the "hashed" address for accessing the set, and thus
> reduce the tags by n bits (per cache line) and also reduces the
> multiplier to a t*t->2t bit multiplier. But I am not sure if this
> results in a significantly skewed distribution, which would be worse
> than expending a few more hardware resources.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: L3 cache allocation on Intel

<48a5f4f1-56b0-4e37-a857-116028450878n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34614&group=comp.arch#34614

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:5c06:b0:66d:552:ec4e with SMTP id ly6-20020a0562145c0600b0066d0552ec4emr116805qvb.13.1697921166288;
Sat, 21 Oct 2023 13:46:06 -0700 (PDT)
X-Received: by 2002:a05:6870:9128:b0:1e9:9cdf:e8be with SMTP id
o40-20020a056870912800b001e99cdfe8bemr2315637oae.4.1697921166017; Sat, 21 Oct
2023 13:46:06 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 21 Oct 2023 13:46:05 -0700 (PDT)
In-Reply-To: <uh0efg$1o4ut$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a996:d846:f964:b416;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a996:d846:f964:b416
References: <2023Oct18.190923@mips.complang.tuwien.ac.at> <SCVXM.40044$MJ59.16148@fx10.iad>
<bTVXM.87760$8fO.38095@fx15.iad> <2023Oct19.211807@mips.complang.tuwien.ac.at>
<3hAYM.29016$Ssze.11022@fx48.iad> <ff7eae0a-7955-4b11-b857-0393ac340c24n@googlegroups.com>
<uh0efg$1o4ut$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <48a5f4f1-56b0-4e37-a857-116028450878n@googlegroups.com>
Subject: Re: L3 cache allocation on Intel
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Sat, 21 Oct 2023 20:46:06 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 23

by: MitchAlsup - Sat, 21 Oct 2023 20:46 UTC

On Saturday, October 21, 2023 at 6:59:49 AM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > It seems to me that all the L3 chunks are on a ring bus.
> > And that getting to RingBus[mod-t] requires a trip around the ring.
> > <
> > So, why not just snoop that L3 of the ring node you are on while
> > passing from where you are to where the data is.
> > <
> > That is, no special hashing needed, get data if you run into it,
> > use snoop statistics to decide where to put it if no snoops succeed.
> > <
> > So, why are we using a hash again ??
> >
> Each node/L3 slice needs a way to determine if it should cache a
> particular line?
<
L3 closest to the originator should end up with the line.
<
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: L3 cache allocation on Intel

<f4207a09-11a0-46cd-aeeb-896b95a3d34fn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34615&group=comp.arch#34615

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:14e9:b0:66d:12cd:31b3 with SMTP id k9-20020a05621414e900b0066d12cd31b3mr101780qvw.6.1697921531041;
Sat, 21 Oct 2023 13:52:11 -0700 (PDT)
X-Received: by 2002:a05:6870:ac2b:b0:1dc:dbb0:60aa with SMTP id
kw43-20020a056870ac2b00b001dcdbb060aamr2545322oab.6.1697921530847; Sat, 21
Oct 2023 13:52:10 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 21 Oct 2023 13:52:10 -0700 (PDT)
In-Reply-To: <FFSYM.143096$2fS.42419@fx16.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a996:d846:f964:b416;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a996:d846:f964:b416
References: <2023Oct18.190923@mips.complang.tuwien.ac.at> <SCVXM.40044$MJ59.16148@fx10.iad>
<bTVXM.87760$8fO.38095@fx15.iad> <2023Oct19.211807@mips.complang.tuwien.ac.at>
<uHyYM.84784$tnmf.71840@fx09.iad> <2023Oct21.165243@mips.complang.tuwien.ac.at>
<FFSYM.143096$2fS.42419@fx16.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f4207a09-11a0-46cd-aeeb-896b95a3d34fn@googlegroups.com>
Subject: Re: L3 cache allocation on Intel
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Sat, 21 Oct 2023 20:52:11 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Sat, 21 Oct 2023 20:52 UTC

On Saturday, October 21, 2023 at 11:06:02 AM UTC-5, Scott Lurndal wrote:
> an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> >EricP <ThatWould...@thevillage.com> writes:
> >>Systematic reverse engineering of cache slice selection
> >>in Intel processors, 2015
> >>https://eprint.iacr.org/2015/690.pdf
> >
> >This paper shows in Figure 2 that the set is determined without going
> >through the hash function. It also shows that the set is determined
> >from the lowest bits beyond the 6 bits used for addressing the byte
> >within the cache line. And, as mentioned above, the has value is
> >determined by higher bits. So if there is an aligned block of 128KB
> >consecutive physical addresses (say, in a huge page), it will be
> >allocated in the same slice. Is this good, bad, or don't care? My
> >feeling is that I would prefer to distribute consecutive cache lines
> >across slices.
> It may be useful to consider the access pattern to the huge
> page - will accesse orignate from a single thread or from mutiple threads..
>
> In the former case, keeping them in the same slice seems reasonable.
>
> In the latter case, distributing them amongst slices may provide
> some additional concurrency.
<
Most huge page users (>1GB) are data bases where they use the huge page
to minimize TLB misses. But these applications have hundreds to thousands
of threads all using the same (index set) data. So, one would expect many
TLBs to contain the huge page translations, and the those pages would
be accessed fairly often but randomly across any LLC structurizing.
>
> Note that huge pages are usually used to reduce TLB pressure, not
> for caching purposes. The access patterns for a huge page may
> not differ significantly from those to an similarly sized region
> of 4k/16k/64k pages.
>
> I wonder if the replacement algorithms anticipate access
> pattern(s) over time, i.e. prefetch ahead when the accesses
> have been sequential?

Re: L3 cache allocation on Intel

<uh2sse$2cb3b$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34617&group=comp.arch#34617

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: L3 cache allocation on Intel
Date: Sun, 22 Oct 2023 12:17:49 +0200
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <uh2sse$2cb3b$1@dont-email.me>
References: <2023Oct18.190923@mips.complang.tuwien.ac.at>
<SCVXM.40044$MJ59.16148@fx10.iad> <bTVXM.87760$8fO.38095@fx15.iad>
<2023Oct19.211807@mips.complang.tuwien.ac.at>
<3hAYM.29016$Ssze.11022@fx48.iad>
<ff7eae0a-7955-4b11-b857-0393ac340c24n@googlegroups.com>
<uh0efg$1o4ut$1@dont-email.me>
<48a5f4f1-56b0-4e37-a857-116028450878n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 22 Oct 2023 10:17:50 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6a4c4ca07dcd82a3c960861364338399";
logging-data="2501739"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19i5dKFPUpQ8789mPlMHTQyd2GyRtQWMqEijnVxYFE7tg=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.17.1
Cancel-Lock: sha1:ait87IKF2RK28VA7OM4MV0tNzDU=
In-Reply-To: <48a5f4f1-56b0-4e37-a857-116028450878n@googlegroups.com>

by: Terje Mathisen - Sun, 22 Oct 2023 10:17 UTC

MitchAlsup wrote:
> On Saturday, October 21, 2023 at 6:59:49â¯AM UTC-5, Terje Mathisen wrote:
>> MitchAlsup wrote:
>>> It seems to me that all the L3 chunks are on a ring bus.
>>> And that getting to RingBus[mod-t] requires a trip around the ring.
>>> <
>>> So, why not just snoop that L3 of the ring node you are on while
>>> passing from where you are to where the data is.
>>> <
>>> That is, no special hashing needed, get data if you run into it,
>>> use snoop statistics to decide where to put it if no snoops succeed.
>>> <
>>> So, why are we using a hash again ??
>>>
>> Each node/L3 slice needs a way to determine if it should cache a
>> particular line?
> <
> L3 closest to the originator should end up with the line.

I agree that this seems like the logical way to do it, but it does
require some kind of dictionary/broadcast process to allow other
cores/slices to discover that this line is already cached somewhere
else, right?

OTOH, this is already the case for all other cache levels on a
multi-core cpu, so nothing new needs to be invented, except the latency
might be too high, so making it faster (more energy-effienciet?) to
allow each core/slice to determine when to cache something?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: L3 cache allocation on Intel

<2023Oct22.164140@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34620&group=comp.arch#34620

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: L3 cache allocation on Intel
Date: Sun, 22 Oct 2023 14:41:40 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 25
Message-ID: <2023Oct22.164140@mips.complang.tuwien.ac.at>
References: <2023Oct18.190923@mips.complang.tuwien.ac.at> <SCVXM.40044$MJ59.16148@fx10.iad> <bTVXM.87760$8fO.38095@fx15.iad> <2023Oct19.211807@mips.complang.tuwien.ac.at> <3hAYM.29016$Ssze.11022@fx48.iad> <ff7eae0a-7955-4b11-b857-0393ac340c24n@googlegroups.com> <uh0efg$1o4ut$1@dont-email.me> <48a5f4f1-56b0-4e37-a857-116028450878n@googlegroups.com> <uh2sse$2cb3b$1@dont-email.me>
Injection-Info: dont-email.me; posting-host="6852e89802ad9f5c3e113157c5512885";
logging-data="2623969"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18WHo0qaK6Y0lXlKgX917D9"
Cancel-Lock: sha1:XvvAwzwnXZhwX6PUQyzhLnwErGg=
X-newsreader: xrn 10.11

by: Anton Ertl - Sun, 22 Oct 2023 14:41 UTC

Terje Mathisen <terje.mathisen@tmsw.no> writes:
>MitchAlsup wrote:
>> L3 closest to the originator should end up with the line.
>
>I agree that this seems like the logical way to do it, but it does
>require some kind of dictionary/broadcast process to allow other
>cores/slices to discover that this line is already cached somewhere
>else, right?

Not just that. It means that a process running on one core can cache
only in the slice of that core (i.e., in 3MB on Raptor Lake), and not
make use of the other slices.

AMD does it that way with it's core complexes (CCXs): e.g., a Ryzen
1800X has two CCXs (on one die), each with 4 cores and 8MB of L3. So
despite it having 16MB of L3, measurements of pointer chasing latency
show a step-up increase in latency after 8MB. See Column 18 of
<http://www.complang.tuwien.ac.at/anton/bplat/Results>; Column 19
(Ryzen 3900X with 64GB L3, but split into 4 CCXs with 16MB L3 each)
also shows this behaviour.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: L3 cache allocation on Intel

<2023Oct22.170835@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34621&group=comp.arch#34621

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: L3 cache allocation on Intel
Date: Sun, 22 Oct 2023 15:08:35 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 52
Message-ID: <2023Oct22.170835@mips.complang.tuwien.ac.at>
References: <2023Oct18.190923@mips.complang.tuwien.ac.at> <vqWXM.40046$MJ59.2768@fx10.iad> <2023Oct19.212155@mips.complang.tuwien.ac.at> <ugt4bf$t978$1@dont-email.me> <2023Oct20.103348@mips.complang.tuwien.ac.at> <914e69a3-d66c-4ddd-a136-d54a6742bc3en@googlegroups.com> <2023Oct20.171424@mips.complang.tuwien.ac.at> <c7029734-e3b7-4134-b2b3-d12ace41a2fan@googlegroups.com>
Injection-Info: dont-email.me; posting-host="6852e89802ad9f5c3e113157c5512885";
logging-data="2623969"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Sk2/MCosqAKb99OwWlSBt"
Cancel-Lock: sha1:2elvJPNJT7j2mXzKgDfua21IOqo=
X-newsreader: xrn 10.11

by: Anton Ertl - Sun, 22 Oct 2023 15:08 UTC

Michael S <already5chosen@yahoo.com> writes:
>On Friday, October 20, 2023 at 6:35:26=E2=80=AFPM UTC+3, Anton Ertl wrote:
>> Michael S <already...@yahoo.com> writes:
>> >On Friday, October 20, 2023 at 12:09:36=3DE2=3D80=3DAFPM UTC+3, Anton Er=
>tl wrote:=20
>> >> Preliminaries: We have n sets in an=20
>> >> L3 slice and s slices, with t=3Dceil(ld(s)).=20
>> >>=20
>> >> To compute the slice and set, n+t bits from the physical address.=20
>> >> Multiply this number with s, giving x. The target slice then is=20
>> >> x/2^(n+t) (which is in the range [0,n)). The target set is the bits=20
>> >> n+t-1..t of x.=20
>> >>=20
>> >> This requires an (n+t)*t-bit multiplier with an n+2t-bit result, and=
>=20
>> >> for, e.g., Sapphire Rapids, t=3D6 and n=3D11 (assuming that each way h=
>as=20
>> >> 128KB; with smaller ways and more associativity, n would be smaller; n=
>=20
>> >> cannot be larger, because 128KB is the largest power-of-2 that divides=
>=20
>> >> the size of an L3 slice). This multiplier size (17*6->23 bits) seems=
>=20
>> >> downright cheap compared to the integer (64*64->128) or FP=20
>> >> multipliers, and, as you point out, also cheap compared to the rest of=
>=20
>> >> the L3 access (both in latency and in energy).=20
....
>> >When you say "physical address" do you mean PA[16:6] a.k.a. Line Index i=
>n=20
>> >classic Nehalem-to-SkylakeClient Intel LLC? Or something else?
>> I wrote "n+t bits from the physical address", which would be 17 bits=20
>> in the example above. So yes, maybe PA[22:6].=20
>>=20
>
>I don't understand how PA[22:17] could be of help.

The idea here is that the output of the "hash function" contains n
bits for the set and (almost) t bits for the slice number. Then you
need at least n+t bits as input in order to produce all possible
output patterns. If you don't produce all possible output patterns,
only a part of the L3 cache will be utilized.

>More so, I don't understand how they could be non-harmful.

Why should they be harmful? As we found out, Intel uses PA[37:20] on
a number of processors.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: L3 cache allocation on Intel

<2023Oct22.171830@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34622&group=comp.arch#34622

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: L3 cache allocation on Intel
Date: Sun, 22 Oct 2023 15:18:30 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 51
Message-ID: <2023Oct22.171830@mips.complang.tuwien.ac.at>
References: <2023Oct18.190923@mips.complang.tuwien.ac.at> <SCVXM.40044$MJ59.16148@fx10.iad> <bTVXM.87760$8fO.38095@fx15.iad> <2023Oct19.211807@mips.complang.tuwien.ac.at> <uHyYM.84784$tnmf.71840@fx09.iad> <2023Oct21.165243@mips.complang.tuwien.ac.at> <FFSYM.143096$2fS.42419@fx16.iad>
Injection-Info: dont-email.me; posting-host="6852e89802ad9f5c3e113157c5512885";
logging-data="2623969"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/wYSOEn5q2/zUE66CMQN1V"
Cancel-Lock: sha1:AHeGm782dpaSN5fVJ+WSAIihjVs=
X-newsreader: xrn 10.11

by: Anton Ertl - Sun, 22 Oct 2023 15:18 UTC

scott@slp53.sl.home (Scott Lurndal) writes:
>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>EricP <ThatWouldBeTelling@thevillage.com> writes:
>>>Systematic reverse engineering of cache slice selection
>>>in Intel processors, 2015
>>>https://eprint.iacr.org/2015/690.pdf
>>
>>This paper shows in Figure 2 that the set is determined without going
>>through the hash function. It also shows that the set is determined
>>from the lowest bits beyond the 6 bits used for addressing the byte
>>within the cache line. And, as mentioned above, the has value is
>>determined by higher bits. So if there is an aligned block of 128KB
>>consecutive physical addresses (say, in a huge page), it will be
>>allocated in the same slice. Is this good, bad, or don't care? My
>>feeling is that I would prefer to distribute consecutive cache lines
>>across slices.
>
>It may be useful to consider the access pattern to the huge
>page - will accesse orignate from a single thread or from mutiple threads.

Often from a single thread.

>In the former case, keeping them in the same slice seems reasonable.

Why? My feeling is that by distributing the accesses of a thread
walking through memory across different slices, there would be better
load distribution. But given that Intel is keeping them clustered,
maybe there is some batching mechanism for the L3 slice accesses that
benefits from the clustering.

>Note that huge pages are usually used to reduce TLB pressure, not
>for caching purposes. The access patterns for a huge page may
>not differ significantly from those to an similarly sized region
>of 4k/16k/64k pages.

I expect that in the physical addresses, they differ significantly,
because, unless the OS does something special, there is no reason for
virtually adjacent pages to be physically adjacent.

>I wonder if the replacement algorithms anticipate access
>pattern(s) over time, i.e. prefetch ahead when the accesses
>have been sequential?

Intel and AMD (and, I expect, others) have had hardware prefetchers
for many years. It has not been discussed much in recent years, so I
guess that they are just working ok.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: L3 cache allocation on Intel

<%RcZM.5334$ntoa.1935@fx03.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34623&group=comp.arch#34623

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx03.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: L3 cache allocation on Intel
Newsgroups: comp.arch
References: <2023Oct18.190923@mips.complang.tuwien.ac.at> <SCVXM.40044$MJ59.16148@fx10.iad> <bTVXM.87760$8fO.38095@fx15.iad> <2023Oct19.211807@mips.complang.tuwien.ac.at> <uHyYM.84784$tnmf.71840@fx09.iad> <2023Oct21.165243@mips.complang.tuwien.ac.at> <FFSYM.143096$2fS.42419@fx16.iad> <2023Oct22.171830@mips.complang.tuwien.ac.at>
Lines: 12
Message-ID: <%RcZM.5334$ntoa.1935@fx03.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sun, 22 Oct 2023 17:20:59 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sun, 22 Oct 2023 17:20:59 GMT
X-Received-Bytes: 1438

by: Scott Lurndal - Sun, 22 Oct 2023 17:20 UTC

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>scott@slp53.sl.home (Scott Lurndal) writes:

>>I wonder if the replacement algorithms anticipate access
>>pattern(s) over time, i.e. prefetch ahead when the accesses
>>have been sequential?
>
>Intel and AMD (and, I expect, others) have had hardware prefetchers
>for many years. It has not been discussed much in recent years, so I
>guess that they are just working ok.

Yes, but those prefetchers work out of the per-core L2, not the LLC.

Re: L3 cache allocation on Intel

<uh3mr2$2ifrn$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34624&group=comp.arch#34624

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaronclayton@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: L3 cache allocation on Intel
Date: Sun, 22 Oct 2023 13:40:49 -0400
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <uh3mr2$2ifrn$1@dont-email.me>
References: <2023Oct18.190923@mips.complang.tuwien.ac.at>
<SCVXM.40044$MJ59.16148@fx10.iad> <bTVXM.87760$8fO.38095@fx15.iad>
<2023Oct19.211807@mips.complang.tuwien.ac.at>
<3hAYM.29016$Ssze.11022@fx48.iad>
<ff7eae0a-7955-4b11-b857-0393ac340c24n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 22 Oct 2023 17:40:50 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e74fbc9b6d0a351c6c76d62289a3f169";
logging-data="2703223"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18YWxbKtZ77msXqrpnx+QaHOaTM0P5/tQw="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:iRYXvvGOSIoNuX71bZZuSpeBQ+U=
In-Reply-To: <ff7eae0a-7955-4b11-b857-0393ac340c24n@googlegroups.com>

by: Paul A. Clayton - Sun, 22 Oct 2023 17:40 UTC

On 10/20/23 3:45 PM, MitchAlsup wrote:
> It seems to me that all the L3 chunks are on a ring bus.
> And that getting to RingBus[mod-t] requires a trip around the ring.
> <
> So, why not just snoop that L3 of the ring node you are on while
> passing from where you are to where the data is.
> <
> That is, no special hashing needed, get data if you run into it,
> use snoop statistics to decide where to put it if no snoops succeed.
> <
> So, why are we using a hash again ??

It was my understanding that the hash-based distribution
was intended to avoid snooping and support home-based
coherence (perhaps in part because (later) Itanium
implementations were directory/home-based).

This obviously provides more flexibility in network
topology (Intel moved from single ring quite some time
ago, IIRC, and I think now uses a mesh/grid topology).

While bandwidth is presumably the main constraint for
snooping (and a ring topology wastes less bandwidth
for all-to-all communication than a grid), I would
guess that the cache probing is not entirely free, so
energy costs might also bias the design choices.

Biasing placement to be closer to a "memory-side"
cache (where data sourced from adjacent memory
controllers is cached) might also have some energy
benefits with a miss having fewer network hops to
reach the memory controller and possibly having
higher bandwidth).

Biasing placement to reduce travel time and energy
to the most likely next user would make sense and
biasing might also be able to help snoop filters
work more effectively.

The tradeoffs seem complex, interacting with workload,
capacity, network topology, memory latency, memory
bandwidth, and other factors (e.g., consistency model?).

One could also bias associativity such that some
snoops would have to check fewer ways. With overlaid
skewed associativity, one could have lower associativity
yet still allow utilization of the entire capacity. One
might also be able to bias physical placement of tags
to reduce average energy use and/or average latency for
expected/common use patterns. (With cuckoo/elbow caching
limiting some addresses to fewer ways may not have that
much effect on miss rate.)

Re: L3 cache allocation on Intel

<2aa88737-15cb-4a16-9297-5e94020976a6n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34625&group=comp.arch#34625

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2205:b0:774:22d7:7690 with SMTP id m5-20020a05620a220500b0077422d77690mr125540qkh.1.1698005512590;
Sun, 22 Oct 2023 13:11:52 -0700 (PDT)
X-Received: by 2002:a05:6871:54b:b0:1e9:f600:53d with SMTP id
t11-20020a056871054b00b001e9f600053dmr3554952oal.10.1698005512312; Sun, 22
Oct 2023 13:11:52 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 22 Oct 2023 13:11:52 -0700 (PDT)
In-Reply-To: <2023Oct22.170835@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:f1e2:235f:260d:e2d7;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:f1e2:235f:260d:e2d7
References: <2023Oct18.190923@mips.complang.tuwien.ac.at> <vqWXM.40046$MJ59.2768@fx10.iad>
<2023Oct19.212155@mips.complang.tuwien.ac.at> <ugt4bf$t978$1@dont-email.me>
<2023Oct20.103348@mips.complang.tuwien.ac.at> <914e69a3-d66c-4ddd-a136-d54a6742bc3en@googlegroups.com>
<2023Oct20.171424@mips.complang.tuwien.ac.at> <c7029734-e3b7-4134-b2b3-d12ace41a2fan@googlegroups.com>
<2023Oct22.170835@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2aa88737-15cb-4a16-9297-5e94020976a6n@googlegroups.com>
Subject: Re: L3 cache allocation on Intel
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Sun, 22 Oct 2023 20:11:52 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 71

by: Michael S - Sun, 22 Oct 2023 20:11 UTC

On Sunday, October 22, 2023 at 6:17:32 PM UTC+3, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >On Friday, October 20, 2023 at 6:35:26=E2=80=AFPM UTC+3, Anton Ertl wrote:
> >> Michael S <already...@yahoo.com> writes:
> >> >On Friday, October 20, 2023 at 12:09:36=3DE2=3D80=3DAFPM UTC+3, Anton Er=
> >tl wrote:=20
> >> >> Preliminaries: We have n sets in an=20
> >> >> L3 slice and s slices, with t=3Dceil(ld(s)).=20
> >> >>=20
> >> >> To compute the slice and set, n+t bits from the physical address.=20
> >> >> Multiply this number with s, giving x. The target slice then is=20
> >> >> x/2^(n+t) (which is in the range [0,n)). The target set is the bits=20
> >> >> n+t-1..t of x.=20
> >> >>=20
> >> >> This requires an (n+t)*t-bit multiplier with an n+2t-bit result, and=
> >=20
> >> >> for, e.g., Sapphire Rapids, t=3D6 and n=3D11 (assuming that each way h=
> >as=20
> >> >> 128KB; with smaller ways and more associativity, n would be smaller; n=
> >=20
> >> >> cannot be larger, because 128KB is the largest power-of-2 that divides=
> >=20
> >> >> the size of an L3 slice). This multiplier size (17*6->23 bits) seems=
> >=20
> >> >> downright cheap compared to the integer (64*64->128) or FP=20
> >> >> multipliers, and, as you point out, also cheap compared to the rest of=
> >=20
> >> >> the L3 access (both in latency and in energy).=20
> ...
> >> >When you say "physical address" do you mean PA[16:6] a.k.a. Line Index i=
> >n=20
> >> >classic Nehalem-to-SkylakeClient Intel LLC? Or something else?
> >> I wrote "n+t bits from the physical address", which would be 17 bits=20
> >> in the example above. So yes, maybe PA[22:6].=20
> >>=20
> >
> >I don't understand how PA[22:17] could be of help.
> The idea here is that the output of the "hash function" contains n
> bits for the set and (almost) t bits for the slice number. Then you
> need at least n+t bits as input in order to produce all possible
> output patterns. If you don't produce all possible output patterns,
> only a part of the L3 cache will be utilized.
> >More so, I don't understand how they could be non-harmful.
> Why should they be harmful? As we found out, Intel uses PA[37:20] on
> a number of processors.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

After re-reading David Kantor's Sandy Bridge article
(https://www.realworldtech.com/sandy-bridge/8/) I realized that
"classic" Intel cache works very differently from what I imagined.
Somehow I was thinking that there were 2048 sets with 64 ways
in each set. Now I am starting to understand that there are only
16 ways. I still didn't get it completely.
So please ignore everything that I said in couple of previous posts.

Re: L3 cache allocation on Intel

<ZWvZM.164528$w4ec.29268@fx14.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34627&group=comp.arch#34627

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!2.eu.feeder.erje.net!feeder.erje.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer01.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: L3 cache allocation on Intel
References: <2023Oct18.190923@mips.complang.tuwien.ac.at> <SCVXM.40044$MJ59.16148@fx10.iad> <bTVXM.87760$8fO.38095@fx15.iad> <2023Oct19.211807@mips.complang.tuwien.ac.at> <3hAYM.29016$Ssze.11022@fx48.iad> <ff7eae0a-7955-4b11-b857-0393ac340c24n@googlegroups.com>
In-Reply-To: <ff7eae0a-7955-4b11-b857-0393ac340c24n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 45
Message-ID: <ZWvZM.164528$w4ec.29268@fx14.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 23 Oct 2023 15:03:21 UTC
Date: Mon, 23 Oct 2023 11:02:54 -0400
X-Received-Bytes: 2871

by: EricP - Mon, 23 Oct 2023 15:02 UTC

MitchAlsup wrote:
> It seems to me that all the L3 chunks are on a ring bus.
> And that getting to RingBus[mod-t] requires a trip around the ring.
> <
> So, why not just snoop that L3 of the ring node you are on while
> passing from where you are to where the data is.
> <
> That is, no special hashing needed, get data if you run into it,
> use snoop statistics to decide where to put it if no snoops succeed.
> <
> So, why are we using a hash again ??

Just guessing... because it is the easiest way to evenly distribute a
variable number and size of shared L3 caches to a variable amount of DRAM.
It allows the logical topology to adapt to different physical topologies.

Each tile has a core, private L1 and L2 cache, and shared L3.
The size of L3 is model dependent and not necessarily a round binary power.
(Also prior L3 was non-inclusive for L1,L2.
Newer L3 is exclusive, L2 victim cache only.)

The number of tiles per chip is model dependent
and not necessarily a round binary power.

The chip tiles are in a 2D grid connected by
vertical and horizontal ring buses.

There are two Memory Controllers each with 3 channels.
There is a variable amount of DRAM connected to each channel.
One MC is physically located on the left chip side, one on the right size.

So it needs to take that #tiles * size L3 cache and distribute it across
the amount of dram discovered at boot, in such a manner that contention
on the ring bus grid is minimized. And do so for a range of applications
with various vectors, matrices, strides and/or sparse accesses.

A strict physical range assignment of dram to tile could bottleneck on
particular tiles for certain applications. And there would still be the
problem of how to evenly allocate the L3 lines to cover the assigned
physical dram range.

A hash fits the bill. The difficulty is coming up with a hash function
that does this cheaply.

"Well hello there Charlie Brown, you blockhead." -- Lucy Van Pelt

devel / comp.arch / Re: L3 cache allocation on Intel

Subject	Author
L3 cache allocation on Intel	Anton Ertl
Re: L3 cache allocation on Intel	EricP
Re: L3 cache allocation on Intel	EricP
Re: L3 cache allocation on Intel	Anton Ertl
Re: L3 cache allocation on Intel	EricP
Re: L3 cache allocation on Intel	Anton Ertl
Re: L3 cache allocation on Intel	Scott Lurndal
Re: L3 cache allocation on Intel	MitchAlsup
Re: L3 cache allocation on Intel	Anton Ertl
Re: L3 cache allocation on Intel	Scott Lurndal
Re: L3 cache allocation on Intel	EricP
Re: L3 cache allocation on Intel	MitchAlsup
Re: L3 cache allocation on Intel	EricP
Re: L3 cache allocation on Intel	MitchAlsup
Re: L3 cache allocation on Intel	Terje Mathisen
Re: L3 cache allocation on Intel	MitchAlsup
Re: L3 cache allocation on Intel	Terje Mathisen
Re: L3 cache allocation on Intel	Anton Ertl
Re: L3 cache allocation on Intel	Paul A. Clayton
Re: L3 cache allocation on Intel	EricP
Re: L3 cache allocation on Intel	Scott Lurndal
Re: L3 cache allocation on Intel	MitchAlsup
Re: L3 cache allocation on Intel	Terje Mathisen
Re: L3 cache allocation on Intel	MitchAlsup
Re: L3 cache allocation on Intel	Anton Ertl
Re: L3 cache allocation on Intel	Terje Mathisen
Re: L3 cache allocation on Intel	Anton Ertl
Re: L3 cache allocation on Intel	Michael S
Re: L3 cache allocation on Intel	Anton Ertl
Re: L3 cache allocation on Intel	Scott Lurndal
Re: L3 cache allocation on Intel	Anton Ertl
Re: L3 cache allocation on Intel	Scott Lurndal
Re: L3 cache allocation on Intel	Michael S
Re: L3 cache allocation on Intel	Anton Ertl
Re: L3 cache allocation on Intel	Michael S
Re: L3 cache allocation on Intel	Scott Lurndal
Re: L3 cache allocation on Intel	EricP