Rocksolid Light - comp.arch - Re: Load/Store with auto-increment

Re: Load/Store with auto-increment

<u3jeba$167ok$1@dont-email.me>

https://news.novabbs.org/devel/article-flat.php?id=32173&group=comp.arch#32173

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Load/Store with auto-increment
Date: Thu, 11 May 2023 14:06:16 -0500
Organization: A noiseless patient Spider
Lines: 374
Message-ID: <u3jeba$167ok$1@dont-email.me>
References: <u35prk$2ssbq$1@dont-email.me>
<u36fd2$121nc$1@newsreader4.netcologne.de>
<2023May9.111344@mips.complang.tuwien.ac.at>
<UQt6M.233407$qpNc.65909@fx03.iad> <_Qu6M.539024$Olad.404121@fx35.iad>
<Y7y6M.233411$qpNc.12100@fx03.iad> <rxy6M.2612724$iS99.592632@fx16.iad>
<u3f19t$i8qu$1@dont-email.me> <zEM6M.411846$ZhSc.226930@fx38.iad>
<u3ghsa$o1p0$1@dont-email.me> <z%Q6M.2840684$9sn9.469502@fx17.iad>
<u3hmu3$101jg$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 11 May 2023 19:06:19 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7adb6d8692f54f2bba25bfdce2224f90";
logging-data="1253140"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+1xv5LTC4J4wOJ/MvjC0gC"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:22CjcO4gvKpGvVurSlpuE2Qkueo=
Content-Language: en-US
In-Reply-To: <u3hmu3$101jg$1@dont-email.me>

by: BGB - Thu, 11 May 2023 19:06 UTC

On 5/10/2023 10:20 PM, BGB wrote:
> On 5/10/2023 1:03 PM, Scott Lurndal wrote:
>> BGB <cr88192@gmail.com> writes:
>>> On 5/10/2023 8:05 AM, Scott Lurndal wrote:
>>>> BGB <cr88192@gmail.com> writes:
>>
>>>>> Probably a reason to have it in A64 is because it already existed and
>>>>> the cores would still need to run 32-bit ARM, so they would already
>>>>> need
>>>>> to have the required hardware.
>>>>
>>>> Many A64 chips don't support A32 or T32 at all.
>>>>
>>>
>>> OK.
>>>
>>> I had thought they had done it x86 style, with most of the 64-bit chips
>>> still including the older 32-bit version even if it wasn't being used.
>>
>> There are some which have support for A32/T32 at EL0. I've not seen
>> any that have A32 support for any higher exception level. (EL0 is
>> usermode, EL1 is kernel, EL2 is Hypervisor and EL3 is machine (i.e.
>> SMM on intel).
>>
>> There's really no reason to support AArch32 (A32/T32 encodings) even
>> at EL0 anymore.
>>
>
> OK.
>
>>>
>>>
>>>>>
>>>>> As I understand it, ARM wasn't really designed for speed, rather to be
>>>>> cheap and low-power (while still being powerful enough to be "actually
>>>>> useful"). Initially them wanting it to be under 1W so that they could
>>>>> use a plastic chip package, etc.
>>>>
>>>> I've seen AArch64 chips running at over 3Ghz, that's pretty speedy.
>>>>
>>>
>>> From what I have seen, performance on ARM is a bit fragile,
>>
>> Where have you seen that? Specifically for ARMv8 and up?
>>
>
> Trying to build and run code on cellphones, typically in Termux.
> Both my old and current phone have Cortex-A53 based CPUs.
>
> Had also experimentally run my BJX2 emulator in a cellphone, via an XFCE
> (in Termux) + X11 Server app.
>
>
> The Raspberry Pi 3 also had similar behavior (though this is much less
> true on the Raspberry Pi 4).
>

The RasPi 4 typically being a little more "consistent" in terms of its
performance.

>
>
>>> Though, this was generally with code running on in-order chips.
>>
>> There are three architectural profiles for ARMv8 (and ARmv7):
>>
>>    - A (Application - this is the most general and most powerful)
>>    - M (Mobile - reduced area, reduced power, reduced performance)
>>    - R (Hard Realtime optimized - most power efficient)
>>
>> The A profile is used by the Graviton, ThunderX/OcteonTx2, and Apple
>> chips
>> designed for high performance general purpose computing.
>>
>
> They are 'A' chips, but the A53 and A55 are in-order.
>
> The A73 and A75 were OoO, but are seemingly limited mostly to higher-end
> phones...
>

No Apple phones here; not exactly made of money...

>
>>
>>>
>>>
>>> At least in GCC, best performance usually seemed to be to try to follow
>>> "conventional" idioms (more like how code was usually written on older
>>> machines), with the "somewhat unrolled modulo-scheduled loops with a
>>> crapton of variables style" (which works well on BJX2 and OK on x86-64)
>>> seemingly being much less performant on ARM devices (despite A64 having
>>> more registers than x86-64).
>>
>> Anecedotal evidence isn't very compelling. Examples.
>>
>
> Stuff like the Doom or Quake engines is crazy fast.
>
> Stuff like my software OpenGL rasterizer, not so much.
>
>
> The situation is reversed on BJX2, where my software OpenGL rasterizer
> holds up better, but the Doom and Quake engines are comparably much slower.
>
> Similarly, relative to clock-speed, the Ryzen in my PC runs circles
> around both a RasPi and a "Motorola Moto E7".
>
>
>
> Not many great examples, but a general coding style...
>
> u64 TKGDI_EncodeCellUTX2(u16 *ics, int sbxs)
> {
>     byte clry[16];
>     byte *pclry;
>     u16 clrm, clrn, pix;
>     u64 pxb;
>     int x, y, cy, mcy, ncy, acy, acy0, acy1, cybi;
>     int cy0, cy1, cy2, cy3, rcp, pxc;
>     int i, j, k;
>
>     mcy=256; ncy=-1;
>     for(y=0; y<4; y++)
>     {
>         pxb=*(u64 *)ics;
>         pclry=clry+(y<<2);
>
>         acy0=pxb>>3;    acy1=pxb>>8;
>         acy0&=127;    acy1&=127;
>         pix=pxb>>16;    cy0=acy0+acy1;
>         acy0=pix>>3;    acy1=pix>>8;
>         acy0&=127;    acy1&=127;
>         pix=pxb>>32;    cy1=acy0+acy1;
>         acy0=pix>>3;    acy1=pix>>8;
>         acy0&=127;    acy1&=127;
>         pix=pxb>>48;    cy2=acy0+acy1;
>         acy0=pix>>3;    acy1=pix>>8;
>         acy0&=127;    acy1&=127;
>         cy3=acy0+acy1;
>         pclry[0]=cy0;    pclry[1]=cy1;
>         pclry[2]=cy2;    pclry[3]=cy3;
>         if(cy0<mcy)
>             { clrm=pxb; mcy=cy0; }
>         if(cy0>ncy)
>             { clrn=pxb; ncy=cy0; }
>         if(cy1<mcy)
>             { clrm=pxb>>16; mcy=cy1; }
>         else if(cy1>ncy)
>             { clrn=pxb>>16; ncy=cy1; }
>         if(cy2<mcy)
>             { clrm=pxb>>32; mcy=cy2; }
>         else if(cy2>ncy)
>             { clrn=pxb>>32; ncy=cy2; }
>         if(cy3<mcy)
>             { clrm=pxb>>48; mcy=cy3; }
>         else if(cy3>ncy)
>             { clrn=pxb>>48; ncy=cy3; }
>         ics+=sbxs;
>     }
>
>     acy=(mcy+ncy)>>1;
>
>     pxc=0;
>     rcp=tkgdi_enc2b_rcptab[ncy-mcy];
>     pclry=clry;
>     for(i=0; i<4; i++)
>     {
>         cy0=pclry[0];    cy1=pclry[1];
>         cy2=pclry[2];    cy3=pclry[3];
>         pclry+=4;
>         cy0-=acy;    cy1-=acy;
>         cy2-=acy;    cy3-=acy;
>         cy0*=rcp;    cy1*=rcp;
>         cy2*=rcp;    cy3*=rcp;
>         cy0=(cy0>>8)+2;        cy1=(cy1>>8)+2;
>         cy2=(cy2>>8)+2;        cy3=(cy3>>8)+2;
>         pxc=(pxc<<2)|cy0;    pxc=(pxc<<2)|cy1;
>         pxc=(pxc<<2)|cy2;    pxc=(pxc<<2)|cy3;
>     }
>
>     clrm&=0x7FFF;
>     clrn&=0x7FFF;
>
>     pxb=(((u64)pxc)<<32)|(clrm<<16)|clrn;
>     return(pxb);
> }
>

Another example of a type of code in the style that seems to have
"relative under-performance":

void TKRA_DrawSpan_LmapModBlUtx2MortZt(u64 *parm,
tkra_rastpixel *dstc, tkra_zbufpixel *dstz, int cnt)
{ tkra_rastpixel *ct, *cte, *src;
tkra_zbufpixel *ctz;
u64 tpos, tstep;
u64 cpos, cstep;
u64 zpos, zstep;
u64 cval, dval, amod, anmod;
u32 xmask, ymask;
s32 z;
int ix0, ix1, ix2, ix3;
u64 pix0, pix1, pix2, pix3;
int pix, dpix, clr, idx;

tpos=parm[TKRA_DS_TPOS];
tstep=parm[TKRA_DS_TSTEP];

cpos=parm[TKRA_DS_CPOS];
cstep=parm[TKRA_DS_CSTEP];

zpos=parm[TKRA_DS_ZPOS];
zstep=parm[TKRA_DS_ZSTEP];

src=(tkra_rastpixel *)(parm[TKRA_DS_TEXBCN]);
xmask=parm[TKRA_DS_XMASK];
ymask=parm[TKRA_DS_YMASK];

ct=dstc; cte=ct+cnt;
ctz=dstz;
while(ct<cte)
{
z=zpos>>16;
if(z>(*ctz))
{
ctz++;
ct++;
tpos+=tstep;
cpos+=cstep;
zpos+=zstep;
continue;
}

ix0=tkra_morton16((tpos>>16)+0, (tpos>>48)+0)&ymask;
ix1=tkra_morton16((tpos>>16)+1, (tpos>>48)+0)&ymask;
ix2=tkra_morton16((tpos>>16)+0, (tpos>>48)+1)&ymask;
ix3=tkra_morton16((tpos>>16)+1, (tpos>>48)+1)&ymask;
pix0=TKRA_CachedBlkUtx2(src, ix0);
pix1=TKRA_CachedBlkUtx2(src, ix1);
pix2=TKRA_CachedBlkUtx2(src, ix2);
pix3=TKRA_CachedBlkUtx2(src, ix3);
cval=TKRA_InterpBilinear64(pix0, pix1, pix2, pix3,
(u16)tpos, (u16)(tpos>>32));

Click here to read the complete article

Send some filthy mail.

devel / comp.arch / Re: Load/Store with auto-increment

Subject	Author
Re: Load/Store with auto-increment	BGB