Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

Witch! Witch! They'll burn ya! -- Hag, "Tomorrow is Yesterday", stardate unknown


devel / comp.arch / Misc: Dealing with variability

SubjectAuthor
* Misc: Dealing with variabilityBGB
+* Re: Misc: Dealing with variabilityMitchAlsup
|+* Re: Misc: Dealing with variabilityrobf...@gmail.com
||`* Re: Misc: Dealing with variabilityBGB
|| `- Re: Misc: Dealing with variabilityBGB
|+* Re: Misc: Dealing with variabilityBGB
||`* Re: Misc: Dealing with variabilityThomas Koenig
|| `- Re: Misc: Dealing with variabilityBGB
|+* Re: Misc: Dealing with variabilityStephen Fuld
||+- Re: Misc: Dealing with variabilityBGB
||+* Re: Misc: Dealing with variabilityMitchAlsup
|||`* Re: Misc: Dealing with variabilityStephen Fuld
||| `- Re: Misc: Dealing with variabilityMitchAlsup
||`* Re: Misc: Dealing with variabilityScott Lurndal
|| `* Re: Misc: Dealing with variabilityStephen Fuld
||  `- Re: Misc: Dealing with variabilityScott Lurndal
|`* Re: Misc: Dealing with variabilityPaul A. Clayton
| +- Re: Misc: Dealing with variabilityThomas Koenig
| +- Re: Misc: Dealing with variabilityScott Lurndal
| `- Re: Misc: Dealing with variabilityMitchAlsup
`- Re: Misc: Dealing with variabilitypec...@gmail.com

1
Misc: Dealing with variability

<u7uvlo$3oi68$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33000&group=comp.arch#33000

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Misc: Dealing with variability
Date: Mon, 3 Jul 2023 12:14:30 -0500
Organization: A noiseless patient Spider
Lines: 149
Message-ID: <u7uvlo$3oi68$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 3 Jul 2023 17:14:33 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="28ee29a3bff11ac6afe8dd5f994c541a";
logging-data="3950792"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+TO24hnhi2vTzeW0L7YRav"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:3PDNSe7Cv1exTFXgN8JNPjcuRVI=
Content-Language: en-US
 by: BGB - Mon, 3 Jul 2023 17:14 UTC

One annoyance with my project is that I can't run a core that can run
exactly the same code along the range of FPGA's I want to use:
XC7S25: Can only run a simplistic RISC-like core (with no FPU or MMU)
XC7S50: Currently running a 2-wide configuration with 32 GPRs
Reduced features, eg: no 128-bit ALU ops in this case, ...
XC7A100T: Can run 3-wide with 64 GPRs, no issue.
Can also run 128-bit ALU ops, ...

Generally, one needs to rebuild the code for each configuration.
Code built for small configurations will perform worse on bigger
configurations. Code built for wide configurations will not run on small
configurations.

There is a lot of "minor stuff" that may break when going between 32 and
64 GPRs.
Running programs built for 64 GPRs on a kernel built in 32 GPR mode will
tend to explode, ...

I also don't necessarily want to have separate builds of of all the
static libraries for each configuration, ...

Though, this is partly inescapable as I have not come up with ideal ways
to gloss over "sizeof(void *)" and "intptr" and some other issues (well,
though there is an "__intptr" type whose main property in this case is
to be the same width as "__sizeof(void *)" and similar).

This part is in an "almost works" category, but code may act buggy or
break in weird ways if linked against a library compiled for a different
width (mostly because type-consistency in a lot of the existing codebase
is a bit hit or miss, and the "gold standard" of using #ifdef for
everything doesn't entirely work in this case).

Though, my wonky compiler architecture has allowed a partial workaround
for a limited set of cases (BGBCC specific):
.ifarch feature_expr //ASM
__declspec(ifarch(feature_expr)) //C declarations
__ifarch(feature_expr) { ... } //C code

Where, conventional #ifdef only works in the preprocessor stage, which
can't deal with variability in the final configuration without a full
recompile.

For example, noting as my existing task scheduler code would explode
with a 32 GPRs or without the XMOV (96-bit VA's) extension, I ended up
needing to add some hackery:
...
taskern=(TKPE_TaskInfoKern *)task->krnlptr;
memcpy(
taskern->ctx_regsave,
__arch_isrsave,
__ARCH_SIZEOF_REGSAVE__);
__ifarch(!has_xgpr) //deal with 32 GPRs
{
taskern->ctx_regsave[TKPE_REGSAVE_GBR] =
taskern->ctx_regsave[TKPE_REGSAVE_GBR_LO];
taskern->ctx_regsave[TKPE_REGSAVE_LR] =
taskern->ctx_regsave[TKPE_REGSAVE_LR_LO];
taskern->ctx_regsave[TKPE_REGSAVE_SPC] =
taskern->ctx_regsave[TKPE_REGSAVE_SPC_LO];
taskern->ctx_regsave[TKPE_REGSAVE_EXSR] =
taskern->ctx_regsave[TKPE_REGSAVE_EXSR_LO];
}
...
__ifarch(has_xmov) //only do this if we have XMOV
{
taskern->ctx_regsave[TKPE_REGSAVE_PCH]=__arch_pch;
taskern->ctx_regsave[TKPE_REGSAVE_GBH]=__arch_gbh;
}
...

Where, it mostly deals with the 32 GPR config by copying some of the
registers around so that they are in the correct places for the other
code (with most of the kernel assuming 64 GPRs, but needing to wrangle
it around when saving and restoring task contexts).

While __ARCH_SIZEOF_REGSAVE__ looks like a normal preprocessor constant,
it doesn't actually expand to a constant, but instead an "architecture
defined variable" (which turns into a constant later on).

Similarly, the "__arch_reg" variables allow exposing the underlying CPU
control registers as-if they were C variables (though, the exact
mechanism differs slightly in this case).

Where, as noted, BGBCC doesn't handle static libraries by compiling
directly to machine code, but instead first compiles to a stack-oriented
bytecode format (RIL, along vaguely similar lines to JVM or .NET
bytecode), which is then translated to 3-address-code and then machine
code while compiling the final binary.

So, in this case, if the "ifarch" remains flexible in the RIL code, then
it can resolved once the RIL is translated to 3AC for the final
compilation (this is typically also when things like "struct layout" and
"sizeof(whatever)" is fully sorted out).

RIL is kinda messy in terms of its design, and the whole image basically
needs to be decoded in a linear fashion (it is not a conventional
structured format, rather the entire image, including all the metadata,
is expressed as a long linear stream of bytecode ops), but "basically
works"... Most efforts to work on replacements have tended to stall though.

For global declarations, it may cause the "global object" to be
suppressed, where the compiler pretends as-if it did not exist.

For code-blocks, it is more mundane:
It is compiled mostly like a normal "if" block, except that the "if()"
expression is resolved to a constant based on architectural feature
flags and (may) allow the resulting basic-block to be either included or
omitted from the binary (though, may not be sufficient to avoid the
compiler from complaining about missing dependencies if the variables or
functions used only exist along certain branches of the "ifarch" tree).

These are mostly using normal C expression syntax, with some
restrictions (one can't access normal runtime variables or call
functions or similar; only use things that will be able to evaluate to a
constant at compile time).

But, the final binaries still end up being specific to a particular
configuration, which is still kind of annoying.

And, getting everything in place so that I can use an Arty S7-50 to
drive a small robot around is being a lot more effort than ideal (mostly
because this has led to a *lot* of these types of issues, as a lot of my
code had implicitly drifted to assumptions of a "bigger" target, and I
need to scale things back some and get things working again).

Well, and in addition still need to write the robot control code and
other stuff, still with some uncertainty about the whole "camera module"
thing (maybe I should just get some IR or ultrasonic distance sensors?...).

Granted, "easy path" in this case would be "just use a RasPi", but alas.
My recent tinkering with neural nets was also partly related to this
sub-case, as I had intended to try to use NNs for processing camera
data, but this still seems like it is "pushing it".

....

Any thoughts?...

Re: Misc: Dealing with variability

<3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33001&group=comp.arch#33001

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:398a:b0:767:1deb:a2c5 with SMTP id ro10-20020a05620a398a00b007671deba2c5mr69291qkn.5.1688421329639;
Mon, 03 Jul 2023 14:55:29 -0700 (PDT)
X-Received: by 2002:a05:6a00:c94:b0:682:140c:2459 with SMTP id
a20-20020a056a000c9400b00682140c2459mr15479232pfv.0.1688421329420; Mon, 03
Jul 2023 14:55:29 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 3 Jul 2023 14:55:28 -0700 (PDT)
In-Reply-To: <u7uvlo$3oi68$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d84d:4434:2143:2a9d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d84d:4434:2143:2a9d
References: <u7uvlo$3oi68$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
Subject: Re: Misc: Dealing with variability
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Mon, 03 Jul 2023 21:55:29 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 10647
 by: MitchAlsup - Mon, 3 Jul 2023 21:55 UTC

On Monday, July 3, 2023 at 12:14:37 PM UTC-5, BGB wrote:
> One annoyance with my project is that I can't run a core that can run
> exactly the same code along the range of FPGA's I want to use:
> XC7S25: Can only run a simplistic RISC-like core (with no FPU or MMU)
> XC7S50: Currently running a 2-wide configuration with 32 GPRs
> Reduced features, eg: no 128-bit ALU ops in this case, ...
> XC7A100T: Can run 3-wide with 64 GPRs, no issue.
> Can also run 128-bit ALU ops, ...
>
>
>
> Generally, one needs to rebuild the code for each configuration.
> Code built for small configurations will perform worse on bigger
> configurations. Code built for wide configurations will not run on small
> configurations.
<
This is exactly why my architecture allows for the HW to perform the narrow
to wide transformations. There is 1 ISA model for everything from 1-wide
in-order to 8-wide GBOoO. Code that runs optimally on the 1-wide machine
is within spitting distance of running optimally on the GBOoO.
<
Then the compiler can be targeted at just the ISA and be model ignorant.
>
> There is a lot of "minor stuff" that may break when going between 32 and
> 64 GPRs.
<
So, don't do that......choose one or the other
<
> Running programs built for 64 GPRs on a kernel built in 32 GPR mode will
> tend to explode, ...
<
So, don't do that......choose one or the other
>
> I also don't necessarily want to have separate builds of of all the
> static libraries for each configuration, ...
>
<
So, don't do that......choose one
>
> Though, this is partly inescapable as I have not come up with ideal ways
> to gloss over "sizeof(void *)" and "intptr" and some other issues (well,
> though there is an "__intptr" type whose main property in this case is
> to be the same width as "__sizeof(void *)" and similar).
>
> This part is in an "almost works" category, but code may act buggy or
> break in weird ways if linked against a library compiled for a different
> width (mostly because type-consistency in a lot of the existing codebase
> is a bit hit or miss, and the "gold standard" of using #ifdef for
> everything doesn't entirely work in this case).
>
>
> Though, my wonky compiler architecture has allowed a partial workaround
> for a limited set of cases (BGBCC specific):
> .ifarch feature_expr //ASM
> __declspec(ifarch(feature_expr)) //C declarations
> __ifarch(feature_expr) { ... } //C code
>
> Where, conventional #ifdef only works in the preprocessor stage, which
> can't deal with variability in the final configuration without a full
> recompile.
>
>
> For example, noting as my existing task scheduler code would explode
> with a 32 GPRs or without the XMOV (96-bit VA's) extension, I ended up
> needing to add some hackery:
> ...
> taskern=(TKPE_TaskInfoKern *)task->krnlptr;
> memcpy(
> taskern->ctx_regsave,
> __arch_isrsave,
> __ARCH_SIZEOF_REGSAVE__);
> __ifarch(!has_xgpr) //deal with 32 GPRs
> {
> taskern->ctx_regsave[TKPE_REGSAVE_GBR] =
> taskern->ctx_regsave[TKPE_REGSAVE_GBR_LO];
> taskern->ctx_regsave[TKPE_REGSAVE_LR] =
> taskern->ctx_regsave[TKPE_REGSAVE_LR_LO];
> taskern->ctx_regsave[TKPE_REGSAVE_SPC] =
> taskern->ctx_regsave[TKPE_REGSAVE_SPC_LO];
> taskern->ctx_regsave[TKPE_REGSAVE_EXSR] =
> taskern->ctx_regsave[TKPE_REGSAVE_EXSR_LO];
> }
> ...
> __ifarch(has_xmov) //only do this if we have XMOV
> {
> taskern->ctx_regsave[TKPE_REGSAVE_PCH]=__arch_pch;
> taskern->ctx_regsave[TKPE_REGSAVE_GBH]=__arch_gbh;
> }
> ...
<
It is stuff like this that caused my architecture to take a detour.
<
I don't have instructions save/restore registers or details
of the state of the thread or core. These pieces of information
are associated with a memory address, and when it comes time
to save them, HW sequencer knows where they go and puts them
there, at the same time HW sequencer knows where the next set
are coming from and can be loading the new set while pushing
out the old set.
<
When control arrives after a "switch" the code is already using
the register file of the new context--including its stack pointer.
Thus the code does not have to do anything other than the job
that needed done based on the reason of the switch to this new
set of state.
<
The register file itself operates like a write-back cache of 4-lines.
The Core state operates like a write-back cache of 1-cache line.
HW sequencing is smart enough to begin the fetch of core state
and register file BEFORE the switch is initiated, thus the core
continues processing the old state until the new state is ready to
take over.
<
All of the state inside a core can be accessed via memory mapped
control registers--including the registers. So, if a core stops talking
to the rest of the system, one can go in and look at what the core
was doing and the instant of where it stopped--remotely.
<
All other control registers {L2 cache, L3 cache, PCIe hostBridge,
devices, timers,...} are all memory mapped.
<
There is no CPUID-like instruction, all the information of each
core is accessible via configuration space on the capabilities
list of what smells like a PCIe header.
>
> Where, it mostly deals with the 32 GPR config by copying some of the
> registers around so that they are in the correct places for the other
> code (with most of the kernel assuming 64 GPRs, but needing to wrangle
> it around when saving and restoring task contexts).
>
> While __ARCH_SIZEOF_REGSAVE__ looks like a normal preprocessor constant,
> it doesn't actually expand to a constant, but instead an "architecture
> defined variable" (which turns into a constant later on).
>
> Similarly, the "__arch_reg" variables allow exposing the underlying CPU
> control registers as-if they were C variables (though, the exact
> mechanism differs slightly in this case).
>
>
>
> Where, as noted, BGBCC doesn't handle static libraries by compiling
> directly to machine code, but instead first compiles to a stack-oriented
> bytecode format (RIL, along vaguely similar lines to JVM or .NET
> bytecode), which is then translated to 3-address-code and then machine
> code while compiling the final binary.
>
> So, in this case, if the "ifarch" remains flexible in the RIL code, then
> it can resolved once the RIL is translated to 3AC for the final
> compilation (this is typically also when things like "struct layout" and
> "sizeof(whatever)" is fully sorted out).
>
>
> RIL is kinda messy in terms of its design, and the whole image basically
> needs to be decoded in a linear fashion (it is not a conventional
> structured format, rather the entire image, including all the metadata,
> is expressed as a long linear stream of bytecode ops), but "basically
> works"... Most efforts to work on replacements have tended to stall though.
>
>
> For global declarations, it may cause the "global object" to be
> suppressed, where the compiler pretends as-if it did not exist.
>
> For code-blocks, it is more mundane:
> It is compiled mostly like a normal "if" block, except that the "if()"
> expression is resolved to a constant based on architectural feature
> flags and (may) allow the resulting basic-block to be either included or
> omitted from the binary (though, may not be sufficient to avoid the
> compiler from complaining about missing dependencies if the variables or
> functions used only exist along certain branches of the "ifarch" tree).
>
> These are mostly using normal C expression syntax, with some
> restrictions (one can't access normal runtime variables or call
> functions or similar; only use things that will be able to evaluate to a
> constant at compile time).
>
>
>
> But, the final binaries still end up being specific to a particular
> configuration, which is still kind of annoying.
<
So, fix it. Nobody is standing over you preventing you from fixing it.
>
> And, getting everything in place so that I can use an Arty S7-50 to
> drive a small robot around is being a lot more effort than ideal (mostly
> because this has led to a *lot* of these types of issues, as a lot of my
> code had implicitly drifted to assumptions of a "bigger" target, and I
> need to scale things back some and get things working again).
>
> Well, and in addition still need to write the robot control code and
> other stuff, still with some uncertainty about the whole "camera module"
> thing (maybe I should just get some IR or ultrasonic distance sensors?...).
>
> Granted, "easy path" in this case would be "just use a RasPi", but alas.
> My recent tinkering with neural nets was also partly related to this
> sub-case, as I had intended to try to use NNs for processing camera
> data, but this still seems like it is "pushing it".
>
> ...
>
>
> Any thoughts?...
<
Get an FPGA that has 10M gates and 500 useable pins.
But this has cost drawbacks.........


Click here to read the complete article
Re: Misc: Dealing with variability

<91f41f25-5f43-449a-8106-94492a63947bn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33002&group=comp.arch#33002

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:b84:b0:635:e19a:6cc4 with SMTP id fe4-20020a0562140b8400b00635e19a6cc4mr68463qvb.2.1688427052442;
Mon, 03 Jul 2023 16:30:52 -0700 (PDT)
X-Received: by 2002:a17:902:8f96:b0:1a2:185a:cd6 with SMTP id
z22-20020a1709028f9600b001a2185a0cd6mr8856327plo.4.1688427051864; Mon, 03 Jul
2023 16:30:51 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 3 Jul 2023 16:30:51 -0700 (PDT)
In-Reply-To: <3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:6a00:f8d9:47db:d474:9c47;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:6a00:f8d9:47db:d474:9c47
References: <u7uvlo$3oi68$1@dont-email.me> <3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <91f41f25-5f43-449a-8106-94492a63947bn@googlegroups.com>
Subject: Re: Misc: Dealing with variability
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Mon, 03 Jul 2023 23:30:52 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: robf...@gmail.com - Mon, 3 Jul 2023 23:30 UTC

On Monday, July 3, 2023 at 5:55:31 PM UTC-4, MitchAlsup wrote:
> On Monday, July 3, 2023 at 12:14:37 PM UTC-5, BGB wrote:
> > One annoyance with my project is that I can't run a core that can run
> > exactly the same code along the range of FPGA's I want to use:
> > XC7S25: Can only run a simplistic RISC-like core (with no FPU or MMU)
> > XC7S50: Currently running a 2-wide configuration with 32 GPRs
> > Reduced features, eg: no 128-bit ALU ops in this case, ...
> > XC7A100T: Can run 3-wide with 64 GPRs, no issue.
> > Can also run 128-bit ALU ops, ...
> >
> >
> >
> > Generally, one needs to rebuild the code for each configuration.
> > Code built for small configurations will perform worse on bigger
> > configurations. Code built for wide configurations will not run on small
> > configurations.
> <
> This is exactly why my architecture allows for the HW to perform the narrow
> to wide transformations. There is 1 ISA model for everything from 1-wide
> in-order to 8-wide GBOoO. Code that runs optimally on the 1-wide machine
> is within spitting distance of running optimally on the GBOoO.
> <
> Then the compiler can be targeted at just the ISA and be model ignorant.
> >
> > There is a lot of "minor stuff" that may break when going between 32 and
> > 64 GPRs.
> <
> So, don't do that......choose one or the other
> <
> > Running programs built for 64 GPRs on a kernel built in 32 GPR mode will
> > tend to explode, ...
> <
> So, don't do that......choose one or the other
> >
> > I also don't necessarily want to have separate builds of of all the
> > static libraries for each configuration, ...
> >
> <
> So, don't do that......choose one
> >
> > Though, this is partly inescapable as I have not come up with ideal ways
> > to gloss over "sizeof(void *)" and "intptr" and some other issues (well,
> > though there is an "__intptr" type whose main property in this case is
> > to be the same width as "__sizeof(void *)" and similar).
> >
> > This part is in an "almost works" category, but code may act buggy or
> > break in weird ways if linked against a library compiled for a different
> > width (mostly because type-consistency in a lot of the existing codebase
> > is a bit hit or miss, and the "gold standard" of using #ifdef for
> > everything doesn't entirely work in this case).
> >
> >
> > Though, my wonky compiler architecture has allowed a partial workaround
> > for a limited set of cases (BGBCC specific):
> > .ifarch feature_expr //ASM
> > __declspec(ifarch(feature_expr)) //C declarations
> > __ifarch(feature_expr) { ... } //C code
> >
> > Where, conventional #ifdef only works in the preprocessor stage, which
> > can't deal with variability in the final configuration without a full
> > recompile.
> >
> >
> > For example, noting as my existing task scheduler code would explode
> > with a 32 GPRs or without the XMOV (96-bit VA's) extension, I ended up
> > needing to add some hackery:
> > ...
> > taskern=(TKPE_TaskInfoKern *)task->krnlptr;
> > memcpy(
> > taskern->ctx_regsave,
> > __arch_isrsave,
> > __ARCH_SIZEOF_REGSAVE__);
> > __ifarch(!has_xgpr) //deal with 32 GPRs
> > {
> > taskern->ctx_regsave[TKPE_REGSAVE_GBR] =
> > taskern->ctx_regsave[TKPE_REGSAVE_GBR_LO];
> > taskern->ctx_regsave[TKPE_REGSAVE_LR] =
> > taskern->ctx_regsave[TKPE_REGSAVE_LR_LO];
> > taskern->ctx_regsave[TKPE_REGSAVE_SPC] =
> > taskern->ctx_regsave[TKPE_REGSAVE_SPC_LO];
> > taskern->ctx_regsave[TKPE_REGSAVE_EXSR] =
> > taskern->ctx_regsave[TKPE_REGSAVE_EXSR_LO];
> > }
> > ...
> > __ifarch(has_xmov) //only do this if we have XMOV
> > {
> > taskern->ctx_regsave[TKPE_REGSAVE_PCH]=__arch_pch;
> > taskern->ctx_regsave[TKPE_REGSAVE_GBH]=__arch_gbh;
> > }
> > ...
> <
> It is stuff like this that caused my architecture to take a detour.
> <
> I don't have instructions save/restore registers or details
> of the state of the thread or core. These pieces of information
> are associated with a memory address, and when it comes time
> to save them, HW sequencer knows where they go and puts them
> there, at the same time HW sequencer knows where the next set
> are coming from and can be loading the new set while pushing
> out the old set.
> <
> When control arrives after a "switch" the code is already using
> the register file of the new context--including its stack pointer.
> Thus the code does not have to do anything other than the job
> that needed done based on the reason of the switch to this new
> set of state.
> <
> The register file itself operates like a write-back cache of 4-lines.
> The Core state operates like a write-back cache of 1-cache line.
> HW sequencing is smart enough to begin the fetch of core state
> and register file BEFORE the switch is initiated, thus the core
> continues processing the old state until the new state is ready to
> take over.
> <
> All of the state inside a core can be accessed via memory mapped
> control registers--including the registers. So, if a core stops talking
> to the rest of the system, one can go in and look at what the core
> was doing and the instant of where it stopped--remotely.
> <
> All other control registers {L2 cache, L3 cache, PCIe hostBridge,
> devices, timers,...} are all memory mapped.
> <
> There is no CPUID-like instruction, all the information of each
> core is accessible via configuration space on the capabilities
> list of what smells like a PCIe header.
> >
> > Where, it mostly deals with the 32 GPR config by copying some of the
> > registers around so that they are in the correct places for the other
> > code (with most of the kernel assuming 64 GPRs, but needing to wrangle
> > it around when saving and restoring task contexts).
> >
> > While __ARCH_SIZEOF_REGSAVE__ looks like a normal preprocessor constant,
> > it doesn't actually expand to a constant, but instead an "architecture
> > defined variable" (which turns into a constant later on).
> >
> > Similarly, the "__arch_reg" variables allow exposing the underlying CPU
> > control registers as-if they were C variables (though, the exact
> > mechanism differs slightly in this case).
> >
> >
> >
> > Where, as noted, BGBCC doesn't handle static libraries by compiling
> > directly to machine code, but instead first compiles to a stack-oriented
> > bytecode format (RIL, along vaguely similar lines to JVM or .NET
> > bytecode), which is then translated to 3-address-code and then machine
> > code while compiling the final binary.
> >
> > So, in this case, if the "ifarch" remains flexible in the RIL code, then
> > it can resolved once the RIL is translated to 3AC for the final
> > compilation (this is typically also when things like "struct layout" and
> > "sizeof(whatever)" is fully sorted out).
> >
> >
> > RIL is kinda messy in terms of its design, and the whole image basically
> > needs to be decoded in a linear fashion (it is not a conventional
> > structured format, rather the entire image, including all the metadata,
> > is expressed as a long linear stream of bytecode ops), but "basically
> > works"... Most efforts to work on replacements have tended to stall though.
> >
> >
> > For global declarations, it may cause the "global object" to be
> > suppressed, where the compiler pretends as-if it did not exist.
> >
> > For code-blocks, it is more mundane:
> > It is compiled mostly like a normal "if" block, except that the "if()"
> > expression is resolved to a constant based on architectural feature
> > flags and (may) allow the resulting basic-block to be either included or
> > omitted from the binary (though, may not be sufficient to avoid the
> > compiler from complaining about missing dependencies if the variables or
> > functions used only exist along certain branches of the "ifarch" tree).
> >
> > These are mostly using normal C expression syntax, with some
> > restrictions (one can't access normal runtime variables or call
> > functions or similar; only use things that will be able to evaluate to a
> > constant at compile time).
> >
> >
> >
> > But, the final binaries still end up being specific to a particular
> > configuration, which is still kind of annoying.
> <
> So, fix it. Nobody is standing over you preventing you from fixing it.
> >
> > And, getting everything in place so that I can use an Arty S7-50 to
> > drive a small robot around is being a lot more effort than ideal (mostly
> > because this has led to a *lot* of these types of issues, as a lot of my
> > code had implicitly drifted to assumptions of a "bigger" target, and I
> > need to scale things back some and get things working again).
> >
> > Well, and in addition still need to write the robot control code and
> > other stuff, still with some uncertainty about the whole "camera module"
> > thing (maybe I should just get some IR or ultrasonic distance sensors?....).
> >
> > Granted, "easy path" in this case would be "just use a RasPi", but alas..
> > My recent tinkering with neural nets was also partly related to this
> > sub-case, as I had intended to try to use NNs for processing camera
> > data, but this still seems like it is "pushing it".
> >
> > ...
> >
> >
> > Any thoughts?...
> <
> Get an FPGA that has 10M gates and 500 useable pins.
> But this has cost drawbacks.........


Click here to read the complete article
Re: Misc: Dealing with variability

<u800bm$3vgok$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33003&group=comp.arch#33003

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Dealing with variability
Date: Mon, 3 Jul 2023 21:32:20 -0500
Organization: A noiseless patient Spider
Lines: 509
Message-ID: <u800bm$3vgok$1@dont-email.me>
References: <u7uvlo$3oi68$1@dont-email.me>
<3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 4 Jul 2023 02:32:23 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="160f43a359ceee858cdc0ca906f6f698";
logging-data="4178708"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+3nzwoKQ2gcy6cmyrBCAqY"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:6DK4rjUwklPRyRjZFztnsXezMu4=
In-Reply-To: <3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 4 Jul 2023 02:32 UTC

On 7/3/2023 4:55 PM, MitchAlsup wrote:
> On Monday, July 3, 2023 at 12:14:37 PM UTC-5, BGB wrote:
>> One annoyance with my project is that I can't run a core that can run
>> exactly the same code along the range of FPGA's I want to use:
>> XC7S25: Can only run a simplistic RISC-like core (with no FPU or MMU)
>> XC7S50: Currently running a 2-wide configuration with 32 GPRs
>> Reduced features, eg: no 128-bit ALU ops in this case, ...
>> XC7A100T: Can run 3-wide with 64 GPRs, no issue.
>> Can also run 128-bit ALU ops, ...
>>
>>
>>
>> Generally, one needs to rebuild the code for each configuration.
>> Code built for small configurations will perform worse on bigger
>> configurations. Code built for wide configurations will not run on small
>> configurations.
> <
> This is exactly why my architecture allows for the HW to perform the narrow
> to wide transformations. There is 1 ISA model for everything from 1-wide
> in-order to 8-wide GBOoO. Code that runs optimally on the 1-wide machine
> is within spitting distance of running optimally on the GBOoO.
> <
> Then the compiler can be targeted at just the ISA and be model ignorant.

The issue is doing this effectively without causing LUT cost or timing
issues.

I have a possible way to execute 1-wide code as a 2-way superscalar, but
it would only really be practical on the larger profiles.

Basic option being to use a pattern matching table to flag ops as "valid
prefix" and "valid suffix", and whenever a match is found (and there are
no register conflicts) the pair executes as a 2-wide bundle.

However, this would still leave an incentive for "jumbo prefixes", which
are also more of an issue on 1-wide cores, since bundling can't fully
"fix" the issue of having a constant-load as a series of
serially-dependent instructions.

There is a way to make jumbo prefixes work:
Jumbo prefixes operate on an internal state variable in the decoder;
They are executed as multiple sequential instructions;
Exception handling may not be directed to the middle of a multi-part
instruction.

But, then they still have a higher LUT cost.

Though, in terms of "one ISA that does everything" (without breaking
binary compatibility between the profiles), a more minimalist RISC-like
ISA might almost make more sense.

One could almost argue for RISC-V here, but I still have some of my own
annoyances with RISC-V, and I suspect it may still not be "optimal" in
terms of cost-balancing all of its features (or its use of encoding space).

But, as-is, I have ended up defining my currently used "mainline"
configuration as an "H profile", and the profile I am currently using on
the XC7S50 as an "I profile", and they are not binary compatible with
each other, which is kinda "rather annoying"...

So, say:
H profile: 64 GPRs, WEX-3W
48-bit address space
Pass 16 arguments in registers, ...
I profile: 32 GPRs, WEX-2W
48-bit address space
Pass 8 arguments in registers.

For a while, I had presumed that WEX-2W was dead so I could ignore it,
but sort of ended up needing to revive it because 3W eats too many LUTs,
and the 1-wide profiles suck too much (they lack MOV.X or SIMD ops;
which were important for the use case).

In theory, it could be possible to compose bundles which work on WEX-2W
but break on an implementation running WEX-3W; however my emulator's
lint tool will not allow these (and my compiler will not emit them), so
at lease 2W code will run on a 3W core (but not the reverse).

But, like, presence or absence of SIMD would also break binary
compatibility, and I don't know of any "cheap" way to get SIMD like
performance gains while also minimizing LUT cost.

Though, the main "expensive" feature here being the 4-wide low-precision
SIMD unit (~ 1.5k or 2k LUT). There is a "slightly cheaper" option of
implementing SIMD by internal pipelining within the main FPU, but this
requires a 10-cycle latency for each SIMD op (vs 3L/1T with the a
dedicated low-precision unit). Similarly, 10-cycle SIMD ops are slow
enough to preclude running neural nets on this.

But, like, I am not going to buy an Arty A7-100T to use on the robot,
this is like a $280 board...

Vs, say, $150 for the Arty S7-50 (well, and I already have one of these
boards, so wouldn't need to buy it).

The CMod-A7 and CMod-S7 are insufficient for what I want to do with it.

>>
>> There is a lot of "minor stuff" that may break when going between 32 and
>> 64 GPRs.
> <
> So, don't do that......choose one or the other
> <

It is a cost tradeoff...

32 is cheaper.

64: More expensive but faster.

On the XC7S25 and XC7S50, the number of registers does seem to effect
the resource cost.

I am not sure how much is actually due to the register arrays:
LUT5 can do 3 bits per LUT
LUT6 only does 2 bits per LUT
Or to the size of the register field and associated logic in the
instruction decoder.

I sort of suspect the latter, as using the larger (7-bit) ID space for
32 GPRs pays nearly the full cost of having 64 GPRs (vs using 6-bit IDs
for 32 GPRs).

Say:
32 GPRs uses a 6-bit register ID
00..1F: GPR Array
20..2F: SPRs
30..3F: CRs
64 GPRs uses a 7-bit register ID
00..3F: GPR Array
40..5F: SPRs
60..7F: CRs

Only C0..C15 exists in the 32 GPR config, whereas the 64 GPR case also
has C16..C31. The C32..C63 space is technically aliased to the SPR space.

The SPR space has a few registers, but is mostly filled up with
internal-use "pseudo registers".

As-is, technically the 64 GPR configuration is required to enable some
ISA features (for example, XMOV couldn't be supported without XGPR
because there is no ID space for XMOV's registers).

Granted, the cost difference isn't that large either way (mostly just an
issue when resource cost is cramped).

>> Running programs built for 64 GPRs on a kernel built in 32 GPR mode will
>> tend to explode, ...
> <
> So, don't do that......choose one or the other
>>
>> I also don't necessarily want to have separate builds of of all the
>> static libraries for each configuration, ...
>>
> <
> So, don't do that......choose one

I have sort of avoided this issue via "__ifarch" and similar...
But, within standard C, this would have been unavoidable.

Still does mean I need to do multiple builds of the final binaries.

Well, and also separate builds of the Boot ROM, as the ROM is not itself
entirely immune to the inter-profile binary compatibility issues
(though, more primarily related to features being enabled or disabled
for the "sanity tests", rather than the Boot ROM actually needing all
that many optional features).

You can't sanity test features that are absent.
Not sanity testing something isn't ideal either.

Could check feature flags in CPUID to enable/disable various sanity
tests, but this adds its own complexities (and then there is a non-zero
overlap between which features the ASM code will sanity test, and which
ones my C compiler will attempt to use if they are flagged as enabled).

Would otherwise need some feature to allow ASM code to temporarily
enable ISA features (for the assembler) as relevant and then have them
revert once the ASM code is done with them.

>>
>> Though, this is partly inescapable as I have not come up with ideal ways
>> to gloss over "sizeof(void *)" and "intptr" and some other issues (well,
>> though there is an "__intptr" type whose main property in this case is
>> to be the same width as "__sizeof(void *)" and similar).
>>
>> This part is in an "almost works" category, but code may act buggy or
>> break in weird ways if linked against a library compiled for a different
>> width (mostly because type-consistency in a lot of the existing codebase
>> is a bit hit or miss, and the "gold standard" of using #ifdef for
>> everything doesn't entirely work in this case).
>>
>>
>> Though, my wonky compiler architecture has allowed a partial workaround
>> for a limited set of cases (BGBCC specific):
>> .ifarch feature_expr //ASM
>> __declspec(ifarch(feature_expr)) //C declarations
>> __ifarch(feature_expr) { ... } //C code
>>
>> Where, conventional #ifdef only works in the preprocessor stage, which
>> can't deal with variability in the final configuration without a full
>> recompile.
>>
>>
>> For example, noting as my existing task scheduler code would explode
>> with a 32 GPRs or without the XMOV (96-bit VA's) extension, I ended up
>> needing to add some hackery:
>> ...
>> taskern=(TKPE_TaskInfoKern *)task->krnlptr;
>> memcpy(
>> taskern->ctx_regsave,
>> __arch_isrsave,
>> __ARCH_SIZEOF_REGSAVE__);
>> __ifarch(!has_xgpr) //deal with 32 GPRs
>> {
>> taskern->ctx_regsave[TKPE_REGSAVE_GBR] =
>> taskern->ctx_regsave[TKPE_REGSAVE_GBR_LO];
>> taskern->ctx_regsave[TKPE_REGSAVE_LR] =
>> taskern->ctx_regsave[TKPE_REGSAVE_LR_LO];
>> taskern->ctx_regsave[TKPE_REGSAVE_SPC] =
>> taskern->ctx_regsave[TKPE_REGSAVE_SPC_LO];
>> taskern->ctx_regsave[TKPE_REGSAVE_EXSR] =
>> taskern->ctx_regsave[TKPE_REGSAVE_EXSR_LO];
>> }
>> ...
>> __ifarch(has_xmov) //only do this if we have XMOV
>> {
>> taskern->ctx_regsave[TKPE_REGSAVE_PCH]=__arch_pch;
>> taskern->ctx_regsave[TKPE_REGSAVE_GBH]=__arch_gbh;
>> }
>> ...
> <
> It is stuff like this that caused my architecture to take a detour.
> <
> I don't have instructions save/restore registers or details
> of the state of the thread or core. These pieces of information
> are associated with a memory address, and when it comes time
> to save them, HW sequencer knows where they go and puts them
> there, at the same time HW sequencer knows where the next set
> are coming from and can be loading the new set while pushing
> out the old set.
> <
> When control arrives after a "switch" the code is already using
> the register file of the new context--including its stack pointer.
> Thus the code does not have to do anything other than the job
> that needed done based on the reason of the switch to this new
> set of state.
> <
> The register file itself operates like a write-back cache of 4-lines.
> The Core state operates like a write-back cache of 1-cache line.
> HW sequencing is smart enough to begin the fetch of core state
> and register file BEFORE the switch is initiated, thus the core
> continues processing the old state until the new state is ready to
> take over.
> <
> All of the state inside a core can be accessed via memory mapped
> control registers--including the registers. So, if a core stops talking
> to the rest of the system, one can go in and look at what the core
> was doing and the instant of where it stopped--remotely.
> <


Click here to read the complete article
Re: Misc: Dealing with variability

<u80537$3vvuo$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33004&group=comp.arch#33004

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Dealing with variability
Date: Mon, 3 Jul 2023 22:53:08 -0500
Organization: A noiseless patient Spider
Lines: 351
Message-ID: <u80537$3vvuo$1@dont-email.me>
References: <u7uvlo$3oi68$1@dont-email.me>
<3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
<91f41f25-5f43-449a-8106-94492a63947bn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 4 Jul 2023 03:53:11 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="160f43a359ceee858cdc0ca906f6f698";
logging-data="4194264"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19wQ/NJo4CaK6Ip72TlZDTy"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:3G0VBFQD7rHj7X9zOE7mAQqVmb8=
Content-Language: en-US
In-Reply-To: <91f41f25-5f43-449a-8106-94492a63947bn@googlegroups.com>
 by: BGB - Tue, 4 Jul 2023 03:53 UTC

On 7/3/2023 6:30 PM, robf...@gmail.com wrote:
> On Monday, July 3, 2023 at 5:55:31 PM UTC-4, MitchAlsup wrote:
>> On Monday, July 3, 2023 at 12:14:37 PM UTC-5, BGB wrote:
>>> One annoyance with my project is that I can't run a core that can run
>>> exactly the same code along the range of FPGA's I want to use:
>>> XC7S25: Can only run a simplistic RISC-like core (with no FPU or MMU)
>>> XC7S50: Currently running a 2-wide configuration with 32 GPRs
>>> Reduced features, eg: no 128-bit ALU ops in this case, ...
>>> XC7A100T: Can run 3-wide with 64 GPRs, no issue.
>>> Can also run 128-bit ALU ops, ...
>>>
>>>
>>>
>>> Generally, one needs to rebuild the code for each configuration.
>>> Code built for small configurations will perform worse on bigger
>>> configurations. Code built for wide configurations will not run on small
>>> configurations.
>> <
>> This is exactly why my architecture allows for the HW to perform the narrow
>> to wide transformations. There is 1 ISA model for everything from 1-wide
>> in-order to 8-wide GBOoO. Code that runs optimally on the 1-wide machine
>> is within spitting distance of running optimally on the GBOoO.
>> <
>> Then the compiler can be targeted at just the ISA and be model ignorant.
>>>
>>> There is a lot of "minor stuff" that may break when going between 32 and
>>> 64 GPRs.
>> <
>> So, don't do that......choose one or the other
>> <
>>> Running programs built for 64 GPRs on a kernel built in 32 GPR mode will
>>> tend to explode, ...
>> <
>> So, don't do that......choose one or the other
>>>
>>> I also don't necessarily want to have separate builds of of all the
>>> static libraries for each configuration, ...
>>>
>> <
>> So, don't do that......choose one
>>>
>>> Though, this is partly inescapable as I have not come up with ideal ways
>>> to gloss over "sizeof(void *)" and "intptr" and some other issues (well,
>>> though there is an "__intptr" type whose main property in this case is
>>> to be the same width as "__sizeof(void *)" and similar).
>>>
>>> This part is in an "almost works" category, but code may act buggy or
>>> break in weird ways if linked against a library compiled for a different
>>> width (mostly because type-consistency in a lot of the existing codebase
>>> is a bit hit or miss, and the "gold standard" of using #ifdef for
>>> everything doesn't entirely work in this case).
>>>
>>>
>>> Though, my wonky compiler architecture has allowed a partial workaround
>>> for a limited set of cases (BGBCC specific):
>>> .ifarch feature_expr //ASM
>>> __declspec(ifarch(feature_expr)) //C declarations
>>> __ifarch(feature_expr) { ... } //C code
>>>
>>> Where, conventional #ifdef only works in the preprocessor stage, which
>>> can't deal with variability in the final configuration without a full
>>> recompile.
>>>
>>>
>>> For example, noting as my existing task scheduler code would explode
>>> with a 32 GPRs or without the XMOV (96-bit VA's) extension, I ended up
>>> needing to add some hackery:
>>> ...
>>> taskern=(TKPE_TaskInfoKern *)task->krnlptr;
>>> memcpy(
>>> taskern->ctx_regsave,
>>> __arch_isrsave,
>>> __ARCH_SIZEOF_REGSAVE__);
>>> __ifarch(!has_xgpr) //deal with 32 GPRs
>>> {
>>> taskern->ctx_regsave[TKPE_REGSAVE_GBR] =
>>> taskern->ctx_regsave[TKPE_REGSAVE_GBR_LO];
>>> taskern->ctx_regsave[TKPE_REGSAVE_LR] =
>>> taskern->ctx_regsave[TKPE_REGSAVE_LR_LO];
>>> taskern->ctx_regsave[TKPE_REGSAVE_SPC] =
>>> taskern->ctx_regsave[TKPE_REGSAVE_SPC_LO];
>>> taskern->ctx_regsave[TKPE_REGSAVE_EXSR] =
>>> taskern->ctx_regsave[TKPE_REGSAVE_EXSR_LO];
>>> }
>>> ...
>>> __ifarch(has_xmov) //only do this if we have XMOV
>>> {
>>> taskern->ctx_regsave[TKPE_REGSAVE_PCH]=__arch_pch;
>>> taskern->ctx_regsave[TKPE_REGSAVE_GBH]=__arch_gbh;
>>> }
>>> ...
>> <
>> It is stuff like this that caused my architecture to take a detour.
>> <
>> I don't have instructions save/restore registers or details
>> of the state of the thread or core. These pieces of information
>> are associated with a memory address, and when it comes time
>> to save them, HW sequencer knows where they go and puts them
>> there, at the same time HW sequencer knows where the next set
>> are coming from and can be loading the new set while pushing
>> out the old set.
>> <
>> When control arrives after a "switch" the code is already using
>> the register file of the new context--including its stack pointer.
>> Thus the code does not have to do anything other than the job
>> that needed done based on the reason of the switch to this new
>> set of state.
>> <
>> The register file itself operates like a write-back cache of 4-lines.
>> The Core state operates like a write-back cache of 1-cache line.
>> HW sequencing is smart enough to begin the fetch of core state
>> and register file BEFORE the switch is initiated, thus the core
>> continues processing the old state until the new state is ready to
>> take over.
>> <
>> All of the state inside a core can be accessed via memory mapped
>> control registers--including the registers. So, if a core stops talking
>> to the rest of the system, one can go in and look at what the core
>> was doing and the instant of where it stopped--remotely.
>> <
>> All other control registers {L2 cache, L3 cache, PCIe hostBridge,
>> devices, timers,...} are all memory mapped.
>> <
>> There is no CPUID-like instruction, all the information of each
>> core is accessible via configuration space on the capabilities
>> list of what smells like a PCIe header.
>>>
>>> Where, it mostly deals with the 32 GPR config by copying some of the
>>> registers around so that they are in the correct places for the other
>>> code (with most of the kernel assuming 64 GPRs, but needing to wrangle
>>> it around when saving and restoring task contexts).
>>>
>>> While __ARCH_SIZEOF_REGSAVE__ looks like a normal preprocessor constant,
>>> it doesn't actually expand to a constant, but instead an "architecture
>>> defined variable" (which turns into a constant later on).
>>>
>>> Similarly, the "__arch_reg" variables allow exposing the underlying CPU
>>> control registers as-if they were C variables (though, the exact
>>> mechanism differs slightly in this case).
>>>
>>>
>>>
>>> Where, as noted, BGBCC doesn't handle static libraries by compiling
>>> directly to machine code, but instead first compiles to a stack-oriented
>>> bytecode format (RIL, along vaguely similar lines to JVM or .NET
>>> bytecode), which is then translated to 3-address-code and then machine
>>> code while compiling the final binary.
>>>
>>> So, in this case, if the "ifarch" remains flexible in the RIL code, then
>>> it can resolved once the RIL is translated to 3AC for the final
>>> compilation (this is typically also when things like "struct layout" and
>>> "sizeof(whatever)" is fully sorted out).
>>>
>>>
>>> RIL is kinda messy in terms of its design, and the whole image basically
>>> needs to be decoded in a linear fashion (it is not a conventional
>>> structured format, rather the entire image, including all the metadata,
>>> is expressed as a long linear stream of bytecode ops), but "basically
>>> works"... Most efforts to work on replacements have tended to stall though.
>>>
>>>
>>> For global declarations, it may cause the "global object" to be
>>> suppressed, where the compiler pretends as-if it did not exist.
>>>
>>> For code-blocks, it is more mundane:
>>> It is compiled mostly like a normal "if" block, except that the "if()"
>>> expression is resolved to a constant based on architectural feature
>>> flags and (may) allow the resulting basic-block to be either included or
>>> omitted from the binary (though, may not be sufficient to avoid the
>>> compiler from complaining about missing dependencies if the variables or
>>> functions used only exist along certain branches of the "ifarch" tree).
>>>
>>> These are mostly using normal C expression syntax, with some
>>> restrictions (one can't access normal runtime variables or call
>>> functions or similar; only use things that will be able to evaluate to a
>>> constant at compile time).
>>>
>>>
>>>
>>> But, the final binaries still end up being specific to a particular
>>> configuration, which is still kind of annoying.
>> <
>> So, fix it. Nobody is standing over you preventing you from fixing it.
>>>
>>> And, getting everything in place so that I can use an Arty S7-50 to
>>> drive a small robot around is being a lot more effort than ideal (mostly
>>> because this has led to a *lot* of these types of issues, as a lot of my
>>> code had implicitly drifted to assumptions of a "bigger" target, and I
>>> need to scale things back some and get things working again).
>>>
>>> Well, and in addition still need to write the robot control code and
>>> other stuff, still with some uncertainty about the whole "camera module"
>>> thing (maybe I should just get some IR or ultrasonic distance sensors?...).
>>>
>>> Granted, "easy path" in this case would be "just use a RasPi", but alas.
>>> My recent tinkering with neural nets was also partly related to this
>>> sub-case, as I had intended to try to use NNs for processing camera
>>> data, but this still seems like it is "pushing it".
>>>
>>> ...
>>>
>>>
>>> Any thoughts?...
>> <
>> Get an FPGA that has 10M gates and 500 useable pins.
>> But this has cost drawbacks.........
>
> This topic may be evidence that the ‘one size fits all’ idea is very challenging
> to achieve.
>


Click here to read the complete article
Re: Misc: Dealing with variability

<u80g92$191e$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33005&group=comp.arch#33005

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Dealing with variability
Date: Tue, 4 Jul 2023 02:04:01 -0500
Organization: A noiseless patient Spider
Lines: 151
Message-ID: <u80g92$191e$1@dont-email.me>
References: <u7uvlo$3oi68$1@dont-email.me>
<3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
<91f41f25-5f43-449a-8106-94492a63947bn@googlegroups.com>
<u80537$3vvuo$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 4 Jul 2023 07:04:03 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="160f43a359ceee858cdc0ca906f6f698";
logging-data="42030"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+8MNA8bWd8+puvTkov5yJq"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:DbWSFniu4i7wL0i8dLk1xqT8gwA=
In-Reply-To: <u80537$3vvuo$1@dont-email.me>
Content-Language: en-US
 by: BGB - Tue, 4 Jul 2023 07:04 UTC

On 7/3/2023 10:53 PM, BGB wrote:

....

>
> Yeah.
>
>
> If I could have a "simple and consistent" solution across the range of
> FPGAs I am trying to use, that would be good.
>
> One could argue for, say, using a simpler RISC-like ISA, and then using
> superscalar or OoO on the bigger cores.
>
> But, the threshold where superscalar or OoO looks like an attractive
> option seems to be higher than what I am dealing with here...
>
>
>
> Well, and then for example, the SweRV core has an interesting design,
> and is an "actual superscalar", but (despite being a 32-bit core) its
> inability to run much faster than around 25 MHz or so is concerning...
>
> Also kinda eats the FPGA...
>
>
> MicroBlaze can run at higher clock-speeds, but AFAIK it is a
> single-issue RISC (can give decent performance, but also tends to run
> the FPGA pretty hot).
>
> So, MicroBlaze is, it looks like:
>   32x 32-bit GPRs
>   Single Precision FPU (optional)
>     DAZ/FTZ semantics.
>   Software managed TLB (Optional)
>   1 wide
>   Supports both fixed displacement and register-indexed addressing.
>   Most immediate ops have 16-bit immediate values.
>     But, also only 16-bit branch displacements.
>   Seems to use "Compare-with-zero and branch" semantics.
>
>
> Contrast, BJX2 Core is:
>   32x|64x 64-bit GPRs
>   Double Precision FPU (most configs)
>     Also DAZ/FTZ
>   Software managed TLB (most configs)
>   1/2/3 wide (3-wide in bigger configs)
>
>
> But, the issue being that code built to assume a feature is enabled will
> not run on a core which lacks that feature, but the more feature-rich
> cores will not fit on smaller FPGAs.
>
> Code built for the more limited cores will not run as efficiently on
> bigger cores.
>
> Like, in the minimal configurations, software-emulated floating point is
> used, but if one is calling runtime functions to perform floating point
> math operations, this is a lot slower than using the floating point
> instructions directly, ... (but, if you want to trap and emulate the
> instructions on the smaller cores; this is even slower...).
>
>
>
> For related reasons, probably no one would likely consider building a
> full-scale computer running MicroBlaze or similar.
>
>
> Though, ironically, in some ways MicroBlaze would likely still be less
> limited than RV32I... Well, except that nearly every function call in
> MicroBlaze is going to need to compose the branch displacement as a
> multi-op sequence or similar because, reasons (you really need like 20
> bits or so for this...).
>

Hmm, looks more, and it seems MicroBlaze has a functional equivalent of
a Jumbo prefix in the form of the IMM instruction.

It also appears they are using a similar exception dispatch mechanism
(branch to a offset relative to a vector register which then encodes a
branch to the ISR entry point).

The spec describes them as using a somewhat smaller TLB though:
64-entry, 1-way associative
Seemingly with an ITLB/DTLB split in the mix as well.

However, their memory access is aligned-only, which potentially lessens
the need for set-associativity in the TLB (with an I/D split and
aligned-only access, it is possible that a 1-way TLB is sufficient).

This is different from the 256x 4-way shared TLB used by BJX2.

So, say:
Direct Displacement Addressing:
BJX2 : 9-bit, scaled by element size
RISC-V : 12-bit unscaled
MicroBlaze: 16-bit unscaled
Aligned only
Indexed Addressing:
BJX2 : Scaled by element size
RISC-V : N/A
MicroBlaze: Unscaled
Integer with Immediate:
BJX2 : 9-bit (3RI), 16-bit (2RI)
RISC-V : 12-bit (3RI)
MicroBlaze: 16-bit (3RI)
Branch Displacement:
BJX2 : 20-bit (scale=2)
RISC-V : 20-bit
MicroBlaze: 16-bit (scale=1)
With Branch Delay Slots...
Conditional Branches:
BJX2 :
BT/BF, 20-bit disp;
Compare(Zero)+Branch, 8-bit (Optional)
RISC-V : Compare+Branch, 12-bit
MicroBlaze: CompareZero+Branch, 16-bit
Jumbo-like Prefix:
BJX2 : Yes, adds 24 bits
RISC-V : No
MicroBlaze: Yes, adds 16 bits.
FPU (Scalar):
BJX2 : GPRs, Binary64
RISC-V : FPRs, Binary32 and Binary64
MicroBlaze: GPRs, Binary32
Integer Divide:
BJX2 : Yes, MULQ Extension
RISC-V : Yes, M extension
MicroBlaze: Yes, Optional

Exclusive to BJX2 (for this set):
Predicated instructions (via SR.T flag)
Instruction Bundles
128-bit ALU ops (ALUX)
SIMD
...

Exclusive to RISC-V:
The use of dedicated floating-point registers.

Exclusive to MicroBlaze:
Aligned-only memory access;
Branch Delay Slots;
...

....

Re: Misc: Dealing with variability

<u8199j$49lr$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33006&group=comp.arch#33006

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd4-c42d-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Misc: Dealing with variability
Date: Tue, 4 Jul 2023 14:10:59 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <u8199j$49lr$1@newsreader4.netcologne.de>
References: <u7uvlo$3oi68$1@dont-email.me>
<3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
<u800bm$3vgok$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 4 Jul 2023 14:10:59 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd4-c42d-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd4:c42d:0:7285:c2ff:fe6c:992d";
logging-data="140987"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Tue, 4 Jul 2023 14:10 UTC

BGB <cr88192@gmail.com> schrieb:
> On 7/3/2023 4:55 PM, MitchAlsup wrote:
>> On Monday, July 3, 2023 at 12:14:37 PM UTC-5, BGB wrote:
>>> One annoyance with my project is that I can't run a core that can run
>>> exactly the same code along the range of FPGA's I want to use:
>>> XC7S25: Can only run a simplistic RISC-like core (with no FPU or MMU)
>>> XC7S50: Currently running a 2-wide configuration with 32 GPRs
>>> Reduced features, eg: no 128-bit ALU ops in this case, ...
>>> XC7A100T: Can run 3-wide with 64 GPRs, no issue.
>>> Can also run 128-bit ALU ops, ...
>>>
>>>
>>>
>>> Generally, one needs to rebuild the code for each configuration.
>>> Code built for small configurations will perform worse on bigger
>>> configurations. Code built for wide configurations will not run on small
>>> configurations.
>> <
>> This is exactly why my architecture allows for the HW to perform the narrow
>> to wide transformations. There is 1 ISA model for everything from 1-wide
>> in-order to 8-wide GBOoO. Code that runs optimally on the 1-wide machine
>> is within spitting distance of running optimally on the GBOoO.
>> <
>> Then the compiler can be targeted at just the ISA and be model ignorant.
>
> The issue is doing this effectively without causing LUT cost or timing
> issues.
>
>
> I have a possible way to execute 1-wide code as a 2-way superscalar, but
> it would only really be practical on the larger profiles.

If you want bundles, you could just execute the individual
instructions sequentially on your smallest core. Unless you throw
out the NOPs during decoding and give the CPU something else to
do, you would then suffer the performance penalty for the NOPs
you generate.

Or, if you have a bit to spare, you can put in a hint that an
instruction can be executed consecutively with the next (or the
preceding) one. This would save you the logic needed to analyze
register dependencies. Low-end implementations could just ignore
the bit, middle-end implementations would use it and high-end
implementations would not need it, wasting a bit.

Or, you could structure your ISA so that the dependency analysis
becomes simple - always put the destination register and source
register(s) in the same place, and have a opcode simple bit pattern
to show which instructions have which registers. Then, having
a two-wide in-order implementation would become simpler.

Think of what a Boolean function "Can the instruction sequence a,b
be executed consecutively" would look like. The simpler this is,
the better for your analysis and your LUT count.

A highly regular structure, like the RISC designs had, would certainly
make this easier.

Re: Misc: Dealing with variability

<u81l1a$5inl$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33007&group=comp.arch#33007

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Dealing with variability
Date: Tue, 4 Jul 2023 12:31:22 -0500
Organization: A noiseless patient Spider
Lines: 328
Message-ID: <u81l1a$5inl$1@dont-email.me>
References: <u7uvlo$3oi68$1@dont-email.me>
<3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
<u800bm$3vgok$1@dont-email.me> <u8199j$49lr$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 4 Jul 2023 17:31:23 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="160f43a359ceee858cdc0ca906f6f698";
logging-data="183029"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/RhTXJWGg/7rHb4z3mblAh"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:A7t89MrJn5Q3pJVm2Gk1gaPssdA=
Content-Language: en-US
In-Reply-To: <u8199j$49lr$1@newsreader4.netcologne.de>
 by: BGB - Tue, 4 Jul 2023 17:31 UTC

On 7/4/2023 9:10 AM, Thomas Koenig wrote:
> BGB <cr88192@gmail.com> schrieb:
>> On 7/3/2023 4:55 PM, MitchAlsup wrote:
>>> On Monday, July 3, 2023 at 12:14:37 PM UTC-5, BGB wrote:
>>>> One annoyance with my project is that I can't run a core that can run
>>>> exactly the same code along the range of FPGA's I want to use:
>>>> XC7S25: Can only run a simplistic RISC-like core (with no FPU or MMU)
>>>> XC7S50: Currently running a 2-wide configuration with 32 GPRs
>>>> Reduced features, eg: no 128-bit ALU ops in this case, ...
>>>> XC7A100T: Can run 3-wide with 64 GPRs, no issue.
>>>> Can also run 128-bit ALU ops, ...
>>>>
>>>>
>>>>
>>>> Generally, one needs to rebuild the code for each configuration.
>>>> Code built for small configurations will perform worse on bigger
>>>> configurations. Code built for wide configurations will not run on small
>>>> configurations.
>>> <
>>> This is exactly why my architecture allows for the HW to perform the narrow
>>> to wide transformations. There is 1 ISA model for everything from 1-wide
>>> in-order to 8-wide GBOoO. Code that runs optimally on the 1-wide machine
>>> is within spitting distance of running optimally on the GBOoO.
>>> <
>>> Then the compiler can be targeted at just the ISA and be model ignorant.
>>
>> The issue is doing this effectively without causing LUT cost or timing
>> issues.
>>
>>
>> I have a possible way to execute 1-wide code as a 2-way superscalar, but
>> it would only really be practical on the larger profiles.
>
> If you want bundles, you could just execute the individual
> instructions sequentially on your smallest core. Unless you throw
> out the NOPs during decoding and give the CPU something else to
> do, you would then suffer the performance penalty for the NOPs
> you generate.
>

This part is not the main issue...

Also, luckily, my encoding scheme does not waste space with a bunch of
NOPs. Rather the bundles are a variable number of 32-bit instruction
words with a wide-execute flag.

The pipeline also has interlock handling, so there is no real need for
"timing NOPs" either.

A much bigger issue is, for example, that the 2-wide and 3-wide profiles
assume the existence of a MOV.X instruction (Load/Store 128-bit pair),
which does not exist on 1-wide as it requires more register ports than
are available.

Would otherwise need some mechanism to decompose it in the pipeline into
a pair of MOV.Q instructions.

Though, MOV.X isn't an issue for WEX-3W code on WEX-2W.

Rather, it is that code built for WEX-3W tends to assume other
extensions (such as MULQ or ALUX). So, stuff breaks as soon as it hits
an attempt to use an Integer Divide instruction or 128-bit ALU op or
similar.

Trying to running 3W on 2W will cause the core to fall back to scalar
operation, as 3W bundling rules aren't valid for 2W.

One could assume that the compiler not use these.

Jumbo operations were originally a problem for 1-wide, but I have ended
up using a hack apparently similar to what MicroBlaze uses, namely the
Jumbo-prefix itself behaves like a NOP as far as the EX stages are
concerned, but captures its state in a special internal register, which
then combines with the following instruction. One has to basically flag
the prefixes as "No, Interrupt can't land here.", similar to those
generated by interlock stalls, so it is a "special" type of NOP.

For the 2-wide, the Jumbo prefixes are decoded the same as for 3-wide.

Just as-is, the 3rd decoder is limited to F8 block instructions, eg:
FEab-cdef-FE01-2345-F811-6789 MOV 0xABCDEF0123456789, R17
So, it can handle 96-bit 2RI MOV and ADD and similar, but not much else.

Limiting the 3rd decoder to the F8 block serving to shave off some LUTs
in this case.

For the most part, Imm57s ALU ops or Disp57s Load/Store, while
theoretically possible to encode, are "not a thing" at present.

Mostly because the number of times these cases tend to come up (where an
Imm33s or Disp33s is insufficient) is "vanishingly small", so it ends up
debatable of whether they are worth the cost of having them enabled in
the decoder (with the closest thing to a "killer app" being to use them
as immediate values for SIMD instructions).

Also, currently Disp33s is the largest format supported natively by the
AGU (going any bigger requiring the use of ALU instructions).

> Or, if you have a bit to spare, you can put in a hint that an
> instruction can be executed consecutively with the next (or the
> preceding) one. This would save you the logic needed to analyze
> register dependencies. Low-end implementations could just ignore
> the bit, middle-end implementations would use it and high-end
> implementations would not need it, wasting a bit.
>

This is pretty much how the encoding scheme works already, and was
designed to operate.

Achieving this scalability in practice isn't as easy.

One would need to also ensure that all of the cores have the same basic
feature set in terms of scalar instructions as well, which is where the
main problem lies with the profiles.

> Or, you could structure your ISA so that the dependency analysis
> becomes simple - always put the destination register and source
> register(s) in the same place, and have a opcode simple bit pattern
> to show which instructions have which registers. Then, having
> a two-wide in-order implementation would become simpler.
>

Luckily, register fields don't move around that much.

Going superscalar or OoO should be very well possible with BJX2.

Just, the bigger question is if one can do a "good enough" superscalar
or OoO core on a Spartan or Artix FPGA to make it worthwhile.

The fastest (in terms of MHz) options thus far tend to be 1-wide RISC
cores (which one can run at ~ 100 MHz).

Or, I can run a VLIW core at 50 MHz, with enough "extra" to mostly
compensate for the lower clock-speed.

Also, at 50 and 100 MHz, performance is more limited by memory bandwidth
in most of my test cases than by clock-speed.

Also, despite the higher clock-speed, relative performance will tank if
one achieves the higher clock-speed by reducing the L1 cache size (so,
maintaining 16K or 32K of L1 cache seems to be "fairly important" for
performance).

An intermediate option is 75 MHz, but it is "very painful" trying to get
or keep the core passing timing at 75MHz (with both 3-wide VLIW and
"decent" L1 caches).

If one drops to 25 MHz, the loss of clock-speed ends up hurting more
than anything one could likely gain from doing so.

Say, for example, even with minimal memory and instruction latency, Doom
(and similar) at 25 MHz would still run slower than 50MHz, shifting from
being memory-bound to being limited by how quickly it can execute
instructions (and it would also rarely go much over around 12 fps or so).

One can also do a 16-bit core and run it at around 200 MHz, but... this
also isn't terribly useful...

> Think of what a Boolean function "Can the instruction sequence a,b
> be executed consecutively" would look like. The simpler this is,
> the better for your analysis and your LUT count.
>
> A highly regular structure, like the RISC designs had, would certainly
> make this easier.

Most of my instructions follow a few forms (for 32-bit ops):
FZnm-ZeoZ //3R
FZnm-ZeZZ //2R
FZnm-Zeii //3RI (Imm9)
FZnZ-Zeii //2RI (Imm10)
FZZZ-ZeoZ //1R ("OP Ro", but Ro understood as Rn)
FZZn-iiii //2RI (Imm16)
FZii-Ziii //IMM (BRA and BSR)

Where, Rn, Rm, Ro are the 3 registers; the 'e' field adding bit 4 for
each register: Znmo.

This leaves the Imm16 block as an outlier with a different register
encoding from all the other ops. But, there was no good way around this
short of dog-chewing the immediate field, and in this case I prioritized
having a contiguous immediate over a consistent register field.

Unlike RISC-V, I tried to minimize dog-chewing the immediate fields.

I also prioritized a layout where I could express encodings sensibly in
a hexadecimal-based notation (rather than binary), and where there was
sufficient detail in the notation to be able to unambiguously encode or
decode the instructions (without needing a big blob of text or diagrams
to try to express how a given instruction in question laid out its fields).

There are also 16-bit ops, mostly following a pattern:
ZZnm //2R
ZZni //1RI (Imm4)
ZZnZ //1R
ZZii //Imm8 / Disp8
Znii //1RI (Imm8) (MOV and ADD)
Ziii //Imm12 (MOV Imm12, R0)

The 16-bit ops being mostly limited to R0..R15 (some Load/Store ops also
have variants to encode R16..R31). There are no 3R encodings in 16-bit land.

On the first 16 bit word, the ranges are:
F0: Mostly 3R, 2R, and 1R ops.
F1: Load/Store Disp9 ops
F2: ALU Imm9 ops.
F3: Reserved / Implementation-Extension
F4..F7: Repeat F0..F3, flagged as bundled.
F8: 2RI Imm16 ops
F9: Reserved
FA: MOV Imm24u, R0
FB: MOV Imm24n, R0
FC/FD: Repeat F8/F9, bundled
FE: Jumbo Prefix (Immed Extension)
FF: Jumbo Prefix (Opcode Extension)


Click here to read the complete article
Re: Misc: Dealing with variability

<u842el$i90j$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33010&group=comp.arch#33010

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Misc: Dealing with variability
Date: Wed, 5 Jul 2023 08:32:35 -0700
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <u842el$i90j$1@dont-email.me>
References: <u7uvlo$3oi68$1@dont-email.me>
<3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 5 Jul 2023 15:32:37 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="bb962625dc0c1659edee4d43cffd382c";
logging-data="599059"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/+QwR8RYccTj0u7ufji9geYNF7Xa8c8dA="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:RlOF0IRO0gsFRCJyv001l8aaKR4=
In-Reply-To: <3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Wed, 5 Jul 2023 15:32 UTC

On 7/3/2023 2:55 PM, MitchAlsup wrote:

snip

> All of the state inside a core can be accessed via memory mapped
> control registers--including the registers. So, if a core stops talking
> to the rest of the system, one can go in and look at what the core
> was doing and the instant of where it stopped--remotely.
> <
> All other control registers {L2 cache, L3 cache, PCIe hostBridge,
> devices, timers,...} are all memory mapped.
> <
> There is no CPUID-like instruction, all the information of each
> core is accessible via configuration space on the capabilities
> list of what smells like a PCIe header.

How does a user program, which presumably has access to 64 bits of user
address space, and can't modify its page table entries, get access to
perform loads from the configuration space?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Misc: Dealing with variability

<u845mf$ikdm$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33012&group=comp.arch#33012

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Dealing with variability
Date: Wed, 5 Jul 2023 11:27:57 -0500
Organization: A noiseless patient Spider
Lines: 67
Message-ID: <u845mf$ikdm$1@dont-email.me>
References: <u7uvlo$3oi68$1@dont-email.me>
<3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
<u842el$i90j$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 5 Jul 2023 16:27:59 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="21b534abe2560721ed2f17e327249673";
logging-data="610742"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/m+YYww2sPBKNgBd48x4mt"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:C/jxMD35rTJyJAPXkx0D4Dx4afM=
In-Reply-To: <u842el$i90j$1@dont-email.me>
Content-Language: en-US
 by: BGB - Wed, 5 Jul 2023 16:27 UTC

On 7/5/2023 10:32 AM, Stephen Fuld wrote:
> On 7/3/2023 2:55 PM, MitchAlsup wrote:
>
> snip
>
>> All of the state inside a core can be accessed via memory mapped
>> control registers--including the registers. So, if a core stops talking
>> to the rest of the system, one can go in and look at what the core
>> was doing and the instant of where it stopped--remotely.
>> <
>> All other control registers {L2 cache, L3 cache, PCIe hostBridge,
>> devices, timers,...} are all memory mapped.
>> <
>> There is no CPUID-like instruction, all the information of each
>> core is accessible via configuration space on the capabilities
>> list of what smells like a PCIe header.
>
> How does a user program, which presumably has access to 64 bits of user
> address space, and can't modify its page table entries, get access to
> perform loads from the configuration space?
>

Hmm... Yeah, that does seem like a mystery.

CPUID is simpler, and doesn't need to be all that fancy.

Making it essentially something like a special case MOV that fetches
magic constants and similar works OK.

Well, and a few more potentially controversial features, like a hardware
RNG and (possible) clock-cycle counter (though, one could argue that a
clock-cycle counter is a potential security risk).

There is also a microsecond clock, but I had put this in MMIO space.

But, using CPUID for this could potentially avoid an ~ 4k cycle overhead
from calling functions like "clock()" or "clock_gettime()" and similar
in userland (userland being unable to directly access the MMIO based
timers).

Also, timing via delay loops sucks, probably better to be able to
quickly fetch an accurate time and drive logic as appropriate.

Though, whether all this is "in the spirit" of a CPUID instruction is
debatable.

Otherwise, recently noted that my XC7A200T can survive me increasing the
size of the L2 cache to 1MB, effectively throwing nearly all the
Block-RAM in the FPGA at the L2 cache...

But, on the upside, it seems that with a 1MB L2 cache, Doom is seemingly
no longer memory-IO bound, and can (finally) give a consistent 20-34 fps...

(In the emulator model, with 1MB of L2, it is getting over a 90% L2 hit
rate).

Things like Heretic and Hexen also get a little faster.
Not nearly so much effect on Quake and similar though.

Re: Misc: Dealing with variability

<d1727465-9162-467e-a030-445b28d0f114n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33013&group=comp.arch#33013

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:170f:b0:767:55c4:572b with SMTP id az15-20020a05620a170f00b0076755c4572bmr36216qkb.3.1688577197679;
Wed, 05 Jul 2023 10:13:17 -0700 (PDT)
X-Received: by 2002:a63:e001:0:b0:53f:7715:9580 with SMTP id
e1-20020a63e001000000b0053f77159580mr1602385pgh.0.1688577197207; Wed, 05 Jul
2023 10:13:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 5 Jul 2023 10:13:16 -0700 (PDT)
In-Reply-To: <u842el$i90j$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:458e:d34c:d867:52d3;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:458e:d34c:d867:52d3
References: <u7uvlo$3oi68$1@dont-email.me> <3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
<u842el$i90j$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d1727465-9162-467e-a030-445b28d0f114n@googlegroups.com>
Subject: Re: Misc: Dealing with variability
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 05 Jul 2023 17:13:17 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2376
 by: MitchAlsup - Wed, 5 Jul 2023 17:13 UTC

On Wednesday, July 5, 2023 at 10:32:41 AM UTC-5, Stephen Fuld wrote:
> On 7/3/2023 2:55 PM, MitchAlsup wrote:
>
> snip
> > All of the state inside a core can be accessed via memory mapped
> > control registers--including the registers. So, if a core stops talking
> > to the rest of the system, one can go in and look at what the core
> > was doing and the instant of where it stopped--remotely.
> > <
> > All other control registers {L2 cache, L3 cache, PCIe hostBridge,
> > devices, timers,...} are all memory mapped.
> > <
> > There is no CPUID-like instruction, all the information of each
> > core is accessible via configuration space on the capabilities
> > list of what smells like a PCIe header.
> How does a user program, which presumably has access to 64 bits of user
> address space, and can't modify its page table entries, get access to
> perform loads from the configuration space?
<
Why would a host not grant read access to the configuration stuff ??
>
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Misc: Dealing with variability

<O9ipM.2184$elA.593@fx36.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33015&group=comp.arch#33015

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx36.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Misc: Dealing with variability
Newsgroups: comp.arch
References: <u7uvlo$3oi68$1@dont-email.me> <3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com> <u842el$i90j$1@dont-email.me>
Lines: 28
Message-ID: <O9ipM.2184$elA.593@fx36.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Wed, 05 Jul 2023 17:57:02 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Wed, 05 Jul 2023 17:57:02 GMT
X-Received-Bytes: 1960
 by: Scott Lurndal - Wed, 5 Jul 2023 17:57 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>On 7/3/2023 2:55 PM, MitchAlsup wrote:
>
>snip
>
>> All of the state inside a core can be accessed via memory mapped
>> control registers--including the registers. So, if a core stops talking
>> to the rest of the system, one can go in and look at what the core
>> was doing and the instant of where it stopped--remotely.
>> <
>> All other control registers {L2 cache, L3 cache, PCIe hostBridge,
>> devices, timers,...} are all memory mapped.
>> <
>> There is no CPUID-like instruction, all the information of each
>> core is accessible via configuration space on the capabilities
>> list of what smells like a PCIe header.
>
>How does a user program, which presumably has access to 64 bits of user
>address space, and can't modify its page table entries, get access to
>perform loads from the configuration space?

User mode code generally doesn't require access to system control
registers. ARMv8 has privileged controls that can open certain
registers up to non-privileged accesses (e.g. the timer counter register).

The registers can be grouped and aligned on page boundaries which would
allow a kernel to grant access to certain functional subsets of the
privileged control registers to user mode code via a va to pa mapping.

Re: Misc: Dealing with variability

<u84av5$j6fc$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33016&group=comp.arch#33016

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Misc: Dealing with variability
Date: Wed, 5 Jul 2023 10:57:57 -0700
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <u84av5$j6fc$1@dont-email.me>
References: <u7uvlo$3oi68$1@dont-email.me>
<3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
<u842el$i90j$1@dont-email.me>
<d1727465-9162-467e-a030-445b28d0f114n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 5 Jul 2023 17:57:57 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="bb962625dc0c1659edee4d43cffd382c";
logging-data="629228"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19emojNZvCP1f67A9DYHkPVSXPnwUgNlKI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:7ad/zI145juc766r/3cCD84AEFg=
Content-Language: en-US
In-Reply-To: <d1727465-9162-467e-a030-445b28d0f114n@googlegroups.com>
 by: Stephen Fuld - Wed, 5 Jul 2023 17:57 UTC

On 7/5/2023 10:13 AM, MitchAlsup wrote:
> On Wednesday, July 5, 2023 at 10:32:41 AM UTC-5, Stephen Fuld wrote:
>> On 7/3/2023 2:55 PM, MitchAlsup wrote:
>>
>> snip
>>> All of the state inside a core can be accessed via memory mapped
>>> control registers--including the registers. So, if a core stops talking
>>> to the rest of the system, one can go in and look at what the core
>>> was doing and the instant of where it stopped--remotely.
>>> <
>>> All other control registers {L2 cache, L3 cache, PCIe hostBridge,
>>> devices, timers,...} are all memory mapped.
>>> <
>>> There is no CPUID-like instruction, all the information of each
>>> core is accessible via configuration space on the capabilities
>>> list of what smells like a PCIe header.
>> How does a user program, which presumably has access to 64 bits of user
>> address space, and can't modify its page table entries, get access to
>> perform loads from the configuration space?
> <
> Why would a host not grant read access to the configuration stuff ??

I presume it would, but if a particular address maps to a page in the
configuration space, then the user loses access to the same addressed
page in the user space. I don't think losing access to a few pages is a
big deal, but it does mean the user doesn't really have a full 64 bit
user address space. Again, not a big deal, but it should be documented.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Misc: Dealing with variability

<u84b8o$j7kg$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33017&group=comp.arch#33017

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Misc: Dealing with variability
Date: Wed, 5 Jul 2023 11:03:02 -0700
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <u84b8o$j7kg$1@dont-email.me>
References: <u7uvlo$3oi68$1@dont-email.me>
<3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
<u842el$i90j$1@dont-email.me> <O9ipM.2184$elA.593@fx36.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 5 Jul 2023 18:03:05 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="bb962625dc0c1659edee4d43cffd382c";
logging-data="630416"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19CqnMd99ksgg78IAy4CMsuC6u465GeYDc="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:vC+2+EXRErwG4J9sQK8HB5tFDfQ=
Content-Language: en-US
In-Reply-To: <O9ipM.2184$elA.593@fx36.iad>
 by: Stephen Fuld - Wed, 5 Jul 2023 18:03 UTC

On 7/5/2023 10:57 AM, Scott Lurndal wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> On 7/3/2023 2:55 PM, MitchAlsup wrote:
>>
>> snip
>>
>>> All of the state inside a core can be accessed via memory mapped
>>> control registers--including the registers. So, if a core stops talking
>>> to the rest of the system, one can go in and look at what the core
>>> was doing and the instant of where it stopped--remotely.
>>> <
>>> All other control registers {L2 cache, L3 cache, PCIe hostBridge,
>>> devices, timers,...} are all memory mapped.
>>> <
>>> There is no CPUID-like instruction, all the information of each
>>> core is accessible via configuration space on the capabilities
>>> list of what smells like a PCIe header.
>>
>> How does a user program, which presumably has access to 64 bits of user
>> address space, and can't modify its page table entries, get access to
>> perform loads from the configuration space?
>
> User mode code generally doesn't require access to system control
> registers. ARMv8 has privileged controls that can open certain
> registers up to non-privileged accesses (e.g. the timer counter register).
>
> The registers can be grouped and aligned on page boundaries which would
> allow a kernel to grant access to certain functional subsets of the
> privileged control registers to user mode code via a va to pa mapping.

Yes, I get that (see my response to Mitch above). My comment is that
this should be documented so the user can see how to read, for example,
the CPUID register or equivalent)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Misc: Dealing with variability

<50495c92-4d2a-4d3e-96bf-fd7aae393dfen@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33019&group=comp.arch#33019

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:192e:b0:635:e383:53c0 with SMTP id es14-20020a056214192e00b00635e38353c0mr48447qvb.12.1688583112583;
Wed, 05 Jul 2023 11:51:52 -0700 (PDT)
X-Received: by 2002:a17:902:d102:b0:1b3:a8f6:1231 with SMTP id
w2-20020a170902d10200b001b3a8f61231mr1890707plw.4.1688583112014; Wed, 05 Jul
2023 11:51:52 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 5 Jul 2023 11:51:51 -0700 (PDT)
In-Reply-To: <u84av5$j6fc$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:28ce:7684:1034:367a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:28ce:7684:1034:367a
References: <u7uvlo$3oi68$1@dont-email.me> <3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
<u842el$i90j$1@dont-email.me> <d1727465-9162-467e-a030-445b28d0f114n@googlegroups.com>
<u84av5$j6fc$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <50495c92-4d2a-4d3e-96bf-fd7aae393dfen@googlegroups.com>
Subject: Re: Misc: Dealing with variability
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 05 Jul 2023 18:51:52 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3504
 by: MitchAlsup - Wed, 5 Jul 2023 18:51 UTC

On Wednesday, July 5, 2023 at 12:58:00 PM UTC-5, Stephen Fuld wrote:
> On 7/5/2023 10:13 AM, MitchAlsup wrote:
> > On Wednesday, July 5, 2023 at 10:32:41 AM UTC-5, Stephen Fuld wrote:
> >> On 7/3/2023 2:55 PM, MitchAlsup wrote:
> >>
> >> snip
> >>> All of the state inside a core can be accessed via memory mapped
> >>> control registers--including the registers. So, if a core stops talking
> >>> to the rest of the system, one can go in and look at what the core
> >>> was doing and the instant of where it stopped--remotely.
> >>> <
> >>> All other control registers {L2 cache, L3 cache, PCIe hostBridge,
> >>> devices, timers,...} are all memory mapped.
> >>> <
> >>> There is no CPUID-like instruction, all the information of each
> >>> core is accessible via configuration space on the capabilities
> >>> list of what smells like a PCIe header.
> >> How does a user program, which presumably has access to 64 bits of user
> >> address space, and can't modify its page table entries, get access to
> >> perform loads from the configuration space?
> > <
> > Why would a host not grant read access to the configuration stuff ??
<
> I presume it would, but if a particular address maps to a page in the
> configuration space, then the user loses access to the same addressed
> page in the user space.
<
My 66000 has 64-bit DRAM space, 64-bit MMI/O space, 64-bit configuration
space, and 64-bit ROM space, accessible from a 63-bit virtual address
space (user; super has 64-bit VAS).
<
Allocating a page where user can access core configuration RO data
takes away from user VA space, but from the 66-bit universal address
space, not enough to worry about.
<
< I don't think losing access to a few pages is a
> big deal, but it does mean the user doesn't really have a full 64 bit
> user address space. Again, not a big deal, but it should be documented.
<
User has a 63-bit VAS, supers have 64-bit VASs.
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Misc: Dealing with variability

<85kpM.1024$cFK.454@fx34.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33022&group=comp.arch#33022

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx34.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Misc: Dealing with variability
Newsgroups: comp.arch
References: <u7uvlo$3oi68$1@dont-email.me> <3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com> <u842el$i90j$1@dont-email.me> <O9ipM.2184$elA.593@fx36.iad> <u84b8o$j7kg$1@dont-email.me>
Lines: 49
Message-ID: <85kpM.1024$cFK.454@fx34.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Wed, 05 Jul 2023 20:08:36 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Wed, 05 Jul 2023 20:08:36 GMT
X-Received-Bytes: 3074
 by: Scott Lurndal - Wed, 5 Jul 2023 20:08 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>On 7/5/2023 10:57 AM, Scott Lurndal wrote:
>> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>>> On 7/3/2023 2:55 PM, MitchAlsup wrote:
>>>
>>> snip
>>>
>>>> All of the state inside a core can be accessed via memory mapped
>>>> control registers--including the registers. So, if a core stops talking
>>>> to the rest of the system, one can go in and look at what the core
>>>> was doing and the instant of where it stopped--remotely.
>>>> <
>>>> All other control registers {L2 cache, L3 cache, PCIe hostBridge,
>>>> devices, timers,...} are all memory mapped.
>>>> <
>>>> There is no CPUID-like instruction, all the information of each
>>>> core is accessible via configuration space on the capabilities
>>>> list of what smells like a PCIe header.
>>>
>>> How does a user program, which presumably has access to 64 bits of user
>>> address space, and can't modify its page table entries, get access to
>>> perform loads from the configuration space?
>>
>> User mode code generally doesn't require access to system control
>> registers. ARMv8 has privileged controls that can open certain
>> registers up to non-privileged accesses (e.g. the timer counter register).
>>
>> The registers can be grouped and aligned on page boundaries which would
>> allow a kernel to grant access to certain functional subsets of the
>> privileged control registers to user mode code via a va to pa mapping.
>
>Yes, I get that (see my response to Mitch above). My comment is that
>this should be documented so the user can see how to read, for example,
>the CPUID register or equivalent)

Oddly enough, that's one of the registers that ARMv8 user mode code
_cannot_ read, and there's no control for privileged code to open it up
to user-mode code (although there are controls to trap OS accesses to
the hypervisor similar to VT-X and SVM).

Linux on ARMv8 (arm64) provides a list of CPU capabilities to
applications through the AUX vector on exec(2) - AT_HWCAP - that can be used as
run-time conditionals for specific sets of instructions such
as the scalable vector extension or scalable matrix extensions.

https://docs.kernel.org/arm64/elf_hwcaps.html

Since x86/amd allow user applications to directly invoke
the CPUID instruction, no AT_HWCAP is zero in those systems.

Re: Misc: Dealing with variability

<u9lu30$ltar$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33439&group=comp.arch#33439

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaronclayton@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Misc: Dealing with variability
Date: Mon, 24 Jul 2023 09:24:47 -0400
Organization: A noiseless patient Spider
Lines: 73
Message-ID: <u9lu30$ltar$1@dont-email.me>
References: <u7uvlo$3oi68$1@dont-email.me>
<3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 24 Jul 2023 13:24:49 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="338ccb5e2c737d93a495d674e5d51d2b";
logging-data="718171"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18EkIF8ISdkNlmhumMAfoGr3wzFj/s1dpg="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:QjzYbdnNxwusngXs/prSSFAiAYE=
In-Reply-To: <3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
 by: Paul A. Clayton - Mon, 24 Jul 2023 13:24 UTC

On 7/3/23 5:55 PM, MitchAlsup wrote:
> On Monday, July 3, 2023 at 12:14:37 PM UTC-5, BGB wrote:
[snip]
>> Generally, one needs to rebuild the code for each configuration.
>> Code built for small configurations will perform worse on bigger
>> configurations. Code built for wide configurations will not run on small
>> configurations.
> <
> This is exactly why my architecture allows for the HW to perform the narrow
> to wide transformations. There is 1 ISA model for everything from 1-wide
> in-order to 8-wide GBOoO. Code that runs optimally on the 1-wide machine
> is within spitting distance of running optimally on the GBOoO.

I would *guess* that if one cannot spit very far (<1% — at which
point _measuring_ the difference actually caused by using the same
binary might be challenging) this implementation constraint might
hinder exploiting some good cases.

Excluding the creativity effect of constraint, such a constraint
seems to require a non-negative (likely positive) cost.

The creativity effect (which is one advantage of formal poetry —
another is audience expectation) realistically applies as design
effort is not infinite.

If I adequately understand the general implementation proposals,
loop fission/fusion, loop interchange, and loop blocking choices
could differ based on the targeted microarchitecture (even just L1
cache size?). If a small loop body runs much faster on a small
implementation, loop fission might be better even with extra store
and load overhead to communicate with the second loop — presumably
with blocking to keep the communication within L1 — while a larger
implementation might run more than 1% faster with the loop not
split.

Doing more work to achieve higher performance seems a known
microarchitecture-aware optimization. In addition to loop
optimizations, higher-level algorithmic options might change which
binary code expression is "nearly best".

While the dynamic scheduling of a GBOoO implementation will tend
to mute differences, I suspect cases can be contrived with more
than a 5% performance difference.

[snip]> All of the state inside a core can be accessed via memory
mapped
> control registers--including the registers. So, if a core stops talking
> to the rest of the system, one can go in and look at what the core
> was doing and the instant of where it stopped--remotely.

What is the consistency model for memory-mapped registers both
within a thread and between cores?

I would guess that within a thread one would want strong
consistency which seems to require alias detection for every load
and store. For between threads, strong consistency also seems
desirable as the cost would typically not be significant (rare
use?) and the software benefit significant (already complex
reasoning required so any simplification might be appreciated).

This also seems to reduce the cost of polling for a receiver
thread (a register test being cheaper than a cache read and the
write is probably more "pushed" than "pulled").

(Address-based store "pushing" has been proposed. With last level
cache home nodes — for data placement and not just coherence
management — lower level evictions get pushed home, but this is
slow. Obviously replacement policy and early writeback could
adjust the timing.)

This might also present an opportunity similar to MONITOR/MWAIT
while avoiding the issues of using general memory. (MONITOR/MWAIT
can be one/any-to-many, which has some uses.)

Re: Misc: Dealing with variability

<u9m1le$16poa$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33440&group=comp.arch#33440

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-29be-0-46be-824f-88ee-6c4e.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Misc: Dealing with variability
Date: Mon, 24 Jul 2023 14:25:50 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <u9m1le$16poa$1@newsreader4.netcologne.de>
References: <u7uvlo$3oi68$1@dont-email.me>
<3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com>
<u9lu30$ltar$1@dont-email.me>
Injection-Date: Mon, 24 Jul 2023 14:25:50 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-29be-0-46be-824f-88ee-6c4e.ipv6dyn.netcologne.de:2001:4dd6:29be:0:46be:824f:88ee:6c4e";
logging-data="1271562"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Mon, 24 Jul 2023 14:25 UTC

Paul A. Clayton <paaronclayton@gmail.com> schrieb:

> If I adequately understand the general implementation proposals,
> loop fission/fusion, loop interchange, and loop blocking choices
> could differ based on the targeted microarchitecture (even just L1
> cache size?).

Very much so.

Very highly optimized code such as OpenBLAS sets the blocking
factors according to a table for the CPU it is running on,
to fit into the cache.

It would make sense to be able to query the cache sizes for
this kind of code.

Re: Misc: Dealing with variability

<5UvvM.116086$TCKc.64010@fx13.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33441&group=comp.arch#33441

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx13.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Misc: Dealing with variability
Newsgroups: comp.arch
References: <u7uvlo$3oi68$1@dont-email.me> <3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com> <u9lu30$ltar$1@dont-email.me>
Lines: 22
Message-ID: <5UvvM.116086$TCKc.64010@fx13.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Mon, 24 Jul 2023 14:28:17 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Mon, 24 Jul 2023 14:28:17 GMT
X-Received-Bytes: 1507
 by: Scott Lurndal - Mon, 24 Jul 2023 14:28 UTC

"Paul A. Clayton" <paaronclayton@gmail.com> writes:
>On 7/3/23 5:55 PM, MitchAlsup wrote:

>
>[snip]> All of the state inside a core can be accessed via memory
>mapped
>> control registers--including the registers. So, if a core stops talking
>> to the rest of the system, one can go in and look at what the core
>> was doing and the instant of where it stopped--remotely.

The AArch64 Octeon chips have had this feature (architecturally defined)
since day one. It can be accessed by other cores, management processors
or via JTAG.

>
>What is the consistency model for memory-mapped registers both
>within a thread and between cores?

The memory mapped registers are generally used for debugging exceptional
conditions (like a hung core), and are not expected to be used in parallel
with other mechanisms (e.g. MSR/MRS in Aarch64).

Re: Misc: Dealing with variability

<0ad7ebe6-1582-41ca-926a-c8bc19ac6f33n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33443&group=comp.arch#33443

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4e42:0:b0:63c:f853:c8d with SMTP id eb2-20020ad44e42000000b0063cf8530c8dmr437qvb.4.1690214208724; Mon, 24 Jul 2023 08:56:48 -0700 (PDT)
X-Received: by 2002:a05:6830:4595:b0:6b7:528c:d8bf with SMTP id az21-20020a056830459500b006b7528cd8bfmr7060562otb.0.1690214208446; Mon, 24 Jul 2023 08:56:48 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!69.80.99.15.MISMATCH!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 24 Jul 2023 08:56:48 -0700 (PDT)
In-Reply-To: <u9lu30$ltar$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:645e:641:4ffe:7894; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:645e:641:4ffe:7894
References: <u7uvlo$3oi68$1@dont-email.me> <3fd7de29-b8d8-4faa-9509-055262fddc77n@googlegroups.com> <u9lu30$ltar$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0ad7ebe6-1582-41ca-926a-c8bc19ac6f33n@googlegroups.com>
Subject: Re: Misc: Dealing with variability
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Mon, 24 Jul 2023 15:56:48 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 105
 by: MitchAlsup - Mon, 24 Jul 2023 15:56 UTC

On Monday, July 24, 2023 at 8:24:53 AM UTC-5, Paul A. Clayton wrote:
> On 7/3/23 5:55 PM, MitchAlsup wrote:
> > On Monday, July 3, 2023 at 12:14:37 PM UTC-5, BGB wrote:
> [snip]
> >> Generally, one needs to rebuild the code for each configuration.
> >> Code built for small configurations will perform worse on bigger
> >> configurations. Code built for wide configurations will not run on small
> >> configurations.
> > <
> > This is exactly why my architecture allows for the HW to perform the narrow
> > to wide transformations. There is 1 ISA model for everything from 1-wide
> > in-order to 8-wide GBOoO. Code that runs optimally on the 1-wide machine
> > is within spitting distance of running optimally on the GBOoO.
> I would *guess* that if one cannot spit very far (<1% — at which
> point _measuring_ the difference actually caused by using the same
> binary might be challenging) this implementation constraint might
> hinder exploiting some good cases.
>
> Excluding the creativity effect of constraint, such a constraint
> seems to require a non-negative (likely positive) cost.
<
Very little more than the GBOoO already requires.
>
> The creativity effect (which is one advantage of formal poetry —
> another is audience expectation) realistically applies as design
> effort is not infinite.
>
> If I adequately understand the general implementation proposals,
> loop fission/fusion, loop interchange, and loop blocking choices
> could differ based on the targeted microarchitecture (even just L1
> cache size?).
<
You do remember I have vectorization where each implementation
is allowed to decide how many lanes are present and perform
accordingly.
<
> If a small loop body runs much faster on a small
> implementation, loop fission might be better even with extra store
> and load overhead to communicate with the second loop — presumably
> with blocking to keep the communication within L1 — while a larger
> implementation might run more than 1% faster with the loop not
> split.
>
> Doing more work to achieve higher performance seems a known
> microarchitecture-aware optimization. In addition to loop
> optimizations, higher-level algorithmic options might change which
> binary code expression is "nearly best".
<
Algorithmic optimization is almost always better than compiler
optimization.
>
> While the dynamic scheduling of a GBOoO implementation will tend
> to mute differences, I suspect cases can be contrived with more
> than a 5% performance difference.
>
> [snip]> All of the state inside a core can be accessed via memory
> mapped
> > control registers--including the registers. So, if a core stops talking
> > to the rest of the system, one can go in and look at what the core
> > was doing and the instant of where it stopped--remotely.
<
> What is the consistency model for memory-mapped registers both
> within a thread and between cores?
<
MMI/O space is sequentially consistent.
Configuration space is strongly ordered.
>
> I would guess that within a thread one would want strong
> consistency which seems to require alias detection for every load
> and store. For between threads, strong consistency also seems
> desirable as the cost would typically not be significant (rare
> use?) and the software benefit significant (already complex
> reasoning required so any simplification might be appreciated).
<
The core has to be able to deal with an access at any clock edge,
but the system expects relatively few of these accesses. When
the system is running "normally" only core[k] accesses its
control registers[k]. There are interrupt<ing> tables used to do
IPIs and alert for scheduled tasks--these are in unCacheable
DRAM (which is cacheable in L3).
>
> This also seems to reduce the cost of polling for a receiver
> thread (a register test being cheaper than a cache read and the
> write is probably more "pushed" than "pulled").
>
> (Address-based store "pushing" has been proposed. With last level
> cache home nodes — for data placement and not just coherence
> management — lower level evictions get pushed home, but this is
> slow. Obviously replacement policy and early writeback could
> adjust the timing.)
>
> This might also present an opportunity similar to MONITOR/MWAIT
> while avoiding the issues of using general memory. (MONITOR/MWAIT
> can be one/any-to-many, which has some uses.)

Re: Misc: Dealing with variability

<a01ef743-67ca-4ec5-aecf-4481f3c15649n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33561&group=comp.arch#33561

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:b23:b0:635:f3f8:43cc with SMTP id w3-20020a0562140b2300b00635f3f843ccmr11727qvj.10.1691227894217;
Sat, 05 Aug 2023 02:31:34 -0700 (PDT)
X-Received: by 2002:a9d:72c8:0:b0:6b7:528c:d8bf with SMTP id
d8-20020a9d72c8000000b006b7528cd8bfmr2281203otk.0.1691227893888; Sat, 05 Aug
2023 02:31:33 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 5 Aug 2023 02:31:33 -0700 (PDT)
In-Reply-To: <u7uvlo$3oi68$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=93.159.184.237; posting-account=zjh_fgoAAABo0Nzgf6peaFtS6c-3xdgr
NNTP-Posting-Host: 93.159.184.237
References: <u7uvlo$3oi68$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a01ef743-67ca-4ec5-aecf-4481f3c15649n@googlegroups.com>
Subject: Re: Misc: Dealing with variability
From: peceed@gmail.com (pec...@gmail.com)
Injection-Date: Sat, 05 Aug 2023 09:31:34 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 8534
 by: pec...@gmail.com - Sat, 5 Aug 2023 09:31 UTC

poniedziałek, 3 lipca 2023 o 19:14:37 UTC+2 BGB napisał(a):
> One annoyance with my project is that I can't run a core that can run
> exactly the same code along the range of FPGA's I want to use:
> XC7S25: Can only run a simplistic RISC-like core (with no FPU or MMU)
> XC7S50: Currently running a 2-wide configuration with 32 GPRs
> Reduced features, eg: no 128-bit ALU ops in this case, ...
> XC7A100T: Can run 3-wide with 64 GPRs, no issue.
> Can also run 128-bit ALU ops, ...
>
>
>
> Generally, one needs to rebuild the code for each configuration.
> Code built for small configurations will perform worse on bigger
> configurations. Code built for wide configurations will not run on small
> configurations.
>
> There is a lot of "minor stuff" that may break when going between 32 and
> 64 GPRs.
> Running programs built for 64 GPRs on a kernel built in 32 GPR mode will
> tend to explode, ...
>
> I also don't necessarily want to have separate builds of of all the
> static libraries for each configuration, ...
>
>
> Though, this is partly inescapable as I have not come up with ideal ways
> to gloss over "sizeof(void *)" and "intptr" and some other issues (well,
> though there is an "__intptr" type whose main property in this case is
> to be the same width as "__sizeof(void *)" and similar).
>
> This part is in an "almost works" category, but code may act buggy or
> break in weird ways if linked against a library compiled for a different
> width (mostly because type-consistency in a lot of the existing codebase
> is a bit hit or miss, and the "gold standard" of using #ifdef for
> everything doesn't entirely work in this case).
>
>
> Though, my wonky compiler architecture has allowed a partial workaround
> for a limited set of cases (BGBCC specific):
> .ifarch feature_expr //ASM
> __declspec(ifarch(feature_expr)) //C declarations
> __ifarch(feature_expr) { ... } //C code
>
> Where, conventional #ifdef only works in the preprocessor stage, which
> can't deal with variability in the final configuration without a full
> recompile.
>
>
> For example, noting as my existing task scheduler code would explode
> with a 32 GPRs or without the XMOV (96-bit VA's) extension, I ended up
> needing to add some hackery:
> ...
> taskern=(TKPE_TaskInfoKern *)task->krnlptr;
> memcpy(
> taskern->ctx_regsave,
> __arch_isrsave,
> __ARCH_SIZEOF_REGSAVE__);
> __ifarch(!has_xgpr) //deal with 32 GPRs
> {
> taskern->ctx_regsave[TKPE_REGSAVE_GBR] =
> taskern->ctx_regsave[TKPE_REGSAVE_GBR_LO];
> taskern->ctx_regsave[TKPE_REGSAVE_LR] =
> taskern->ctx_regsave[TKPE_REGSAVE_LR_LO];
> taskern->ctx_regsave[TKPE_REGSAVE_SPC] =
> taskern->ctx_regsave[TKPE_REGSAVE_SPC_LO];
> taskern->ctx_regsave[TKPE_REGSAVE_EXSR] =
> taskern->ctx_regsave[TKPE_REGSAVE_EXSR_LO];
> }
> ...
> __ifarch(has_xmov) //only do this if we have XMOV
> {
> taskern->ctx_regsave[TKPE_REGSAVE_PCH]=__arch_pch;
> taskern->ctx_regsave[TKPE_REGSAVE_GBH]=__arch_gbh;
> }
> ...
>
> Where, it mostly deals with the 32 GPR config by copying some of the
> registers around so that they are in the correct places for the other
> code (with most of the kernel assuming 64 GPRs, but needing to wrangle
> it around when saving and restoring task contexts).
>
> While __ARCH_SIZEOF_REGSAVE__ looks like a normal preprocessor constant,
> it doesn't actually expand to a constant, but instead an "architecture
> defined variable" (which turns into a constant later on).
>
> Similarly, the "__arch_reg" variables allow exposing the underlying CPU
> control registers as-if they were C variables (though, the exact
> mechanism differs slightly in this case).
>
>
>
> Where, as noted, BGBCC doesn't handle static libraries by compiling
> directly to machine code, but instead first compiles to a stack-oriented
> bytecode format (RIL, along vaguely similar lines to JVM or .NET
> bytecode), which is then translated to 3-address-code and then machine
> code while compiling the final binary.
>
> So, in this case, if the "ifarch" remains flexible in the RIL code, then
> it can resolved once the RIL is translated to 3AC for the final
> compilation (this is typically also when things like "struct layout" and
> "sizeof(whatever)" is fully sorted out).
>
>
> RIL is kinda messy in terms of its design, and the whole image basically
> needs to be decoded in a linear fashion (it is not a conventional
> structured format, rather the entire image, including all the metadata,
> is expressed as a long linear stream of bytecode ops), but "basically
> works"... Most efforts to work on replacements have tended to stall though.
>
>
> For global declarations, it may cause the "global object" to be
> suppressed, where the compiler pretends as-if it did not exist.
>
> For code-blocks, it is more mundane:
> It is compiled mostly like a normal "if" block, except that the "if()"
> expression is resolved to a constant based on architectural feature
> flags and (may) allow the resulting basic-block to be either included or
> omitted from the binary (though, may not be sufficient to avoid the
> compiler from complaining about missing dependencies if the variables or
> functions used only exist along certain branches of the "ifarch" tree).
>
> These are mostly using normal C expression syntax, with some
> restrictions (one can't access normal runtime variables or call
> functions or similar; only use things that will be able to evaluate to a
> constant at compile time).
>
>
>
> But, the final binaries still end up being specific to a particular
> configuration, which is still kind of annoying.
>
> And, getting everything in place so that I can use an Arty S7-50 to
> drive a small robot around is being a lot more effort than ideal (mostly
> because this has led to a *lot* of these types of issues, as a lot of my
> code had implicitly drifted to assumptions of a "bigger" target, and I
> need to scale things back some and get things working again).
>
> Well, and in addition still need to write the robot control code and
> other stuff, still with some uncertainty about the whole "camera module"
> thing (maybe I should just get some IR or ultrasonic distance sensors?...).
>
> Granted, "easy path" in this case would be "just use a RasPi", but alas.
> My recent tinkering with neural nets was also partly related to this
> sub-case, as I had intended to try to use NNs for processing camera
> data, but this still seems like it is "pushing it".
>
> ...
>
>
> Any thoughts?...
It is a fun project, and you are complaining on excessive amount of fun...
In the real world there are goals and constraints.
You need to specify yours, set priorities, and take realistic approach.
It is hard to eat a cake and have it!
Time is your most precious resource.

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor