Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

Swap read error. You lose your mind.


devel / comp.arch / Misc Tradeoffs: Core I can fit on the XC7S50

SubjectAuthor
* Misc Tradeoffs: Core I can fit on the XC7S50BGB
`* Re: Misc Tradeoffs: Core I can fit on the XC7S50robf...@gmail.com
 `* Re: Misc Tradeoffs: Core I can fit on the XC7S50BGB
  `* Re: Misc Tradeoffs: Core I can fit on the XC7S50Peter Lund
   `* Re: Misc Tradeoffs: Core I can fit on the XC7S50BGB
    `* Re: Misc Tradeoffs: Core I can fit on the XC7S50robf...@gmail.com
     `* Re: Misc Tradeoffs: Core I can fit on the XC7S50BGB
      +* Re: Misc Tradeoffs: Core I can fit on the XC7S50Peter Lund
      |+- Re: Misc Tradeoffs: Core I can fit on the XC7S50Peter Lund
      |`* Re: Misc Tradeoffs: Core I can fit on the XC7S50BGB
      | `* Re: Misc Tradeoffs: Core I can fit on the XC7S50Peter Lund
      |  +- Re: Misc Tradeoffs: Core I can fit on the XC7S50robf...@gmail.com
      |  `- Re: Misc Tradeoffs: Core I can fit on the XC7S50BGB
      `* Re: Misc Tradeoffs: Core I can fit on the XC7S50MitchAlsup
       +* Re: Misc Tradeoffs: Core I can fit on the XC7S50Stephen Fuld
       |`- Re: Misc Tradeoffs: Core I can fit on the XC7S50MitchAlsup
       +* Re: Misc Tradeoffs: Core I can fit on the XC7S50Terje Mathisen
       |`- Re: Misc Tradeoffs: Core I can fit on the XC7S50MitchAlsup
       `* Re: Misc Tradeoffs: Core I can fit on the XC7S50BGB
        +* Re: Misc Tradeoffs: Core I can fit on the XC7S50robf...@gmail.com
        |`- Re: Misc Tradeoffs: Core I can fit on the XC7S50BGB
        `* Re: Misc Tradeoffs: Core I can fit on the XC7S50Peter Lund
         `* Re: Misc Tradeoffs: Core I can fit on the XC7S50BGB
          `- Re: Misc Tradeoffs: Core I can fit on the XC7S50BGB

1
Misc Tradeoffs: Core I can fit on the XC7S50

<u61ekn$28oq1$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32745&group=comp.arch#32745

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Misc Tradeoffs: Core I can fit on the XC7S50
Date: Sat, 10 Jun 2023 04:09:42 -0500
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <u61ekn$28oq1$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 10 Jun 2023 09:09:43 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9e6c7e11f197a4bb8c447e22209bb0fd";
logging-data="2384705"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19BvX2n9TuSjbsJBdQE06VU"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:0mX6/bh4OIrl2dpIed/pj85UZcI=
Content-Language: en-US
 by: BGB - Sat, 10 Jun 2023 09:09 UTC

For a project of mine (wanting to use my CPU core to drive a small robot
around), I have decided to try to get the BJX2 core onto an Arty S7-50
(mostly because I have it already), which is using the XC7S50 FPGA.

On the XC7S50, I have hit the limit for the LUT budget with a single
core, so have ended up dropping a bunch of stuff in this
mode/configuration to make it fit.

What does still fit:
3-wide pipeline with 64 GPRs
Jumbo ops;
48-bit virtual address space.
TLB (mostly)
FPU (sorta)
Low-precision FP-SIMD

What was dropped:
RISC-V alt-mode
LDTEX and some other mostly TKRA-GL specific features.
(For now) XG2 Mode (needs more evaluation);
64-bit multiply, and DIV/REM (Shift-Add unit dropped)
Bounds checking ops;
2x40b bundles;
Support for FPU immediate values;
...

May be dropped:
Ability to use shift ops from Lane 3;
FMOV.S / FMOV.H

Debated dropping to 32 GPRs, but this doesn't make a huge difference WRT
LUTs cost (the cost difference between LUT5 and LUT6 being "not
particularly large" in this case).

For the TLB, had dropped VUGID and ACLID support, so only traditional
User/Supervisor protection is available.

I am debating whether to drop FMOV.S / FMOV.H, as these cost LUTs and
can fall back to the older 2-op-sequence approach.

For the main FPU, dropped its ability to handle SIMD ops itself, meaning
that only the LP-FPU can do SIMD in this case. Implicitly, this means
that 2x Binary64 SIMD can't be done on this core. In this case, it is
only usable from Lane 1, and only does scalar Binary64 ops.

Had temporarily disabled the FPU's ability to round stuff, but may
re-enable this as it didn't effect cost that much.

Though, I guess I can note that the XC7S50 isn't as limited as the
XC7S25 (which would basically require falling back to a scalar design
with no FPU).

Granted, maybe there is someone around that can fit a 3-wide 64-bit VLIW
with FP-SIMD and similar on an XC7S25.

Timing: Passes fairly easily in this case.

Any thoughts?...

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32751&group=comp.arch#32751

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:57c8:0:b0:62d:e423:bef1 with SMTP id y8-20020ad457c8000000b0062de423bef1mr18115qvx.12.1686413690679;
Sat, 10 Jun 2023 09:14:50 -0700 (PDT)
X-Received: by 2002:a05:6830:1311:b0:6a5:e9c9:5b58 with SMTP id
p17-20020a056830131100b006a5e9c95b58mr1421573otq.1.1686413690374; Sat, 10 Jun
2023 09:14:50 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 10 Jun 2023 09:14:50 -0700 (PDT)
In-Reply-To: <u61ekn$28oq1$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <u61ekn$28oq1$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Sat, 10 Jun 2023 16:14:50 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: robf...@gmail.com - Sat, 10 Jun 2023 16:14 UTC

On Saturday, June 10, 2023 at 5:09:47 AM UTC-4, BGB wrote:
> For a project of mine (wanting to use my CPU core to drive a small robot
> around), I have decided to try to get the BJX2 core onto an Arty S7-50
> (mostly because I have it already), which is using the XC7S50 FPGA.
>
> On the XC7S50, I have hit the limit for the LUT budget with a single
> core, so have ended up dropping a bunch of stuff in this
> mode/configuration to make it fit.
>
>
> What does still fit:
> 3-wide pipeline with 64 GPRs
> Jumbo ops;
> 48-bit virtual address space.
> TLB (mostly)
> FPU (sorta)
> Low-precision FP-SIMD
>
> What was dropped:
> RISC-V alt-mode
> LDTEX and some other mostly TKRA-GL specific features.
> (For now) XG2 Mode (needs more evaluation);
> 64-bit multiply, and DIV/REM (Shift-Add unit dropped)
> Bounds checking ops;
> 2x40b bundles;
> Support for FPU immediate values;
> ...
>
> May be dropped:
> Ability to use shift ops from Lane 3;
> FMOV.S / FMOV.H
>
>
> Debated dropping to 32 GPRs, but this doesn't make a huge difference WRT
> LUTs cost (the cost difference between LUT5 and LUT6 being "not
> particularly large" in this case).
>
>
> For the TLB, had dropped VUGID and ACLID support, so only traditional
> User/Supervisor protection is available.
>
> I am debating whether to drop FMOV.S / FMOV.H, as these cost LUTs and
> can fall back to the older 2-op-sequence approach.
>
>
> For the main FPU, dropped its ability to handle SIMD ops itself, meaning
> that only the LP-FPU can do SIMD in this case. Implicitly, this means
> that 2x Binary64 SIMD can't be done on this core. In this case, it is
> only usable from Lane 1, and only does scalar Binary64 ops.
>
> Had temporarily disabled the FPU's ability to round stuff, but may
> re-enable this as it didn't effect cost that much.
>
>
>
> Though, I guess I can note that the XC7S50 isn't as limited as the
> XC7S25 (which would basically require falling back to a scalar design
> with no FPU).
>
> Granted, maybe there is someone around that can fit a 3-wide 64-bit VLIW
> with FP-SIMD and similar on an XC7S25.
>
> Timing: Passes fairly easily in this case.
>
>
> Any thoughts?...

I am a bit surprised that that BJX2 64-bits would fit in an XC7S50. 64-bits
might be overkill unless it is the main brains and you want to do some
high powered AI. I made a 32-bit RISCV system that fit easily into a
XC7S35 using the CModA7. The CModA7 is smaller and a bit less
expensive. If you end up using multiple boards for the robot a less
expensive board would reduce costs. I heard of an engineer who made
a robot using about 30 6502 SBCs. Have you seen the ZUBoard? It has
an ARM processor on it, and is priced similar to the Arty.

Years ago, I tried to make a small robot with a 68008 on a small
wire wrapped board with all CMOS logic. Never got much past spinning
the wheels a few times. Used modified model airplane servos for the
drive motors. Hacked apart they rotate 360 degrees and have feedback
logic built into them.

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<u62hfj$2d0bh$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32753&group=comp.arch#32753

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
Date: Sat, 10 Jun 2023 14:04:16 -0500
Organization: A noiseless patient Spider
Lines: 247
Message-ID: <u62hfj$2d0bh$1@dont-email.me>
References: <u61ekn$28oq1$1@dont-email.me>
<34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 10 Jun 2023 19:04:19 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9e6c7e11f197a4bb8c447e22209bb0fd";
logging-data="2523505"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18jnh5GdTahQYZIs7zMrzkR"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:2zvI8aNaCG1/5SbDMpxqucH3pow=
Content-Language: en-US
In-Reply-To: <34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
 by: BGB - Sat, 10 Jun 2023 19:04 UTC

On 6/10/2023 11:14 AM, robf...@gmail.com wrote:
> On Saturday, June 10, 2023 at 5:09:47 AM UTC-4, BGB wrote:
>> For a project of mine (wanting to use my CPU core to drive a small robot
>> around), I have decided to try to get the BJX2 core onto an Arty S7-50
>> (mostly because I have it already), which is using the XC7S50 FPGA.
>>
>> On the XC7S50, I have hit the limit for the LUT budget with a single
>> core, so have ended up dropping a bunch of stuff in this
>> mode/configuration to make it fit.
>>
>>
>> What does still fit:
>> 3-wide pipeline with 64 GPRs
>> Jumbo ops;
>> 48-bit virtual address space.
>> TLB (mostly)
>> FPU (sorta)
>> Low-precision FP-SIMD
>>
>> What was dropped:
>> RISC-V alt-mode
>> LDTEX and some other mostly TKRA-GL specific features.
>> (For now) XG2 Mode (needs more evaluation);
>> 64-bit multiply, and DIV/REM (Shift-Add unit dropped)
>> Bounds checking ops;
>> 2x40b bundles;
>> Support for FPU immediate values;
>> ...
>>
>> May be dropped:
>> Ability to use shift ops from Lane 3;
>> FMOV.S / FMOV.H
>>
>>
>> Debated dropping to 32 GPRs, but this doesn't make a huge difference WRT
>> LUTs cost (the cost difference between LUT5 and LUT6 being "not
>> particularly large" in this case).
>>
>>
>> For the TLB, had dropped VUGID and ACLID support, so only traditional
>> User/Supervisor protection is available.
>>
>> I am debating whether to drop FMOV.S / FMOV.H, as these cost LUTs and
>> can fall back to the older 2-op-sequence approach.
>>
>>
>> For the main FPU, dropped its ability to handle SIMD ops itself, meaning
>> that only the LP-FPU can do SIMD in this case. Implicitly, this means
>> that 2x Binary64 SIMD can't be done on this core. In this case, it is
>> only usable from Lane 1, and only does scalar Binary64 ops.
>>
>> Had temporarily disabled the FPU's ability to round stuff, but may
>> re-enable this as it didn't effect cost that much.
>>
>>
>>
>> Though, I guess I can note that the XC7S50 isn't as limited as the
>> XC7S25 (which would basically require falling back to a scalar design
>> with no FPU).
>>
>> Granted, maybe there is someone around that can fit a 3-wide 64-bit VLIW
>> with FP-SIMD and similar on an XC7S25.
>>
>> Timing: Passes fairly easily in this case.
>>
>>
>> Any thoughts?...
>
> I am a bit surprised that that BJX2 64-bits would fit in an XC7S50. 64-bits
> might be overkill unless it is the main brains and you want to do some
> high powered AI.

It has needed a little bit of fiddling to get the LUT cost down...

A lot of the stuff I disabled in this mode shaved around 12k LUTs off
the CPU core.

I am currently using 93% of the LUTs, but this is "still technically
fits", and timing is looking pretty OK at the moment (~ 2.0 ns for WNS).

I was trying to tune the configuration such that I can still run
neural-nets at OK speeds (mostly because I want to still be able to run
some computer-vision tasks and similar).

These cases are basically running code that is almost a solid wall of
Binary16 SIMD ops and similar.

Support for FP-IMM and combined SIMD+Shuffle could have been nice, but
these are a bit of a strain for LUT budget.

Or, basically:
PMULSH.H Rm, Imm48sh8, Rn

Combined Packed Binary16 MUL + Shuffle + 4x S.E5.F6 Immed, eats too many
LUTs for the XC7S50.

As-is, the configuration can still run 128-bit 4x Binary32 SIMD ops.
But, for what I want to do with this, I only really technically need
Binary16 SIMD.

Had otherwise, recently wrote a module that can output either H-Bridge
PWM pulses (what are probably going to be used in this case), or 1-2ms
servo pulses (could be useful, if I want to make use of RC servos).

So, the external configuration currently looks like:
16x GPIO pins, currently routed to the PMOD connectors;
8x PWM pins, currently mapped to the "shield connector" pins.
All of these will currently be needed for the wheels though.
Each H-bridge needs 2 pins to control.
Could maybe add a second H-Bridge module
Which could be available for 8x servo PWMs.
SDcard SPI is currently routed to the SPI pins on the Arty.
The Arty S7 lacks a built-in SDcard slot (annoying).

Still fiddling with stuff though...

> I made a 32-bit RISCV system that fit easily into a
> XC7S35 using the CModA7. The CModA7 is smaller and a bit less
> expensive. If you end up using multiple boards for the robot a less
> expensive board would reduce costs. I heard of an engineer who made
> a robot using about 30 6502 SBCs. Have you seen the ZUBoard? It has
> an ARM processor on it, and is priced similar to the Arty.
>

I have the CMod-S7 with an XC7S25, but this is only really sufficient
for microcontroller like cores (also its lack of a RAM chip severely
hurts its utility here for anything much beyond "small microcontroller"
tasks).

The CMod-A7 would be a little better, but the XC7A35T is likely not
sufficient to run a 3-wide VLIW core. Even if it did, the 512K RAM
module would be a bit more limited (vs a 128MB or 256MB RAM chip on most
of the larger boards).

For as much as the CMod-A7 costs, could almost get some more of the
QMTECH boards.

But, OTOH, throwing an XC7A100T or XC7A200T board at the problem would
be kinda overkill. Nevermind some "other annoyances" that may be an
issue for trying to use a QMTECH board for this.

For as much as the CMod boards would limit things, would almost be just
better off throwing a RasPi Pico or similar at the problem...

There was previously a Arty A7-35T board (with an XC7A35T), but Digilent
seems to have dropped it (so now only the more expensive XC7A100T
variant remains).

For an XC7A35T, could maybe do a 1-wide configuration with FPU though.
128-bit SIMD would be no-go in this case, but could potentially still
support 4x Binary16 SIMD.

As noted, one annoyance with BJX2 is that code built for a 3-wide core
will not run on 1-wide core (though, code built for 1-wide will still
generally run on 3-wide).

At present, this mostly leaves the Arty S7-50, which I also happen to
have already (and this board wasn't seeing a whole lot of use otherwise).

I don't really need a 48-bit address space for this use case, but this
is more of a "meh, whatever" thing. Did sort of want FP-SIMD though.

Theoretically, could have used a Cortex-M based microcontroller, but
these tend to "suck real hard" at floating-point use-cases (like, even
running at 250MHz or similar isn't enough to offset the typically dismal
floating-point performance).

I would generally like to be able to have some form of wireless
communication, but the specifics of this are still unclear.

Don't really want to use a RasPi, more so as the RasPi's tend to be
kinda heavy-weight when trying to run them off AA or AAA batteries (the
robot isn't all that large, so current plan is to go with NIMH AA's for
the main battery pack).

But, may still need to have a RasPi running on a desk to communicate
with the robot (since my BJX2 project still falls a bit short of what is
needed to be able to communicate over WiFi).

Something like wireless RS232 would have been fairly convenient though,
but alas.

> Years ago, I tried to make a small robot with a 68008 on a small
> wire wrapped board with all CMOS logic. Never got much past spinning
> the wheels a few times. Used modified model airplane servos for the
> drive motors. Hacked apart they rotate 360 degrees and have feedback
> logic built into them.
>

I had bought a frame that came with some small PMDC gear-motors and
mecanum wheels (I suspect the wheels were the main expensive part).

The frame wasn't much more than the wheels by themselves, but the
manufacturing had some obvious signs of "cheapness".

Otherwise, I would have probably used some NEMA-17 steppers or similar
as the drive motors (but would have needed to machine the frame myself).

Some years ago, I did build a previous small robot using power-drill
motors, and a lot of 3D printed plastic and cardboard, using a RasPi as
the controller. In that case was, using hand-wired H-Bridge drivers.

Had used a lead-acid UPS battery for running this one (with a smaller AA
based backup-pack for the RasPi, which could keep it running if the
bigger lead-acid battery was disconnected).

In this case, the current plan is to run everything off an AA pack, and
use some "cheapish" off-the-shelf H-Bridge drivers (L298N based,
theoretically sufficient for the motors).

While running the motors off AA's is pushing it a little, the NIMH AA's
can push a fair bit of current (does require beefing up the wires for
the battery holders though, as the de-facto standard 24ga
copper-clad-steel can't handle this, and starts "cooking" if one pulls
much over 100-200 mA).


Click here to read the complete article
Re: Misc Tradeoffs: Core I can fit on the XC7S50

<443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32788&group=comp.arch#32788

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:288:b0:3ee:be98:9fcf with SMTP id z8-20020a05622a028800b003eebe989fcfmr2758870qtw.0.1686571052328;
Mon, 12 Jun 2023 04:57:32 -0700 (PDT)
X-Received: by 2002:a05:6870:8c19:b0:1a3:528:9900 with SMTP id
ec25-20020a0568708c1900b001a305289900mr2904689oab.11.1686571052017; Mon, 12
Jun 2023 04:57:32 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jun 2023 04:57:31 -0700 (PDT)
In-Reply-To: <u62hfj$2d0bh$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=80.62.116.18; posting-account=iwcJjQoAAAAIecwT8pOXxaSOyiUTZMJr
NNTP-Posting-Host: 80.62.116.18
References: <u61ekn$28oq1$1@dont-email.me> <34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
From: peterfirefly@gmail.com (Peter Lund)
Injection-Date: Mon, 12 Jun 2023 11:57:32 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 1771
 by: Peter Lund - Mon, 12 Jun 2023 11:57 UTC

On Saturday, June 10, 2023 at 9:04:23 PM UTC+2, BGB wrote:
> Theoretically, could have used a Cortex-M based microcontroller, but
> these tend to "suck real hard" at floating-point use-cases (like, even
> running at 250MHz or similar isn't enough to offset the typically dismal
> floating-point performance).

Cortex-M0/M0+/M1 do indeed suck at fp. M4F has hardware fp so it should be ok. Even M3 will almost certainly be ok, given that it has single-cycle shifts and single-cycle counting of leading zeros.

-Peter

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<u67n00$36dql$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32798&group=comp.arch#32798

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
Date: Mon, 12 Jun 2023 13:09:02 -0500
Organization: A noiseless patient Spider
Lines: 81
Message-ID: <u67n00$36dql$1@dont-email.me>
References: <u61ekn$28oq1$1@dont-email.me>
<34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me>
<443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 12 Jun 2023 18:09:05 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8b32067dde87924b4d4d03d2b4de6538";
logging-data="3356501"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/k9ilFgY6GwtSPM8yEJ1hs"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:QyrJNEk3ej9xw983frsTBsbh6vQ=
In-Reply-To: <443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
Content-Language: en-US
 by: BGB - Mon, 12 Jun 2023 18:09 UTC

On 6/12/2023 6:57 AM, Peter Lund wrote:
> On Saturday, June 10, 2023 at 9:04:23 PM UTC+2, BGB wrote:
>> Theoretically, could have used a Cortex-M based microcontroller, but
>> these tend to "suck real hard" at floating-point use-cases (like, even
>> running at 250MHz or similar isn't enough to offset the typically dismal
>> floating-point performance).
>
> Cortex-M0/M0+/M1 do indeed suck at fp. M4F has hardware fp so it should be ok. Even M3 will almost certainly be ok, given that it has single-cycle shifts and single-cycle counting of leading zeros.
>

I was wanting to be able to run neural nets, which kinda depend "a lot"
on floating-point performance.

Despite the relatively low clock-speed of the BJX2 core, having 3L/1T
FP-SIMD ops allows for "relatively decent" performance (when the NNs are
turned into a big blob of ASM).

But, yeah, for Cortex-M, was mostly thinking of the M0/M0+ and similar...

A previous micro-benchmark for an NN showed it as being "relatively
comparable" to an early 2000s laptop (using x87).

Granted, both are also beat out at this by an original 700MHz/ARM11
RasPi (with pretty much all the newer RasPi's being clearly superior on
this front, *).

Could technically have just run the NNs on a RasPi, but, ...
This was more a case of me wanting to use my own stuff, vs just throwing
a RasPi at the problem.

Looks like an Cortex-M4F or Cortex-M7 could also work (might need to
compare against a RasPi or similar; not sure how much RAM is typical for
them, or if they can handle dual camera inputs, ...).

*: Despite the CPU in the laptop having better looking on-paper stats
(both higher clock-speed and OoO vs in-order), general performance was
fairly comparable to a RasPi2.

In terms of both floating-point and memcpy benchmarks, the RasPi2 is
significantly ahead (also the RasPi2 has more RAM as well, typical
"cheap" modern SDcards are also bigger than the laptop's HDD, ...).
Can't really put in a newer HDD because the laptop uses a parallel ATA
variant (and no one made significantly larger PATA drives).

In other news:
I recently added a hack that allows the 32-bit MUL unit in the BJX2 core
to handle a certain range of DIVS.L and DIVU.L instructions (it will
signal to the Shift-Add divider that it has handled it, and the divider
will skip doing anything; when the signal arrives to the EX3 stage, EX3
will forward the multiplier's result instead of the divider's).

It works if the source value is positive, and the divisor is a small
positive integer which it can turn it into a lookup table.

Currently only can deal with dividing positive and unsigned values, as
trying to use it on negative values gives "round towards negative
infinity" behavior rather than "round towards zero" (and the required
sign-flipping would wreck the timing constraints); and consequently
breaking some of the "sanity checks".

While the compiler does this transformation itself for
divide-by-constant, this hack mostly covers division at runtime.

But, it is still a bit of an ugly hack...

Modeling this behavior in my emulator seems able to push Dhrystone past
the 80k mark (for a long while, it was stuck at ~ 75k). Granted, a
recent special case to "long long" multiply pushed it from 75k to 77k,
and the divider tweak pushed it the rest of the way.

Granted, it is a bit of a hack, as it changes DIVS.L timing from a fixed
36-cycle timing, to 3 cycle with a 33-cycle penalty case.

....

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32808&group=comp.arch#32808

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:65d6:0:b0:760:73ce:f069 with SMTP id z205-20020a3765d6000000b0076073cef069mr553982qkb.6.1686635057980;
Mon, 12 Jun 2023 22:44:17 -0700 (PDT)
X-Received: by 2002:a05:6870:3a15:b0:1a6:78ee:7ca7 with SMTP id
du21-20020a0568703a1500b001a678ee7ca7mr2216332oab.9.1686635057640; Mon, 12
Jun 2023 22:44:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jun 2023 22:44:17 -0700 (PDT)
In-Reply-To: <u67n00$36dql$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:6a00:34e3:12b3:d6b5:91c6;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:6a00:34e3:12b3:d6b5:91c6
References: <u61ekn$28oq1$1@dont-email.me> <34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me> <443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Tue, 13 Jun 2023 05:44:17 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: robf...@gmail.com - Tue, 13 Jun 2023 05:44 UTC

On Monday, June 12, 2023 at 2:09:08 PM UTC-4, BGB wrote:
> On 6/12/2023 6:57 AM, Peter Lund wrote:
> > On Saturday, June 10, 2023 at 9:04:23 PM UTC+2, BGB wrote:
> >> Theoretically, could have used a Cortex-M based microcontroller, but
> >> these tend to "suck real hard" at floating-point use-cases (like, even
> >> running at 250MHz or similar isn't enough to offset the typically dismal
> >> floating-point performance).
> >
> > Cortex-M0/M0+/M1 do indeed suck at fp. M4F has hardware fp so it should be ok. Even M3 will almost certainly be ok, given that it has single-cycle shifts and single-cycle counting of leading zeros.
> >
> I was wanting to be able to run neural nets, which kinda depend "a lot"
> on floating-point performance.

Could you use fixed-point? I based my neural net accelerator on 16.16 fixed
point. I have not checked the size, but I think modelling a small number of
neurons is possible. I used a word-serial approach and tables occupying
two BRAMs per neuron. Then they can be built up into larger networks using
software. Relying on a neuron accelerator rather than a pure software
implementation.

>
> Despite the relatively low clock-speed of the BJX2 core, having 3L/1T
> FP-SIMD ops allows for "relatively decent" performance (when the NNs are
> turned into a big blob of ASM).
>
> But, yeah, for Cortex-M, was mostly thinking of the M0/M0+ and similar...
>
>
>
> A previous micro-benchmark for an NN showed it as being "relatively
> comparable" to an early 2000s laptop (using x87).
>
> Granted, both are also beat out at this by an original 700MHz/ARM11
> RasPi (with pretty much all the newer RasPi's being clearly superior on
> this front, *).
>
> Could technically have just run the NNs on a RasPi, but, ...
> This was more a case of me wanting to use my own stuff, vs just throwing
> a RasPi at the problem.
>
> Looks like an Cortex-M4F or Cortex-M7 could also work (might need to
> compare against a RasPi or similar; not sure how much RAM is typical for
> them, or if they can handle dual camera inputs, ...).
>
>
> *: Despite the CPU in the laptop having better looking on-paper stats
> (both higher clock-speed and OoO vs in-order), general performance was
> fairly comparable to a RasPi2.
>
> In terms of both floating-point and memcpy benchmarks, the RasPi2 is
> significantly ahead (also the RasPi2 has more RAM as well, typical
> "cheap" modern SDcards are also bigger than the laptop's HDD, ...).
> Can't really put in a newer HDD because the laptop uses a parallel ATA
> variant (and no one made significantly larger PATA drives).
>
>
>
> In other news:
> I recently added a hack that allows the 32-bit MUL unit in the BJX2 core
> to handle a certain range of DIVS.L and DIVU.L instructions (it will
> signal to the Shift-Add divider that it has handled it, and the divider
> will skip doing anything; when the signal arrives to the EX3 stage, EX3
> will forward the multiplier's result instead of the divider's).
>
> It works if the source value is positive, and the divisor is a small
> positive integer which it can turn it into a lookup table.
>
> Currently only can deal with dividing positive and unsigned values, as
> trying to use it on negative values gives "round towards negative
> infinity" behavior rather than "round towards zero" (and the required
> sign-flipping would wreck the timing constraints); and consequently
> breaking some of the "sanity checks".
>
> While the compiler does this transformation itself for
> divide-by-constant, this hack mostly covers division at runtime.
>
> But, it is still a bit of an ugly hack...
>
>
> Modeling this behavior in my emulator seems able to push Dhrystone past
> the 80k mark (for a long while, it was stuck at ~ 75k). Granted, a
> recent special case to "long long" multiply pushed it from 75k to 77k,
> and the divider tweak pushed it the rest of the way.
>
> Granted, it is a bit of a hack, as it changes DIVS.L timing from a fixed
> 36-cycle timing, to 3 cycle with a 33-cycle penalty case.
>
> ...

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<u69ani$3ho4r$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32815&group=comp.arch#32815

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
Date: Tue, 13 Jun 2023 03:52:00 -0500
Organization: A noiseless patient Spider
Lines: 208
Message-ID: <u69ani$3ho4r$1@dont-email.me>
References: <u61ekn$28oq1$1@dont-email.me>
<34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me>
<443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me>
<bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 13 Jun 2023 08:52:03 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="d35fdf76e1550127dbdbe14c1d6b409e";
logging-data="3727515"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19k8Dgf3A2GtbVj8vmPH+IZ"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:rXZbAXRh/IM3DXYIB8/+E3AZb/A=
In-Reply-To: <bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 13 Jun 2023 08:52 UTC

On 6/13/2023 12:44 AM, robf...@gmail.com wrote:
> On Monday, June 12, 2023 at 2:09:08 PM UTC-4, BGB wrote:
>> On 6/12/2023 6:57 AM, Peter Lund wrote:
>>> On Saturday, June 10, 2023 at 9:04:23 PM UTC+2, BGB wrote:
>>>> Theoretically, could have used a Cortex-M based microcontroller, but
>>>> these tend to "suck real hard" at floating-point use-cases (like, even
>>>> running at 250MHz or similar isn't enough to offset the typically dismal
>>>> floating-point performance).
>>>
>>> Cortex-M0/M0+/M1 do indeed suck at fp. M4F has hardware fp so it should be ok. Even M3 will almost certainly be ok, given that it has single-cycle shifts and single-cycle counting of leading zeros.
>>>
>> I was wanting to be able to run neural nets, which kinda depend "a lot"
>> on floating-point performance.
>
> Could you use fixed-point? I based my neural net accelerator on 16.16 fixed
> point. I have not checked the size, but I think modelling a small number of
> neurons is possible. I used a word-serial approach and tables occupying
> two BRAMs per neuron. Then they can be built up into larger networks using
> software. Relying on a neuron accelerator rather than a pure software
> implementation.
>

Early on, I had considered 8.8 fixed-point, but Binary16 has a larger
dynamic range and can give better results. Similar, there is no
(particular) performance advantage of fixed-point over floating-point in
this case (while the FP-SIMD ops have a higher latency, there is enough
going on that it is possible to schedule the instructions such that full
throughput is achieved).

The "PMULSH.H" instruction would have been useful:
* FE-FE F2nm_9qjj PMULSH.H Rm, Imm48fv8sh, Rn

But, sadly, this is asking a little too much of the LUT budget on the
XC7S50... Though, this instruction was designed specifically for this
use-case.

It does make a difference in that without this instruction, it makes
sense to compute roughly 4 neurons in parallel (with a lot of cycles
being spent on 64-bit constant loads), whereas to achieve full
throughput with this instruction, it makes more sense to calculate 16
neurons in parallel. But, then this also basically needs the XGPR
extension to be able to not run out of registers.

This scenario pushing roughly 150 MFLOP/s at 50 MHz (vs only around 110
MFLOP/s with separate constant-loads and shuffles).

For Binary32 SIMD, the situation would be a little worse, not because of
a limitation of the SIMD unit, but rather because this precludes being
able to run shuffle ops in parallel with SIMD ops.

If I could pull off the "PMACSH.H" instruction which does FMAC instead,
this would allow a further speedup (saving all the clock-cycles that
would have been spent on PADD.H instructions). But, this would add other
issues (it being unclear if I can shove a Binary16 FMAC into a 3-cycle
latency...).

Though, unlike many other cases, rather than being D$ limited, this sort
of code puts a lot more strain on the I$ instead (with a comparably high
I$ miss rate).

Theoretically, the SIMD unit could go up to 200 MFLOP/s at 50 MHz, but
achieving this is unrealistic.

Some of the nets I had tested with were:
64-inputs;
3 layers of 32 neurons;
4 outputs.

So, say, roughly 3.2k weights + 100 biases.

For something like depth inference, one would pull, say, 32 pixels from
each image (in a roughly overlapping section), then tile the net across
the scanline (to generate a low-res map Z map).

>>
>> Despite the relatively low clock-speed of the BJX2 core, having 3L/1T
>> FP-SIMD ops allows for "relatively decent" performance (when the NNs are
>> turned into a big blob of ASM).
>>
>> But, yeah, for Cortex-M, was mostly thinking of the M0/M0+ and similar...
>>
>>
>>
>> A previous micro-benchmark for an NN showed it as being "relatively
>> comparable" to an early 2000s laptop (using x87).
>>

For context, despite the laptop's clock-speed advantage, it was
hard-pressed to win with the x87.

Granted, this is a fairly uneven comparison (eg: dense SIMD vs generic x87).

Also, generated ASM blobs vs "more generic" C.

Whole lot of:
a0=a0+va[0]*wv[0];
a1=a1+va[1]*wv[1];
a2=a2+va[2]*wv[2];
a3=a3+va[3]*wv[3];
...

Like, doesn't seem like it is asking "that" much?...

But, like, the clock-speed disparity is big enough that one would expect
that a 50 MHz core would have any chance against 1.47 GHz on any metric
(regardless of the ISA).

Granted, maybe it would make sense to compare against an SSE
implementation, since this is closer to "apples to apples".

>> Granted, both are also beat out at this by an original 700MHz/ARM11
>> RasPi (with pretty much all the newer RasPi's being clearly superior on
>> this front, *).
>>

With a RasPi pushing a bit more with similar code...

>> Could technically have just run the NNs on a RasPi, but, ...
>> This was more a case of me wanting to use my own stuff, vs just throwing
>> a RasPi at the problem.
>>
>> Looks like an Cortex-M4F or Cortex-M7 could also work (might need to
>> compare against a RasPi or similar; not sure how much RAM is typical for
>> them, or if they can handle dual camera inputs, ...).
>>
>>
>> *: Despite the CPU in the laptop having better looking on-paper stats
>> (both higher clock-speed and OoO vs in-order), general performance was
>> fairly comparable to a RasPi2.
>>

Eg: "Mobile Athlon XP 1700+"

On paper, seems like it should be pretty solid...

On paper, seems like it should stomp a RasPi or RasPi2.
But, my attempts at testing seem to disagree.

>> In terms of both floating-point and memcpy benchmarks, the RasPi2 is
>> significantly ahead (also the RasPi2 has more RAM as well, typical
>> "cheap" modern SDcards are also bigger than the laptop's HDD, ...).
>> Can't really put in a newer HDD because the laptop uses a parallel ATA
>> variant (and no one made significantly larger PATA drives).
>>

I am less sure about the RasPi2's memcpy speed advantage.

A lot apparently comes down to the (unknown) detail of the RAM speed and
bus width and similar on the RasPi.

Goes and looks at stuff, PC-2100 shouldn't be at that much of a
disadvantage... (Lower MT/s but should have a wider bus width). The
laptop equipped with 2x 256MB modules.

Will note though that my past benchmark attempts seem to fall well short
of the theoretical bandwidth (then again, it seems like in my attempts
at memcpy benchmarks, I never see numbers anywhere near the theoretical
bandwidth values).

>>
>>
>> In other news:
>> I recently added a hack that allows the 32-bit MUL unit in the BJX2 core
>> to handle a certain range of DIVS.L and DIVU.L instructions (it will
>> signal to the Shift-Add divider that it has handled it, and the divider
>> will skip doing anything; when the signal arrives to the EX3 stage, EX3
>> will forward the multiplier's result instead of the divider's).
>>
>> It works if the source value is positive, and the divisor is a small
>> positive integer which it can turn it into a lookup table.
>>
>> Currently only can deal with dividing positive and unsigned values, as
>> trying to use it on negative values gives "round towards negative
>> infinity" behavior rather than "round towards zero" (and the required
>> sign-flipping would wreck the timing constraints); and consequently
>> breaking some of the "sanity checks".
>>
>> While the compiler does this transformation itself for
>> divide-by-constant, this hack mostly covers division at runtime.
>>
>> But, it is still a bit of an ugly hack...
>>
>>
>> Modeling this behavior in my emulator seems able to push Dhrystone past
>> the 80k mark (for a long while, it was stuck at ~ 75k). Granted, a
>> recent special case to "long long" multiply pushed it from 75k to 77k,
>> and the divider tweak pushed it the rest of the way.
>>
>> Granted, it is a bit of a hack, as it changes DIVS.L timing from a fixed
>> 36-cycle timing, to 3 cycle with a 33-cycle penalty case.
>>
>> ...

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<1005dc81-2a60-4bef-871d-eee8a3f0179dn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32826&group=comp.arch#32826

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9a12:0:b0:75a:1d5b:8678 with SMTP id c18-20020a379a12000000b0075a1d5b8678mr896221qke.0.1686679238552;
Tue, 13 Jun 2023 11:00:38 -0700 (PDT)
X-Received: by 2002:a05:6808:128e:b0:39c:5f64:82fc with SMTP id
a14-20020a056808128e00b0039c5f6482fcmr3994511oiw.4.1686679236733; Tue, 13 Jun
2023 11:00:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 13 Jun 2023 11:00:36 -0700 (PDT)
In-Reply-To: <u69ani$3ho4r$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=80.62.116.10; posting-account=iwcJjQoAAAAIecwT8pOXxaSOyiUTZMJr
NNTP-Posting-Host: 80.62.116.10
References: <u61ekn$28oq1$1@dont-email.me> <34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me> <443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me> <bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1005dc81-2a60-4bef-871d-eee8a3f0179dn@googlegroups.com>
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
From: peterfirefly@gmail.com (Peter Lund)
Injection-Date: Tue, 13 Jun 2023 18:00:38 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 261
 by: Peter Lund - Tue, 13 Jun 2023 18:00 UTC

On Tuesday, June 13, 2023 at 10:52:07 AM UTC+2, BGB wrote:
> On 6/13/2023 12:44 AM, robf...@gmail.com wrote:
> > On Monday, June 12, 2023 at 2:09:08 PM UTC-4, BGB wrote:
> >> On 6/12/2023 6:57 AM, Peter Lund wrote:
> >>> On Saturday, June 10, 2023 at 9:04:23 PM UTC+2, BGB wrote:
> >>>> Theoretically, could have used a Cortex-M based microcontroller, but
> >>>> these tend to "suck real hard" at floating-point use-cases (like, even
> >>>> running at 250MHz or similar isn't enough to offset the typically dismal
> >>>> floating-point performance).
> >>>
> >>> Cortex-M0/M0+/M1 do indeed suck at fp. M4F has hardware fp so it should be ok. Even M3 will almost certainly be ok, given that it has single-cycle shifts and single-cycle counting of leading zeros.
> >>>
> >> I was wanting to be able to run neural nets, which kinda depend "a lot"
> >> on floating-point performance.
> >
> > Could you use fixed-point? I based my neural net accelerator on 16.16 fixed
> > point. I have not checked the size, but I think modelling a small number of
> > neurons is possible. I used a word-serial approach and tables occupying
> > two BRAMs per neuron. Then they can be built up into larger networks using
> > software. Relying on a neuron accelerator rather than a pure software
> > implementation.
> >
> Early on, I had considered 8.8 fixed-point, but Binary16 has a larger
> dynamic range and can give better results. Similar, there is no
> (particular) performance advantage of fixed-point over floating-point in
> this case (while the FP-SIMD ops have a higher latency, there is enough
> going on that it is possible to schedule the instructions such that full
> throughput is achieved).
>
>
> The "PMULSH.H" instruction would have been useful:
> * FE-FE F2nm_9qjj PMULSH.H Rm, Imm48fv8sh, Rn
>
> But, sadly, this is asking a little too much of the LUT budget on the
> XC7S50... Though, this instruction was designed specifically for this
> use-case.
>
> It does make a difference in that without this instruction, it makes
> sense to compute roughly 4 neurons in parallel (with a lot of cycles
> being spent on 64-bit constant loads), whereas to achieve full
> throughput with this instruction, it makes more sense to calculate 16
> neurons in parallel. But, then this also basically needs the XGPR
> extension to be able to not run out of registers.
>
> This scenario pushing roughly 150 MFLOP/s at 50 MHz (vs only around 110
> MFLOP/s with separate constant-loads and shuffles).
>
> For Binary32 SIMD, the situation would be a little worse, not because of
> a limitation of the SIMD unit, but rather because this precludes being
> able to run shuffle ops in parallel with SIMD ops.
>
> If I could pull off the "PMACSH.H" instruction which does FMAC instead,
> this would allow a further speedup (saving all the clock-cycles that
> would have been spent on PADD.H instructions). But, this would add other
> issues (it being unclear if I can shove a Binary16 FMAC into a 3-cycle
> latency...).
>
>
> Though, unlike many other cases, rather than being D$ limited, this sort
> of code puts a lot more strain on the I$ instead (with a comparably high
> I$ miss rate).
>
> Theoretically, the SIMD unit could go up to 200 MFLOP/s at 50 MHz, but
> achieving this is unrealistic.
>
>
>
>
> Some of the nets I had tested with were:
> 64-inputs;
> 3 layers of 32 neurons;
> 4 outputs.
>
> So, say, roughly 3.2k weights + 100 biases.
>
> For something like depth inference, one would pull, say, 32 pixels from
> each image (in a roughly overlapping section), then tile the net across
> the scanline (to generate a low-res map Z map).
> >>
> >> Despite the relatively low clock-speed of the BJX2 core, having 3L/1T
> >> FP-SIMD ops allows for "relatively decent" performance (when the NNs are
> >> turned into a big blob of ASM).
> >>
> >> But, yeah, for Cortex-M, was mostly thinking of the M0/M0+ and similar....
> >>
> >>
> >>
> >> A previous micro-benchmark for an NN showed it as being "relatively
> >> comparable" to an early 2000s laptop (using x87).
> >>
> For context, despite the laptop's clock-speed advantage, it was
> hard-pressed to win with the x87.
>
> Granted, this is a fairly uneven comparison (eg: dense SIMD vs generic x87).
>
> Also, generated ASM blobs vs "more generic" C.
>
> Whole lot of:
> a0=a0+va[0]*wv[0];
> a1=a1+va[1]*wv[1];
> a2=a2+va[2]*wv[2];
> a3=a3+va[3]*wv[3];
> ...
>
> Like, doesn't seem like it is asking "that" much?...
>
>
>
> But, like, the clock-speed disparity is big enough that one would expect
> that a 50 MHz core would have any chance against 1.47 GHz on any metric
> (regardless of the ISA).
>
> Granted, maybe it would make sense to compare against an SSE
> implementation, since this is closer to "apples to apples".
> >> Granted, both are also beat out at this by an original 700MHz/ARM11
> >> RasPi (with pretty much all the newer RasPi's being clearly superior on
> >> this front, *).
> >>
> With a RasPi pushing a bit more with similar code...
> >> Could technically have just run the NNs on a RasPi, but, ...
> >> This was more a case of me wanting to use my own stuff, vs just throwing
> >> a RasPi at the problem.
> >>
> >> Looks like an Cortex-M4F or Cortex-M7 could also work (might need to
> >> compare against a RasPi or similar; not sure how much RAM is typical for
> >> them, or if they can handle dual camera inputs, ...).
> >>
> >>
> >> *: Despite the CPU in the laptop having better looking on-paper stats
> >> (both higher clock-speed and OoO vs in-order), general performance was
> >> fairly comparable to a RasPi2.
> >>
> Eg: "Mobile Athlon XP 1700+"
>
> On paper, seems like it should be pretty solid...
>
> On paper, seems like it should stomp a RasPi or RasPi2.
> But, my attempts at testing seem to disagree.
> >> In terms of both floating-point and memcpy benchmarks, the RasPi2 is
> >> significantly ahead (also the RasPi2 has more RAM as well, typical
> >> "cheap" modern SDcards are also bigger than the laptop's HDD, ...).
> >> Can't really put in a newer HDD because the laptop uses a parallel ATA
> >> variant (and no one made significantly larger PATA drives).
> >>
> I am less sure about the RasPi2's memcpy speed advantage.
>
> A lot apparently comes down to the (unknown) detail of the RAM speed and
> bus width and similar on the RasPi.
>
>
> Goes and looks at stuff, PC-2100 shouldn't be at that much of a
> disadvantage... (Lower MT/s but should have a wider bus width). The
> laptop equipped with 2x 256MB modules.
>
>
> Will note though that my past benchmark attempts seem to fall well short
> of the theoretical bandwidth (then again, it seems like in my attempts
> at memcpy benchmarks, I never see numbers anywhere near the theoretical
> bandwidth values).
> >>
> >>
> >> In other news:
> >> I recently added a hack that allows the 32-bit MUL unit in the BJX2 core
> >> to handle a certain range of DIVS.L and DIVU.L instructions (it will
> >> signal to the Shift-Add divider that it has handled it, and the divider
> >> will skip doing anything; when the signal arrives to the EX3 stage, EX3
> >> will forward the multiplier's result instead of the divider's).
> >>
> >> It works if the source value is positive, and the divisor is a small
> >> positive integer which it can turn it into a lookup table.
> >>
> >> Currently only can deal with dividing positive and unsigned values, as
> >> trying to use it on negative values gives "round towards negative
> >> infinity" behavior rather than "round towards zero" (and the required
> >> sign-flipping would wreck the timing constraints); and consequently
> >> breaking some of the "sanity checks".
> >>
> >> While the compiler does this transformation itself for
> >> divide-by-constant, this hack mostly covers division at runtime.
> >>
> >> But, it is still a bit of an ugly hack...
> >>
> >>
> >> Modeling this behavior in my emulator seems able to push Dhrystone past
> >> the 80k mark (for a long while, it was stuck at ~ 75k). Granted, a
> >> recent special case to "long long" multiply pushed it from 75k to 77k,
> >> and the divider tweak pushed it the rest of the way.
> >>
> >> Granted, it is a bit of a hack, as it changes DIVS.L timing from a fixed
> >> 36-cycle timing, to 3 cycle with a 33-cycle penalty case.
> >>
> >> ...


Click here to read the complete article
Re: Misc Tradeoffs: Core I can fit on the XC7S50

<5f693d8e-4968-4362-95e5-e290280637f6n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32827&group=comp.arch#32827

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:59ce:0:b0:3f9:d266:7bc8 with SMTP id f14-20020ac859ce000000b003f9d2667bc8mr4427898qtf.3.1686679390438;
Tue, 13 Jun 2023 11:03:10 -0700 (PDT)
X-Received: by 2002:a05:6871:6a9c:b0:1a2:fd06:9f9b with SMTP id
zf28-20020a0568716a9c00b001a2fd069f9bmr3659063oab.8.1686679389994; Tue, 13
Jun 2023 11:03:09 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 13 Jun 2023 11:03:09 -0700 (PDT)
In-Reply-To: <1005dc81-2a60-4bef-871d-eee8a3f0179dn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=80.62.116.10; posting-account=iwcJjQoAAAAIecwT8pOXxaSOyiUTZMJr
NNTP-Posting-Host: 80.62.116.10
References: <u61ekn$28oq1$1@dont-email.me> <34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me> <443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me> <bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me> <1005dc81-2a60-4bef-871d-eee8a3f0179dn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5f693d8e-4968-4362-95e5-e290280637f6n@googlegroups.com>
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
From: peterfirefly@gmail.com (Peter Lund)
Injection-Date: Tue, 13 Jun 2023 18:03:10 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 1822
 by: Peter Lund - Tue, 13 Jun 2023 18:03 UTC

On Tuesday, June 13, 2023 at 8:00:40 PM UTC+2, Peter Lund wrote:
> Take a look at distillation and quantization. You almost certainly don't need floating-point for useful inference.

You can probably get away with 4-8 bits of (scaled) integer weights -- possibly augmented with some wider (scaled) integers or fp.

-Peter

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32830&group=comp.arch#32830

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:9c3:b0:62d:e657:57d2 with SMTP id dp3-20020a05621409c300b0062de65757d2mr1125260qvb.2.1686681823037;
Tue, 13 Jun 2023 11:43:43 -0700 (PDT)
X-Received: by 2002:a4a:c506:0:b0:558:a10f:46eb with SMTP id
i6-20020a4ac506000000b00558a10f46ebmr3944332ooq.1.1686681822764; Tue, 13 Jun
2023 11:43:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 13 Jun 2023 11:43:42 -0700 (PDT)
In-Reply-To: <u69ani$3ho4r$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:991b:d366:7e92:932f;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:991b:d366:7e92:932f
References: <u61ekn$28oq1$1@dont-email.me> <34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me> <443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me> <bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Tue, 13 Jun 2023 18:43:43 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2414
 by: MitchAlsup - Tue, 13 Jun 2023 18:43 UTC

On Tuesday, June 13, 2023 at 3:52:07 AM UTC-5, BGB wrote:
> On 6/13/2023 12:44 AM, robf...@gmail.com wrote:
> > On Monday, June 12, 2023 at 2:09:08 PM UTC-4, BGB wrote:
>
> For context, despite the laptop's clock-speed advantage, it was
> hard-pressed to win with the x87.
>
> Granted, this is a fairly uneven comparison (eg: dense SIMD vs generic x87).
>
> Also, generated ASM blobs vs "more generic" C.
>
> Whole lot of:
> a0=a0+va[0]*wv[0];
> a1=a1+va[1]*wv[1];
> a2=a2+va[2]*wv[2];
> a3=a3+va[3]*wv[3];
> ...
<
The google NN chip does 1024 multiplies and sum reduces these to a single
result at 1024 new multiplies start every cycle and 1 new result pops out
every cycle.
<
result = 0;
for( i = 0; i < 1024; i++ )
result += a[i]×w[i];
>
> Like, doesn't seem like it is asking "that" much?...
>
In fact, too little......
>

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<u6aftf$3mjqi$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32833&group=comp.arch#32833

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
Date: Tue, 13 Jun 2023 12:26:37 -0700
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <u6aftf$3mjqi$1@dont-email.me>
References: <u61ekn$28oq1$1@dont-email.me>
<34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me>
<443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me>
<bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me>
<f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 13 Jun 2023 19:26:39 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="51a5558cac382b3fd4e1a502e925b71c";
logging-data="3886930"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+3gyRqo57UO2qyX3rkyDGWcSJooWgTUsw="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:d2r68Dkvk7aaQgyQ6tTFtlkLfAc=
In-Reply-To: <f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Tue, 13 Jun 2023 19:26 UTC

On 6/13/2023 11:43 AM, MitchAlsup wrote:
> On Tuesday, June 13, 2023 at 3:52:07 AM UTC-5, BGB wrote:
>> On 6/13/2023 12:44 AM, robf...@gmail.com wrote:
>>> On Monday, June 12, 2023 at 2:09:08 PM UTC-4, BGB wrote:
>>
>> For context, despite the laptop's clock-speed advantage, it was
>> hard-pressed to win with the x87.
>>
>> Granted, this is a fairly uneven comparison (eg: dense SIMD vs generic x87).
>>
>> Also, generated ASM blobs vs "more generic" C.
>>
>> Whole lot of:
>> a0=a0+va[0]*wv[0];
>> a1=a1+va[1]*wv[1];
>> a2=a2+va[2]*wv[2];
>> a3=a3+va[3]*wv[3];
>> ...
> <
> The google NN chip does 1024 multiplies and sum reduces these to a single
> result at 1024 new multiplies start every cycle and 1 new result pops out
> every cycle.
> <
> result = 0;
> for( i = 0; i < 1024; i++ )
> result += a[i]×w[i];

Interesting. What is the latency, or to put it another way, while you
can just add hardware to do all the multiplies in parallel, how long and
how much hardware does the addition take?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<u6agtt$3mnqq$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32835&group=comp.arch#32835

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
Date: Tue, 13 Jun 2023 21:43:57 +0200
Organization: A noiseless patient Spider
Lines: 43
Message-ID: <u6agtt$3mnqq$1@dont-email.me>
References: <u61ekn$28oq1$1@dont-email.me>
<34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me>
<443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me>
<bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me>
<f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 13 Jun 2023 19:43:57 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3b364792274804d5bbe8f882351abd6b";
logging-data="3891034"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX190LFf0TqCzJXU/DrsgPLdxgzxRSIFGoW6747xFrzOYKA=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:ecRXMBfgfjEW6jDvz0lrgU7h0qU=
In-Reply-To: <f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
 by: Terje Mathisen - Tue, 13 Jun 2023 19:43 UTC

MitchAlsup wrote:
> On Tuesday, June 13, 2023 at 3:52:07 AM UTC-5, BGB wrote:
>> On 6/13/2023 12:44 AM, robf...@gmail.com wrote:
>>> On Monday, June 12, 2023 at 2:09:08 PM UTC-4, BGB wrote:
>>
>> For context, despite the laptop's clock-speed advantage, it was
>> hard-pressed to win with the x87.
>>
>> Granted, this is a fairly uneven comparison (eg: dense SIMD vs generic x87).
>>
>> Also, generated ASM blobs vs "more generic" C.
>>
>> Whole lot of:
>> a0=a0+va[0]*wv[0];
>> a1=a1+va[1]*wv[1];
>> a2=a2+va[2]*wv[2];
>> a3=a3+va[3]*wv[3];
>> ...
> <
> The google NN chip does 1024 multiplies and sum reduces these to a single
> result at 1024 new multiplies start every cycle and 1 new result pops out
> every cycle.
> <
> result = 0;
> for( i = 0; i < 1024; i++ )
> result += a[i]×w[i];
>>
>> Like, doesn't seem like it is asking "that" much?...
>>
> In fact, too little......

When I read about the google chip several years ago, it was stated that
it could do 64K mul-acc operations/cycle. This compared to something
like 10K MACs in the Tesla chip which needs to work on battery power.

Did I misremember or are we taling about a different chip architecture?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<d6f7fd9b-195c-41b9-9d04-4598fbc9a860n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32841&group=comp.arch#32841

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5905:0:b0:3e4:ed8e:6dd8 with SMTP id 5-20020ac85905000000b003e4ed8e6dd8mr20446qty.6.1686691878296;
Tue, 13 Jun 2023 14:31:18 -0700 (PDT)
X-Received: by 2002:aca:f2c4:0:b0:39c:8fc4:31a7 with SMTP id
q187-20020acaf2c4000000b0039c8fc431a7mr3098863oih.6.1686691878081; Tue, 13
Jun 2023 14:31:18 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 13 Jun 2023 14:31:17 -0700 (PDT)
In-Reply-To: <u6aftf$3mjqi$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:991b:d366:7e92:932f;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:991b:d366:7e92:932f
References: <u61ekn$28oq1$1@dont-email.me> <34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me> <443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me> <bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me> <f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
<u6aftf$3mjqi$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d6f7fd9b-195c-41b9-9d04-4598fbc9a860n@googlegroups.com>
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Tue, 13 Jun 2023 21:31:18 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3003
 by: MitchAlsup - Tue, 13 Jun 2023 21:31 UTC

On Tuesday, June 13, 2023 at 2:26:43 PM UTC-5, Stephen Fuld wrote:
> On 6/13/2023 11:43 AM, MitchAlsup wrote:
> > On Tuesday, June 13, 2023 at 3:52:07 AM UTC-5, BGB wrote:
> >> On 6/13/2023 12:44 AM, robf...@gmail.com wrote:
> >>> On Monday, June 12, 2023 at 2:09:08 PM UTC-4, BGB wrote:
> >>
> >> For context, despite the laptop's clock-speed advantage, it was
> >> hard-pressed to win with the x87.
> >>
> >> Granted, this is a fairly uneven comparison (eg: dense SIMD vs generic x87).
> >>
> >> Also, generated ASM blobs vs "more generic" C.
> >>
> >> Whole lot of:
> >> a0=a0+va[0]*wv[0];
> >> a1=a1+va[1]*wv[1];
> >> a2=a2+va[2]*wv[2];
> >> a3=a3+va[3]*wv[3];
> >> ...
> > <
> > The google NN chip does 1024 multiplies and sum reduces these to a single
> > result at 1024 new multiplies start every cycle and 1 new result pops out
> > every cycle.
> > <
> > result = 0;
> > for( i = 0; i < 1024; i++ )
> > result += a[i]×w[i];
> Interesting. What is the latency, or to put it another way, while you
> can just add hardware to do all the multiplies in parallel, how long and
> how much hardware does the addition take?
<
3 or 4 cycles--it is all carry-save addition until you get to the final summation.
>
>
>
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<55acffa1-4909-407f-b0bf-c438a920bdc9n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32842&group=comp.arch#32842

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:587:b0:3f5:30fd:a2d1 with SMTP id c7-20020a05622a058700b003f530fda2d1mr11620qtb.10.1686692007782;
Tue, 13 Jun 2023 14:33:27 -0700 (PDT)
X-Received: by 2002:aca:c1c2:0:b0:39c:f00f:4ae with SMTP id
r185-20020acac1c2000000b0039cf00f04aemr993862oif.1.1686692007486; Tue, 13 Jun
2023 14:33:27 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 13 Jun 2023 14:33:27 -0700 (PDT)
In-Reply-To: <u6agtt$3mnqq$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:991b:d366:7e92:932f;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:991b:d366:7e92:932f
References: <u61ekn$28oq1$1@dont-email.me> <34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me> <443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me> <bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me> <f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
<u6agtt$3mnqq$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <55acffa1-4909-407f-b0bf-c438a920bdc9n@googlegroups.com>
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Tue, 13 Jun 2023 21:33:27 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3332
 by: MitchAlsup - Tue, 13 Jun 2023 21:33 UTC

On Tuesday, June 13, 2023 at 2:44:01 PM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Tuesday, June 13, 2023 at 3:52:07 AM UTC-5, BGB wrote:
> >> On 6/13/2023 12:44 AM, robf...@gmail.com wrote:
> >>> On Monday, June 12, 2023 at 2:09:08 PM UTC-4, BGB wrote:
> >>
> >> For context, despite the laptop's clock-speed advantage, it was
> >> hard-pressed to win with the x87.
> >>
> >> Granted, this is a fairly uneven comparison (eg: dense SIMD vs generic x87).
> >>
> >> Also, generated ASM blobs vs "more generic" C.
> >>
> >> Whole lot of:
> >> a0=a0+va[0]*wv[0];
> >> a1=a1+va[1]*wv[1];
> >> a2=a2+va[2]*wv[2];
> >> a3=a3+va[3]*wv[3];
> >> ...
> > <
> > The google NN chip does 1024 multiplies and sum reduces these to a single
> > result at 1024 new multiplies start every cycle and 1 new result pops out
> > every cycle.
> > <
> > result = 0;
> > for( i = 0; i < 1024; i++ )
> > result += a[i]×w[i];
> >>
> >> Like, doesn't seem like it is asking "that" much?...
> >>
> > In fact, too little......
<
> When I read about the google chip several years ago, it was stated that
> it could do 64K mul-acc operations/cycle. This compared to something
> like 10K MACs in the Tesla chip which needs to work on battery power.
<
I could have easily misremembered 65536 as 1024.
>
> Did I misremember or are we taling about a different chip architecture?
<
NNs are not like CPUs, nor GPUs, nor DSPs, nor ... they are a different beast
all together.
>
> Terje
>
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<u6bcih$3tjsi$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32843&group=comp.arch#32843

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
Date: Tue, 13 Jun 2023 22:35:43 -0500
Organization: A noiseless patient Spider
Lines: 333
Message-ID: <u6bcih$3tjsi$1@dont-email.me>
References: <u61ekn$28oq1$1@dont-email.me>
<34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me>
<443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me>
<bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me>
<1005dc81-2a60-4bef-871d-eee8a3f0179dn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Jun 2023 03:35:46 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6413868313de167fbe4b969ded75b82c";
logging-data="4116370"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Gj5tLfLv88Js1Y1i7F7m0"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:KoHh6WL3BYmTFQCXv4/DOX6G/bk=
Content-Language: en-US
In-Reply-To: <1005dc81-2a60-4bef-871d-eee8a3f0179dn@googlegroups.com>
 by: BGB - Wed, 14 Jun 2023 03:35 UTC

On 6/13/2023 1:00 PM, Peter Lund wrote:
> On Tuesday, June 13, 2023 at 10:52:07 AM UTC+2, BGB wrote:
>> On 6/13/2023 12:44 AM, robf...@gmail.com wrote:
>>> On Monday, June 12, 2023 at 2:09:08 PM UTC-4, BGB wrote:
>>>> On 6/12/2023 6:57 AM, Peter Lund wrote:
>>>>> On Saturday, June 10, 2023 at 9:04:23 PM UTC+2, BGB wrote:
>>>>>> Theoretically, could have used a Cortex-M based microcontroller, but
>>>>>> these tend to "suck real hard" at floating-point use-cases (like, even
>>>>>> running at 250MHz or similar isn't enough to offset the typically dismal
>>>>>> floating-point performance).
>>>>>
>>>>> Cortex-M0/M0+/M1 do indeed suck at fp. M4F has hardware fp so it should be ok. Even M3 will almost certainly be ok, given that it has single-cycle shifts and single-cycle counting of leading zeros.
>>>>>
>>>> I was wanting to be able to run neural nets, which kinda depend "a lot"
>>>> on floating-point performance.
>>>
>>> Could you use fixed-point? I based my neural net accelerator on 16.16 fixed
>>> point. I have not checked the size, but I think modelling a small number of
>>> neurons is possible. I used a word-serial approach and tables occupying
>>> two BRAMs per neuron. Then they can be built up into larger networks using
>>> software. Relying on a neuron accelerator rather than a pure software
>>> implementation.
>>>
>> Early on, I had considered 8.8 fixed-point, but Binary16 has a larger
>> dynamic range and can give better results. Similar, there is no
>> (particular) performance advantage of fixed-point over floating-point in
>> this case (while the FP-SIMD ops have a higher latency, there is enough
>> going on that it is possible to schedule the instructions such that full
>> throughput is achieved).
>>
>>
>> The "PMULSH.H" instruction would have been useful:
>> * FE-FE F2nm_9qjj PMULSH.H Rm, Imm48fv8sh, Rn
>>
>> But, sadly, this is asking a little too much of the LUT budget on the
>> XC7S50... Though, this instruction was designed specifically for this
>> use-case.
>>
>> It does make a difference in that without this instruction, it makes
>> sense to compute roughly 4 neurons in parallel (with a lot of cycles
>> being spent on 64-bit constant loads), whereas to achieve full
>> throughput with this instruction, it makes more sense to calculate 16
>> neurons in parallel. But, then this also basically needs the XGPR
>> extension to be able to not run out of registers.
>>
>> This scenario pushing roughly 150 MFLOP/s at 50 MHz (vs only around 110
>> MFLOP/s with separate constant-loads and shuffles).
>>
>> For Binary32 SIMD, the situation would be a little worse, not because of
>> a limitation of the SIMD unit, but rather because this precludes being
>> able to run shuffle ops in parallel with SIMD ops.
>>
>> If I could pull off the "PMACSH.H" instruction which does FMAC instead,
>> this would allow a further speedup (saving all the clock-cycles that
>> would have been spent on PADD.H instructions). But, this would add other
>> issues (it being unclear if I can shove a Binary16 FMAC into a 3-cycle
>> latency...).
>>
>>
>> Though, unlike many other cases, rather than being D$ limited, this sort
>> of code puts a lot more strain on the I$ instead (with a comparably high
>> I$ miss rate).
>>
>> Theoretically, the SIMD unit could go up to 200 MFLOP/s at 50 MHz, but
>> achieving this is unrealistic.
>>
>>
>>
>>
>> Some of the nets I had tested with were:
>> 64-inputs;
>> 3 layers of 32 neurons;
>> 4 outputs.
>>
>> So, say, roughly 3.2k weights + 100 biases.
>>
>> For something like depth inference, one would pull, say, 32 pixels from
>> each image (in a roughly overlapping section), then tile the net across
>> the scanline (to generate a low-res map Z map).
>>>>
>>>> Despite the relatively low clock-speed of the BJX2 core, having 3L/1T
>>>> FP-SIMD ops allows for "relatively decent" performance (when the NNs are
>>>> turned into a big blob of ASM).
>>>>
>>>> But, yeah, for Cortex-M, was mostly thinking of the M0/M0+ and similar...
>>>>
>>>>
>>>>
>>>> A previous micro-benchmark for an NN showed it as being "relatively
>>>> comparable" to an early 2000s laptop (using x87).
>>>>
>> For context, despite the laptop's clock-speed advantage, it was
>> hard-pressed to win with the x87.
>>
>> Granted, this is a fairly uneven comparison (eg: dense SIMD vs generic x87).
>>
>> Also, generated ASM blobs vs "more generic" C.
>>
>> Whole lot of:
>> a0=a0+va[0]*wv[0];
>> a1=a1+va[1]*wv[1];
>> a2=a2+va[2]*wv[2];
>> a3=a3+va[3]*wv[3];
>> ...
>>
>> Like, doesn't seem like it is asking "that" much?...
>>
>>
>>
>> But, like, the clock-speed disparity is big enough that one would expect
>> that a 50 MHz core would have any chance against 1.47 GHz on any metric
>> (regardless of the ISA).
>>
>> Granted, maybe it would make sense to compare against an SSE
>> implementation, since this is closer to "apples to apples".
>>>> Granted, both are also beat out at this by an original 700MHz/ARM11
>>>> RasPi (with pretty much all the newer RasPi's being clearly superior on
>>>> this front, *).
>>>>
>> With a RasPi pushing a bit more with similar code...
>>>> Could technically have just run the NNs on a RasPi, but, ...
>>>> This was more a case of me wanting to use my own stuff, vs just throwing
>>>> a RasPi at the problem.
>>>>
>>>> Looks like an Cortex-M4F or Cortex-M7 could also work (might need to
>>>> compare against a RasPi or similar; not sure how much RAM is typical for
>>>> them, or if they can handle dual camera inputs, ...).
>>>>
>>>>
>>>> *: Despite the CPU in the laptop having better looking on-paper stats
>>>> (both higher clock-speed and OoO vs in-order), general performance was
>>>> fairly comparable to a RasPi2.
>>>>
>> Eg: "Mobile Athlon XP 1700+"
>>
>> On paper, seems like it should be pretty solid...
>>
>> On paper, seems like it should stomp a RasPi or RasPi2.
>> But, my attempts at testing seem to disagree.
>>>> In terms of both floating-point and memcpy benchmarks, the RasPi2 is
>>>> significantly ahead (also the RasPi2 has more RAM as well, typical
>>>> "cheap" modern SDcards are also bigger than the laptop's HDD, ...).
>>>> Can't really put in a newer HDD because the laptop uses a parallel ATA
>>>> variant (and no one made significantly larger PATA drives).
>>>>
>> I am less sure about the RasPi2's memcpy speed advantage.
>>
>> A lot apparently comes down to the (unknown) detail of the RAM speed and
>> bus width and similar on the RasPi.
>>
>>
>> Goes and looks at stuff, PC-2100 shouldn't be at that much of a
>> disadvantage... (Lower MT/s but should have a wider bus width). The
>> laptop equipped with 2x 256MB modules.
>>
>>
>> Will note though that my past benchmark attempts seem to fall well short
>> of the theoretical bandwidth (then again, it seems like in my attempts
>> at memcpy benchmarks, I never see numbers anywhere near the theoretical
>> bandwidth values).
>>>>
>>>>
>>>> In other news:
>>>> I recently added a hack that allows the 32-bit MUL unit in the BJX2 core
>>>> to handle a certain range of DIVS.L and DIVU.L instructions (it will
>>>> signal to the Shift-Add divider that it has handled it, and the divider
>>>> will skip doing anything; when the signal arrives to the EX3 stage, EX3
>>>> will forward the multiplier's result instead of the divider's).
>>>>
>>>> It works if the source value is positive, and the divisor is a small
>>>> positive integer which it can turn it into a lookup table.
>>>>
>>>> Currently only can deal with dividing positive and unsigned values, as
>>>> trying to use it on negative values gives "round towards negative
>>>> infinity" behavior rather than "round towards zero" (and the required
>>>> sign-flipping would wreck the timing constraints); and consequently
>>>> breaking some of the "sanity checks".
>>>>
>>>> While the compiler does this transformation itself for
>>>> divide-by-constant, this hack mostly covers division at runtime.
>>>>
>>>> But, it is still a bit of an ugly hack...
>>>>
>>>>
>>>> Modeling this behavior in my emulator seems able to push Dhrystone past
>>>> the 80k mark (for a long while, it was stuck at ~ 75k). Granted, a
>>>> recent special case to "long long" multiply pushed it from 75k to 77k,
>>>> and the divider tweak pushed it the rest of the way.
>>>>
>>>> Granted, it is a bit of a hack, as it changes DIVS.L timing from a fixed
>>>> 36-cycle timing, to 3 cycle with a 33-cycle penalty case.
>>>>
>>>> ...
>
> Take a look at distillation and quantization. You almost certainly don't need floating-point for useful inference.
>
> What you do is you train a model on your laptop/TPU/Dojo cluster using floating-point. Then you distill and quantize that so you end up with a smaller and simpler model that doesn't use floating-point at all (or only uses it in a few select places). The process is basically to use the first model to train the second one instead of using the original training data. The second model can start small and grow neurons during training or it can start bigger and have neurons pruned during training.
> Both methods work -- but look up the details in recent papers first, of course.
>


Click here to read the complete article
Re: Misc Tradeoffs: Core I can fit on the XC7S50

<u6bgl6$3u1lp$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32844&group=comp.arch#32844

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
Date: Tue, 13 Jun 2023 23:45:23 -0500
Organization: A noiseless patient Spider
Lines: 70
Message-ID: <u6bgl6$3u1lp$1@dont-email.me>
References: <u61ekn$28oq1$1@dont-email.me>
<34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me>
<443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me>
<bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me>
<f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Jun 2023 04:45:26 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6413868313de167fbe4b969ded75b82c";
logging-data="4130489"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18UXw8f8geMXHgk0pvCMlzI"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:xE4gnC6m04eXyz1Wn/gGYm0YYaA=
Content-Language: en-US
In-Reply-To: <f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
 by: BGB - Wed, 14 Jun 2023 04:45 UTC

On 6/13/2023 1:43 PM, MitchAlsup wrote:
> On Tuesday, June 13, 2023 at 3:52:07 AM UTC-5, BGB wrote:
>> On 6/13/2023 12:44 AM, robf...@gmail.com wrote:
>>> On Monday, June 12, 2023 at 2:09:08 PM UTC-4, BGB wrote:
>>
>> For context, despite the laptop's clock-speed advantage, it was
>> hard-pressed to win with the x87.
>>
>> Granted, this is a fairly uneven comparison (eg: dense SIMD vs generic x87).
>>
>> Also, generated ASM blobs vs "more generic" C.
>>
>> Whole lot of:
>> a0=a0+va[0]*wv[0];
>> a1=a1+va[1]*wv[1];
>> a2=a2+va[2]*wv[2];
>> a3=a3+va[3]*wv[3];
>> ...
> <
> The google NN chip does 1024 multiplies and sum reduces these to a single
> result at 1024 new multiplies start every cycle and 1 new result pops out
> every cycle.
> <
> result = 0;
> for( i = 0; i < 1024; i++ )
> result += a[i]×w[i];

Seems like a problem with NNs is that one either has:
A CPU with some NN features;
Dedicated NN hardware.

The latter still requiring some way to get data into and out of said
hardware. And enough compute resources to make it worthwhile vs using
SIMD or similar.

Likely, having dedicated hardware NN would likely require at least
something like an XC7A100T, or maybe a Zynq (so all the "administrative"
work can be done on an ARM core).

Also, AFAIK, the Google NN chips aren't available in hobbyist grade
dev-boards.

>>
>> Like, doesn't seem like it is asking "that" much?...
>>
> In fact, too little......

There is limits to what I can do on a Spartan-7.
Similarly, limits to what can be done on a 20 year old laptop.

Could use a newer RasPi variant, but, where is the fun in that?...

Granted, OTOH, RasPi's can have WiFi...

....

Comparably, the FPGA uses the least power, say:
FPGA : ~ 1 W
RasPi : ~ 5 W
Old Laptop: ~ 50 W

Granted, it looks like for the robot base, each motor (when running)
will pull 2W, so running all 4 motors at the same time is likely to pull
~ 8 W (granted, still well within the limits of the NIMH battery pack).

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<f62250ff-5172-481c-ba35-9006f60318c1n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32845&group=comp.arch#32845

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:550a:0:b0:62d:e8f6:27b with SMTP id pz10-20020ad4550a000000b0062de8f6027bmr1275673qvb.5.1686725598190;
Tue, 13 Jun 2023 23:53:18 -0700 (PDT)
X-Received: by 2002:a05:6808:202a:b0:39b:4710:7158 with SMTP id
q42-20020a056808202a00b0039b47107158mr320375oiw.2.1686725597857; Tue, 13 Jun
2023 23:53:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 13 Jun 2023 23:53:17 -0700 (PDT)
In-Reply-To: <u6bgl6$3u1lp$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <u61ekn$28oq1$1@dont-email.me> <34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me> <443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me> <bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me> <f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
<u6bgl6$3u1lp$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f62250ff-5172-481c-ba35-9006f60318c1n@googlegroups.com>
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Wed, 14 Jun 2023 06:53:18 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4662
 by: robf...@gmail.com - Wed, 14 Jun 2023 06:53 UTC

On Wednesday, June 14, 2023 at 12:45:30 AM UTC-4, BGB wrote:
> On 6/13/2023 1:43 PM, MitchAlsup wrote:
> > On Tuesday, June 13, 2023 at 3:52:07 AM UTC-5, BGB wrote:
> >> On 6/13/2023 12:44 AM, robf...@gmail.com wrote:
> >>> On Monday, June 12, 2023 at 2:09:08 PM UTC-4, BGB wrote:
> >>
> >> For context, despite the laptop's clock-speed advantage, it was
> >> hard-pressed to win with the x87.
> >>
> >> Granted, this is a fairly uneven comparison (eg: dense SIMD vs generic x87).
> >>
> >> Also, generated ASM blobs vs "more generic" C.
> >>
> >> Whole lot of:
> >> a0=a0+va[0]*wv[0];
> >> a1=a1+va[1]*wv[1];
> >> a2=a2+va[2]*wv[2];
> >> a3=a3+va[3]*wv[3];
> >> ...
> > <
> > The google NN chip does 1024 multiplies and sum reduces these to a single
> > result at 1024 new multiplies start every cycle and 1 new result pops out
> > every cycle.
> > <
> > result = 0;
> > for( i = 0; i < 1024; i++ )
> > result += a[i]×w[i];
> Seems like a problem with NNs is that one either has:
> A CPU with some NN features;
> Dedicated NN hardware.
>
> The latter still requiring some way to get data into and out of said
> hardware. And enough compute resources to make it worthwhile vs using
> SIMD or similar.
>
> Likely, having dedicated hardware NN would likely require at least
> something like an XC7A100T, or maybe a Zynq (so all the "administrative"
> work can be done on an ARM core).

The neuron model Thor uses takes only 2.1kLUTs per neuron. Each neuron
has up to 1024 inputs and weights, but it can be partitioned into smaller
chunks so one neuron can be reused multiple times. The plan is to implement
eight hardware neurons taking about 17k LUTs. All eight process in parallel.
To sum all 1024 inputs and weights would take 1024 clock cycles.
Partitioning the tables would allow a result for every partion in 1024/partitions
clocks. Eg 16 partitions, virtual neurons, would allow 64 clocks for a result for
each partition.
1024 inputs is enough for a 2D 32x32 input. The thought was to do OCR
processing as a start.

>
> Also, AFAIK, the Google NN chips aren't available in hobbyist grade
> dev-boards.
> >>
> >> Like, doesn't seem like it is asking "that" much?...
> >>
> > In fact, too little......
> There is limits to what I can do on a Spartan-7.
> Similarly, limits to what can be done on a 20 year old laptop.
>
> Could use a newer RasPi variant, but, where is the fun in that?...
>
> Granted, OTOH, RasPi's can have WiFi...
>
> ...
>
>
> Comparably, the FPGA uses the least power, say:
> FPGA : ~ 1 W
> RasPi : ~ 5 W
> Old Laptop: ~ 50 W
>
>
> Granted, it looks like for the robot base, each motor (when running)
> will pull 2W, so running all 4 motors at the same time is likely to pull
> ~ 8 W (granted, still well within the limits of the NIMH battery pack).

I wonder if a motor-cycle battery or a telecom battery could be used too.

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<fcd1a3bd-5028-4ada-a839-f5ee134b19c7n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32846&group=comp.arch#32846

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:5655:b0:62f:e0d3:4e9f with SMTP id mh21-20020a056214565500b0062fe0d34e9fmr278593qvb.6.1686743379751;
Wed, 14 Jun 2023 04:49:39 -0700 (PDT)
X-Received: by 2002:a9d:4d0e:0:b0:6b2:962b:cf9b with SMTP id
n14-20020a9d4d0e000000b006b2962bcf9bmr4143631otf.0.1686743379505; Wed, 14 Jun
2023 04:49:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 14 Jun 2023 04:49:39 -0700 (PDT)
In-Reply-To: <u6bgl6$3u1lp$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=80.62.116.10; posting-account=iwcJjQoAAAAIecwT8pOXxaSOyiUTZMJr
NNTP-Posting-Host: 80.62.116.10
References: <u61ekn$28oq1$1@dont-email.me> <34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me> <443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me> <bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me> <f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
<u6bgl6$3u1lp$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fcd1a3bd-5028-4ada-a839-f5ee134b19c7n@googlegroups.com>
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
From: peterfirefly@gmail.com (Peter Lund)
Injection-Date: Wed, 14 Jun 2023 11:49:39 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2222
 by: Peter Lund - Wed, 14 Jun 2023 11:49 UTC

On Wednesday, June 14, 2023 at 6:45:30 AM UTC+2, BGB wrote:
> Also, AFAIK, the Google NN chips aren't available in hobbyist grade
> dev-boards.

No, but once you have a decent model + training pipeline + training data prepared, you can sign up for a free (or cheap) deep learning cloud account.

Training requires much, much more compute than inference.

(Also remember to initialize the weights properly before training -- google "Kaiming He". Different types of models require different types of initialization so if you try something fancy, you might need a different initialization than He's. Just trace the citations of He's paper to find them. You might also need more than 3 layers.)

-Peter

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<912ee52b-7f9b-419a-9c72-bc5ce9c4b63dn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32847&group=comp.arch#32847

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:4646:0:b0:760:8aad:52ff with SMTP id t67-20020a374646000000b007608aad52ffmr671986qka.10.1686744310696;
Wed, 14 Jun 2023 05:05:10 -0700 (PDT)
X-Received: by 2002:a05:6830:1ca:b0:6b2:b563:82b1 with SMTP id
r10-20020a05683001ca00b006b2b56382b1mr4275889ota.7.1686744310340; Wed, 14 Jun
2023 05:05:10 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 14 Jun 2023 05:05:10 -0700 (PDT)
In-Reply-To: <u6bcih$3tjsi$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=80.62.116.10; posting-account=iwcJjQoAAAAIecwT8pOXxaSOyiUTZMJr
NNTP-Posting-Host: 80.62.116.10
References: <u61ekn$28oq1$1@dont-email.me> <34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me> <443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me> <bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me> <1005dc81-2a60-4bef-871d-eee8a3f0179dn@googlegroups.com>
<u6bcih$3tjsi$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <912ee52b-7f9b-419a-9c72-bc5ce9c4b63dn@googlegroups.com>
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
From: peterfirefly@gmail.com (Peter Lund)
Injection-Date: Wed, 14 Jun 2023 12:05:10 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5122
 by: Peter Lund - Wed, 14 Jun 2023 12:05 UTC

On Wednesday, June 14, 2023 at 5:35:50 AM UTC+2, BGB wrote:
> On 6/13/2023 1:00 PM, Peter Lund wrote:
> I am not building an object recognition model for now (this would likely
> be harder than depth-inference).

Depends on the quality of depth inference you want :)

Tesla does it but their net is a lot bigger than the one you proposed.

https://www.youtube.com/watch?v=LR0bDLCElKg

You can also use traditional techniques: try to identify lots (hundreds to thousands) of "interesting points" in each frame.
Then try to see if/how they fit a "warping" between frames. Some of the "interesting points" won't be shared between the frames, some of them will be random artifacts (and not actually interesting at all), but that's ok. Because you can use regression technique that can tolerate outliers: RANSAC.

You can use this with one or camera or multiple cameras. The only really hard part is the identification of the "interesting points".

There are plenty of papers from the 00's and 10's...

> Idea is mostly to try to figure out how far away things are

Won't ultrasound be easier/cheaper?

> , and to have
> some way to figure out how to estimate or measure how the robot has
> moved, and then build a 3D voxel map.

Accelerometers? Rotation counters on your motors?

>
> At least in theory, most of the needed information should be in the
> video stream. But, processing it out is the harder part.

> One other idea (would be more involved), would be to make an unusual
> "lobed" front lens such that there end up being multiple focal points on
> the image sensor. I don't currently have the resources to make something
> like this.
>
>
> Though, one other possibility being to make a sort of lens with a V or W
> shaped profile (rather than a normal uniformly circular profile), which
> could in turn encode some 3D spatial information in the form of optical
> effects.
>
> Where, with some rounding, a V shaped profile would almost resemble a
> "heart" shape, and the W shape sort of like two heart-shaped merged
> together, likely with a thicker middle section and thinner edges (but
> also sorta following the profile of the letter W).
>
> It seems like an "F-hole" shaped lens could maybe also work.
>
> Not really found anything talking about lenses like this, so all this is
> mostly based on "my ability to imagine stuff".

J. Tanner & C. Mead, ``An integrated analog optical motion sensor,'' . In R..W. Brodersen & H.S. Moscovitz, editor, VLSI Signal Processing II, pp. 59-87. IEEE, New York, 1988.

C. Mead. Analog VLSI and neural systems. Addison-Wesley, Reading, Massachusetts, 1989.

There is a an old article in Scientific American about Mead's ideas on this: Using a Fresnel lens to make optical flow much easier to measure even with very few computational resources. The Fresnel lens in this case doesn't emulate a standard lens, it is meant as a cheap and compact way of doing tricks along the lines you suggest above.

> In theory (given a small ball-nose endmill and a machine that can hold
> fairly tight tolerances), it should be possible to machine such a lens
> out of a piece of polycarbonate or similar (and then just sorta stick it
> onto the front of a camera module).

Or maybe etch it? Or use a small controllable laser to melt/evaporate the parts you don't want?
Maybe use flame polishing afterwards to smooth out the edges around the cuts?

-Peter

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<b80d1d3b-2722-4a08-937a-5aaf6f0bf61fn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32848&group=comp.arch#32848

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:aa0f:0:b0:75c:b21d:737a with SMTP id t15-20020a37aa0f000000b0075cb21d737amr2343549qke.12.1686746683374;
Wed, 14 Jun 2023 05:44:43 -0700 (PDT)
X-Received: by 2002:a9d:6acb:0:b0:6b1:6079:790a with SMTP id
m11-20020a9d6acb000000b006b16079790amr4315242otq.1.1686746683121; Wed, 14 Jun
2023 05:44:43 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 14 Jun 2023 05:44:42 -0700 (PDT)
In-Reply-To: <912ee52b-7f9b-419a-9c72-bc5ce9c4b63dn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <u61ekn$28oq1$1@dont-email.me> <34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me> <443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me> <bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me> <1005dc81-2a60-4bef-871d-eee8a3f0179dn@googlegroups.com>
<u6bcih$3tjsi$1@dont-email.me> <912ee52b-7f9b-419a-9c72-bc5ce9c4b63dn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b80d1d3b-2722-4a08-937a-5aaf6f0bf61fn@googlegroups.com>
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Wed, 14 Jun 2023 12:44:43 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5448
 by: robf...@gmail.com - Wed, 14 Jun 2023 12:44 UTC

On Wednesday, June 14, 2023 at 8:05:12 AM UTC-4, Peter Lund wrote:
> On Wednesday, June 14, 2023 at 5:35:50 AM UTC+2, BGB wrote:
> > On 6/13/2023 1:00 PM, Peter Lund wrote:
> > I am not building an object recognition model for now (this would likely
> > be harder than depth-inference).
> Depends on the quality of depth inference you want :)
>
> Tesla does it but their net is a lot bigger than the one you proposed.
>
> https://www.youtube.com/watch?v=LR0bDLCElKg
>
> You can also use traditional techniques: try to identify lots (hundreds to thousands) of "interesting points" in each frame.
> Then try to see if/how they fit a "warping" between frames. Some of the "interesting points" won't be shared between the frames, some of them will be random artifacts (and not actually interesting at all), but that's ok. Because you can use regression technique that can tolerate outliers: RANSAC.
>
> You can use this with one or camera or multiple cameras. The only really hard part is the identification of the "interesting points".
>
> There are plenty of papers from the 00's and 10's...
> > Idea is mostly to try to figure out how far away things are
> Won't ultrasound be easier/cheaper?
> > , and to have
> > some way to figure out how to estimate or measure how the robot has
> > moved, and then build a 3D voxel map.
> Accelerometers? Rotation counters on your motors?

GPS comes to mind too.

> >
> > At least in theory, most of the needed information should be in the
> > video stream. But, processing it out is the harder part.
> > One other idea (would be more involved), would be to make an unusual
> > "lobed" front lens such that there end up being multiple focal points on
> > the image sensor. I don't currently have the resources to make something
> > like this.
> >
> >
> > Though, one other possibility being to make a sort of lens with a V or W
> > shaped profile (rather than a normal uniformly circular profile), which
> > could in turn encode some 3D spatial information in the form of optical
> > effects.
> >
> > Where, with some rounding, a V shaped profile would almost resemble a
> > "heart" shape, and the W shape sort of like two heart-shaped merged
> > together, likely with a thicker middle section and thinner edges (but
> > also sorta following the profile of the letter W).
> >
> > It seems like an "F-hole" shaped lens could maybe also work.
> >
> > Not really found anything talking about lenses like this, so all this is
> > mostly based on "my ability to imagine stuff".
> J. Tanner & C. Mead, ``An integrated analog optical motion sensor,'' . In R.W. Brodersen & H.S. Moscovitz, editor, VLSI Signal Processing II, pp. 59-87. IEEE, New York, 1988.
>
> C. Mead. Analog VLSI and neural systems. Addison-Wesley, Reading, Massachusetts, 1989.
>
> There is a an old article in Scientific American about Mead's ideas on this: Using a Fresnel lens to make optical flow much easier to measure even with very few computational resources. The Fresnel lens in this case doesn't emulate a standard lens, it is meant as a cheap and compact way of doing tricks along the lines you suggest above.
> > In theory (given a small ball-nose endmill and a machine that can hold
> > fairly tight tolerances), it should be possible to machine such a lens
> > out of a piece of polycarbonate or similar (and then just sorta stick it
> > onto the front of a camera module).
> Or maybe etch it? Or use a small controllable laser to melt/evaporate the parts you don't want?
> Maybe use flame polishing afterwards to smooth out the edges around the cuts?
>
> -Peter

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<u6cpbo$2ir4$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32851&group=comp.arch#32851

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
Date: Wed, 14 Jun 2023 11:20:06 -0500
Organization: A noiseless patient Spider
Lines: 153
Message-ID: <u6cpbo$2ir4$1@dont-email.me>
References: <u61ekn$28oq1$1@dont-email.me>
<34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me>
<443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me>
<bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me>
<1005dc81-2a60-4bef-871d-eee8a3f0179dn@googlegroups.com>
<u6bcih$3tjsi$1@dont-email.me>
<912ee52b-7f9b-419a-9c72-bc5ce9c4b63dn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Jun 2023 16:20:08 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6413868313de167fbe4b969ded75b82c";
logging-data="84836"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/jA3gX028kGX5A83IJP6SQ"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:pOckje3gcvNRxVYC7G/S8Wvf+4w=
In-Reply-To: <912ee52b-7f9b-419a-9c72-bc5ce9c4b63dn@googlegroups.com>
Content-Language: en-US
 by: BGB - Wed, 14 Jun 2023 16:20 UTC

On 6/14/2023 7:05 AM, Peter Lund wrote:
> On Wednesday, June 14, 2023 at 5:35:50 AM UTC+2, BGB wrote:
>> On 6/13/2023 1:00 PM, Peter Lund wrote:
>> I am not building an object recognition model for now (this would likely
>> be harder than depth-inference).
>
> Depends on the quality of depth inference you want :)
>
> Tesla does it but their net is a lot bigger than the one you proposed.
>
> https://www.youtube.com/watch?v=LR0bDLCElKg
>
> You can also use traditional techniques: try to identify lots (hundreds to thousands) of "interesting points" in each frame.
> Then try to see if/how they fit a "warping" between frames. Some of the "interesting points" won't be shared between the frames, some of them will be random artifacts (and not actually interesting at all), but that's ok. Because you can use regression technique that can tolerate outliers: RANSAC.
>
> You can use this with one or camera or multiple cameras. The only really hard part is the identification of the "interesting points".
>
> There are plenty of papers from the 00's and 10's...
>

My ideas weren't usually based on points, rather using filtering to
figure out the parallax.

The NN needs to be wide enough to cover the parts of the frame likely to
overlap and be subject to parallax, but does not need to be large enough
to cover the entire frame. Would be invoked multiple times per scanline.

So, 32 pixels for a 320x200 frame should hopefully be sufficient
(excluding objects that are very close to the camera modules). Possibly,
64 inputs per frame could be "better" and deal with more divergence.

One doesn't want the net to be so big though that it is entirely outside
what can fit in the L1 cache.

In this case, ~ 3k weights is pushing it. Possibly limiting to 1k to 2k
weights would be easier on the L1 cache (and/or try to make a 32K L1 I$
workable...).

If one goes outside the limits of the L1 I$, performance drops
significantly...

>> Idea is mostly to try to figure out how far away things are
>
> Won't ultrasound be easier/cheaper?
>

Ultrasound and IR are the traditionally cheap/easy options for this.

Didn't really want to go this route.
One almost may as well just use a microcontroller in this case.

....

Would need to buy these though as I don't already have any available.

>> , and to have
>> some way to figure out how to estimate or measure how the robot has
>> moved, and then build a 3D voxel map.
>
> Accelerometers? Rotation counters on your motors?
>

Possible, but I really didn't want to go this route either.
No easy way to mount rotary encoders on the motors in question (but
could be put on the axles).

However, most options for rotary encoders would cost more than the
motors they are being used on...

Accelerometers could work and typically use an I2C bus interface IIRC
(had messed with one before).

>>
>> At least in theory, most of the needed information should be in the
>> video stream. But, processing it out is the harder part.
>
>> One other idea (would be more involved), would be to make an unusual
>> "lobed" front lens such that there end up being multiple focal points on
>> the image sensor. I don't currently have the resources to make something
>> like this.
>>
>>
>> Though, one other possibility being to make a sort of lens with a V or W
>> shaped profile (rather than a normal uniformly circular profile), which
>> could in turn encode some 3D spatial information in the form of optical
>> effects.
>>
>> Where, with some rounding, a V shaped profile would almost resemble a
>> "heart" shape, and the W shape sort of like two heart-shaped merged
>> together, likely with a thicker middle section and thinner edges (but
>> also sorta following the profile of the letter W).
>>
>> It seems like an "F-hole" shaped lens could maybe also work.
>>
>> Not really found anything talking about lenses like this, so all this is
>> mostly based on "my ability to imagine stuff".
>
> J. Tanner & C. Mead, ``An integrated analog optical motion sensor,'' . In R.W. Brodersen & H.S. Moscovitz, editor, VLSI Signal Processing II, pp. 59-87. IEEE, New York, 1988.
>
> C. Mead. Analog VLSI and neural systems. Addison-Wesley, Reading, Massachusetts, 1989.
>
> There is a an old article in Scientific American about Mead's ideas on this: Using a Fresnel lens to make optical flow much easier to measure even with very few computational resources. The Fresnel lens in this case doesn't emulate a standard lens, it is meant as a cheap and compact way of doing tricks along the lines you suggest above.
>

OK.

>> In theory (given a small ball-nose endmill and a machine that can hold
>> fairly tight tolerances), it should be possible to machine such a lens
>> out of a piece of polycarbonate or similar (and then just sorta stick it
>> onto the front of a camera module).
>
> Or maybe etch it? Or use a small controllable laser to melt/evaporate the parts you don't want?
> Maybe use flame polishing afterwards to smooth out the edges around the cuts?
>

Possibly.
The machines I have access to at the moment can get "mostly in the area"
of around 0.005 inch (these being mostly CNC retrofits on manual machines).

Say, for a roughly 0.250 - 0.375 inch lens (to stick onto the front of a
camera module), would ideally want something that could hold closer to
0.001 inch for this part...

Would probably mill it using a 1/16 (0.0625 inch) ball-nose mill. Post
polishing would still be needed in any case.

Could likely mill the lens out of a piece of 1/4" (0.250") or 3/16"
(0.187") polycarbonate or acrylic (usually sold in big sheets; commonly
known as "Plexiglas").

Not worth bothering with trying to replace the main lens, these are
typically tiny (say, around 0.050 inch). Can't really machine (or easily
handle) anything this small (and they are also permanently embedded
inside a much bigger piece of plastic; usually a sort of plastic screw
mechanism; with a matching threaded section surrounding the main image
sensor; IIRC somewhere in the area of 0.2 inch).

Usually all of this in turn being mounted onto a PCB.

> -Peter

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<u6fi5v$fpi9$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32853&group=comp.arch#32853

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
Date: Thu, 15 Jun 2023 12:35:56 -0500
Organization: A noiseless patient Spider
Lines: 178
Message-ID: <u6fi5v$fpi9$1@dont-email.me>
References: <u61ekn$28oq1$1@dont-email.me>
<34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me>
<443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me>
<bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me>
<f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
<u6bgl6$3u1lp$1@dont-email.me>
<f62250ff-5172-481c-ba35-9006f60318c1n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 15 Jun 2023 17:35:59 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="182930b4c2f27a65f7523e92e2763bb3";
logging-data="517705"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+FYza6vf3NJEhGXpvCe/gm"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:fYMbWn+I7wUMnlPhzQYf5ZGPXys=
In-Reply-To: <f62250ff-5172-481c-ba35-9006f60318c1n@googlegroups.com>
Content-Language: en-US
 by: BGB - Thu, 15 Jun 2023 17:35 UTC

On 6/14/2023 1:53 AM, robf...@gmail.com wrote:
> On Wednesday, June 14, 2023 at 12:45:30 AM UTC-4, BGB wrote:
>> On 6/13/2023 1:43 PM, MitchAlsup wrote:
>>> On Tuesday, June 13, 2023 at 3:52:07 AM UTC-5, BGB wrote:
>>>> On 6/13/2023 12:44 AM, robf...@gmail.com wrote:
>>>>> On Monday, June 12, 2023 at 2:09:08 PM UTC-4, BGB wrote:
>>>>
>>>> For context, despite the laptop's clock-speed advantage, it was
>>>> hard-pressed to win with the x87.
>>>>
>>>> Granted, this is a fairly uneven comparison (eg: dense SIMD vs generic x87).
>>>>
>>>> Also, generated ASM blobs vs "more generic" C.
>>>>
>>>> Whole lot of:
>>>> a0=a0+va[0]*wv[0];
>>>> a1=a1+va[1]*wv[1];
>>>> a2=a2+va[2]*wv[2];
>>>> a3=a3+va[3]*wv[3];
>>>> ...
>>> <
>>> The google NN chip does 1024 multiplies and sum reduces these to a single
>>> result at 1024 new multiplies start every cycle and 1 new result pops out
>>> every cycle.
>>> <
>>> result = 0;
>>> for( i = 0; i < 1024; i++ )
>>> result += a[i]×w[i];
>> Seems like a problem with NNs is that one either has:
>> A CPU with some NN features;
>> Dedicated NN hardware.
>>
>> The latter still requiring some way to get data into and out of said
>> hardware. And enough compute resources to make it worthwhile vs using
>> SIMD or similar.
>>
>> Likely, having dedicated hardware NN would likely require at least
>> something like an XC7A100T, or maybe a Zynq (so all the "administrative"
>> work can be done on an ARM core).
>
> The neuron model Thor uses takes only 2.1kLUTs per neuron. Each neuron
> has up to 1024 inputs and weights, but it can be partitioned into smaller
> chunks so one neuron can be reused multiple times. The plan is to implement
> eight hardware neurons taking about 17k LUTs. All eight process in parallel.
> To sum all 1024 inputs and weights would take 1024 clock cycles.
> Partitioning the tables would allow a result for every partion in 1024/partitions
> clocks. Eg 16 partitions, virtual neurons, would allow 64 clocks for a result for
> each partition.
> 1024 inputs is enough for a 2D 32x32 input. The thought was to do OCR
> processing as a start.
>

OK.

I guess this is a different strategy than throwing SIMD at the problem.

As can be noted, with SIMD the idea is to compute multiple neurons in
parallel. Some cycles are spent on shuffles and constant loads, but
these can be used to "buffer" some of the latency.

One can do 8 or 16 neurons in parallel (to spend fewer cycles on
interlocks), but this increases register pressure somewhat.

So, per group of 4-inputs to 4 neurons:
One memory load;
4 constant loads (each weight vector);
4 vector shuffles;
4 SIMD MUL;
4 SIMD ADD.

One can sort of stagger the operations such that two of these processes
overlap, but the dependency chain of the ADD means that one needs (at
least) 3 cycles between each vector ADD (not too hard if one needs to
burn a clock cycle for each constant load, and can also put the next
stage's MUL's between each ADD).

If doing 8 neurons, then there is less need for overlap.

But, with the PMULSH.H instruction, would need ~ 16 neurons in parallel
to minimize interlock stalls.

One also needs enough registers such that all the operations can be
staggered and overlapped.

Well, and also the net needs to be kept small enough to hopefully fit
into the L1 cache.

The PMULSH would have helped slightly with the L1 issue, in that it
basically combines the work of 5x32b words into 3x32b, but as noted, the
3-cycle latency becomes more of an issue in this case.

But, as noted, pushes the LUT requirements over-budget for the XC7S50.

Could, in theory, get an Arty A7-100T, but this isn't a particularly
cheap board (even if it is probably a better fit for what I want to do
with it than the S7-50...).

>>
>> Also, AFAIK, the Google NN chips aren't available in hobbyist grade
>> dev-boards.
>>>>
>>>> Like, doesn't seem like it is asking "that" much?...
>>>>
>>> In fact, too little......
>> There is limits to what I can do on a Spartan-7.
>> Similarly, limits to what can be done on a 20 year old laptop.
>>
>> Could use a newer RasPi variant, but, where is the fun in that?...
>>
>> Granted, OTOH, RasPi's can have WiFi...
>>
>> ...
>>
>>
>> Comparably, the FPGA uses the least power, say:
>> FPGA : ~ 1 W
>> RasPi : ~ 5 W
>> Old Laptop: ~ 50 W
>>
>>
>> Granted, it looks like for the robot base, each motor (when running)
>> will pull 2W, so running all 4 motors at the same time is likely to pull
>> ~ 8 W (granted, still well within the limits of the NIMH battery pack).
>
> I wonder if a motor-cycle battery or a telecom battery could be used too.

Not on this size of robot...

With a motorcycle battery or one of the larger 18 Ah Lead-Acid UPS
batteries, it is very likely the motors wouldn't be strong enough to
move the battery.

Given the frame came with a 3-cell 18650 holder, this is probably what
they intended people to use, but NIMH AA's are cheaper in a Wh/$ sense
(and I have a lot more of these than I have 18650 cells).

On some past things I had worked on, lead-acid batteries were used.
I had at one point started working on (but never finished) an idea for a
"balanced" biwheel robot (with IIRC, 28" wheels, mostly plastic
yard-cart wheels, but modified to replace the ball-bearings with a rigid
drive shaft) that would have used NEMA 23 steppers as wheel motors.
Though, this would have "cheated" some in that it was not actively
balanced, but would have primarily used the battery (located between the
wheels) as ballast to keep the thing balanced (ultimately, its ability
to accelerate/decelerate or climb slopes would be determined mostly by
the relative weight of the battery to the torque on the wheels).

Or, basically, like a robot that would have maintained balance similar
to those round-bottom dolls/toys with lead (or, now bismuth) in the
base. Just, in this case, the weight being in the form of a big
lead-acid battery...

Had bought some parts for it, but then the idea sorta fizzled before all
that much got built (sort of a pattern with these sorts of projects).
This is a project though that would likely require a metal frame, as the
physical constraints pushed the limits of what is practical with
wood+cardboard+hot-glue.

Where, say, one might find that wood lathing strips aren't quite as
strong as one might expect when tasked with holding the weight of an
18Ah lead-acid battery... Actually, the hot glue tends to be stronger
than the wood in this case (hot glue is actually surprisingly durable at
times in cases where it actually sticks to things).

But, realistically, a "better" option being to make the the frame out of
aluminum angle or similar (but, steel and aluminum angle costs a lot
more, ...).

....

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<u6fkaf$g2qk$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32854&group=comp.arch#32854

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
Date: Thu, 15 Jun 2023 13:12:28 -0500
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <u6fkaf$g2qk$1@dont-email.me>
References: <u61ekn$28oq1$1@dont-email.me>
<34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me>
<443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me>
<bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me>
<f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
<u6bgl6$3u1lp$1@dont-email.me>
<fcd1a3bd-5028-4ada-a839-f5ee134b19c7n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 15 Jun 2023 18:12:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="182930b4c2f27a65f7523e92e2763bb3";
logging-data="527188"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+8TqkB5NpO82uBdh/L2+Ma"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:ztHjEVNoH3YVdWTkKWxt/k8Z9AI=
Content-Language: en-US
In-Reply-To: <fcd1a3bd-5028-4ada-a839-f5ee134b19c7n@googlegroups.com>
 by: BGB - Thu, 15 Jun 2023 18:12 UTC

On 6/14/2023 6:49 AM, Peter Lund wrote:
> On Wednesday, June 14, 2023 at 6:45:30 AM UTC+2, BGB wrote:
>> Also, AFAIK, the Google NN chips aren't available in hobbyist grade
>> dev-boards.
>
> No, but once you have a decent model + training pipeline + training data prepared, you can sign up for a free (or cheap) deep learning cloud account.
>
> Training requires much, much more compute than inference.
>
> (Also remember to initialize the weights properly before training -- google "Kaiming He". Different types of models require different types of initialization so if you try something fancy, you might need a different initialization than He's. Just trace the citations of He's paper to find them. You might also need more than 3 layers.)
>
> -Peter

Possible.

For this, I had mostly run past experiments with doing the training on
my PC.

The nets being, not "that" big, are not a huge burden for a desktop PC
(apart from the cost of the cost of emulating the Binary16 math on a
system that does not natively support Binary16...).

Thinking about it, it may make sense to leave the NN weights in memory
and handle them using memory loads, rather than embed them directly into
the ASM code. This would allow using a more conventional loop structure,
and wouldn't risk being bottle-necked by the L1 I$ (but would still face
issues if the input data and weights have too much of a miss rate with
the L1 D$).

Main drawback is that memory loads have a higher latency.

Likely the weights would remain as Binary16 (FP8 could save some space,
potentially allowing for a bigger net to fit in the L1 cache, but would
add further latency and is not likely sufficient for this case).

....

Re: Misc Tradeoffs: Core I can fit on the XC7S50

<u6i4ug$sp38$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32855&group=comp.arch#32855

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc Tradeoffs: Core I can fit on the XC7S50
Date: Fri, 16 Jun 2023 12:08:31 -0500
Organization: A noiseless patient Spider
Lines: 172
Message-ID: <u6i4ug$sp38$1@dont-email.me>
References: <u61ekn$28oq1$1@dont-email.me>
<34ab4899-596e-4d51-9877-78d188078080n@googlegroups.com>
<u62hfj$2d0bh$1@dont-email.me>
<443ec881-6c9a-4251-aa59-69b6b952260bn@googlegroups.com>
<u67n00$36dql$1@dont-email.me>
<bce9bfca-a385-4384-9696-0cdeed20fc6cn@googlegroups.com>
<u69ani$3ho4r$1@dont-email.me>
<f11a7d06-274d-4475-bb25-b7fd396ca35cn@googlegroups.com>
<u6bgl6$3u1lp$1@dont-email.me>
<fcd1a3bd-5028-4ada-a839-f5ee134b19c7n@googlegroups.com>
<u6fkaf$g2qk$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 16 Jun 2023 17:08:33 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="32c5e8ddf2b7df6616e988cbdf1bd343";
logging-data="943208"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19fUCkqQ/B7cpvYjjide+D1"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:7g5NGTIHif++RwePjVmykHz8xfw=
In-Reply-To: <u6fkaf$g2qk$1@dont-email.me>
Content-Language: en-US
 by: BGB - Fri, 16 Jun 2023 17:08 UTC

On 6/15/2023 1:12 PM, BGB wrote:
> On 6/14/2023 6:49 AM, Peter Lund wrote:
>> On Wednesday, June 14, 2023 at 6:45:30 AM UTC+2, BGB wrote:
>>> Also, AFAIK, the Google NN chips aren't available in hobbyist grade
>>> dev-boards.
>>
>> No, but once you have a decent model + training pipeline + training
>> data prepared, you can sign up for a free (or cheap) deep learning
>> cloud account.
>>
>> Training requires much, much more compute than inference.
>>
>> (Also remember to initialize the weights properly before training --
>> google "Kaiming He".  Different types of models require different
>> types of initialization so if you try something fancy, you might need
>> a different initialization than He's.  Just trace the citations of
>> He's paper to find them.  You might also need more than 3 layers.)
>>
>> -Peter
>
> Possible.
>
> For this, I had mostly run past experiments with doing the training on
> my PC.
>
> The nets being, not "that" big, are not a huge burden for a desktop PC
> (apart from the cost of the cost of emulating the Binary16 math on a
> system that does not natively support Binary16...).
>
>
> Thinking about it, it may make sense to leave the NN weights in memory
> and handle them using memory loads, rather than embed them directly into
> the ASM code. This would allow using a more conventional loop structure,
> and wouldn't risk being bottle-necked by the L1 I$ (but would still face
> issues if the input data and weights have too much of a miss rate with
> the L1 D$).
>
> Main drawback is that memory loads have a higher latency.
>
>
> Likely the weights would remain as Binary16 (FP8 could save some space,
> potentially allowing for a bigger net to fit in the L1 cache, but would
> add further latency and is not likely sufficient for this case).
>
> ...
>

Well, anyways, here is an example of a "generic" loop written to load
the weights from memory (8-wide case):

TK_NN_ProcessVector8w:
ADD -128, SP
MOV.X R30, (SP, 112)
MOV.X R28, (SP, 96)
MOV.X R26, (SP, 80)
MOV.X R24, (SP, 64)

MOV.Q R14, (SP, 48)
MOV.X R12, (SP, 32)
MOV.X R10, (SP, 16)
MOV.X R8 , (SP, 0)

MOV 0, R2
MOV 0, R3
MOV R7, R14

//Loop Start
.L0:
MOV.Q (R4, 0), R7

MOV.X (R5, 0), R16
MOV.X (R5, 16), R18
PSHUF.W R7, 0x00, R20
PSHUF.W R7, 0x55, R21
PMUL.H R16, R20, R24
PMUL.H R17, R21, R25
PSHUF.W R7, 0xAA, R22
PADD.H R24, R2
PMUL.H R18, R22, R26
PSHUF.W R7, 0xFF, R23
PADD.H R25, R2
PMUL.H R19, R23, R27

MOV.X (R5, 32), R16
PADD.H R26, R2

MOV.X (R5, 48), R18
PSHUF.W R7, 0x00, R20
PSHUF.W R7, 0x55, R21
PADD.H R27, R2

PMUL.H R16, R20, R24
PMUL.H R17, R21, R25
PSHUF.W R7, 0xAA, R22

PADD.H R24, R3
PMUL.H R18, R22, R26
PSHUF.W R7, 0xFF, R23
PADD.H R25, R3
PMUL.H R19, R23, R27
ADD -1, R14
PADD.H R26, R3
ADD 64, R5
ADD 8, R4
PADD.H R27, R3

CMPGT 0, R14
BT .L0
//Loop End

MOV.X (R5, 0), R16

PADD.H R2, R16, R2
PADD.H R3, R17, R3

AND R16, 3, R18
AND R17, 3, R19
CMPEQ 1, R18
PRELU.H?T R2, R2 //ReLU
CMPEQ 2, R18
PSQRTA.H?T R2, R2 //Signed Square Root Approx
CMPEQ 3, R18
PSQRTUA.H?T R2, R2 //Unsigned Square Root Approx

CMPEQ 1, R19
PRELU.H?T R3, R3
CMPEQ 2, R19
PSQRTA.H?T R3, R3
CMPEQ 3, R19
PSQRTUA.H?T R3, R3
MOV.X R2, (R6)

MOV.X (SP, 112), R30
MOV.X (SP, 96), R28
MOV.X (SP, 80), R26
MOV.X (SP, 64), R24

MOV.Q (SP, 48), R14
MOV.X (SP, 32), R12
MOV.X (SP, 16), R10
MOV.X (SP, 0), R8
ADD 128, SP

RTS

Not tested yet, but written to try to minimize interlock stalls. Not
many cases where bundling can be used, as this code is mostly limited by
latency (most of the operations in the inner loop being 3-cycle ops).

Also stays within the limits of the low 32 registers.
Doesn't look like the extra memory loads likely wrecked things too horribly.

Where:
void TK_NN_ProcessVector8w(
u64 *vvec, u64 *wvec,
u64 *dvec,
int n);

Expressing the vectors as 64-bit integers mostly because this is what
"normal C" would allow.

Leaving out a generic C version, as (besides it being likely a few
orders of magnitude slower), it is also a lot more bulky...

....

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor