Rocksolid Light - comp.arch - Re: Misc Tradeoffs: Core I can fit on the XC7S50

On 6/13/2023 1:00 PM, Peter Lund wrote:
> On Tuesday, June 13, 2023 at 10:52:07 AM UTC+2, BGB wrote:
>> On 6/13/2023 12:44 AM, robf...@gmail.com wrote:
>>> On Monday, June 12, 2023 at 2:09:08 PM UTC-4, BGB wrote:
>>>> On 6/12/2023 6:57 AM, Peter Lund wrote:
>>>>> On Saturday, June 10, 2023 at 9:04:23 PM UTC+2, BGB wrote:
>>>>>> Theoretically, could have used a Cortex-M based microcontroller, but
>>>>>> these tend to "suck real hard" at floating-point use-cases (like, even
>>>>>> running at 250MHz or similar isn't enough to offset the typically dismal
>>>>>> floating-point performance).
>>>>>
>>>>> Cortex-M0/M0+/M1 do indeed suck at fp. M4F has hardware fp so it should be ok. Even M3 will almost certainly be ok, given that it has single-cycle shifts and single-cycle counting of leading zeros.
>>>>>
>>>> I was wanting to be able to run neural nets, which kinda depend "a lot"
>>>> on floating-point performance.
>>>
>>> Could you use fixed-point? I based my neural net accelerator on 16.16 fixed
>>> point. I have not checked the size, but I think modelling a small number of
>>> neurons is possible. I used a word-serial approach and tables occupying
>>> two BRAMs per neuron. Then they can be built up into larger networks using
>>> software. Relying on a neuron accelerator rather than a pure software
>>> implementation.
>>>
>> Early on, I had considered 8.8 fixed-point, but Binary16 has a larger
>> dynamic range and can give better results. Similar, there is no
>> (particular) performance advantage of fixed-point over floating-point in
>> this case (while the FP-SIMD ops have a higher latency, there is enough
>> going on that it is possible to schedule the instructions such that full
>> throughput is achieved).
>>
>>
>> The "PMULSH.H" instruction would have been useful:
>> * FE-FE F2nm_9qjj PMULSH.H Rm, Imm48fv8sh, Rn
>>
>> But, sadly, this is asking a little too much of the LUT budget on the
>> XC7S50... Though, this instruction was designed specifically for this
>> use-case.
>>
>> It does make a difference in that without this instruction, it makes
>> sense to compute roughly 4 neurons in parallel (with a lot of cycles
>> being spent on 64-bit constant loads), whereas to achieve full
>> throughput with this instruction, it makes more sense to calculate 16
>> neurons in parallel. But, then this also basically needs the XGPR
>> extension to be able to not run out of registers.
>>
>> This scenario pushing roughly 150 MFLOP/s at 50 MHz (vs only around 110
>> MFLOP/s with separate constant-loads and shuffles).
>>
>> For Binary32 SIMD, the situation would be a little worse, not because of
>> a limitation of the SIMD unit, but rather because this precludes being
>> able to run shuffle ops in parallel with SIMD ops.
>>
>> If I could pull off the "PMACSH.H" instruction which does FMAC instead,
>> this would allow a further speedup (saving all the clock-cycles that
>> would have been spent on PADD.H instructions). But, this would add other
>> issues (it being unclear if I can shove a Binary16 FMAC into a 3-cycle
>> latency...).
>>
>>
>> Though, unlike many other cases, rather than being D$ limited, this sort
>> of code puts a lot more strain on the I$ instead (with a comparably high
>> I$ miss rate).
>>
>> Theoretically, the SIMD unit could go up to 200 MFLOP/s at 50 MHz, but
>> achieving this is unrealistic.
>>
>>
>>
>>
>> Some of the nets I had tested with were:
>> 64-inputs;
>> 3 layers of 32 neurons;
>> 4 outputs.
>>
>> So, say, roughly 3.2k weights + 100 biases.
>>
>> For something like depth inference, one would pull, say, 32 pixels from
>> each image (in a roughly overlapping section), then tile the net across
>> the scanline (to generate a low-res map Z map).
>>>>
>>>> Despite the relatively low clock-speed of the BJX2 core, having 3L/1T
>>>> FP-SIMD ops allows for "relatively decent" performance (when the NNs are
>>>> turned into a big blob of ASM).
>>>>
>>>> But, yeah, for Cortex-M, was mostly thinking of the M0/M0+ and similar...
>>>>
>>>>
>>>>
>>>> A previous micro-benchmark for an NN showed it as being "relatively
>>>> comparable" to an early 2000s laptop (using x87).
>>>>
>> For context, despite the laptop's clock-speed advantage, it was
>> hard-pressed to win with the x87.
>>
>> Granted, this is a fairly uneven comparison (eg: dense SIMD vs generic x87).
>>
>> Also, generated ASM blobs vs "more generic" C.
>>
>> Whole lot of:
>> a0=a0+va[0]*wv[0];
>> a1=a1+va[1]*wv[1];
>> a2=a2+va[2]*wv[2];
>> a3=a3+va[3]*wv[3];
>> ...
>>
>> Like, doesn't seem like it is asking "that" much?...
>>
>>
>>
>> But, like, the clock-speed disparity is big enough that one would expect
>> that a 50 MHz core would have any chance against 1.47 GHz on any metric
>> (regardless of the ISA).
>>
>> Granted, maybe it would make sense to compare against an SSE
>> implementation, since this is closer to "apples to apples".
>>>> Granted, both are also beat out at this by an original 700MHz/ARM11
>>>> RasPi (with pretty much all the newer RasPi's being clearly superior on
>>>> this front, *).
>>>>
>> With a RasPi pushing a bit more with similar code...
>>>> Could technically have just run the NNs on a RasPi, but, ...
>>>> This was more a case of me wanting to use my own stuff, vs just throwing
>>>> a RasPi at the problem.
>>>>
>>>> Looks like an Cortex-M4F or Cortex-M7 could also work (might need to
>>>> compare against a RasPi or similar; not sure how much RAM is typical for
>>>> them, or if they can handle dual camera inputs, ...).
>>>>
>>>>
>>>> *: Despite the CPU in the laptop having better looking on-paper stats
>>>> (both higher clock-speed and OoO vs in-order), general performance was
>>>> fairly comparable to a RasPi2.
>>>>
>> Eg: "Mobile Athlon XP 1700+"
>>
>> On paper, seems like it should be pretty solid...
>>
>> On paper, seems like it should stomp a RasPi or RasPi2.
>> But, my attempts at testing seem to disagree.
>>>> In terms of both floating-point and memcpy benchmarks, the RasPi2 is
>>>> significantly ahead (also the RasPi2 has more RAM as well, typical
>>>> "cheap" modern SDcards are also bigger than the laptop's HDD, ...).
>>>> Can't really put in a newer HDD because the laptop uses a parallel ATA
>>>> variant (and no one made significantly larger PATA drives).
>>>>
>> I am less sure about the RasPi2's memcpy speed advantage.
>>
>> A lot apparently comes down to the (unknown) detail of the RAM speed and
>> bus width and similar on the RasPi.
>>
>>
>> Goes and looks at stuff, PC-2100 shouldn't be at that much of a
>> disadvantage... (Lower MT/s but should have a wider bus width). The
>> laptop equipped with 2x 256MB modules.
>>
>>
>> Will note though that my past benchmark attempts seem to fall well short
>> of the theoretical bandwidth (then again, it seems like in my attempts
>> at memcpy benchmarks, I never see numbers anywhere near the theoretical
>> bandwidth values).
>>>>
>>>>
>>>> In other news:
>>>> I recently added a hack that allows the 32-bit MUL unit in the BJX2 core
>>>> to handle a certain range of DIVS.L and DIVU.L instructions (it will
>>>> signal to the Shift-Add divider that it has handled it, and the divider
>>>> will skip doing anything; when the signal arrives to the EX3 stage, EX3
>>>> will forward the multiplier's result instead of the divider's).
>>>>
>>>> It works if the source value is positive, and the divisor is a small
>>>> positive integer which it can turn it into a lookup table.
>>>>
>>>> Currently only can deal with dividing positive and unsigned values, as
>>>> trying to use it on negative values gives "round towards negative
>>>> infinity" behavior rather than "round towards zero" (and the required
>>>> sign-flipping would wreck the timing constraints); and consequently
>>>> breaking some of the "sanity checks".
>>>>
>>>> While the compiler does this transformation itself for
>>>> divide-by-constant, this hack mostly covers division at runtime.
>>>>
>>>> But, it is still a bit of an ugly hack...
>>>>
>>>>
>>>> Modeling this behavior in my emulator seems able to push Dhrystone past
>>>> the 80k mark (for a long while, it was stuck at ~ 75k). Granted, a
>>>> recent special case to "long long" multiply pushed it from 75k to 77k,
>>>> and the divider tweak pushed it the rest of the way.
>>>>
>>>> Granted, it is a bit of a hack, as it changes DIVS.L timing from a fixed
>>>> 36-cycle timing, to 3 cycle with a 33-cycle penalty case.
>>>>
>>>> ...
>
> Take a look at distillation and quantization. You almost certainly don't need floating-point for useful inference.
>
> What you do is you train a model on your laptop/TPU/Dojo cluster using floating-point. Then you distill and quantize that so you end up with a smaller and simpler model that doesn't use floating-point at all (or only uses it in a few select places). The process is basically to use the first model to train the second one instead of using the original training data. The second model can start small and grow neurons during training or it can start bigger and have neurons pruned during training.
> Both methods work -- but look up the details in recent papers first, of course.
>

Possible.

> If you are building a vision model, remember to do data augmentation on your training set.
> You can make big models generalize well with the right training method, in which case big models are nicer than small ones. You get a small one later via distillation.
>

I am not building an object recognition model for now (this would likely
be harder than depth-inference).

Idea is mostly to try to figure out how far away things are, and to have
some way to figure out how to estimate or measure how the robot has
moved, and then build a 3D voxel map.

At least in theory, most of the needed information should be in the
video stream. But, processing it out is the harder part.

There are "offline" algorithms for stereo depth inference, namely
sliding parts of the images across each other and finding the pairs with
the lowest localized RMSE. But, computationally, this is also fairly
expensive.

Had before made a variant which used a color-cell encoding scheme
(similar to the alpha channel from DXT5) as a preoptimization step
(estimating the maximal error and then checking the blocks with the
lowest maximum error.

This version would have been fast enough to run real-time on a RasPi.

For some past experiments, had used some of the "stereo 3D video" off of
YouTube for testing/training, which would have the video as two separate
left/right views. Results were mixed, as the eye-separation/etc were
often inconsistent and wildly exaggerated.

They also apparently often seemed to use a "two wider spaced cameras,
slightly cross-eyed to converge directly behind a target" strategy;
rather than a "two parallel cameras pointing directly forward with a
smaller fixed displacement". I suspect because the former leads to a
more obvious "3D!" effect than the latter (and they were treating it
more like a sensory gimmick rather than trying to accurately recreate a
"true to life" experience).

Well, and then there is a single-camera algorithm where one can use
depth of field, which sorta worked with cheap camera modules as there
was a significant difference between in-focus and out-of-focus areas
over a relatively short range of distances.

However, this approach is ineffective with 3D renderer images (which
tend to lack any sort of depth of field).

Cameras with auto-focus would also break it. But, pros/cons, most cheap
modules have the lens just sort of mounted to a screw-mechanism that can
be manually turned to adjust focus.

For this, one could use a "much simpler" NN (or "perceptron") trained on
the output of the DCT transform (similar to what would be used in JPEG
encoding).

Though, for whatever reason, pretty much no one else seems to do this...

OTOH, I guess the drawback is that one basically ends up with a camera
with "spectacularly bad" image quality, limiting its usefulness for much
else.

I had imagined before that if one had a camera where one could control
the focal length (such as having a focus servo), it could be possible to
use a "Z sweep" strategy to "map out" a 3D space by using camera focus.

Though, would lead to the drawback of needing to nearly constantly sweep
the lens back and forth, taking a series of images at each depth (well,
and-or have a movable lens in front of the camera).

Well, and then there is "structure from motion", but this only works if
the camera or scene is moving.

One other idea (would be more involved), would be to make an unusual
"lobed" front lens such that there end up being multiple focal points on
the image sensor. I don't currently have the resources to make something
like this.

Though, one other possibility being to make a sort of lens with a V or W
shaped profile (rather than a normal uniformly circular profile), which
could in turn encode some 3D spatial information in the form of optical
effects.

Where, with some rounding, a V shaped profile would almost resemble a
"heart" shape, and the W shape sort of like two heart-shaped merged
together, likely with a thicker middle section and thinner edges (but
also sorta following the profile of the letter W).

It seems like an "F-hole" shaped lens could maybe also work.

Not really found anything talking about lenses like this, so all this is
mostly based on "my ability to imagine stuff".

Granted, it seems like people are mostly designing lenses to make
"conventionally good looking camera images" and similar rather than
trying to figure out ways to capture spatial information via an image
sensor using optical effects.

The W and F-hole lens could potentially have a reasonably normal looking
middle section, with some of the spatial information mostly located in
the peripheral parts of the image. The V (or heart-shaped) lens would
likely instead mostly distort the middle section.

Other shapes seem possible as well.

These strategies could potentially also gain basic 3D information with a
single camera.

In theory (given a small ball-nose endmill and a machine that can hold
fairly tight tolerances), it should be possible to machine such a lens
out of a piece of polycarbonate or similar (and then just sorta stick it
onto the front of a camera module).

....

Subject	Replies	Author
Misc Tradeoffs: Core I can fit on the XC7S50 By: BGB on Sat, 10 Jun 2023	23	BGB

Pie are not square. Pie are round. Cornbread are square.

computers / comp.arch / Re: Misc Tradeoffs: Core I can fit on the XC7S50