Welcome to Rocksolid Light

mail files register newsreader groups login

Message-ID:

"There is such a fine line between genius and stupidity." -- David St. Hubbins, "Spinal Tap"

Re: FPGA use

Subject	Author
FPGA use	Brian G. Lucas
Re: FPGA use	BGB
Re: FPGA use	Robert Finch
Re: FPGA use	BGB
Re: FPGA use	Robert Finch
Re: FPGA use	BGB
Re: FPGA use	Anton Ertl
Re: FPGA use	Thomas Koenig
Re: FPGA use	Anton Ertl
Re: FPGA use	Scott Lurndal
Re: FPGA use	BGB
Re: FPGA use	Scott Lurndal
Re: FPGA use	BGB
Re: FPGA use	Scott Lurndal
Re: FPGA use	David Brown
Re: FPGA use	BGB
Re: FPGA use	Anton Ertl
Re: FPGA use	BGB-Alt
Re: FPGA use	Anton Ertl
Re: FPGA use	BGB
Re: FPGA use	David Brown
Re: FPGA use	BGB
Re: FPGA use	Torbjorn Lindgren
Re: FPGA use	David Brown
Re: FPGA use	Anssi Saari
Re: FPGA use	Michael S
Re: FPGA use	Robert Finch
Re: FPGA use	David Brown
Re: FPGA use	Marcus
Re: FPGA use	BGB
Re: FPGA use	BGB
Re: FPGA use	BGB
Re: FPGA use	Robert Finch
Re: FPGA use	BGB
Re: FPGA use	Terje Mathisen
Re: FPGA use	Terje Mathisen
Re: FPGA use	BGB

Pages:12

Re: FPGA use

<unri4l$3g3ve$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36832&group=comp.arch#36832

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: FPGA use
Date: Fri, 12 Jan 2024 09:25:24 -0500
Organization: A noiseless patient Spider
Lines: 51
Message-ID: <unri4l$3g3ve$1@dont-email.me>
References: <unf3r6$17714$1@dont-email.me> <ung6vq$1f2j0$1@dont-email.me>
<ungai4$1fgmp$1@dont-email.me> <ungm0n$1h241$1@dont-email.me>
<unhbua$1k7dn$1@dont-email.me> <unlg0a$2d9d2$2@dont-email.me>
<sm01qaoyrzk.fsf@lakka.kapsi.fi> <20240112153000.00005db4@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 12 Jan 2024 14:25:25 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="15a9b9a94ceb7fc90ccfcfbd0778c599";
logging-data="3674094"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19QDneUeiTUbksr+2a3sMnjDka8u3ief5c="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:nbUXVEL15fwvTOP9Y8q+5D0dPPU=
In-Reply-To: <20240112153000.00005db4@yahoo.com>
Content-Language: en-US

by: Robert Finch - Fri, 12 Jan 2024 14:25 UTC

On 2024-01-12 8:30 a.m., Michael S wrote:
> On Thu, 11 Jan 2024 14:27:59 +0200
> Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> wrote:
>
>> David Brown <david.brown@hesbynett.no> writes:
>>
>>> I remember working with a fairly large CPLD in the late 90's, with a
>>> 16-bit Windows system. It could take 2 or 3 hours for the build, if
>>> it fitted - and 6 to 10 hours if it failed. And being 16-bit
>>> Windows, the system was useless for anything else while the build
>>> was running. That was tedious development work!
>>
>> I have a similar memory but I'm not sure what kind of Windows it
>> was. FPGA was something fairly large for the time, the FPGA tool was
>> probably early Quartus and there was no progress indication at all.
>> When you hit start, mouse cursor changed to a hourglass for hours and
>> then it was done, pass or fail. I think someone rigged the then new
>> fangled invention called a web camera to watch the display so we
>> didn't have to walk to the machine to see if it was finished or not.
>
> What David describes does not sound like Altera.
> By late 90s in Altera world everybody were using MAX+Plus II. It was
> 32-bit.
> Altera's biggest CPLD family back then was Max 9000 which was not
> particularly big. May be, the biggest design could take 30 minutes to
> compile, but I never encountered that. 5-7 minutes was more typical on
> decent Pentium-II under Win NT4. And it didn't even need a lot of RAM.
> 64 MB was fully sufficient.
> Now, the biggest contemporary Altera FPGAs (Flex 10K family) is a
> different story. I never used biggest members of the family myself,
> but heard that compilation could take several hours.
>
> I still have project with ACEX FPGA to maintain. ACEX has exactly the
> same architecture as Flex 10K, the only difference is fewer SKUs and
> much lower price tag. But ACEX is supported by Quartus-II v.6 (circa
> 2006) so I have no need to somehow convince old MAX+Plus II software to
> work on newer OSes. So on this front I have no war stories to share.
>
> As for Qaurtus-II, I don't remember ever using very early versions.
> Likely v.4 is the first I used. This one, of course, had progress bar
> (unreliable, but then every progress bar I had ever seen in this sorts
> of software was unreliable) and it could beep when it ends compilation.
>
>
>
I bought an Trex-C1 FPGA board which has an Altera FPGA on it a few
years ago. Used it and Quartus? for about a year. Worked very well. I
seem to recall running into a licensing issue, but then I got a newer,
larger FPGA board. I still have the board around somewhere, not wanting
to part with it.

Re: FPGA use

<unrqrb$3heeq$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36837&group=comp.arch#36837

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: david.brown@hesbynett.no (David Brown)
Newsgroups: comp.arch
Subject: Re: FPGA use
Date: Fri, 12 Jan 2024 17:54:03 +0100
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <unrqrb$3heeq$1@dont-email.me>
References: <unf3r6$17714$1@dont-email.me> <ung6vq$1f2j0$1@dont-email.me>
<ungai4$1fgmp$1@dont-email.me> <ungm0n$1h241$1@dont-email.me>
<unhbua$1k7dn$1@dont-email.me> <unlg0a$2d9d2$2@dont-email.me>
<sm01qaoyrzk.fsf@lakka.kapsi.fi> <20240112153000.00005db4@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 12 Jan 2024 16:54:03 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8cdd22475589055e3052b9eb67ea743e";
logging-data="3717594"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18FO8SU5sy/cEY6/8b/UqXiZL6H+uQWGzo="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:3ktouBveoOFP/r36/rzOc0aRV9U=
In-Reply-To: <20240112153000.00005db4@yahoo.com>
Content-Language: en-GB

by: David Brown - Fri, 12 Jan 2024 16:54 UTC

On 12/01/2024 14:30, Michael S wrote:
> On Thu, 11 Jan 2024 14:27:59 +0200
> Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> wrote:
>
>> David Brown <david.brown@hesbynett.no> writes:
>>
>>> I remember working with a fairly large CPLD in the late 90's, with a
>>> 16-bit Windows system. It could take 2 or 3 hours for the build, if
>>> it fitted - and 6 to 10 hours if it failed. And being 16-bit
>>> Windows, the system was useless for anything else while the build
>>> was running. That was tedious development work!
>>
>> I have a similar memory but I'm not sure what kind of Windows it
>> was. FPGA was something fairly large for the time, the FPGA tool was
>> probably early Quartus and there was no progress indication at all.
>> When you hit start, mouse cursor changed to a hourglass for hours and
>> then it was done, pass or fail. I think someone rigged the then new
>> fangled invention called a web camera to watch the display so we
>> didn't have to walk to the machine to see if it was finished or not.
>
> What David describes does not sound like Altera.
> By late 90s in Altera world everybody were using MAX+Plus II. It was
> 32-bit.

It was before then. Probably a Mach4 or Mach5, from Vantis (originally
AMD, then later Lattice, if I remember correctly). I don't recall which
parts we used when we had the worst build times, but we were absolutely
pushing the limits of the devices - at least 95% macrocell usage. And
the PC used for the job was not top of the range, by any means.

Re: FPGA use

<unrtr2$3htab$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36840&group=comp.arch#36840

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: FPGA use
Date: Fri, 12 Jan 2024 11:44:56 -0600
Organization: A noiseless patient Spider
Lines: 126
Message-ID: <unrtr2$3htab$1@dont-email.me>
References: <unf3r6$17714$1@dont-email.me> <ung6vq$1f2j0$1@dont-email.me>
<ungai4$1fgmp$1@dont-email.me> <ungm0n$1h241$1@dont-email.me>
<unhbua$1k7dn$1@dont-email.me> <2024Jan9.085310@mips.complang.tuwien.ac.at>
<unk6l4$285p7$2@newsreader4.netcologne.de>
<2024Jan9.233826@mips.complang.tuwien.ac.at> <unkn9j$26p7d$1@dont-email.me>
<xslnN.145430$PuZ9.119892@fx11.iad> <unkue2$27jh9$1@dont-email.me>
<2024Jan10.180420@mips.complang.tuwien.ac.at> <unpiau$34c5u$1@dont-email.me>
<2024Jan12.081610@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 12 Jan 2024 17:45:07 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="267e646840a57255f4927348df7da9b5";
logging-data="3732811"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+2orXNE1qO/1pVW2IGAl1L"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:ILXykBDyv1xytiX2IZO9bRG0VbY=
Content-Language: en-US
In-Reply-To: <2024Jan12.081610@mips.complang.tuwien.ac.at>

by: BGB - Fri, 12 Jan 2024 17:44 UTC

On 1/12/2024 1:16 AM, Anton Ertl wrote:
> BGB-Alt <bohannonindustriesllc@gmail.com> writes:
>> On 1/10/2024 11:04 AM, Anton Ertl wrote:
>>> BGB <cr88192@gmail.com> writes:
>>>> Looking elsewhere, it says that the AMD B450 chipset has a 128GB limit.
>>>
>>> AM4 generally does not support more than 128GB RAM, because DDR4 can
>>> only have 16GB/chip-select and channel, and AM4 only supports 4
>>> chip-selects per channel and 2 channels. But B450 certainly does
>>> support 128GB. We have 128GB in at least one machine with a B450
>>> board:
>>>
>>> # dmidecode|grep -B1 B450
>>> Manufacturer: ASUSTeK COMPUTER INC.
>>> Product Name: TUF B450M-PLUS GAMING
>>>
>>> # dmesg
>>> ...
>>> [ 3.859281] EDAC MC: UMC0 chip selects:
>>> [ 3.859282] EDAC amd64: MC: 0: 16384MB 1: 16384MB
>>> [ 3.860295] EDAC amd64: MC: 2: 16384MB 3: 16384MB
>>> [ 3.861280] EDAC MC: UMC1 chip selects:
>>> [ 3.861280] EDAC amd64: MC: 0: 16384MB 1: 16384MB
>>> [ 3.862238] EDAC amd64: MC: 2: 16384MB 3: 16384MB
>>> ...
>>> # free
>>> total used free shared buff/cache available
>>> Mem: 131832604 104094356 17986860 24772 11018444 27738248
>>> Swap: 0 0 0
>>>
>>> Apparently the kernel uses a little over 2GB for its own purposes (the
>>> number of total KB shown is only 125.7GB).
>>>
>>
>> Not sure, is this with an integrated or discrete GPU?...
>
> Discrete. The CPU is a 3900X.
>

OK.

Was using a GTX 980 for a fairly long time, then got an RTX 3050...

>> Currently 2733 MHz IIRC, as 2933 MHz seemed to have stability issues
>> (the new RAM modules, from Corsair, claimed to be good for 3200 MHz, but
>> I didn't see stable results much over 2733; was running 2933 with the
>> old modules, which IIRC claimed 3000 MHz...).
>
> The machine reported above uses DIMMs rated at 2666MT/s, and runs them
> at 2666MT/s. dmidecode outputs:
>
> ...
> Speed: 2666 MT/s
> ...
> Configured Memory Speed: 2666 MT/s
> ...
>

Looks it up, yeah, it is MT/s, not MHz, in this case.

The RAM is often sold as 3000 or 3200, idea is that one runs them at
that speed.

Usually they list JEDEC and XMP1/XMP2 speeds, eg:
JDEC: 2133 in this case.
XMP: 3000 or 3200, but freely modifiable in the BIOS.

IIRC, timings are something like 16-20-20.
XMP1 and XMP2 generally seem to be equivalent.

In the past, much of the RAM came with huge elaborate heatsinks and
often RGB LEDs, the newer RAM has a less massive heatsink, and no RGB.

At one point, there was some RAM online that seemed to have some sort of
"RGB angel wings" thing going on. Seems to have gone away when I was
last looking online for RAM.

Not entirely sure how a person was supposed to get it seated (like,
normally, one needs to be able to be able to apply pressure at various
points across the module, which can't really happen in this case if the
top is covered with big spikey angel wings...).

Most of the other (remaining) RAM seemingly content to have a top ridge
of RGB or similar.

Looking at it, a lot of the RGB'ed RAM also has brand names like, say:
BallistX, T-Force, Dominator, Fury, WarHawk, Vengeance, ...
Seems like they are trying to go for a gamer machismo thing...
With RGB, they could have also gone for the whole "kawaii uwu" thing.
Maybe also made the RAM heatsinks pink or lavender, ...

Don't remember who made the absurd looking angel-wing RAM, have a vague
memory it may have been T-Force, but don't really remember. All the
stuff they seem to be selling now has a much more sensible heatsink design.

Well, nevermind if the MOBO itself has a bunch of RGB LEDs on it, and
turns into a sort of lightshow whenever the computer is turned on (I
guess theoretically, there is software to control this, but meh,
whatever...).

>> IIRC, I have Windows currently set up with around 384GB of swap space,
>> spread across several HDDs.
>
> Have you computed how long it would take to page 384GB out to the HDDs
> and to page them back in? IME paging to HDDs does not make sense
> anymore (and paging to SSDs is questionable, too).
>

Performance of the swapfile isn't the issue, but rather so that computer
doesn't go "oh crap" and die once it uses up all the RAM.

Like, just went and looked and saw that PC was sitting at around 240 GB
of commit-charge, much of this likely "Firefox doing the Firefox thing"
(if one leaves it running for long enough, it eventually expands and
consumes all available RAM and swap space, until killed off and
restarted...).

> - anton

Re: FPGA use

<untlmv$3ssbv$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36859&group=comp.arch#36859

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: m.delete@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: FPGA use
Date: Sat, 13 Jan 2024 10:38:37 +0100
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <untlmv$3ssbv$1@dont-email.me>
References: <unf3r6$17714$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 13 Jan 2024 09:38:39 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="84021c670c24a6254185d79aeb59c4bf";
logging-data="4092287"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18JBa7fvT3mlrv2KRhHf3fk/RtKgC3UHmY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Uxu/c/ArX2P/kc6XF4mCRY5WHBc=
In-Reply-To: <unf3r6$17714$1@dont-email.me>
Content-Language: en-US

by: Marcus - Sat, 13 Jan 2024 09:38 UTC

On 2024-01-07 22:07, Brian G. Lucas wrote:
> Several posters on comp.arch are running their cpu designs on FPGAs.

Yes. I use it for my MRISC32 project. I implement a CPU (MRISC32-A1) as
a soft processor for FPGA use. Furthermore I implement a kind of
computer around the CPU, to provide ROM, RAM, I/O, graphics, etc.

Here's a recent video of the computer running Quake:

https://vimeo.com/901506667

I use/target two different FPGA boards, but I mainly use one of them for
development.

> I have several questions:
> 1. Which particular FPGA chip? (not just family but the particular SKU)

a) Intel Cyclone-V 5CEBA4F23C7N (my main development FPGA)

b) Intel MAX 10 10M50DAF484 (this is the smaller one of the two)

> 2. On what development board?

a) Terasic DE0-CV

b) Terasic DE10-Lite

> 3. Using what tools?

Development: Sublime Text + VS Code + GHDL + gtkwave (all free).

Programming: Intel Quartus Prime Lite Edition, v19.1.0 (it's free).

>
> Thanks,
> brian

Re: FPGA use

<unusod$3290$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36863&group=comp.arch#36863

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: FPGA use
Date: Sat, 13 Jan 2024 14:45:01 -0600
Organization: A noiseless patient Spider
Lines: 268
Message-ID: <unusod$3290$1@dont-email.me>
References: <unf3r6$17714$1@dont-email.me> <untlmv$3ssbv$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 13 Jan 2024 20:45:09 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="d931ae28f47a555e2ab859d8d5c2be37";
logging-data="100640"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19P6Hd/2VzwLEburXjJJW3T"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:24vI14zTwas2IQJ5jMOa1f8y6jU=
Content-Language: en-US
In-Reply-To: <untlmv$3ssbv$1@dont-email.me>

by: BGB - Sat, 13 Jan 2024 20:45 UTC

On 1/13/2024 3:38 AM, Marcus wrote:
> On 2024-01-07 22:07, Brian G. Lucas wrote:
>> Several posters on comp.arch are running their cpu designs on FPGAs.
>
> Yes. I use it for my MRISC32 project. I implement a CPU (MRISC32-A1) as
> a soft processor for FPGA use. Furthermore I implement a kind of
> computer around the CPU, to provide ROM, RAM, I/O, graphics, etc.
>
> Here's a recent video of the computer running Quake:
>
> https://vimeo.com/901506667
>
> I use/target two different FPGA boards, but I mainly use one of them for
> development.
>

At least it is going fast...

If I run the emulator at 110 MHz:
Software Quake gets ~ 10.4 fps.
GLQuake gets 12.7 fps.

Though, the GLQuake performance partly took a hit recently as I had been
moving away from running the GL backend directly in the program, to
instead run it via system calls.

I had partly integrated some features from the version that was stuck
onto the Quake engine into the other branch which was modified to work
inside the TKGDI process, such as support for the rasterizer module.

Looking in the profile output, it appears it is still doing a bit of the
GL rendering via the software span-drawing though.

Though, in the process, it has gone from the use of "hybrid poor-mans
perspective correct" back to plain "affine texturing with dynamic
subdivision", with a comparably finer subdivision.

So, proper perspective correct would involve:
Divide ST coords by Z before rasterization;
Interpolate as 1/Z;
Dynamically calculate "Z" via "1/(interpolated 1/Z)";
Scale ST coords by Z during rasterization.

Poor man's version:
Divide ST coords by Z before rasterization;
Interpolate as Z;
Scale ST coords by Z during rasterization.
This version isn't as good as the proper version, and adds some of its
own issues vs affine.

Affine:
Interpolate ST coords directly (no Z scaling).

However, larger primitives (*) with affine texturing need to be split
apart into smaller primitives during rendering, which adds cost in terms
of transform/projection, which it seems is a more significant part of
the cost when using the hardware rasterizer module.

Actual perspective-correct could be better here, but the "quickly and
semi-accurately calculate 1/(1/Z) part" is a challenge.

*: At the moment, basically any triangle with a circumference larger
than ~ 21 pixels, or 28 pixels for a quad, will be subdivided. Going too
much bigger makes the affine warping a lot more obvious.

The Software Quake in this case, is a modified version of the Quake C
software renderer:
Was modified early on to use 16-bit pixels rather than 8-bit pixels;
Initially, this was YUV655, but then went to RGB555.
A few functions were rewritten in ASM.
Though, still basically all scalar code;
The bulk of the renderer is still C though.

There is still some weirdness in a few places where the math still
assumes YUV, which leads to things like the menu background blending
being the wrong color (never got around to fixing this), ...
(The video there seemed to show a dithered effect, which is a little
different from a color-blend).

Did gain some alpha blended effects (such as a translucent console),
because these seemed cool at the time, and isn't too hard to pull off
with RGB pixels.

Note that my GLQuake port is still faster than the Quake
software-renderer, even with software-rasterized OpenGL.

Does sort of imply a faster software renderer could still be possible...

Though, in my Doom port, I did eventually go from the use of
color-blending (for things like screen flashes) to the use of
integrating the color-flash into the active "colormap" table (*), which
is used every time a span or column is drawn in Doom (not so much in SW
Quake; where texturing+lighting is precalculated via a "surface cache").

*: It being computationally faster to RGB blend the current version of
the colormap table, than to RGB blend the final screen image (with
menus/status-bar/etc being drawn via the unblended colormap).

Though, I did once experiment with eliminating the colormap table
entirely in Doom, and using purely RGB modulation (like one might do in
an GL style rasterizer), but this was slower than using the colormap table.

At the moment, Doom at least mostly holds over 20 fps (at 50MHz), having
gained a few fps on average with a recent experimental optimization:
Temporary variables which are used exclusively as function-call inputs
may have the expression output directly to the register corresponding to
the function argument, rather than first going to a callee-save register
and then being MOV'ed to the final argument register.

Effect seems to be:
Makes binary 3% smaller;
Makes Doom roughly 9% faster;
Drops "MOV Reg,Reg" from being ~ 16% of total ops, to ~ 13%;
Cause the number of bundled instructions to drop by 1% though;
...

Note that this only applies to temporaries, not to expressions performed
via local variables or similar, which still use callee-save registers.

Had sort of hoped it would save more, but it seems like many of the
"MOV's" for function arguments are coming from local variables rather
than temporaries (but, unlike a temporary, the contents of a local
variable still need to still be intact after the function call).

After a fair bit of debugging (to get the built program to not be
entirely broken), this change has a more obvious effect on the
performance of ROTT (which gets around 70% faster and ~ 6% smaller).
(Though, there is some other unresolved, less-recent bug, that has
seemed to cause MIDI playback in ROTT to sound like broken garbage).

Though, for ROTT this wasn't isolated from another few recent optimizations:
Eliminating initial condition-check with "for()" loops of the form:
for(i=M; i<N; i++)
When M<N, and both are constant.
Reworking "*ptr++" in the RIL-IR stage to eliminate an extra "MOV";
Also eliminates using an extra temporary (manifest as the "MOV").
Involved detecting and handling this as a single operation.
And generating the RIL3 stack-operations in a different order.
Didn't bother detecting/handling preincrement cases yet though.
Making expressions like "x=*ptr;" not use an extra temporary;
...

Well, and other changes:
Making the size-limit for inline "memcpy()" smaller,
added a copy-slide and generated memcpy's for intermediate cases.
Was:
< 128 byte: generate inline.
< 512 byte: maybe special-case inline (if speed-optimized).
Now:
< 64 byte: generate inline
< 512: call a generated unrolled copy-loop/slide.
This mostly being because handling larger cases inline is bulky.
It takes around 512 bytes of ".text" to copy 512 bytes inline...

Some of this is basically a case of going through some debug ASM and
looking for "stupid instruction sequences", and trying to figure out
what causes them and how to fix it.

However, "obvious cases that save lots of instructions" are becoming
much less common.

And, some other optimizations, such as "constant propagation" would be a
lot more difficult to pull off... Where, say, the value of a constant
would be seen via a variable rather than a "#define" or similar; my
compiler already has the optimization of replacing expressions like
"2+3" with "5".

The big problem with constant propagation is that whether or not a
constant can be propagated depends on local visibility and control flow
(and would likely be of very limited effectiveness if it could not cross
boundaries between basic-blocks).

For example, if it could not cross a basic-block boundary, it would have
still been N/A for the previous "for() loop" optimization (which in this
case was handled via AST level pattern matching).

Some of the remaining inefficiencies cross multiple levels in the
compiler, which is annoying...

Then there are a lot of things that GCC does, that I have little idea
how to pull off at the moment.

For example, it assigns local variables to registers which seem to be
localized and flow across basic-block boundaries; currently BGBCC does
nothing of the sort (closest it can do is rank the most-used variables,
and static-assign them to registers for the scope of the whole function;
anything else using spill-and-fill via the stack frame).

Sadly, despite having 64 GPRs, still have not entirely eliminated the
use of spill and fill. The mechanism that can eliminate spill-and-fill
on a function scale (by assigning everything to registers), is basically
defeated as soon as anything takes the address of a local variable or
similar (whole function falls back to the generic strategy; the local
variable in question going over to not caching the value in a register
at all, and instead using spill/fill every time that variable is
accessed, anywhere in the function...).

Click here to read the complete article

Re: FPGA use

<unuuuq$3drs$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36864&group=comp.arch#36864

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: FPGA use
Date: Sat, 13 Jan 2024 15:22:34 -0600
Organization: A noiseless patient Spider
Lines: 279
Message-ID: <unuuuq$3drs$1@dont-email.me>
References: <unf3r6$17714$1@dont-email.me> <untlmv$3ssbv$1@dont-email.me>
<unusod$3290$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 13 Jan 2024 21:22:36 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="d931ae28f47a555e2ab859d8d5c2be37";
logging-data="112508"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19RveG1BpPqsv86ktazQuFz"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:ZyEEk13cyMoMb2lhyop7YaUrF6o=
Content-Language: en-US
In-Reply-To: <unusod$3290$1@dont-email.me>

by: BGB - Sat, 13 Jan 2024 21:22 UTC

Clarification, all of the rest was for my BJX2 project...
I realized this part may have been ambiguous...

Not trying to down sell your effort, getting 30+ fps from Quake on an
FPGA is still pretty good...

> If I run the emulator at 110 MHz:
> Software Quake gets ~ 10.4 fps.
> GLQuake gets 12.7 fps.
>
>
> Though, the GLQuake performance partly took a hit recently as I had been
> moving away from running the GL backend directly in the program, to
> instead run it via system calls.
>
>
> I had partly integrated some features from the version that was stuck
> onto the Quake engine into the other branch which was modified to work
> inside the TKGDI process, such as support for the rasterizer module.
>
> Looking in the profile output, it appears it is still doing a bit of the
> GL rendering via the software span-drawing though.
>
>
> Though, in the process, it has gone from the use of "hybrid poor-mans
> perspective correct" back to plain "affine texturing with dynamic
> subdivision", with a comparably finer subdivision.
>
> So, proper perspective correct would involve:
> Divide ST coords by Z before rasterization;
> Interpolate as 1/Z;
>     Dynamically calculate "Z" via "1/(interpolated 1/Z)";
> Scale ST coords by Z during rasterization.
>
> Poor man's version:
> Divide ST coords by Z before rasterization;
> Interpolate as Z;
> Scale ST coords by Z during rasterization.
> This version isn't as good as the proper version, and adds some of its
> own issues vs affine.
>
>
> Affine:
> Interpolate ST coords directly (no Z scaling).
>
> However, larger primitives (*) with affine texturing need to be split
> apart into smaller primitives during rendering, which adds cost in terms
> of transform/projection, which it seems is a more significant part of
> the cost when using the hardware rasterizer module.
>
> Actual perspective-correct could be better here, but the "quickly and
> semi-accurately calculate 1/(1/Z) part" is a challenge.
>
>
> *: At the moment, basically any triangle with a circumference larger
> than ~ 21 pixels, or 28 pixels for a quad, will be subdivided. Going too
> much bigger makes the affine warping a lot more obvious.
>
>
>
> The Software Quake in this case, is a modified version of the Quake C
> software renderer:
> Was modified early on to use 16-bit pixels rather than 8-bit pixels;
>     Initially, this was YUV655, but then went to RGB555.
> A few functions were rewritten in ASM.
>     Though, still basically all scalar code;
>     The bulk of the renderer is still C though.
>
> There is still some weirdness in a few places where the math still
> assumes YUV, which leads to things like the menu background blending
> being the wrong color (never got around to fixing this), ...
> (The video there seemed to show a dithered effect, which is a little
> different from a color-blend).
>
> Did gain some alpha blended effects (such as a translucent console),
> because these seemed cool at the time, and isn't too hard to pull off
> with RGB pixels.
>
>
> Note that my GLQuake port is still faster than the Quake
> software-renderer, even with software-rasterized OpenGL.
>
> Does sort of imply a faster software renderer could still be possible...
>
>
>
> Though, in my Doom port, I did eventually go from the use of
> color-blending (for things like screen flashes) to the use of
> integrating the color-flash into the active "colormap" table (*), which
> is used every time a span or column is drawn in Doom (not so much in SW
> Quake; where texturing+lighting is precalculated via a "surface cache").
>
> *: It being computationally faster to RGB blend the current version of
> the colormap table, than to RGB blend the final screen image (with
> menus/status-bar/etc being drawn via the unblended colormap).
>
> Though, I did once experiment with eliminating the colormap table
> entirely in Doom, and using purely RGB modulation (like one might do in
> an GL style rasterizer), but this was slower than using the colormap table.
>
>
> At the moment, Doom at least mostly holds over 20 fps (at 50MHz), having
> gained a few fps on average with a recent experimental optimization:
> Temporary variables which are used exclusively as function-call inputs
> may have the expression output directly to the register corresponding to
> the function argument, rather than first going to a callee-save register
> and then being MOV'ed to the final argument register.
>
> Effect seems to be:
> Makes binary 3% smaller;
> Makes Doom roughly 9% faster;
> Drops "MOV Reg,Reg" from being ~ 16% of total ops, to ~ 13%;
> Cause the number of bundled instructions to drop by 1% though;
> ...
>
> Note that this only applies to temporaries, not to expressions performed
> via local variables or similar, which still use callee-save registers.
>
> Had sort of hoped it would save more, but it seems like many of the
> "MOV's" for function arguments are coming from local variables rather
> than temporaries (but, unlike a temporary, the contents of a local
> variable still need to still be intact after the function call).
>
>
>
> After a fair bit of debugging (to get the built program to not be
> entirely broken), this change has a more obvious effect on the
> performance of ROTT (which gets around 70% faster and ~ 6% smaller).
> (Though, there is some other unresolved, less-recent bug, that has
> seemed to cause MIDI playback in ROTT to sound like broken garbage).
>
> Though, for ROTT this wasn't isolated from another few recent
> optimizations:
> Eliminating initial condition-check with "for()" loops of the form:
>     for(i=M; i<N; i++)
> When M<N, and both are constant.
> Reworking "*ptr++" in the RIL-IR stage to eliminate an extra "MOV";
>     Also eliminates using an extra temporary (manifest as the "MOV").
>     Involved detecting and handling this as a single operation.
>       And generating the RIL3 stack-operations in a different order.
>       Didn't bother detecting/handling preincrement cases yet though.
> Making expressions like "x=*ptr;" not use an extra temporary;
> ...
>
> Well, and other changes:
> Making the size-limit for inline "memcpy()" smaller,
>     added a copy-slide and generated memcpy's for intermediate cases.
> Was:
>     < 128 byte: generate inline.
>     < 512 byte: maybe special-case inline (if speed-optimized).
> Now:
>     < 64 byte: generate inline
>     < 512: call a generated unrolled copy-loop/slide.
> This mostly being because handling larger cases inline is bulky.
>     It takes around 512 bytes of ".text" to copy 512 bytes inline...
>
>
> Some of this is basically a case of going through some debug ASM and
> looking for "stupid instruction sequences", and trying to figure out
> what causes them and how to fix it.
>
> However, "obvious cases that save lots of instructions" are becoming
> much less common.
>
> And, some other optimizations, such as "constant propagation" would be a
> lot more difficult to pull off... Where, say, the value of a constant
> would be seen via a variable rather than a "#define" or similar; my
> compiler already has the optimization of replacing expressions like
> "2+3" with "5".
>
>
> The big problem with constant propagation is that whether or not a
> constant can be propagated depends on local visibility and control flow
> (and would likely be of very limited effectiveness if it could not cross
> boundaries between basic-blocks).
>
> For example, if it could not cross a basic-block boundary, it would have
> still been N/A for the previous "for() loop" optimization (which in this
> case was handled via AST level pattern matching).
>
>
> Some of the remaining inefficiencies cross multiple levels in the
> compiler, which is annoying...
>
> Then there are a lot of things that GCC does, that I have little idea
> how to pull off at the moment.
>
>
> For example, it assigns local variables to registers which seem to be
> localized and flow across basic-block boundaries; currently BGBCC does
> nothing of the sort (closest it can do is rank the most-used variables,
> and static-assign them to registers for the scope of the whole function;
> anything else using spill-and-fill via the stack frame).
>
> Sadly, despite having 64 GPRs, still have not entirely eliminated the
> use of spill and fill. The mechanism that can eliminate spill-and-fill
> on a function scale (by assigning everything to registers), is basically
> defeated as soon as anything takes the address of a local variable or
> similar (whole function falls back to the generic strategy; the local
> variable in question going over to not caching the value in a register
> at all, and instead using spill/fill every time that variable is
> accessed, anywhere in the function...).
>
> ...
>
>
>>> I have several questions:
>>> 1. Which particular FPGA chip? (not just family but the particular SKU)
>>
>> a) Intel Cyclone-V 5CEBA4F23C7N (my main development FPGA)
>>
>> b) Intel MAX 10 10M50DAF484 (this is the smaller one of the two)
>>
>
> Mostly still XC7A100T and XC7A200T.
> Advantage of the latter in this case that I can fit multiple cores.
> Where the single-core config uses around 70% of an XC7A100T.
>
> With some limitations, can sorta shoe-horn it into an XC7S50, though not
> with the entire feature-set.
>
>
>>> 2. On what development board?
>>
>> a) Terasic DE0-CV
>>
>> b) Terasic DE10-Lite
>>
>
> Had once looked into these, but didn't get them as they weren't super
> cheap, and were different enough as to require some porting effort.
>
> Did at one point synthesize the BJX2 core in Quartus though...
>
>
>>> 3. Using what tools?
>>
>> Development: Sublime Text + VS Code + GHDL + gtkwave (all free).
>>
>> Programming: Intel Quartus Prime Lite Edition, v19.1.0 (it's free).
>>
>
> All Verilog here...
>
> Seems the version I am using is some sort of intermediate between
> Verilog and SystemVerilog. Vivado accepts it as Verilog, but for Quartus
> I needed to tell it that it was SystemVerilog.
>
> Otherwise, seems to work fine in Verilator and similar as well.
>
>
>>>
>>> Thanks,
>>> brian
>>
>>
>

Click here to read the complete article

Re: FPGA use

<uo45pf$12oh6$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36879&group=comp.arch#36879

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: FPGA use
Date: Mon, 15 Jan 2024 14:49:49 -0600
Organization: A noiseless patient Spider
Lines: 434
Message-ID: <uo45pf$12oh6$1@dont-email.me>
References: <unf3r6$17714$1@dont-email.me> <untlmv$3ssbv$1@dont-email.me>
<unusod$3290$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 15 Jan 2024 20:49:54 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e67f737aa52224cc61f382d75c1d7cc6";
logging-data="1139238"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Hf+1gDuE2WxyWVxXQu6yN"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:bw4uKQyVMrgJcEPBFlSkEd3+BCc=
In-Reply-To: <unusod$3290$1@dont-email.me>
Content-Language: en-US

by: BGB - Mon, 15 Jan 2024 20:49 UTC

On 1/13/2024 2:45 PM, BGB wrote:
> On 1/13/2024 3:38 AM, Marcus wrote:
>> On 2024-01-07 22:07, Brian G. Lucas wrote:
>>> Several posters on comp.arch are running their cpu designs on FPGAs.
>>
>> Yes. I use it for my MRISC32 project. I implement a CPU (MRISC32-A1) as
>> a soft processor for FPGA use. Furthermore I implement a kind of
>> computer around the CPU, to provide ROM, RAM, I/O, graphics, etc.
>>
>> Here's a recent video of the computer running Quake:
>>
>> https://vimeo.com/901506667
>>
>> I use/target two different FPGA boards, but I mainly use one of them for
>> development.
>>
>
> At least it is going fast...
>
> If I run the emulator at 110 MHz:
> Software Quake gets ~ 10.4 fps.
> GLQuake gets 12.7 fps.
>
>
> Though, the GLQuake performance partly took a hit recently as I had been
> moving away from running the GL backend directly in the program, to
> instead run it via system calls.
>

Generally, it is around 1 system call per draw operation, and system
calls for things like binding a texture, uploading an image, etc.

Possible could be to try to eliminate the texture binding and instead
pass one or more texture handles for each draw operation or similar.

>
> I had partly integrated some features from the version that was stuck
> onto the Quake engine into the other branch which was modified to work
> inside the TKGDI process, such as support for the rasterizer module.
>
> Looking in the profile output, it appears it is still doing a bit of the
> GL rendering via the software span-drawing though.
>
>
> Though, in the process, it has gone from the use of "hybrid poor-mans
> perspective correct" back to plain "affine texturing with dynamic
> subdivision", with a comparably finer subdivision.
>
> So, proper perspective correct would involve:
> Divide ST coords by Z before rasterization;
> Interpolate as 1/Z;
> Dynamically calculate "Z" via "1/(interpolated 1/Z)";
> Scale ST coords by Z during rasterization.
>
> Poor man's version:
> Divide ST coords by Z before rasterization;
> Interpolate as Z;
> Scale ST coords by Z during rasterization.
> This version isn't as good as the proper version, and adds some of its
> own issues vs affine.
>
>
> Affine:
> Interpolate ST coords directly (no Z scaling).
>
> However, larger primitives (*) with affine texturing need to be split
> apart into smaller primitives during rendering, which adds cost in terms
> of transform/projection, which it seems is a more significant part of
> the cost when using the hardware rasterizer module.
>
> Actual perspective-correct could be better here, but the "quickly and
> semi-accurately calculate 1/(1/Z) part" is a challenge.
>
>
> *: At the moment, basically any triangle with a circumference larger
> than ~ 21 pixels, or 28 pixels for a quad, will be subdivided. Going too
> much bigger makes the affine warping a lot more obvious.
>
>

Self-correction:
It is 144 for a triangle and 192 for a quad, 48 pixels on each side...

I had mistakenly done a square-root here, noting that I compared these
values as distance^2, but then forgot to take into account that the
values themselves were squared, so it was 48 pixels linear, rather than
square-root of 48 pixels.

Using 7 pixels would have probably led to a lot less affine distortion,
but would have have a much higher overhead. Goal being mostly to break
apart large primitives where the affine distortion is more obvious.

Though, one possibility that was not explored at the time, would be to
do the transform as-if it were doing perspective correct rendering, then
breaking up the primitive in screen space (then, for affine rendering,
one would calculate 1/(1/Z) and scale S/T by Z for each sub-primitive).

The current implementation is more like:
Project primitive into screen space;
Calculate edge-lengths;
If too big, break apart at the midpoint along each edge;
This part happens in world space;
Try again with each sub-piece.
Mostly using a stack to manage the primitive transforms (the primitive
is fully drawn once this stack is empty).

Which, as noted, is not the most efficient way possible to do it, as
each fragmented primitive also needs to send all of the sub-vertices
through the vertex transform.

May also make sense to clip the primitive to the frustum, as primitives
crossing outside the frustum or across the near clip plane, behave
particularly badly with the perspective-correct math.

But, this part could possibly use a bit of restructuring, as I
originally wrote it with the assumption of a (comparably slow) software
rasterized backend. Hadn't expected the front-end stages to become the
bottleneck...

Note that the framebuffer and depth buffer in my case were generally:
RGB555A
0rrrrrgggggbbbbb (A=255)
1rrrraggggabbbba (A=0/32/64/96/128/160/192/224).
Z16 (No Stencil)
Z12.S4 (Stencil)

Though, 12-bit depth does lead to fairly obvious Z fighting.

Note that 4 stencil bits is basically the minimum needed to pull off
effects like stencil shadows.

The reason for not using 32-bit Color and Depth buffers mostly has to do
with memory bandwidth and performance.

>
> The Software Quake in this case, is a modified version of the Quake C
> software renderer:
> Was modified early on to use 16-bit pixels rather than 8-bit pixels;
>     Initially, this was YUV655, but then went to RGB555.
> A few functions were rewritten in ASM.
>     Though, still basically all scalar code;
>     The bulk of the renderer is still C though.
>
> There is still some weirdness in a few places where the math still
> assumes YUV, which leads to things like the menu background blending
> being the wrong color (never got around to fixing this), ...
> (The video there seemed to show a dithered effect, which is a little
> different from a color-blend).
>
> Did gain some alpha blended effects (such as a translucent console),
> because these seemed cool at the time, and isn't too hard to pull off
> with RGB pixels.
>
>
> Note that my GLQuake port is still faster than the Quake
> software-renderer, even with software-rasterized OpenGL.
>
> Does sort of imply a faster software renderer could still be possible...
>

The OpenGL API and front-end isn't exactly low-overhead, so
lower-overhead could be possible.

Just my implementation "cheats" a bit by using a lot of SIMD ops,
whereas Quake itself tends to be pretty much exclusively scalar code,
generally working with vectors via function calls and "float *" pointers
and similar.

Well, or:
typedef float vec3_t[3];
Which basically achieves the same effect in practice.

In my case, no vector extensions, just sort of a limited subset of MMX
and SSE style SIMD operations (implemented on top of GPRs or GPR pairs).

Well, and unlike MMX, there are generally no packed byte operations
(smallest packed element being a 16-bit word).

>
>
> Though, in my Doom port, I did eventually go from the use of
> color-blending (for things like screen flashes) to the use of
> integrating the color-flash into the active "colormap" table (*), which
> is used every time a span or column is drawn in Doom (not so much in SW
> Quake; where texturing+lighting is precalculated via a "surface cache").
>
> *: It being computationally faster to RGB blend the current version of
> the colormap table, than to RGB blend the final screen image (with
> menus/status-bar/etc being drawn via the unblended colormap).
>
> Though, I did once experiment with eliminating the colormap table
> entirely in Doom, and using purely RGB modulation (like one might do in
> an GL style rasterizer), but this was slower than using the colormap table.
>
>
> At the moment, Doom at least mostly holds over 20 fps (at 50MHz), having
> gained a few fps on average with a recent experimental optimization:
> Temporary variables which are used exclusively as function-call inputs
> may have the expression output directly to the register corresponding to
> the function argument, rather than first going to a callee-save register
> and then being MOV'ed to the final argument register.
>
> Effect seems to be:
> Makes binary 3% smaller;
> Makes Doom roughly 9% faster;
> Drops "MOV Reg,Reg" from being ~ 16% of total ops, to ~ 13%;
> Cause the number of bundled instructions to drop by 1% though;
> ...
>
> Note that this only applies to temporaries, not to expressions performed
> via local variables or similar, which still use callee-save registers.
>
> Had sort of hoped it would save more, but it seems like many of the
> "MOV's" for function arguments are coming from local variables rather
> than temporaries (but, unlike a temporary, the contents of a local
> variable still need to still be intact after the function call).
>

Click here to read the complete article

Re: FPGA use

<uo4tlc$19nr6$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36880&group=comp.arch#36880

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: FPGA use
Date: Mon, 15 Jan 2024 22:37:13 -0500
Organization: A noiseless patient Spider
Lines: 449
Message-ID: <uo4tlc$19nr6$1@dont-email.me>
References: <unf3r6$17714$1@dont-email.me> <untlmv$3ssbv$1@dont-email.me>
<unusod$3290$1@dont-email.me> <uo45pf$12oh6$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 16 Jan 2024 03:37:16 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="040ff6624b0434efe74e6af84441ef89";
logging-data="1367910"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+/JCBLASscZNiuroEB4udzzmp32DXJ6qo="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:OYgkC3LpL9yzgJ8XN/iCyOzi2l0=
In-Reply-To: <uo45pf$12oh6$1@dont-email.me>
Content-Language: en-US

by: Robert Finch - Tue, 16 Jan 2024 03:37 UTC

On 2024-01-15 3:49 p.m., BGB wrote:
> On 1/13/2024 2:45 PM, BGB wrote:
>> On 1/13/2024 3:38 AM, Marcus wrote:
>>> On 2024-01-07 22:07, Brian G. Lucas wrote:
>>>> Several posters on comp.arch are running their cpu designs on FPGAs.
>>>
>>> Yes. I use it for my MRISC32 project. I implement a CPU (MRISC32-A1) as
>>> a soft processor for FPGA use. Furthermore I implement a kind of
>>> computer around the CPU, to provide ROM, RAM, I/O, graphics, etc.
>>>
>>> Here's a recent video of the computer running Quake:
>>>
>>> https://vimeo.com/901506667
>>>
>>> I use/target two different FPGA boards, but I mainly use one of them for
>>> development.
>>>
>>
>> At least it is going fast...
>>
>> If I run the emulator at 110 MHz:
>>    Software Quake gets ~ 10.4 fps.
>>    GLQuake gets 12.7 fps.
>>
>>
>> Though, the GLQuake performance partly took a hit recently as I had
>> been moving away from running the GL backend directly in the program,
>> to instead run it via system calls.
>>
>
> Generally, it is around 1 system call per draw operation, and system
> calls for things like binding a texture, uploading an image, etc.
>
> Possible could be to try to eliminate the texture binding and instead
> pass one or more texture handles for each draw operation or similar.
>
>>
>> I had partly integrated some features from the version that was stuck
>> onto the Quake engine into the other branch which was modified to work
>> inside the TKGDI process, such as support for the rasterizer module.
>>
>> Looking in the profile output, it appears it is still doing a bit of
>> the GL rendering via the software span-drawing though.
>>
>>
>> Though, in the process, it has gone from the use of "hybrid poor-mans
>> perspective correct" back to plain "affine texturing with dynamic
>> subdivision", with a comparably finer subdivision.
>>
>> So, proper perspective correct would involve:
>>    Divide ST coords by Z before rasterization;
>>    Interpolate as 1/Z;
>>      Dynamically calculate "Z" via "1/(interpolated 1/Z)";
>>    Scale ST coords by Z during rasterization.
>>
>> Poor man's version:
>>    Divide ST coords by Z before rasterization;
>>    Interpolate as Z;
>>    Scale ST coords by Z during rasterization.
>> This version isn't as good as the proper version, and adds some of its
>> own issues vs affine.
>>
>>
>> Affine:
>>    Interpolate ST coords directly (no Z scaling).
>>
>> However, larger primitives (*) with affine texturing need to be split
>> apart into smaller primitives during rendering, which adds cost in
>> terms of transform/projection, which it seems is a more significant
>> part of the cost when using the hardware rasterizer module.
>>
>> Actual perspective-correct could be better here, but the "quickly and
>> semi-accurately calculate 1/(1/Z) part" is a challenge.
>>
>>
>> *: At the moment, basically any triangle with a circumference larger
>> than ~ 21 pixels, or 28 pixels for a quad, will be subdivided. Going
>> too much bigger makes the affine warping a lot more obvious.
>>
>>
>
> Self-correction:
> It is 144 for a triangle and 192 for a quad, 48 pixels on each side...
>
> I had mistakenly done a square-root here, noting that I compared these
> values as distance^2, but then forgot to take into account that the
> values themselves were squared, so it was 48 pixels linear, rather than
> square-root of 48 pixels.
>
> Using 7 pixels would have probably led to a lot less affine distortion,
> but would have have a much higher overhead. Goal being mostly to break
> apart large primitives where the affine distortion is more obvious.
>
>
>
> Though, one possibility that was not explored at the time, would be to
> do the transform as-if it were doing perspective correct rendering, then
> breaking up the primitive in screen space (then, for affine rendering,
> one would calculate 1/(1/Z) and scale S/T by Z for each sub-primitive).
>
>
> The current implementation is more like:
> Project primitive into screen space;
> Calculate edge-lengths;
> If too big, break apart at the midpoint along each edge;
>     This part happens in world space;
> Try again with each sub-piece.
> Mostly using a stack to manage the primitive transforms (the primitive
> is fully drawn once this stack is empty).
>
> Which, as noted, is not the most efficient way possible to do it, as
> each fragmented primitive also needs to send all of the sub-vertices
> through the vertex transform.
>
>
> May also make sense to clip the primitive to the frustum, as primitives
> crossing outside the frustum or across the near clip plane, behave
> particularly badly with the perspective-correct math.
>
>
> But, this part could possibly use a bit of restructuring, as I
> originally wrote it with the assumption of a (comparably slow) software
> rasterized backend. Hadn't expected the front-end stages to become the
> bottleneck...
>
>
> Note that the framebuffer and depth buffer in my case were generally:
> RGB555A
>     0rrrrrgggggbbbbb (A=255)
>     1rrrraggggabbbba (A=0/32/64/96/128/160/192/224).
>     Z16    (No Stencil)
>     Z12.S4 (Stencil)
>
> Though, 12-bit depth does lead to fairly obvious Z fighting.
>
> Note that 4 stencil bits is basically the minimum needed to pull off
> effects like stencil shadows.
>
> The reason for not using 32-bit Color and Depth buffers mostly has to do
> with memory bandwidth and performance.
>
>
>>
>> The Software Quake in this case, is a modified version of the Quake C
>> software renderer:
>>    Was modified early on to use 16-bit pixels rather than 8-bit pixels;
>>      Initially, this was YUV655, but then went to RGB555.
>>    A few functions were rewritten in ASM.
>>      Though, still basically all scalar code;
>>      The bulk of the renderer is still C though.
>>
>> There is still some weirdness in a few places where the math still
>> assumes YUV, which leads to things like the menu background blending
>> being the wrong color (never got around to fixing this), ...
>> (The video there seemed to show a dithered effect, which is a little
>> different from a color-blend).
>>
>> Did gain some alpha blended effects (such as a translucent console),
>> because these seemed cool at the time, and isn't too hard to pull off
>> with RGB pixels.
>>
>>
>> Note that my GLQuake port is still faster than the Quake
>> software-renderer, even with software-rasterized OpenGL.
>>
>> Does sort of imply a faster software renderer could still be possible...
>>
>
> The OpenGL API and front-end isn't exactly low-overhead, so
> lower-overhead could be possible.
>
> Just my implementation "cheats" a bit by using a lot of SIMD ops,
> whereas Quake itself tends to be pretty much exclusively scalar code,
> generally working with vectors via function calls and "float *" pointers
> and similar.
>
> Well, or:
> typedef float vec3_t[3];
> Which basically achieves the same effect in practice.
>
>
>
> In my case, no vector extensions, just sort of a limited subset of MMX
> and SSE style SIMD operations (implemented on top of GPRs or GPR pairs).
>
> Well, and unlike MMX, there are generally no packed byte operations
> (smallest packed element being a 16-bit word).
>
>
>>
>>
>> Though, in my Doom port, I did eventually go from the use of
>> color-blending (for things like screen flashes) to the use of
>> integrating the color-flash into the active "colormap" table (*),
>> which is used every time a span or column is drawn in Doom (not so
>> much in SW Quake; where texturing+lighting is precalculated via a
>> "surface cache").
>>
>> *: It being computationally faster to RGB blend the current version of
>> the colormap table, than to RGB blend the final screen image (with
>> menus/status-bar/etc being drawn via the unblended colormap).
>>
>> Though, I did once experiment with eliminating the colormap table
>> entirely in Doom, and using purely RGB modulation (like one might do
>> in an GL style rasterizer), but this was slower than using the
>> colormap table.
>>
>>
>> At the moment, Doom at least mostly holds over 20 fps (at 50MHz),
>> having gained a few fps on average with a recent experimental
>> optimization:
>> Temporary variables which are used exclusively as function-call inputs
>> may have the expression output directly to the register corresponding
>> to the function argument, rather than first going to a callee-save
>> register and then being MOV'ed to the final argument register.
>>
>> Effect seems to be:
>>    Makes binary 3% smaller;
>>    Makes Doom roughly 9% faster;
>>    Drops "MOV Reg,Reg" from being ~ 16% of total ops, to ~ 13%;
>>    Cause the number of bundled instructions to drop by 1% though;
>>    ...
>>
>> Note that this only applies to temporaries, not to expressions
>> performed via local variables or similar, which still use callee-save
>> registers.
>>
>> Had sort of hoped it would save more, but it seems like many of the
>> "MOV's" for function arguments are coming from local variables rather
>> than temporaries (but, unlike a temporary, the contents of a local
>> variable still need to still be intact after the function call).
>>
>
> A bunch more fiddling, it is now down a little more:
> The "MOV Reg,Reg" case is down to around 10% of the total binary.
> A lot of this was by fiddling with stuff mostly at the IR stages to
> reduce the number of temporaries used in cases where using an extra
> temporary was not necessary.
>
> Still some remain, but attempts to eliminate these cases had broken the
> program being compiled.
>
>>
>>
>> After a fair bit of debugging (to get the built program to not be
>> entirely broken), this change has a more obvious effect on the
>> performance of ROTT (which gets around 70% faster and ~ 6% smaller).
>> (Though, there is some other unresolved, less-recent bug, that has
>> seemed to cause MIDI playback in ROTT to sound like broken garbage).
>>
>
> Not sure why ROTT seemed to have seen such a disproportionately larger
> result...
>
> But, ROTT went from "fairly slow" up to "relatively fast" (now running
> at closer to Doom-like speeds).
>
>
> Hexen is still fairly slow though, as Hexen manages to be almost as slow
> as Quake, despite being Doom-engine based, and my Doom engine port is
> comparably faster.
>
> Previously, ROTT was around a similar speed to Hexen.
>
>
> Though, of the ports, ROTT is ironically the only one still doing its
> rendering in 8-bit color, but this was partly because the engine had
> tried using the VGA in very weird ways (*), and I ended up implementing
> the ROTT port partly by faking the VGA behavior by implementing most of
> the graphics-hardware interface via function calls.
>
> *: Rather than linear 320x200, it used the VGA in a sort of planar mode:
> 4 planes, each 96x200, for ~ 384x200 mode, but only 320x200 is visible.
> For some effects, it dynamically modifies the color palette, ...
>
>
> The other games had merely used linear 320x200, and had seemingly
> already made some provisions for rendering via 16-bit pixels.
>
> Though, this did still leave the matter of how to implement things like
> screen color-flashes (which in 256 color versions were pulled off by
> dynamically modifying the color palette).
>
>
> Early on, had used framebuffer level color-blending. For Doom, later
> went to dynamically updating the "colormap" table based on the screen
> flash (but always drawing HUD/UI using the original colormap, as Doom
> redraws things like the HUD incrementally, and so drawing it with the
> colormap can lead to ugly artifacts if screen-flashing gets involved).
>
> For things like the "invisibility" effect, had to rework this to working
> with RGB555 (original version fed the pixels back through the colormap
> table a second time to darken each pixel by an amount given in a lookup
> table; wrote something different that did a similar effect but with
> packed RGB555 pixels instead).
>
>
> Meanwhile, Hexen had done a different effect of doing a 50% blend
> between the sprite color and the background color (in indexed color,
> this was done using a lookup table; in RGB555, the blending is done in
> RGB space).
>
> Though, a simple 50% blend can be "cheesed" as, say:
> newclr=((clr1&0x7BDE)+(clr2&0x7BDE))>>1;
> Rather than needing to unpack and blend each component, then repack the
> result.
>
I added a color blend operation to the Thor/Q+ instruction set a while
ago as part of the graphics operations, thinking that the CPU could take
over some of the graphics ops performed by an accelerator. The color
blend operator blends two 30-bit RGB10-10-10 colors with a fixed-point
alpha 1.9 bits. It computes the color blend in two clock cycles. There
is also a ‘transform’ instruction that translates and rotates a point in
a 3D space. Performs all the matrix math (nine multiplies and adds) in
about four clock cycles.

Click here to read the complete article

Re: FPGA use

<uo53fm$1afa5$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36881&group=comp.arch#36881

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.furie.org.uk!usenet.goja.nl.eu.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: FPGA use
Date: Mon, 15 Jan 2024 23:16:35 -0600
Organization: A noiseless patient Spider
Lines: 577
Message-ID: <uo53fm$1afa5$1@dont-email.me>
References: <unf3r6$17714$1@dont-email.me> <untlmv$3ssbv$1@dont-email.me>
<unusod$3290$1@dont-email.me> <uo45pf$12oh6$1@dont-email.me>
<uo4tlc$19nr6$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 16 Jan 2024 05:16:41 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="124bb731d779d9f7acf382eb79b2d27d";
logging-data="1391941"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18XdEKG47sht/J4NXY/dggF"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:zf8HHWxL1TEucGJc68XluUFSrhc=
In-Reply-To: <uo4tlc$19nr6$1@dont-email.me>
Content-Language: en-US

by: BGB - Tue, 16 Jan 2024 05:16 UTC

On 1/15/2024 9:37 PM, Robert Finch wrote:
> On 2024-01-15 3:49 p.m., BGB wrote:
>> On 1/13/2024 2:45 PM, BGB wrote:
>>> On 1/13/2024 3:38 AM, Marcus wrote:
>>>> On 2024-01-07 22:07, Brian G. Lucas wrote:
>>>>> Several posters on comp.arch are running their cpu designs on FPGAs.
>>>>
>>>> Yes. I use it for my MRISC32 project. I implement a CPU (MRISC32-A1) as
>>>> a soft processor for FPGA use. Furthermore I implement a kind of
>>>> computer around the CPU, to provide ROM, RAM, I/O, graphics, etc.
>>>>
>>>> Here's a recent video of the computer running Quake:
>>>>
>>>> https://vimeo.com/901506667
>>>>
>>>> I use/target two different FPGA boards, but I mainly use one of them
>>>> for
>>>> development.
>>>>
>>>
>>> At least it is going fast...
>>>
>>> If I run the emulator at 110 MHz:
>>>    Software Quake gets ~ 10.4 fps.
>>>    GLQuake gets 12.7 fps.
>>>
>>>
>>> Though, the GLQuake performance partly took a hit recently as I had
>>> been moving away from running the GL backend directly in the program,
>>> to instead run it via system calls.
>>>
>>
>> Generally, it is around 1 system call per draw operation, and system
>> calls for things like binding a texture, uploading an image, etc.
>>
>> Possible could be to try to eliminate the texture binding and instead
>> pass one or more texture handles for each draw operation or similar.
>>
>>>
>>> I had partly integrated some features from the version that was stuck
>>> onto the Quake engine into the other branch which was modified to
>>> work inside the TKGDI process, such as support for the rasterizer
>>> module.
>>>
>>> Looking in the profile output, it appears it is still doing a bit of
>>> the GL rendering via the software span-drawing though.
>>>
>>>
>>> Though, in the process, it has gone from the use of "hybrid poor-mans
>>> perspective correct" back to plain "affine texturing with dynamic
>>> subdivision", with a comparably finer subdivision.
>>>
>>> So, proper perspective correct would involve:
>>>    Divide ST coords by Z before rasterization;
>>>    Interpolate as 1/Z;
>>>      Dynamically calculate "Z" via "1/(interpolated 1/Z)";
>>>    Scale ST coords by Z during rasterization.
>>>
>>> Poor man's version:
>>>    Divide ST coords by Z before rasterization;
>>>    Interpolate as Z;
>>>    Scale ST coords by Z during rasterization.
>>> This version isn't as good as the proper version, and adds some of
>>> its own issues vs affine.
>>>
>>>
>>> Affine:
>>>    Interpolate ST coords directly (no Z scaling).
>>>
>>> However, larger primitives (*) with affine texturing need to be split
>>> apart into smaller primitives during rendering, which adds cost in
>>> terms of transform/projection, which it seems is a more significant
>>> part of the cost when using the hardware rasterizer module.
>>>
>>> Actual perspective-correct could be better here, but the "quickly and
>>> semi-accurately calculate 1/(1/Z) part" is a challenge.
>>>
>>>
>>> *: At the moment, basically any triangle with a circumference larger
>>> than ~ 21 pixels, or 28 pixels for a quad, will be subdivided. Going
>>> too much bigger makes the affine warping a lot more obvious.
>>>
>>>
>>
>> Self-correction:
>>    It is 144 for a triangle and 192 for a quad, 48 pixels on each side...
>>
>> I had mistakenly done a square-root here, noting that I compared these
>> values as distance^2, but then forgot to take into account that the
>> values themselves were squared, so it was 48 pixels linear, rather
>> than square-root of 48 pixels.
>>
>> Using 7 pixels would have probably led to a lot less affine
>> distortion, but would have have a much higher overhead. Goal being
>> mostly to break apart large primitives where the affine distortion is
>> more obvious.
>>
>>
>>
>> Though, one possibility that was not explored at the time, would be to
>> do the transform as-if it were doing perspective correct rendering,
>> then breaking up the primitive in screen space (then, for affine
>> rendering, one would calculate 1/(1/Z) and scale S/T by Z for each
>> sub-primitive).
>>
>>
>> The current implementation is more like:
>>    Project primitive into screen space;
>>    Calculate edge-lengths;
>>    If too big, break apart at the midpoint along each edge;
>>      This part happens in world space;
>>    Try again with each sub-piece.
>> Mostly using a stack to manage the primitive transforms (the primitive
>> is fully drawn once this stack is empty).
>>
>> Which, as noted, is not the most efficient way possible to do it, as
>> each fragmented primitive also needs to send all of the sub-vertices
>> through the vertex transform.
>>
>>
>> May also make sense to clip the primitive to the frustum, as
>> primitives crossing outside the frustum or across the near clip plane,
>> behave particularly badly with the perspective-correct math.
>>
>>
>> But, this part could possibly use a bit of restructuring, as I
>> originally wrote it with the assumption of a (comparably slow)
>> software rasterized backend. Hadn't expected the front-end stages to
>> become the bottleneck...
>>
>>
>> Note that the framebuffer and depth buffer in my case were generally:
>>    RGB555A
>>      0rrrrrgggggbbbbb (A=255)
>>      1rrrraggggabbbba (A=0/32/64/96/128/160/192/224).
>>      Z16    (No Stencil)
>>      Z12.S4 (Stencil)
>>
>> Though, 12-bit depth does lead to fairly obvious Z fighting.
>>
>> Note that 4 stencil bits is basically the minimum needed to pull off
>> effects like stencil shadows.
>>
>> The reason for not using 32-bit Color and Depth buffers mostly has to
>> do with memory bandwidth and performance.
>>
>>
>>>
>>> The Software Quake in this case, is a modified version of the Quake C
>>> software renderer:
>>>    Was modified early on to use 16-bit pixels rather than 8-bit pixels;
>>>      Initially, this was YUV655, but then went to RGB555.
>>>    A few functions were rewritten in ASM.
>>>      Though, still basically all scalar code;
>>>      The bulk of the renderer is still C though.
>>>
>>> There is still some weirdness in a few places where the math still
>>> assumes YUV, which leads to things like the menu background blending
>>> being the wrong color (never got around to fixing this), ...
>>> (The video there seemed to show a dithered effect, which is a little
>>> different from a color-blend).
>>>
>>> Did gain some alpha blended effects (such as a translucent console),
>>> because these seemed cool at the time, and isn't too hard to pull off
>>> with RGB pixels.
>>>
>>>
>>> Note that my GLQuake port is still faster than the Quake
>>> software-renderer, even with software-rasterized OpenGL.
>>>
>>> Does sort of imply a faster software renderer could still be possible...
>>>
>>
>> The OpenGL API and front-end isn't exactly low-overhead, so
>> lower-overhead could be possible.
>>
>> Just my implementation "cheats" a bit by using a lot of SIMD ops,
>> whereas Quake itself tends to be pretty much exclusively scalar code,
>> generally working with vectors via function calls and "float *"
>> pointers and similar.
>>
>> Well, or:
>>    typedef float vec3_t[3];
>> Which basically achieves the same effect in practice.
>>
>>
>>
>> In my case, no vector extensions, just sort of a limited subset of MMX
>> and SSE style SIMD operations (implemented on top of GPRs or GPR pairs).
>>
>> Well, and unlike MMX, there are generally no packed byte operations
>> (smallest packed element being a 16-bit word).
>>
>>
>>>
>>>
>>> Though, in my Doom port, I did eventually go from the use of
>>> color-blending (for things like screen flashes) to the use of
>>> integrating the color-flash into the active "colormap" table (*),
>>> which is used every time a span or column is drawn in Doom (not so
>>> much in SW Quake; where texturing+lighting is precalculated via a
>>> "surface cache").
>>>
>>> *: It being computationally faster to RGB blend the current version
>>> of the colormap table, than to RGB blend the final screen image (with
>>> menus/status-bar/etc being drawn via the unblended colormap).
>>>
>>> Though, I did once experiment with eliminating the colormap table
>>> entirely in Doom, and using purely RGB modulation (like one might do
>>> in an GL style rasterizer), but this was slower than using the
>>> colormap table.
>>>
>>>
>>> At the moment, Doom at least mostly holds over 20 fps (at 50MHz),
>>> having gained a few fps on average with a recent experimental
>>> optimization:
>>> Temporary variables which are used exclusively as function-call
>>> inputs may have the expression output directly to the register
>>> corresponding to the function argument, rather than first going to a
>>> callee-save register and then being MOV'ed to the final argument
>>> register.
>>>
>>> Effect seems to be:
>>>    Makes binary 3% smaller;
>>>    Makes Doom roughly 9% faster;
>>>    Drops "MOV Reg,Reg" from being ~ 16% of total ops, to ~ 13%;
>>>    Cause the number of bundled instructions to drop by 1% though;
>>>    ...
>>>
>>> Note that this only applies to temporaries, not to expressions
>>> performed via local variables or similar, which still use callee-save
>>> registers.
>>>
>>> Had sort of hoped it would save more, but it seems like many of the
>>> "MOV's" for function arguments are coming from local variables rather
>>> than temporaries (but, unlike a temporary, the contents of a local
>>> variable still need to still be intact after the function call).
>>>
>>
>> A bunch more fiddling, it is now down a little more:
>>    The "MOV Reg,Reg" case is down to around 10% of the total binary.
>> A lot of this was by fiddling with stuff mostly at the IR stages to
>> reduce the number of temporaries used in cases where using an extra
>> temporary was not necessary.
>>
>> Still some remain, but attempts to eliminate these cases had broken
>> the program being compiled.
>>
>>>
>>>
>>> After a fair bit of debugging (to get the built program to not be
>>> entirely broken), this change has a more obvious effect on the
>>> performance of ROTT (which gets around 70% faster and ~ 6% smaller).
>>> (Though, there is some other unresolved, less-recent bug, that has
>>> seemed to cause MIDI playback in ROTT to sound like broken garbage).
>>>
>>
>> Not sure why ROTT seemed to have seen such a disproportionately larger
>> result...
>>
>> But, ROTT went from "fairly slow" up to "relatively fast" (now running
>> at closer to Doom-like speeds).
>>
>>
>> Hexen is still fairly slow though, as Hexen manages to be almost as
>> slow as Quake, despite being Doom-engine based, and my Doom engine
>> port is comparably faster.
>>
>> Previously, ROTT was around a similar speed to Hexen.
>>
>>
>> Though, of the ports, ROTT is ironically the only one still doing its
>> rendering in 8-bit color, but this was partly because the engine had
>> tried using the VGA in very weird ways (*), and I ended up
>> implementing the ROTT port partly by faking the VGA behavior by
>> implementing most of the graphics-hardware interface via function calls.
>>
>> *: Rather than linear 320x200, it used the VGA in a sort of planar mode:
>> 4 planes, each 96x200, for ~ 384x200 mode, but only 320x200 is
>> visible. For some effects, it dynamically modifies the color palette, ...
>>
>>
>> The other games had merely used linear 320x200, and had seemingly
>> already made some provisions for rendering via 16-bit pixels.
>>
>> Though, this did still leave the matter of how to implement things
>> like screen color-flashes (which in 256 color versions were pulled off
>> by dynamically modifying the color palette).
>>
>>
>> Early on, had used framebuffer level color-blending. For Doom, later
>> went to dynamically updating the "colormap" table based on the screen
>> flash (but always drawing HUD/UI using the original colormap, as Doom
>> redraws things like the HUD incrementally, and so drawing it with the
>> colormap can lead to ugly artifacts if screen-flashing gets involved).
>>
>> For things like the "invisibility" effect, had to rework this to
>> working with RGB555 (original version fed the pixels back through the
>> colormap table a second time to darken each pixel by an amount given
>> in a lookup table; wrote something different that did a similar effect
>> but with packed RGB555 pixels instead).
>>
>>
>> Meanwhile, Hexen had done a different effect of doing a 50% blend
>> between the sprite color and the background color (in indexed color,
>> this was done using a lookup table; in RGB555, the blending is done in
>> RGB space).
>>
>> Though, a simple 50% blend can be "cheesed" as, say:
>>    newclr=((clr1&0x7BDE)+(clr2&0x7BDE))>>1;
>> Rather than needing to unpack and blend each component, then repack
>> the result.
>>
> I added a color blend operation to the Thor/Q+ instruction set a while
> ago as part of the graphics operations, thinking that the CPU could take
> over some of the graphics ops performed by an accelerator. The color
> blend operator blends two 30-bit RGB10-10-10 colors with a fixed-point
> alpha 1.9 bits. It computes the color blend in two clock cycles. There
> is also a ‘transform’ instruction that translates and rotates a point in
> a 3D space. Performs all the matrix math (nine multiplies and adds) in
> about four clock cycles.
>

Click here to read the complete article

Re: FPGA use

<uo58qf$1b30c$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36882&group=comp.arch#36882

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: FPGA use
Date: Tue, 16 Jan 2024 07:47:43 +0100
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <uo58qf$1b30c$1@dont-email.me>
References: <unf3r6$17714$1@dont-email.me> <untlmv$3ssbv$1@dont-email.me>
<unusod$3290$1@dont-email.me> <uo45pf$12oh6$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Injection-Date: Tue, 16 Jan 2024 06:47:43 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0c460b582cd9ee3f37adc472f607dbb4";
logging-data="1412108"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/posBID7dwnmOqmqXOGC3mjVSgZBk2dFGRqzMWB+RNuw=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18
Cancel-Lock: sha1:s7kqp7z/weMh0XdZEHPex7hvsRk=
In-Reply-To: <uo45pf$12oh6$1@dont-email.me>

by: Terje Mathisen - Tue, 16 Jan 2024 06:47 UTC

BGB wrote:
> On 1/13/2024 2:45 PM, BGB wrote:
>> *: At the moment, basically any triangle with a circumference larger
>> than ~ 21 pixels, or 28 pixels for a quad, will be subdivided. Going
>> too much bigger makes the affine warping a lot more obvious.
>>
>>
>
> Self-correction:
> It is 144 for a triangle and 192 for a quad, 48 pixels on each side...
>
> I had mistakenly done a square-root here, noting that I compared these
> values as distance^2, but then forgot to take into account that the
> values themselves were squared, so it was 48 pixels linear, rather than
> square-root of 48 pixels.
>
> Using 7 pixels would have probably led to a lot less affine distortion,
> but would have have a much higher overhead. Goal being mostly to break
> apart large primitives where the affine distortion is more obvious.

I am pretty sure that the original SW Quake used 16-pixel spans, with a
single 1/Z division for each span, so interpolated between affine and
perspective correct?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: FPGA use

<uo59ec$1b5tc$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36883&group=comp.arch#36883

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: FPGA use
Date: Tue, 16 Jan 2024 07:58:20 +0100
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <uo59ec$1b5tc$1@dont-email.me>
References: <unf3r6$17714$1@dont-email.me> <untlmv$3ssbv$1@dont-email.me>
<unusod$3290$1@dont-email.me> <uo45pf$12oh6$1@dont-email.me>
<uo58qf$1b30c$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Injection-Date: Tue, 16 Jan 2024 06:58:20 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0c460b582cd9ee3f37adc472f607dbb4";
logging-data="1415084"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+EjwQ5AsFVtyaysgX9cY2GuS3EWGGkfWLxvJsQYDKVNg=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18
Cancel-Lock: sha1:zzY0l/eHtqxGuS9mPcbKvVJmA9k=
In-Reply-To: <uo58qf$1b30c$1@dont-email.me>

by: Terje Mathisen - Tue, 16 Jan 2024 06:58 UTC

Terje Mathisen wrote:
> BGB wrote:
>> On 1/13/2024 2:45 PM, BGB wrote:
>>> *: At the moment, basically any triangle with a circumference larger
>>> than ~ 21 pixels, or 28 pixels for a quad, will be subdivided. Going
>>> too much bigger makes the affine warping a lot more obvious.
>>>
>>>
>>
>> Self-correction:
>> Â It is 144 for a triangle and 192 for a quad, 48 pixels on each
>> side...
>>
>> I had mistakenly done a square-root here, noting that I compared these
>> values as distance^2, but then forgot to take into account that the
>> values themselves were squared, so it was 48 pixels linear, rather
>> than square-root of 48 pixels.
>>
>> Using 7 pixels would have probably led to a lot less affine
>> distortion, but would have have a much higher overhead. Goal being
>> mostly to break apart large primitives where the affine distortion is
>> more obvious.
>
> I am pretty sure that the original SW Quake used 16-pixel spans, with a
> single 1/Z division for each span, so interpolated between affine and
> perspective correct?

PS. A key idea (from Mike Abrash) was that he managed to overlap
(nearly?) all of the FDIV latency with integer ops drawing that 16-pixel
span, so the division became close to free!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: FPGA use

<uo5j1l$1cgao$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36884&group=comp.arch#36884

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.chmurka.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: FPGA use
Date: Tue, 16 Jan 2024 03:42:12 -0600
Organization: A noiseless patient Spider
Lines: 59
Message-ID: <uo5j1l$1cgao$2@dont-email.me>
References: <unf3r6$17714$1@dont-email.me> <untlmv$3ssbv$1@dont-email.me>
<unusod$3290$1@dont-email.me> <uo45pf$12oh6$1@dont-email.me>
<uo58qf$1b30c$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 16 Jan 2024 09:42:14 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="124bb731d779d9f7acf382eb79b2d27d";
logging-data="1458520"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+rzDIbV9mql83tsg2g+3hb"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Nz63wo1IjqrC72rcITaeYx6Rjco=
In-Reply-To: <uo58qf$1b30c$1@dont-email.me>
Content-Language: en-US

by: BGB - Tue, 16 Jan 2024 09:42 UTC

On 1/16/2024 12:47 AM, Terje Mathisen wrote:
> BGB wrote:
>> On 1/13/2024 2:45 PM, BGB wrote:
>>> *: At the moment, basically any triangle with a circumference larger
>>> than ~ 21 pixels, or 28 pixels for a quad, will be subdivided. Going
>>> too much bigger makes the affine warping a lot more obvious.
>>>
>>>
>>
>> Self-correction:
>> It is 144 for a triangle and 192 for a quad, 48 pixels on each side...
>>
>> I had mistakenly done a square-root here, noting that I compared these
>> values as distance^2, but then forgot to take into account that the
>> values themselves were squared, so it was 48 pixels linear, rather
>> than square-root of 48 pixels.
>>
>> Using 7 pixels would have probably led to a lot less affine
>> distortion, but would have have a much higher overhead. Goal being
>> mostly to break apart large primitives where the affine distortion is
>> more obvious.
>
> I am pretty sure that the original SW Quake used 16-pixel spans, with a
> single 1/Z division for each span, so interpolated between affine and
> perspective correct?
>

Yeah, though the "7 pixel" figure was a bit of a mental screw-up on my
part; in TKRA-GL, it is more like 48 pixels... Took the square root of
48, should not have square-rooted it, as the 48 was being squared...

There are cases where this limit is reduced though:
Steep angles;
Crossing frustum edge;
Crossing near clip plane;
...

But, yeah, in any case, I may need to rework how I do this:
Possibly switching to perspective-correct rendering, or doing the
subdivision in screen-space rather than world space (likely with the
primitives being initially clipped against the view frustum).

As for SW Quake, yeah.
IIRC, the ASM version redid the 1/Z every 16 pixels or so.
Though, this was 8 pixels originally for the C version.
C works well;
32 pixels leads to more obvious distortion.

Granted, 48 is worse than 32, but going too much smaller, greatly
increases the time spent fragmenting and projecting primitives.

> Terje
>

Pages:12

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor