Rocksolid Light - comp.arch - Re: The Impending Return of Concertina III

Robert Finch wrote:

> On 2024-01-25 4:25 p.m., MitchAlsup1 wrote:
>>
>>
>> Whereas:
>>
>> funct2:
>>      ENTER   R25,R0,stackArea2
>>      ...
>>
>> funct1:
>>      ...
>>      EXIT    R21,R0,stackArea1
>>
>> will have registers R0,R25..R30 in the same positions on the stack
>> guaranteed by ISA definition!!

> I like the ENTER / EXIT instructions and safe stack idea, and have
> incorporated them into Q+ called ENTER and LEAVE. EXIT makes me think of
> program exit(). They can improve code density. I gather that the stack
> used for ENTER and EXIT is not the same stack as is available for the
> rest of the app. This means managing two stack pointers, the regular
> stack and the safe stack. Q+ could have the safe stack pointer as a
> register that is not even accessible by the app and not part of the GPR
> file.

LEAVE has older x86 connotations, so I used a different word.

Registers R16..R31 go on the safe stack (when enabled) SSP
Registers R01..R15 go on the regular stack SP

When safe stack is enabled, Return Address goes directly on safe stack
without passing through R0; and comes off of safe-stack without passing
through R0.

SSP requires privilege to access.
The safe stack pages are required to have RWE = 3'B000 rights; so SW
cannot read or write these containers directly or indirectly.

> For ENTER/LEAVE Q+ has the number of registers to save specified as a
> four-bit number and saves only the saved registers, link register and
> frame pointer according to the ABI. So, “ENTER 3,64” will save s0 to s2,
> the frame-pointer, link register and allocate 64 bytes plus the return
> block on the stack. The return block contains the frame-pointer, link
> register and two slots that are zeroed out intended for exception
> handlers. The saved registers are limited to s0 so s9.

I specify start and stop registers in ENTER and EXIT. In addition the
16-bit immediate field is used to allocate/deallocate space other than
the save/restored registers. Since the stack is always doubleword
aligned, the low order 3 bits are used "for special things"::
bit<0> decides if SP is saved on the stack (or not 99%)
bit<1> decides if FP is saved and updated (or restored)
bit<2> decides if a return is performed (used when SW walks a stack
back when doing try-throw-catch stuff.)

I use the HoB of register index to signal select stack pointer.

> Q+ also has a PUSHA / POPA instructions to push or pop all the
> registers, meant for interrupt handlers. PUSH and POP instructions by
> themselves can push or pop up to five registers.

By the time control arrives at interrupt dispatched, the old registers
have been saved and the registers of the ISR have been loaded; so have
ASID and ROOT,..... Thus an ISR can keep pointers in its register file
to quicken access when invoked.

> Some thought has been given towards modifying ENTER and LEAVE to support
> interrupt handlers, rather than have separate PUSHA / POPA instructions.
> ENTER 15,0 would save all the registers, and LEAVE 15,0 would restore
> them all and return using an interrupt return.

On 1/26/2024 10:58 AM, Robert Finch wrote:
> On 2024-01-25 4:25 p.m., MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> On 1/25/2024 11:26 AM, MitchAlsup1 wrote:
>>>> BGB wrote:
>>>>
>>>>> On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
>>>>>> BGB wrote:
>>>>>>
>>>>>>> Granted, one can argue the same of prolog/epilog compression in
>>>>>>> my case:
>>>>>>> Save some space on prolog/epilog by calling or branching to prior
>>>>>>> versions (since the code to save and restore GPRs is fairly
>>>>>>> repetitive).
>>>>>>
>>>>>> ENTER and EXIT eliminate the additional control transfers and can
>>>>>> allow
>>>>>> FETCH of the return address to start before the restores are
>>>>>> finished.
>>>>
>>>>> Possible, but branches are cheaper to implement in hardware, and
>>>>> would have been implemented already...
>>>>
>>>> Are you intentionally misreading what I wrote ??
>>>>
>>
>>> ?? I don't understand.
>>
>>
>>
>>>> Epilogue is a sequence of loads leading to a jump to the return
>>>> address.
>>>>
>>>> Your ISA cannot jump to the return address while performing the loads
>>>> so FETCH does not get the return address and can't start fetching
>>>> instructions until the jump is performed.
>>>>
>>
>>> You can put the load for the return address before the other loads.
>>> Then, if the epilog is long enough (so that this load is no-longer in
>>> flight once it hits the final jump), the branch-predictor will lead
>>> to it start loading the post-return instructions before the jump is
>>> reached.
>>
>> Yes, you can read RA early.
>> What you cannot do is JMP early so the FETCH stage fetches instructions
>> at return address early.
>> {{If you JMP early, then the rest of the LDs won't happen}}
>>
>>> This is likely a non-issue as I see it.
>>
>>> It is only really an issue if one demands that reloading the return
>>> address be done as one of the final instructions in the epilog, and
>>> not one of the first instructions.
>>
>> I make no such demand--I merely demand the JMP RA is the last
>> instruction.
>>
>>> Granted, one would have to do it as one of the final ops, if it were
>>> implemented as a slide, but it is not. There are "practical reasons"
>>> why a slide would not be a workable strategy in this case.
>>
>>> So, generally, these parts of the prolog/epilog sequences are emitted
>>> for every combination of saved/restored registers that had been
>>> encountered.
>>
>>> Though, granted, when used, does mean that any such function needs to
>>> effectively two two sets of stack-pointer adjustments:
>>> One set for the save/restore area (in the reused part);
>>> One part for the function (for its data and local/temporary variables
>>> and similar).
>>
>>
>>>> Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
>>>> the return address from the stack and fetch the instructions at the
>>>> return address while still loading the preserved registers (that were
>>>> saved) so that the instructions are ready for execution by the time
>>>> the last LD is performed.
>>>>
>>>> In addition, If one is performing an EXIT and fetch runs into a CALL;
>>>> it can fetch the Called address and if there is an ENTER instruction
>>>> there, it can cancel the remainder of EXIT and cancel some of ENTER
>>>> because the preserved registers are already on the stack where they
>>>> are supposed to be.
>>>>
>>>> Doing these with STs and LDs cannot save those cycles.
>>>>
>>
>>> I don't see why not, the branch-predictor can still do its thing
>>> regardless of whether or not LD/ST ops were used.
>>
>> Consider::
>>
>> main:
>>       ...
>>       CALL   funct1
>>       CALL   funct2
>>
>> funct2:
>>       SUB    Sp,SP,stackArea2
>>       ST     R0,[SP,offset20]
>>       ST     R0,[SP,offset20]
>>       ST     R30,[SP,offset230]
>>       ST     R29,[SP,offset229]
>>       ST     R28,[SP,offset228]
>>       ST     R27,[SP,offset227]
>>       ST     R26,[SP,offset226]
>>       ST     R25,[SP,offset225]
>>       ...
>>
>> funct1:
>>       ...
>>       LD     R0,[SP,offset10]
>>       LD     R30,[SP,offset130]
>>       LD     R29,[SP,offset129]
>>       LD     R28,[SP,offset128]
>>       LD     R27,[SP,offset127]
>>       LD     R26,[SP,offset126]
>>       LD     R25,[SP,offset125]
>>       LD     R24,[SP,offset124]
>>       LD     R23,[SP,offset123]
>>       LD     R22,[SP,offset122]
>>       LD     R21,[SP,offset121]
>>       ADD    SP,SP,stackArea1
>>       JMP    R0
>>
>> The above would have to observe that all offset1's are equal to all
>> offset2's in order to short circuit the data movements. A single::
>>
>>       LD     R26,[SP,someotheroffset]
>>
>> ruins the short circuit.
>>
>> Whereas:
>>
>> funct2:
>>       ENTER   R25,R0,stackArea2
>>       ...
>>
>> funct1:
>>       ...
>>       EXIT    R21,R0,stackArea1
>>
>> will have registers R0,R25..R30 in the same positions on the stack
>> guaranteed by ISA definition!!
>
> I like the ENTER / EXIT instructions and safe stack idea, and have
> incorporated them into Q+ called ENTER and LEAVE. EXIT makes me think of
> program exit(). They can improve code density. I gather that the stack
> used for ENTER and EXIT is not the same stack as is available for the
> rest of the app. This means managing two stack pointers, the regular
> stack and the safe stack. Q+ could have the safe stack pointer as a
> register that is not even accessible by the app and not part of the GPR
> file.
>
> For ENTER/LEAVE Q+ has the number of registers to save specified as a
> four-bit number and saves only the saved registers, link register and
> frame pointer according to the ABI. So, “ENTER 3,64” will save s0 to s2,
> the frame-pointer, link register and allocate 64 bytes plus the return
> block on the stack. The return block contains the frame-pointer, link
> register and two slots that are zeroed out intended for exception
> handlers. The saved registers are limited to s0 so s9.
>
> Q+ also has a PUSHA / POPA instructions to push or pop all the
> registers, meant for interrupt handlers. PUSH and POP instructions by
> themselves can push or pop up to five registers.
>
> Some thought has been given towards modifying ENTER and LEAVE to support
> interrupt handlers, rather than have separate PUSHA / POPA instructions.
> ENTER 15,0 would save all the registers, and LEAVE 15,0 would restore
> them all and return using an interrupt return.
>

Admittedly, it can make sense for an ISA intended for higher-end
hardware, but not necessarily something intended to aim for similar
hardware costs to something like an in-order RISC-V core.

In my case, the core seems to be within a similar LUT cost range to some
of the RISC-V soft-cores. Generally smaller than some of the superscalar
cores, but bigger than a lot of the in-order scalar cores.

Looks like, if one wants to optimize for ASIC though (vs FPGA), it makes
sense to minimize the use of SRAM.

So, say:
Multiple copies of the regfile (like RISC-V does) is still not ideal;
Might also make sense to try to optimize things for smaller caches,
possibly with more expensive logic (so, say, small set-associative
caches rather than bigger direct-mapped caches).

Seems though like a ringbus might still be a cheapish though, since it's
storage is mostly in the flip-flops used to implement the ring itself,
rather than needing SRAM FIFOs like in some other bus designs. I would
suspect that something like AXI or Wishbone would likely involve a
number of internal FIFO buffers.

Click here to read the complete article

On 2024-01-26 11:10 p.m., BGB wrote:
> On 1/26/2024 10:58 AM, Robert Finch wrote:
>> On 2024-01-25 4:25 p.m., MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> On 1/25/2024 11:26 AM, MitchAlsup1 wrote:
>>>>> BGB wrote:
>>>>>
>>>>>> On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
>>>>>>> BGB wrote:
>>>>>>>
>>>>>>>> Granted, one can argue the same of prolog/epilog compression in
>>>>>>>> my case:
>>>>>>>> Save some space on prolog/epilog by calling or branching to
>>>>>>>> prior versions (since the code to save and restore GPRs is
>>>>>>>> fairly repetitive).
>>>>>>>
>>>>>>> ENTER and EXIT eliminate the additional control transfers and can
>>>>>>> allow
>>>>>>> FETCH of the return address to start before the restores are
>>>>>>> finished.
>>>>>
>>>>>> Possible, but branches are cheaper to implement in hardware, and
>>>>>> would have been implemented already...
>>>>>
>>>>> Are you intentionally misreading what I wrote ??
>>>>>
>>>
>>>> ?? I don't understand.
>>>
>>>
>>>
>>>>> Epilogue is a sequence of loads leading to a jump to the return
>>>>> address.
>>>>>
>>>>> Your ISA cannot jump to the return address while performing the loads
>>>>> so FETCH does not get the return address and can't start fetching
>>>>> instructions until the jump is performed.
>>>>>
>>>
>>>> You can put the load for the return address before the other loads.
>>>> Then, if the epilog is long enough (so that this load is no-longer
>>>> in flight once it hits the final jump), the branch-predictor will
>>>> lead to it start loading the post-return instructions before the
>>>> jump is reached.
>>>
>>> Yes, you can read RA early.
>>> What you cannot do is JMP early so the FETCH stage fetches instructions
>>> at return address early.
>>> {{If you JMP early, then the rest of the LDs won't happen}}
>>>
>>>> This is likely a non-issue as I see it.
>>>
>>>> It is only really an issue if one demands that reloading the return
>>>> address be done as one of the final instructions in the epilog, and
>>>> not one of the first instructions.
>>>
>>> I make no such demand--I merely demand the JMP RA is the last
>>> instruction.
>>>
>>>> Granted, one would have to do it as one of the final ops, if it were
>>>> implemented as a slide, but it is not. There are "practical reasons"
>>>> why a slide would not be a workable strategy in this case.
>>>
>>>> So, generally, these parts of the prolog/epilog sequences are
>>>> emitted for every combination of saved/restored registers that had
>>>> been encountered.
>>>
>>>> Though, granted, when used, does mean that any such function needs
>>>> to effectively two two sets of stack-pointer adjustments:
>>>> One set for the save/restore area (in the reused part);
>>>> One part for the function (for its data and local/temporary
>>>> variables and similar).
>>>
>>>
>>>>> Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
>>>>> the return address from the stack and fetch the instructions at the
>>>>> return address while still loading the preserved registers (that were
>>>>> saved) so that the instructions are ready for execution by the time
>>>>> the last LD is performed.
>>>>>
>>>>> In addition, If one is performing an EXIT and fetch runs into a CALL;
>>>>> it can fetch the Called address and if there is an ENTER instruction
>>>>> there, it can cancel the remainder of EXIT and cancel some of ENTER
>>>>> because the preserved registers are already on the stack where they
>>>>> are supposed to be.
>>>>>
>>>>> Doing these with STs and LDs cannot save those cycles.
>>>>>
>>>
>>>> I don't see why not, the branch-predictor can still do its thing
>>>> regardless of whether or not LD/ST ops were used.
>>>
>>> Consider::
>>>
>>> main:
>>>       ...
>>>       CALL   funct1
>>>       CALL   funct2
>>>
>>> funct2:
>>>       SUB    Sp,SP,stackArea2
>>>       ST     R0,[SP,offset20]
>>>       ST     R0,[SP,offset20]
>>>       ST     R30,[SP,offset230]
>>>       ST     R29,[SP,offset229]
>>>       ST     R28,[SP,offset228]
>>>       ST     R27,[SP,offset227]
>>>       ST     R26,[SP,offset226]
>>>       ST     R25,[SP,offset225]
>>>       ...
>>>
>>> funct1:
>>>       ...
>>>       LD     R0,[SP,offset10]
>>>       LD     R30,[SP,offset130]
>>>       LD     R29,[SP,offset129]
>>>       LD     R28,[SP,offset128]
>>>       LD     R27,[SP,offset127]
>>>       LD     R26,[SP,offset126]
>>>       LD     R25,[SP,offset125]
>>>       LD     R24,[SP,offset124]
>>>       LD     R23,[SP,offset123]
>>>       LD     R22,[SP,offset122]
>>>       LD     R21,[SP,offset121]
>>>       ADD    SP,SP,stackArea1
>>>       JMP    R0
>>>
>>> The above would have to observe that all offset1's are equal to all
>>> offset2's in order to short circuit the data movements. A single::
>>>
>>>       LD     R26,[SP,someotheroffset]
>>>
>>> ruins the short circuit.
>>>
>>> Whereas:
>>>
>>> funct2:
>>>       ENTER   R25,R0,stackArea2
>>>       ...
>>>
>>> funct1:
>>>       ...
>>>       EXIT    R21,R0,stackArea1
>>>
>>> will have registers R0,R25..R30 in the same positions on the stack
>>> guaranteed by ISA definition!!
>>
>> I like the ENTER / EXIT instructions and safe stack idea, and have
>> incorporated them into Q+ called ENTER and LEAVE. EXIT makes me think
>> of program exit(). They can improve code density. I gather that the
>> stack used for ENTER and EXIT is not the same stack as is available
>> for the rest of the app. This means managing two stack pointers, the
>> regular stack and the safe stack. Q+ could have the safe stack pointer
>> as a register that is not even accessible by the app and not part of
>> the GPR file.
>>
>> For ENTER/LEAVE Q+ has the number of registers to save specified as a
>> four-bit number and saves only the saved registers, link register and
>> frame pointer according to the ABI. So, “ENTER 3,64” will save s0 to
>> s2, the frame-pointer, link register and allocate 64 bytes plus the
>> return block on the stack. The return block contains the
>> frame-pointer, link register and two slots that are zeroed out
>> intended for exception handlers. The saved registers are limited to s0
>> so s9.
>>
>> Q+ also has a PUSHA / POPA instructions to push or pop all the
>> registers, meant for interrupt handlers. PUSH and POP instructions by
>> themselves can push or pop up to five registers.
>>
>> Some thought has been given towards modifying ENTER and LEAVE to
>> support interrupt handlers, rather than have separate PUSHA / POPA
>> instructions. ENTER 15,0 would save all the registers, and LEAVE 15,0
>> would restore them all and return using an interrupt return.
>>
>
> Admittedly, it can make sense for an ISA intended for higher-end
> hardware, but not necessarily something intended to aim for similar
> hardware costs to something like an in-order RISC-V core.

Once there is micro-code or a state machine to handle an instruction
with multiple micro-ops, it is not that costly to add other operations.
The Q+ micro-code cost something like < 1k LUTs. Many early micro's use
micro-code.
>
> In my case, the core seems to be within a similar LUT cost range to some
> of the RISC-V soft-cores. Generally smaller than some of the superscalar
> cores, but bigger than a lot of the in-order scalar cores.
>
>
> Looks like, if one wants to optimize for ASIC though (vs FPGA), it makes
> sense to minimize the use of SRAM.
>
> So, say:
> Multiple copies of the regfile (like RISC-V does) is still not ideal;
> Might also make sense to try to optimize things for smaller caches,
> possibly with more expensive logic (so, say, small set-associative
> caches rather than bigger direct-mapped caches).
>
>
> Seems though like a ringbus might still be a cheapish though, since it's
> storage is mostly in the flip-flops used to implement the ring itself,
> rather than needing SRAM FIFOs like in some other bus designs. I would
> suspect that something like AXI or Wishbone would likely involve a
> number of internal FIFO buffers.
>
> Also, unlike my original bus design, it is not dead slow...
>
>
> Looking at it, it seems "Wishbone Classic" is functionally similar, but
> has different signaling. Whereas other versions of Wishbone would likely
> need FIFOs to hold requests to be pushed around the bus.
>
> Though, for whatever reason, they were going with 32 or 64-bit
> transfers, whereas my bus was designed around sending data in 128-bit
> chunks. Granted, potentially, passing 128-bits would cost more than
> 64-bits. However, I would expect the logic costs to deal with 64-bit
> transfers might be higher than for 128-bit transfers (say, since now the
> L1 caches would need to deal with multi-part transfers for each L1 cache
> line; and the L2 would need to deal with its cache lines being accessed
> in terms of a larger number of comparable smaller pieces).
>
> You also wouldn't want 64-bit cache lines as then the tagging will cost
> more than the payload data (whereas 128 and 256 bit have a better ratio
> of tagging vs payload).
>
>
> I would guess though, possibly for an ASIC, using 32B or 64B cache lines
> might be preferable, as here a smaller amount of the total SRAM is spent
> on tagging bits, and in relation the logic would be cheaper relative to
> the cost of the SRAM.
>
>
> ...
>
Q+ uses a 128-bit system bus the bus tag is not the same tag as used for
the cache. Q+ burst loads the cache with 4 128-bit accesses for 512 bits
and the 64B cache line is tagged with a single tag. The instruction /
data cache controller takes care of adjusting the bus size between the
cache and system.

Click here to read the complete article

Robert Finch wrote:

> On 2024-01-26 11:10 p.m., BGB wrote:
>> On 1/26/2024 10:58 AM, Robert Finch wrote:
>>><snip>
>>
>> Admittedly, it can make sense for an ISA intended for higher-end
>> hardware, but not necessarily something intended to aim for similar
>> hardware costs to something like an in-order RISC-V core.

> Once there is micro-code or a state machine to handle an instruction
> with multiple micro-ops, it is not that costly to add other operations.
> The Q+ micro-code cost something like < 1k LUTs. Many early micro's use
> micro-code.

The FMAC unit has a sequencer that performs FDIV, SQRT, and transcendental
polynomials. The memory unit has a sequencer to perform LDM, STM, MM, and
ENTER and EXIT.

>> <snip>
>>
> Q+ uses a 128-bit system bus the bus tag is not the same tag as used for
> the cache. Q+ burst loads the cache with 4 128-bit accesses for 512 bits
> and the 64B cache line is tagged with a single tag. The instruction /
> data cache controller takes care of adjusting the bus size between the
> cache and system.

A four (4) Beat burst is de rigueur for FPGA implementations.

> I think I suggested this before, and the idea got shot down, but I
> cannot find the post. It is mystery operations where the opcode comes
> from a register value. I was thinking of adding an instruction modifier
> to do this. The instruction modifier would supply the opcode bits for
> the next instruction from a register value. This would only be applied
> to specific classes of instructions. In particular register-register
> operate instructions. Many of the register-register functions are not
> decoded until execute time. The function code is simply copied to the
> execution unit. It does not have to run through the decode and rename
> stage. I think this field could easily come from a register. Seems like
> it would be easy to update the opcode while the instruction is sitting
> in the reorder buffer.

Classic 360 EXECUTE instruction ??
Basically, it sounds dangerous. {Side channels in plenty}

On 1/27/2024 11:25 AM, MitchAlsup1 wrote:
> Robert Finch wrote:
>
>> On 2024-01-26 11:10 p.m., BGB wrote:
>>> On 1/26/2024 10:58 AM, Robert Finch wrote:
>>>> <snip>
>>>
>>> Admittedly, it can make sense for an ISA intended for higher-end
>>> hardware, but not necessarily something intended to aim for similar
>>> hardware costs to something like an in-order RISC-V core.
>
>> Once there is micro-code or a state machine to handle an instruction
>> with multiple micro-ops, it is not that costly to add other
>> operations. The Q+ micro-code cost something like < 1k LUTs. Many
>> early micro's use micro-code.
>
> The FMAC unit has a sequencer that performs FDIV, SQRT, and transcendental
> polynomials. The memory unit has a sequencer to perform LDM, STM, MM, and
> ENTER and EXIT.
>

I had a mechanism that basically plugged the outputs of the MUL and ADD
units together in a certain way to perform FDIV and FSQRT via running in
a feedback loop (and would wait a certain number of clock-cycles for the
result to converge). Not particularly fast though, and the results were
debatable...

For FDIV, was faster and more accurate to route it through the Shift-Add
unit.

But, yeah, no microcode or sequencers or similar thus far in my case.
All instructions have needed to map directly to some behavior in the EX
stage.

The closest thing is a mechanism within the main FPU to support SIMD
operations, which is basically logic to MUX the inputs and outputs to
the FPU based on the current clock-cycle (then, say, rather than
stalling for 6 cycles for a scalar operation, one stalls for 10 cycles
for a SIMD operation). In this case, if the faster SIMD unit exists, the
SIMD instructions are mapped to that unit instead (which does 4 Binary16
or Binary32 ops in parallel).

And, anything that can't be done directly in the EX stages, hasn't been
done at all (or, if it depends on ).

>>> <snip>
>>>
>> Q+ uses a 128-bit system bus the bus tag is not the same tag as used
>> for the cache. Q+ burst loads the cache with 4 128-bit accesses for
>> 512 bits and the 64B cache line is tagged with a single tag. The
>> instruction / data cache controller takes care of adjusting the bus
>> size between the cache and system.
>
> A four (4) Beat burst is de rigueur for FPGA implementations.
>

In my case, it is 1 beat per L1 line (128-bits), effectively sending the
whole line at once.

If I were to do 32B cache lines, this would require two. Would also
complicate the logic in the L1 cache.

For an ASIC, it would likely be preferable to use 32B or 64B lines,
since logic is comparably cheaper, and it might be harder to justify
half the SRAM use in this case mostly going just to the tag bits.

Could in theory use 1 row of 32B lines rather than two rows of 16B
lines, but there would be a problem here in terms of memory ports
(BRAM|SRAM with 3 ports, 2R1W, vs 1RW or 1R1W, isn't a thing IIRC).

Generally need a way to do two accesses in parallel to be able to
support unaligned memory access (would only need 1 row, and a single
access, if the cache only supported aligned access).

I guess, possible could be to consider something like Wishbone B3 or B4
as a possible option. While it would likely involve FIFO's, it could
have lower latency than is possible with my ringbus design.

And, in this case, performance is more being limited by latency than by
bus capacity.

Or, an intermediate option could be to keep the existing bus signaling,
but merely replace much of the "hot path" parts of the ring with a sort
of "crossbar". This option wouldn't necessarily need any FIFOs to be
added. Though, if any of the endpoints become congested, it could
potentially deadlock the bus (say, for example, the L1 D$ gets
backlogged with requests and the L2 cache with responses, but because
there are no free spots in either micro-ring, then forward progress is
not possible).

Comparably, the existing strategy of reducing ring latency via
special-case paths is moderately effective (IOW: overall topology is
still a ring, but with the equivalent of on and off ramps for messages
to take different/shorter paths to their intended destination).

>> I think I suggested this before, and the idea got shot down, but I
>> cannot find the post. It is mystery operations where the opcode comes
>> from a register value. I was thinking of adding an instruction
>> modifier to do this. The instruction modifier would supply the opcode
>> bits for the next instruction from a register value. This would only
>> be applied to specific classes of instructions. In particular
>> register-register operate instructions. Many of the register-register
>> functions are not decoded until execute time. The function code is
>> simply copied to the execution unit. It does not have to run through
>> the decode and rename stage. I think this field could easily come from
>> a register. Seems like it would be easy to update the opcode while the
>> instruction is sitting in the reorder buffer.
>
> Classic 360 EXECUTE instruction ??
> Basically, it sounds dangerous. {Side channels in plenty}

Yeah.

Better to keep side-channels to a minimum.

In my case, only certain registers could have side channels:
DLR/R0, DHR/R1, and SP/R15;
Various CR's (LR, SPC, SSP, etc).

Though, many of these have ended up being read-only side-channels (the
usage of side channels to update registers has mostly been eliminated,
in favor of using normal register updates whenever possible).

The SP side-channel was mostly a consequence of:
Early on, my ISA had PUSH/POP which operated via a side-channel (Long
since eliminated);
Previously, the interrupt mechanism worked by swapping the values of SP
and SSP, rather than the current mechanism of swapping them in the decoder.

Note that the decoder also renumbers the registers in RV64 Mode.

All the normal GPRs are entirely inaccessible via side-channels.

In my current design, all register ports are resolved in the ID2 / RF stage.

Originally, predication was handled in EX1, but has been effectively
partly relocated to ID2 as well (with updates to SR.T being handled via
interlock stalls, if the following instruction depends on SR.T).

This did have the consequence of effectively also increasing CMPxx to
2-cycles, but did improve FPGA timing (though, this leaves the combined
compare-with-zero-and-branch as often preferable). Luckily, extending
these ops from 8s to 11s (or 13s in XG2 mode) did make them more useful
(the 13s case can branch +/- 8K).

Though, the split compare and branch cases still do have the advantage
of being able to reach a further distance (1MB in Baseline, 8MB in XG2).

Technically, there are still the two-register compare-and-branch ops,
but these are not enabled by default in any profile and still limited to
Disp8s. Main reason they are around is mostly because RISC-V mode needs
this feature to be enabled.

....

"Gort, klaatu nikto barada." -- The Day the Earth Stood Still

devel / comp.arch / Re: The Impending Return of Concertina III

Subject	Author
The Impending Return of Concertina III	Quadibloc
Re: The Impending Return of Concertina III	Quadibloc
Re: The Impending Return of Concertina III	Quadibloc
Re: The Impending Return of Concertina III	Robert Finch
Re: The Impending Return of Concertina III	Quadibloc
Re: The Impending Return of Concertina III	Quadibloc
Re: The Impending Return of Concertina III	BGB
Re: The Impending Return of Concertina III	Quadibloc
Re: The Impending Return of Concertina III	Quadibloc
Re: The Impending Return of Concertina III	BGB
Re: The Impending Return of Concertina III	MitchAlsup1
Re: The Impending Return of Concertina III	Brian G. Lucas
Re: The Impending Return of Concertina III	Chris M. Thomasson
Re: The Impending Return of Concertina III	Scott Lurndal
Re: The Impending Return of Concertina III	Chris M. Thomasson
Re: The Impending Return of Concertina III	BGB
Re: The Impending Return of Concertina III	MitchAlsup1
Re: The Impending Return of Concertina III	BGB
Re: The Impending Return of Concertina III	MitchAlsup1
Re: The Impending Return of Concertina III	MitchAlsup1
Re: The Impending Return of Concertina III	BGB
Re: The Impending Return of Concertina III	MitchAlsup1
Re: The Impending Return of Concertina III	BGB
Re: The Impending Return of Concertina III	Robert Finch
Re: The Impending Return of Concertina III	MitchAlsup1
Re: The Impending Return of Concertina III	BGB
Re: The Impending Return of Concertina III	Robert Finch
Re: The Impending Return of Concertina III	MitchAlsup1
Re: The Impending Return of Concertina III	BGB
Re: The Impending Return of Concertina III	Quadibloc