U.S. patent application number 10/294044 was filed with the patent office on 2004-02-12 for reshuffled communications processes in pipelined asynchronous circuits.
Invention is credited to Cummings, Uri, Lines, Andrew M., Martin, Alain J..
Application Number | 20040030858 10/294044 |
Document ID | / |
Family ID | 22241108 |
Filed Date | 2004-02-12 |
United States Patent
Application |
20040030858 |
Kind Code |
A1 |
Lines, Andrew M. ; et
al. |
February 12, 2004 |
Reshuffled communications processes in pipelined asynchronous
circuits
Abstract
An asynchronous processor that has reshuffled processes to
implement precharge logic.
Inventors: |
Lines, Andrew M.;
(Calabasas, CA) ; Martin, Alain J.; (Pasadena,
CA) ; Cummings, Uri; (Oak Park, CA) |
Correspondence
Address: |
FISH & RICHARDSON, PC
12390 EL CAMINO REAL
SAN DIEGO
CA
92130-2081
US
|
Family ID: |
22241108 |
Appl. No.: |
10/294044 |
Filed: |
July 18, 2001 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10294044 |
Jul 18, 2001 |
|
|
|
09501638 |
Feb 10, 2000 |
|
|
|
09501638 |
Feb 10, 2000 |
|
|
|
09360468 |
Jul 22, 1999 |
|
|
|
60093840 |
Jul 22, 1998 |
|
|
|
Current U.S.
Class: |
712/11 ; 710/100;
712/E9.063 |
Current CPC
Class: |
G06F 9/3869 20130101;
G06F 7/00 20130101 |
Class at
Publication: |
712/11 ;
710/100 |
International
Class: |
G06F 015/00 |
Goverment Interests
[0002] This application may have received funding under U.S.
Government Grant No. DAAH-04-94-G-0274 awarded by the Department of
Army.
Claims
What is claimed is:
1. An asynchronous circuit, comprising: a first process; a second
process, communicating with said first process; wherein said first
and second processes communicate using precharge logic that receive
inputs in a first gate, and test nuetrality of said inputs in a
second gate separate from said first gate.
2. A circuit as in claim 1 where each of said first and second
processes communicate via request and acknowledges.
3. A circuit as in claim 1 further comprising determining a
specified request, setting a state variable to represent said
specified request, and resetting said specified request before
acknowledging or acting on it.
4. A circuit as in claim 1 wherein said first and second processes
communicate according to PCFB.ident.*[[R.sup.a{circumflex over (
)}L]; R.Arrow-up bold.; L.sup.a.Arrow-up bold.;
en.dwnarw.([R.sup.a]; R.dwnarw.), [L]; L.sup.a.dwnarw.);
en.Arrow-up bold..
5. A circuit as in claim 1 wherein said first and second processses
communicate according to PCHB.ident.*[[R.sup.a{circumflex over (
)}L]; R.Arrow-up bold.; L.sup.a.Arrow-up bold.; [R.sup.a];
R.dwnarw.; [L]; L.sup.a.dwnarw..
6. A circuit according to claim 1 wherein said precharge logic
includes a first portion which computes validity of inputs and a
second portion which computes validity of outputs.
7. A method of communicating between first and second processes
without a synchronizing global clock, comprising: receiving
requests in a first gate; and acknowledging said requests prior to
completion of action thereon, and determining neutrality of said
inputs in a second gate that is separate from said first gate that
receives the input.
8. A method as in claim 7 wherein said device set test validity of
the input using at least one transistor which is unconnected to the
transistor of the device that receives the inputs.
9. A method of operating an asynchronous process, comprising:
receiving a request for some action to occur; reshuffling responses
that usually occur relative to said request, said reshuffling
responses including using a precharge logic.
10. A cell, comprising: a buffering element; a logic element,
connected to said buffering element, said logic element having a
dual rail precharge domino logic block which computes an output
based on an input; a completion tree for an input channel and a
completion tree for an output channel; and a control circuit which
combines the completion trees to generate an input acknowledge and
to precharge the logic element.
11. A cell as in claim 10, wherein said input acknowledge does not
wait for nuetrality of output data, and also producing an enable
which does wait for nuetrality of output data.
12. A cell as in claim 10, wherein said outputs are conditionally
produced by indicating a condition with an extra wire.
13. A cell as in claim 10, wherein said inputs are conditional
inputs
14. A cell as in claim 10, wherein said input is only conditionally
acknowledged, and said logic determines which inputs to
acknowledge.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation of U.S. application Ser.
No. 09/501,638, filed Feb. 10, 2000, which is a continuation of
U.S. application Ser. No. 09/360,468, filed Jul. 22, 1999, which
claims the benefit of U.S. Provisional Application No. 60/093,840,
filed on Jul. 22, 1998, all of which are incorporated herein by
reference.
[0003] This specification describes communicating sequential
processes (CSP) which are implemented as quasi delay insensitive
asynchronous circuits. More specifically the present specification
teaches reshuffling communication sequences and combining
computation with buffering to produce pipelined circuits.
BACKGROUND
[0004] Asynchronous processors are known as described in U.S. Pat.
No. 5,752,050. These processors process an information stream
without a global clock synchronizing the operation.
[0005] An asynchronous processor pipeline scheme uses the basic
layout shown in FIG. 1. A first process 100 communicates with a
second process 110 that in turn sends a message to the next
process. The messages use a four phase handshake. In the first
phase, the sender raises the request line. In the second phase, the
receiver raises the acknowledge line. In the third pase, the sender
lowers the request line. In the fourth phase, the receiver lowers
the acknowledge line. In the handshaking expansion language (HSE),
the handshake on channel X is described as X+; Xa+; X-; Xa-. In
FIG. 1, the request between 100 and 110 is the L wire (102). The
acknowledge for that communication is La (108). The request between
110 and 120 is the R wire (104), and the acknowledge is Ra
(106).
[0006] This is a basic request, acknowledge system. The request [L]
is acknowledged (La), then acted on R.Arrow-up bold., then
acknowledged again (Ra).
[0007] Pipelined asynchronous circuits are known as "Bundled-Data"
or "Micropipelines" and have a synchronous style data path which is
"clocked" by asynchronous self-timed control elements. These
control elements handshake between pipeline stages with a
request/acknowledge pair. The delay of the datapath logic is
estimated with a delay-element in the control, so that the request
to the next pipeline state is not made until the data is assumed to
be valid.
[0008] The alternative style involves (quasi) delay-insensitive
circuits, for which no delay assumptions are made. In this style,
the prior art is embodied in the Caltech Asynchronous
Microprocessor patent. Datapaths are still separated from control,
as in the bundled-data case, but completion detection circuitry is
added instead of delay lines to detect when the data is valid.
Communication between processes occurs via delay-insensitive
channels with a 4 phase handshake. In between latches or buffers,
logic can be performed by unpipelined weak-condition logic
blocks.
SUMMARY
[0009] The present system teaches a way of pipelining this
handshake to allow certain processes to occur closer to
simultaneously. The disclosed system is a delay insensitive system
that uses a combination of logic and buffering to resequence
certain operations.
[0010] A new way of pipelining quasi-delay-insensitive circuits is
disclosed in which control is not explicitly separated from the
datapath. No extra buffers or latches are added between logic
blocks. Instead, the state-holding property of a buffer is combined
directly with a dual-rail domino logic computation. The tokens
travel through the pipeline as in the case of simple buffers. The
tokens also carry values which are computed upon. By not separating
control from data, and by carefully designing the circuit parts
which handle the handshakes, higher throughput is expected. The
extra handshake circuitry typically adds no more than 50% area.
[0011] The supporting circuitry which handles the handshake takes
place in precharge domino logic of a type that is common in
synchronous design. Additional circuits detect the validity of the
input and output channels (common in asynchronous design). An
acknowledge circuit acknowledges the inputs and precharges the
logic.
[0012] The circuit implementations disclosed in this patent include
components for logic computation, plus components to detect the
validitity of the input and output data, and another component to
generate the acknowledges and precharge the logic. The details and
composition of these pieces generate fast quasi delay insensitive
circuits superior to the prior art.
[0013] This patent also include further enhancements of this
combined buffer/logic cell are disclosed. These include the ability
to conditionally communicate on either inputs or outputs, so as to
implement routing functionality. Also, mechanisms for efficiently
implementing internal state variables are described.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] These and other aspects will now be described in detail with
reference to the accompanying drawings, wherein:
[0015] FIG. 1 shows a basic pipelining system and some of the
signals used in that system;
[0016] FIG. 1A shows a basic precharge type buffer in block diagram
form;
[0017] FIG. 2 shows a basic weak condition half buffer circuit;
[0018] FIG. 3 shows the transistor diagrams for the weak condition
half buffer;
[0019] FIG. 4 shows a precharge buffer, with the transistor
arrangement at the top; and the gate arrangement at the bottom;
[0020] FIG. 5 shows a split precharge circuit;
[0021] FIG. 6 shows a merge precharge circuit;
[0022] FIG. 7 shows a Reg precharge circuit.
DESCRIPTION OF THE EMBODIMENT
[0023] The present system is based on a way of pipelining the
information in the FIG. 1 drawing using precharge logic that allows
the operations to occur in parallel. Pipelining allows a system to
carry out more than one operation at the same time. Put another
way, a pipelined system does not need to wait for one action to be
completed before the other action is carried out. However, if one
attempts to reset data before using it, then the data is lost.
[0024] The present system teaches a way of dealing with this issue
by reshuffling the communication sequence, storing certain
information within the sequence, and enabling more efficient
pipelining information.
[0025] A "pipeline" is a linear sequence of buffers where the
output of one buffer connects to the input of the next buffer as
shown in FIG. 1. "Tokens" 99 are sent into the input end of the
pipeline, and flow through each buffer to the output end. The
tokens remain in first-in-first-out (FIFO) order.
[0026] For synchronous pipelines, the tokens usually advance
through one stage on each clock cycle. For asynchronous pipelines
there is no global clock to synchronize the movement. Instead, each
token moves forward down the pipeline where there is an empty cell
in front of it; otherwise, the token stalls. Effectively, the
tokens have similar behavior to cars on a freeway.
[0027] The buffer capacity or "slack" of an asynchronous pipeline
is proportional to the maximum number of tokens that can be packed
into the pipeline without stalling the input end of the pipeline.
The "throughput" is the number of tokens per second which pass a
given stage in the pipeline. The "forward latency" is the time it
takes a given token to travel the length of the pipeline.
Buffer Reshuffling
[0028] A single rail buffer has the Communication Sequential
Process "CSP" specification *[L;R]. Using a passive protocol for L
and a lazy active protocol for R, the buffer will have the
handshaking expansion (HSE):
*[[L]; L.sup.a.Arrow-up bold.[L]; L.sup.a.dwnarw.; [R.sup.a];
R.Arrow-up bold.; [R.sup.a]; R.dwnarw..], (1)
[0029] in english, the handshaking expansion for this buffer is as
follows: Wait for L to become true. Set La true. Wait for L to
become false. Set La false. Wait for Ra to become false. Set R
true. Wait for Ra to become true. Set R false. Repeat
infinitely.
[0030] The present system recognizes that certain sequences are the
most interesting among these sequences. The present application
reshuffles the sequence in order to do these first.
[0031] In effect, equation 1 represents a four phase protocol. The
first two actions [L]; L.sup.a.Arrow-up bold., represent waiting
for L to become active, and acknowledging that. The second two
actions represent L becoming inactive. The third two actions
represent waiting for R to become active. The fourth two actions
represent R inactive.
[0032] The environment will perform *[[L.sup.a]; L.Arrow-up bold.;
[L.sup.a]; L.dwnarw.] and *[[R]; R.sup.a.Arrow-up bold.[R];
R.sup.a.dwnarw.]. The wait for L, or [L] is interpreted to be the
arrival of an input token, and the transition R.Arrow-up bold. is
the beginning of the output token. Buffers are used herein to
preserve the desired FIFO order and properties of a pipeline.
[0033] Direct implementation of this handshaking expression can use
a state variable to distinguish the first half from the second
half. This represents a large amount of sequencing in each
cycle.
[0034] Another option is to reshuffle the waits and events to
reduce the amount of sequencing and the number of state variables,
in order to maximize the throughput and minimize the latency of the
pipeline.
[0035] The first requirement for a valid reshuffling is that the
Handshaking expression maintains the handshaking protocols on L and
R. That is, the projection on the L channel is *[[L];
L.sup.a.Arrow-up bold.; [L]; L.sup.a.dwnarw.] and the projection on
the R channel is *[[R.sup.a]; [R.Arrow-up bold.; [R.sup.a;
R.dwnarw.]. In addition, the number of completed L.Arrow-up bold.
minus the number of completed R.Arrow-up bold. (the slack of the
buffer) should be at least zero to conserve the number of tokens in
the pipeline. Also, since this is a "buffer", it should introduce
some nonzero slack. Hence, the L.sup.a.Arrow-up bold. should not
wait for the corresponding [R.sup.a], or the reshuffling will have
zero slack. This is the "constant response time" requirement.
[0036] Although these three requirements are sufficient to
guarantee a correct implementation, one more is useful. The L and R
channels may be expanded to encode data. If the reshuffling moves
the R.Arrow-up bold. past the corresponding L.sup.a.Arrow-up bold.,
then the "L" data would disappear before R.Arrow-up bold. is done.
The data here is saved in a buffer, here implemented as an internal
state variable proportional to the number of bits on R or L. That
data would need to be saved in internal state bits, since the L
data may disappear as soon as La+ occurs. These additional internal
state bits are undesirable, so La.Arrow-up bold. will follow
R.Arrow-up bold..
[0037] There are nine valid reshufflings, each labeled below: 1
MSFB * [ [ R a L ] ; R ; ( [ R a ] ; R ) , ( L a ; [ L ] ; L a ) ]
PCFB * [ [ R a L ] ; R ; L a ; ( [ R a ] ; R ) , [ L ] ; L a ) ]
PCHB * [ [ R a L ] ; R ; L a ; [ R a ] ; R ) , [ L ] ; L a ] WCHB *
[ [ R a L ] ; R ; L a ; [ R a L ] ; R L a ] B1 * [ [ R a L ] ; R ;
L a ; [ R a L ] ; L a R ] B2 * [ [ R a L ] ; R ; L a ; [ L ] ; L a
; [ R a ] ; R ] B3 * [ [ R a L ] ; R ; L a ; [ L ] ; [ R a ] ; R L
a ] B4 * [ [ R a L ] ; R ; L a ; [ R a ] ; R , [ L ] ; L a ] B5 * [
[ R a L ] ; R ; L a ; [ R a L ] ; R , L a ]
[0038] It takes two state variables to implement the MSFB
reshuffling. The PCFB, B1, B2, B3, B4, and B5 reshufflings all
require one state variable en (short for enable) with en; inserted
after L.sup.a.dwnarw. and en.Arrow-up bold. inserted before the
end.
[0039] Selection of which of these reshufflings is the best can
assume that the goal is fewer transistors and faster operation. By
that metric, the present inventors believe that B3, B4, and B5 are
always inferior to PCFB. They all require the same state variable.
They produce only a subset of the trade-off PCFB, with additional
waits that may be unnecessary. These waits add extra transistors
and slow the circuit down, compared to PCFB. They also slow the
circuit down as compared with PCFB, which adds extra
transistors.
[0040] B1 and B2 are also very similar to PCFP, except they have
more sequencing. However, that extra sequencing simplifies the
production rule for en: to R.fwdarw.en.Arrow-up bold. instead of
R{circumflex over ( )}L.sup.a.fwdarw.en.Arrow-up bold., in the case
of PCFB. The inventors therefore do not believe that these will
always be inferior to PCFB. However, due to the extra sequencing
and additional transistors elsewhere, these reshufflings will
likely seldom, if ever, be better than PCFB.
[0041] The MSFB has the least possible sequencing of any of these
reshufflings. However, MSFB requires two state variables and has
more complicated production rules than PCFB. It has a possible
advantage in speed since it allows R.dwnarw. to happen a little
earlier. If one counts transitions, it turns out that the next
buffer in the pipeline (if it is reshuffled similarly) will not
even raise R.sup.a until after L.sup.a.dwnarw.; occurs. This might
not really be an advantage at all.
[0042] That leaves three most interesting reshufflings, WCHB, PCHB,
and PCFB. The names are derived from characteristics of the circuit
implementations. WC indicates weak-condition logic. PC indicates
precharge logic. HB indicates a halfbuffer (slack 1/2), and FB
indicates a fullbuffer (slack 1).
[0043] In the halfbuffer reshufflings, only every other stage can
have a token on its output channel, since a token on that channel
blocks the previous stage from producing an output token. In
practice, each of these reshufflings has advantages for certain
applications, so they are all useful. With state variables
inserted, the three best reshufflings are: 2 PCFB * [ [ R a L ] ; R
; L a ; en ( [ R a ] ; R ) , [ L ] ; L a ) ; en ] PCHB * [ [ R a L
] ; R ; L a ; [ R a ] ; R ; [ L ] ; L a ] WCHB * [ [ R a L ] ; R ;
L a ; [ R a L ] ; R ; L a ]
[0044] Note that the first three parts of the reshuffling are the
same.
[0045] FIG. 1A shows a box and arrow diagram of the standard
components of a PCHB or PCFB cell. The various parts of the circuit
may be thought of as logic, input completion, output completion,
and enable generation. The logic is shown as precharge dual rail
domino logic with two enabling gates, the internal enable and the
output enable coming back from the next cell in the pipeline. The
inverted logic is followed by inverters to restore it to the normal
sense. The completion circuits are standard NOR or NAND gates and
C-element trees which compute the validity of the inputs and the
validity of the outputs. Finally, the "enable" circuit generates
the input acknowledge(s) and the internal enable (en) of the cell.
The PCHB and PCFB differ only in the exact implementation of this
enable circuit.
[0046] Logic with Buffering
[0047] Suppose it is desired to implement a unit with CSP of the
form: 3 P * [ A ? a , B ? b ; X ! f ( a , b , ) , Y ! g ( a , b , )
, ]
[0048] Where A?a means receive data a on channel A and y!g means
send data g on channel y.
[0049] On each cycle, P receives some inputs, then sends out
functions computed from these inputs. The channels A,B,X, and Y
must encode some data. The usual way to do this is using sets of
1-of-N rails for each channel. For instance, to send two bits, one
could use two 1-of-2 rails with one acknowledge, or one 1-of-4
rails with one acknowledge.
[0050] As a notational convention, a rail is identified by the
channel name with a superscript for the 1-of-N wire which is
active, and a subscript for what group of 1-of-N wires it belongs
to (if there is more than one group in the channel). The
corresponding acknowledge will be the channel name with a "a"
superscript, or an "e" superscript if it is used in the inverted
sense.
[0051] As in the single rail buffer case, P could be implemented by
expanding each channel communication into a handshaking expansion.
Direct implementation of this handshaking expansion requires state
variables for the a, b variables and more. It could produce an
enormously big and slow circuit. Some reshuffling is desired. The
PCFB, PCHB, and WCHB reshufflings will be the most useful ones. The
correspondence between the single rail "templates" for PCFB, PCHB,
and WCHB and a process like P is as follows. The L and L.sup.a
represent all the input data and acknowledges. The R and R.sup.a
represent all the output data and acknowledges. [L] indicates a
wait for the validity of all inputs, and [L] indicates a wait for
the neutrality of all inputs. [R] indicates a wait for all the
output acknowledges to be false, and [R.sup.a] indicates a wait for
all the output acknowledges to be true. L.sup.a.Arrow-up bold.
indicates making true all the input acknowledges in parallel, and
L.sup.a.dwnarw. indicates making them false. R.Arrow-up bold. means
that all the outputs are set to their valid states in parallel.
R.dwnarw. means that all the outputs are set to their neutral
states. When R.Arrow-up bold. occurs, it means that particular
rails of the outputs are made true, depending on which rails of L
are true. This expands R.Arrow-up bold. into a set of exclusive
selection statements executing in parallel.
[0052] Unfortunately, the inventors have recognized that this
simple translation may introduce more sequencing than necessary. Of
the various actions which occur in parallel like setting all the
outputs valid (R.Arrow-up bold.), each action might need to wait
for only a portion of the preceding guard ([R.sup.a{circumflex over
( )}L]). For instance, raising X.sup.0.Arrow-up bold. or
X.sup.1.Arrow-up bold. needs to check [X.sup.a] but not [Y.sup.a].
Similarly, the semicolons between actions (R.Arrow-up bold.;
L.sup.a.Arrow-up bold.) might also over sequence. However, this
cannot be easily fixed while still using the Handshaking expansion
language. For instance, in the sequence X.Arrow-up bold.,
Y.Arrow-up bold.; A.sup.a.Arrow-up bold., B.sup.a.Arrow-up bold.,
it might be necessary for A.sup.a.Arrow-up bold. to wait for [X]
only (if Y.Arrow-up bold. did not use the value of A) while
B.sup.a.Arrow-up bold. might need to wait for [X{circumflex over (
)}Y]. This case could be written as X.Arrow-up bold., Y.Arrow-up
bold.,[X]; A.sup.a.Arrow-up bold.),[X{circumflex over ( )}Y];
B.sup.a.Arrow-up bold.). However, this may make the written
software more difficult to understand. If the next actions are not
fully sequenced, it could get even worse. In the limit, the
Handshaking expansion just mirrors the actual production rule set
(PRS). To skirt the issue, Handshaking expansion can be used. This
might be a bit over sequenced, with the understanding that the
unnecessary sequencing will be optimized out in the compilation to
production rules.
[0053] The PCFB version of a P with dual rail channels would
therefore be: 4 * [ [ X a f 0 ( A , B , ) X 0 ; X a f 1 ( A , B , )
X 1 ] , [ Y a g 0 ( A , B , ) Y 0 ; Y a g 1 ( A , B , ) Y 1 ] , ; A
a , B a , ; en ; [ X a X 0 X 1 ] , [ Y a Y . Y 1 Y 1 ] , , [ A 0 A
A a ] , BE B 1 B a ] , ; ]
[0054] In this Handshaking expansion, the f.sup.0, f.sup.1,
g.sup.0, and g.sup.1 are boolean expansions in the data rails of
the input channels. They are derived from the f and g of the CSP
and indicate the conditions for raising the various data rails of
the output channels. Note that each output channel waits only for
its own acknowledge, which is less sequenced than a direct
translation of the PCFB template would be.
[0055] In P it is seen that A.sup.a and B.sup.a tend to switch at
about the same time. They could actually be combined into a single
AB.sup.a which would wait for the conjunction of the guards on
A.sup.a and B.sup.a. Combining the acknowledges tends to reduce the
area of the circuit, but might slow it down. The best decision
depends on the circumstances.
Examples of Logic with Buffering
[0056] To put the previous section into practice, several CSP
processes with the same form as P are compiled into pipelined
circuits. The simplest CSP buffer that encodes data has a dual rail
input L, and a dual rail output R. The CSP is *[L?x;R!x]. Three
Handshaking expansion reshufflings for this process are: 5 WCHB_BUF
* [ [ R a L 0 R 0 ; R a L 1 R 1 ] ; L a ; [ R a L 0 L 1 R 0 R 1 ] ;
L a ] PCHB_BUF * [ [ R a L 0 R 0 ; 0 R a L 1 R 1 ] ; L a [ R a R 0
, R 1 ] ; [ L 0 L 1 L a ] ] PCFG_BUF * [ [ R a l 0 R 0 R a L1 R 1 ]
; L a ; en ; [ R a R 0 , R 1 ] , [ L 0 L 1 L a ] ; en
[0057] After bubble-reshuffling (which suggests using the inverted
acknowledges, L.sup.e and R.sup.e), the production rules for the
WCHB-BUF follow. The circuit diagram for a WCHB is shown in FIG. 2.
6 R e L 0 R _ 0 L e L 1 R _ 1 R 0 R 0 R 1 R 1 R 0 R 1 L _ e L e L _
e R e L 0 R _ 0 R e L 1 R _ 1 R 0 R 0 R 1 R 1 R 1 R 1 L _ e L e L _
e
[0058] The other Handshaking expansions can be implemented
similarly, but they are both somewhat bigger. For this reshuffling,
the validity and neutrality of the output data R implies the
neutrality of the input data L. Logic which has this property is
called "weak-condition". It means that the L does not need to be
checked anywhere else, besides in R. The WCHB also gets some of its
semicolons implemented for free. The semicolon between
L.sup.a.Arrow-up bold.; [R.sup.a{circumflex over ( )}L] is
implemented by the environment, as is the implicit semicolon at the
end of the loop. The WCHB has some inherent benefits. However, it
turns out that although WCHB works well for buffers, the
"weak-condition" requirement can cause problems with other
circuits.
[0059] This WCHB_BUF bubble-reshuffling has 2 transitions forward
latency and 3 transitions "backward" latency (for the path from the
right acknowledge to the left acknowledge). Combining these times
for the whole handshake yields 2+3+2+3=10 transitions per
cycle.
[0060] Extra inverters can be added to WCHB_BUF to get 10
transitions per cycle. These inverters can actually speed up the
throughput, despite the increased transition count, because
inverters have high gain. Also, the 6 transitions per cycle buffer
would invert the senses of the data and acknowledges after every
stage, which is highly inconvenient when composing different
pipelined cells. As a standard practice, most pipelined logic cells
will be done with 2 transitions of forward latency, but more
complicated circuits will have 5, 7 or even 9 transitions backward
latency, yielding transitions per cycle from 10 to 22 (even numbers
only, of course).
[0061] Next consider a fulladder, with the CSP *[A?a, B?b, C?c;
S!XOR(a,b,c), D!MAJ(a,b,c)]. The A,B,C,S and D channels are dual
rail. The acknowledges for A,B, and C are combined into a single
F.sup.e. Inverted acknowledges are used from the start. The three
Handshaking expansion reshufflings are: 7 WCHB_FA * [ [ S e XOR 0 (
A , B , C ) S 0 0 S e XOR 1 ( A , B , C ) S 1 ] , [ D e MAJ 0 ( A ,
B , C ) D 0 0 D e MAJ 1 ( A , B , C ) D 1 ] ; F e ; [ S e A 0 A 1 C
0 S 0 , S 0 ] , [ D e B 0 B 1 C 1 D 0 , D 0 ] , F e ] PCHB_FA * [ [
S e XOR 0 ( A , B , C ) S 0 0 S e XOR 1 ( A , B , C ) S 1 ] , [ D e
MAJ 0 ( A , B , C ) D 0 0 D e MAJ 1 ( A , B , C ) D 1 ] ; F e ; [ S
e S 0 , S 1 ] , [ D e D 0 , D 1 ] , [ A 0 A 1 B 0 B 1 C 0 C 1 , F e
] ; ] PCFB_FA * [ [ S e XOR 0 ( A , B , C ) S 0 0 S e XOR 1 ( A , B
, C ) S 1 ] , [ D e MAJ 0 ( A , B , C ) D 0 0 D e MAJ 1 ( A , B , C
) D 1 ] ; F e ; en ; [ S e S 0 , S 1 ] , [ D e D 0 , D 1 ] , [ A 0
A 1 B 0 B 1 C 0 C 1 , F e ] ; en ]
[0062] In the WCHB_FA, the validity of the outputs S and D implies
the validity of the inputs, because the S must check all of A,B,
and C. The test for the neutrality of the inputs is split between
S.dwnarw. and D.dwnarw.. This works as long as both S.dwnarw. and
D.dwnarw. check at least one input's neutrality completely, and
both rails of S and D wait for the same expansion. In both PCHB_FA
and PCFB_FA, the expansion for the neutrality of the inputs is
obviously too large to implement as a single production rule.
Instead, the neutrality test must be decomposed into several
operators. The usual decomposition is "nor" gates for each dual
rail input, followed by a 3-input c-element. F.sup.e.dwnarw. must
now wait for the validity of the inputs just to acknowledge the
internal transitions. However, this means the logic for S and D no
longer needs to fully check validity of the inputs; it is not
required to be weak-condition.
[0063] The bubble-reshuffled and decomposed production rules for
WCHB_FA are: 8 S e XOR 0 ( A , B , C ) S 0 _ S e XOR 1 ( A , B , C
) S 1 _ D e MAJ 0 ( A , B , C ) D 0 _ D e MAJ 1 ( A , B , C ) D 1 _
S 0 _ S 0 S 1 _ S 1 D 0 _ D 0 D 1 _ D 1 ( S 0 _ S 1 _ ) ( D 0 _ D 1
_ ) F e _ F e _ F e S e A 0 A 1 C 0 S 0 _ S e A 0 A 1 C 0 S 1 _ D e
B 0 B 1 C 1 D 0 _ D e B 0 B 1 C 1 D 1 _ S 0 _ S 0 S 1 _ S 1 D 0 _ D
0 D 1 _ D 1 S 0 _ S 1 _ D 0 _ D 1 _ F e _ F e _ F e
[0064] The circuit diagram is shown in FIG. 3. The pull-up logic
for S0, S.sup.1, D.sup.0, and D.sup.1 has 4 P-type transistors in
series. This can be quite weak, due to the lower mobility of holes.
Other WCHB circuits can be even worse. Since all the inputs are
checked for neutrality before the outputs reset, a process with
three inputs and only one output would end up with 7 p-transistors
in series to reset that output.
[0065] The present systems uses the "precharge-logic" reshufflings,
PCHB_FA or PCFE_FA. These test the neutrality of the inputs in a
different place, which is more easily decomposed into manageable
gates, and does not slow the forward latency. The PCHB_FA
reshuffling has the production: rules: 9 A 0 A 1 A v _ B 0 B 1 B v
_ C 0 C 1 C v _ F e S e XOR 0 ( A , B , C ) S 0 _ F e S e XOR 1 ( A
, B , C ) S 1 _ F e D e MAJ 0 ( A , B , C ) D 0 _ F e D e MAJ 1 ( A
, B , C ) D 1 _ S 0 _ S 0 S 1 _ S 1 D 0 _ D 0 D 1 _ D 1 A v _ B v _
C v _ ABC v S 0 _ S 1 _ S v D 0 _ D 1 _ D v S v D v ABC v F e A 0 A
1 A v _ B 0 B 1 B v _ C 0 C 1 C v _ S e F e S 0 _ S e F e S 1 _ D e
F e D 0 _ D e F e D 1 _ S 0 _ S 0 S 1 _ S 1 D 0 _ D 0 D 1 _ D 1 A v
_ B v _ C v _ ABC v S 0 _ S 1 _ S v D 0 _ D 1 _ D v S v D v ABC v F
e
[0066] This circuit can be made faster by adding two inverters to
Fe and then two more to produce the F.sup.e used internally (which
is now called en). This circuit is shown in FIG. 4.
[0067] A PCFB_FA reshuffling would have only slightly different
production rules: 10 A 0 A 1 A v _ B 0 B 1 B v _ C 0 C 1 C v _ en S
e XOR 0 ( A , B , C ) S 0 _ en S e XOR 1 ( A , B , C ) S 1 _ en D e
MAJ 0 ( A , B , C ) D 0 _ en D e MAJ 1 ( A , B , C ) D 1 _ S 0 _ S
0 S 1 _ S 1 D 0 _ D 0 D 1 _ D 1 A v _ B v _ C v _ ABC v S 0 _ S 1 _
S v D 0 _ D 1 _ D v en S v D v ABC v F e S v D v SD v _ F e SD v _
en _ en _ en A 0 A 1 A v _ B 0 B 1 B v _ C 0 C 1 C v _ S e F e S 0
_ S e F e S 1 _ D e F e D 0 _ D e F e D 1 _ S 0 _ S 0 S 1 _ S 1 D 0
_ D 0 D 1 _ D 1 A v _ B v _ C v _ ABC v S 0 _ S 1 _ S v D 0 _ D 1 _
D v en ABC v F e S v D v SD v _ F e SD v _ en _ en _ en
[0068] Comparing the three fulladder reshufflings, the WCHB_FA has
only 10 transitions per cycle, while the PCHB_FA has 14 and the
PCFB_FA has 12 (7 on the setting phase, but 5 on the resetting
phase, since the L and R handshakes reset in parallel). Although
the W FA has fewer transistors, to make it reasonably fast, the 4
P-transistors in series must be made very large. Despite the lower
transition count of tile WCHB_FA, both PCHB_FA and PCFB_FA are
substantially faster in throughput and latency. PCFB_FA is the
fastest of all, since it relies heavily on n-transistors and saves
2 transitions on the reset phase. However PCFB_FA can be larger
than PCHB_FA, due to the extra state variable en and the extra
completion SD.sup.v If the speed of the fulladder is not critical,
the PCHB FA seems to be the best choice.
[0069] In general, the WCHB reshuffling tends to be best only for
buffers and copies ([L?x;R!x,S!x]). The PCHB is the workhorse for
most applications; it is both small and fast. When exceptional
speed is called for, the PCFB dominates. It is also especially good
at completing 1-of-N codes where N is very large, since the
completion can be done by a circuit which looks like a tied-or
pulldown as opposed to many stages of combinational logic. The
reshuffling can actually be mixed together; with each channel in
the cell using a different one. This is most commonly useful when a
cell computes on some inputs using PCHB, but also copies some
inputs directly to outputs using WCHB. In this case, the neutrality
detection for the WCHB outputs is only one p-gate, which is no
worse than an extra en gate.
[0070] Another common class of logic circuits use shared control
inputs to process multi-bit words. This is similar to a fulladder.
The control is just another input, which happens to have a large
fallout to many output channels. Since the outputs only sparsely
depend on the inputs (usually with a bit to bit correspondence),
the number of gates in series in the logic often does not become
prohibitive. However, if the number of bits is large e.g. 32, the
completion of all the inputs and outputs will take many stages in a
c-element tree, which adds to the cycle time, as does the load on
the broadcast of the control data. To make high throughput datapath
logic, it can be better to break the datapath up into manageable
chunks (perhaps 4 or 8 bits), and send buffered copies of the
control tokens to each chunk. This cuts down the cycle time, but
does not change the high-level meaning, except to introduce extra
slack.
[0071] Conditionally Producing Outputs
[0072] Although the cells discussed in the previous section can be
shown to be Turing complete (they can be turned into a VonNeumann
state machine, with some outputs fed back through buffers to store
state), they are clearly inefficient for many applications. A very
useful extension is the ability to skip a communication on a
channel on a given cycle. This turns out to require only a few
minor modifications to the scheme as presented so far.
[0073] Suppose the process completes at most one communication per
cycle on the outputs, but always receives all its inputs. The CSP
would be: 11 P1 = * [ A ? a , B ? b , ; [ do_x ( a , b , ) X ! f (
a , b , ) do_x ( a , b , ) skip ] , [ do_y ( a , b , ) Y ! g ( a ,
b ) ; do_x ( a , b , ) skip ] ,
[0074] As above, this can reshuffle like WCHB, PCHB, or PCFB. The
selection statements for the outputs expand into exclusive
selections for setting the output rails, plus a new case for
producing no output at all on the channel. A dual-rail version of
P1 with a PCFB reshuffling is: 12 * [ [ do_x ( A , B , ) X a f 0 (
A , B , ) X 0 ; do_x ( A , B , ) X a f 1 ( A , B , ) X 1 do_x ( A ,
B , ) skip ] , do_y ( A , B , ) Y a g o ( A , B , ) Y o do_y ( A ,
B , ) Y a g 1 ( A , B , ) Y 1 do_y ( A , B , ) skip ] , ; A a , B a
, ; en ; [ X a X 0 X 1 X 0 , X 1 ] , [ Y a Y 0 Y 1 Y 0 , Y 1 ] , ,
[ A 0 A 1 A a ] , [ B 0 B 1 B a ] , ; en ]
[0075] Note that the resetting of the output channels X and Y must
accommodate the cases when those channels were not used. Since they
produce no outputs, they must not wait for the acknowledges. Adding
in the X.sup.o{circumflex over ( )}X.sup.1 terms will allow the
wait to be completed vacuously. This does not actually generate any
production rules. This Handshaking expansion can be compiled into
production rules, but there are some tricky details.
[0076] An interesting choice arises from the use of the skip. A
skip causes no visible change in state, so the next statements in
sequence (A.sup.a.Arrow-up bold.,B.sup.a, . . . ) must actually
look directly at the boolean expansion for do_x(A,B, . . . ) and
do_y(A,B, . . . ) in addition to the output rails X*, X.sup.1, Y*,
Y.sup.1.
[0077] The completion condition for setting the outputs would be en
{circumflex over ( )}(X*vX.sup.1vdo_x(A,B, . . . )){circumflex over
( )}(Y*vY.sup.1vdo_y(A,B, . . . )). However, this expansion cannot
be used directly in the guards for A.sup.a.Arrow-up bold. and
B.sup.a.Arrow-up bold., since if one fired first, it could
destabilize the other. (This would work if A.sup.a and B.sup.a were
combined into one acknowledge.)
[0078] Another approach is to introduce a new variable to represent
the do_x and do_y cases. Suppose the skip's are replaced with
no_x.Arrow-up bold. and no_y.Arrow-up bold., respectively, and
no_x.dwnarw., are added to X.sup.0.dwnarw.,X.sup.1.dwnarw. and
no_y.dwnarw., to Y.sup.0.dwnarw., Y.sup.1.dwnarw.. Now the
production rules are simply produced as if X and Y were 1-of-3
channels instead of 1-of-2, except the extra rail doesn't check the
right acknowledge, or, in fact, leave the cell.
[0079] Finally, there are many cases were some expansion of the
outputs is sufficient to produce the output completion expansion
without reference to the inputs. For instance, if one input is used
to decide if a certain output is used, but is also copied to
another output, the copied output could be used to check the
completion of the optional output. Similarly, if two output
channels are used exclusively, such that one or the other will be
used each cycle, the completion for both is just the or of each
one's completion.
[0080] To put this discussion into practice, a split is
implemented, a fundamental routing process which uses one control
input to route a data input to one of two output channels. The
simple one-bit CSP is *[S?s,L?x; [s.fwdarw.A!x[ ]s.fwdarw.B!x]].
The PCHB reshuffling is: 13 PCHB_SPLIT * [ [ A e S 0 L o A 0 ; A e
S 0 L 1 A 1 ; S 1 skip ] , [ B e S 1 L o B 0 ; B e S 1 L 1 B 1 ; S
1 skip ] ; SL e ; [ A e A 0 A 1 A 0 A 1 ] , [ B e B 0 B 1 B 0 B 1 ]
SL e ]
[0081] The first two selection statements are known to be finished
when A.sup.0vA.sup.1vB.sup.0vB.sup.1v Hence, this will be used as
the guard for SL.sup.e.dwnarw.. The bubble-reshuffled production
rules are: 14 S 0 S 1 S v _ L 0 L 1 L v _ SL e A e S 0 L 0 A 0 _ SL
e A e S 0 L 1 A 1 _ SL e A e S 1 L 0 B 0 _ SL e A e S 1 L 1 B 1 _ A
0 _ A 0 A 1 _ A 1 B 0 _ B 0 B 1 _ B 1 S v _ L v _ SL v A 0 _ A 1 _
B 0 _ B 1 _ AB v AB v SL v SL e S 0 S 1 S v _ L 0 L 1 L v _ SL e A
e A 0 _ SL e A e A 1 _ SL e B e B 0 _ SL e B e B 1 _ A 0 _ A 0 A 1
_ A 1 B 0 _ B 0 B 1 _ B 1 S v _ L v _ SL v A 0 _ A 1 _ B 0 _ B 1 _
AB v AB v SL v SL e
[0082] The circuit is shown in FIG. 5.
[0083] Conditionally Reading Inputs
[0084] It is also highly useful to be able to conditionally read
inputs. Normally the condition is read in on a separate
unconditional channel, but in general it could be any expansion of
the rails of the inputs. A CSP template for type of cell this would
be: 15 P2 * [ [ do_a ( A _ , B _ ) A ? a ; no_a ( A _ , B _ ) a :=
" unused " ] , [ do_b ( A _ , B _ ) B ? b ; no_a ( A _ , B _ ) b :=
" unused " ] , X ! f ( a , b ) , Y ! g ( a , b ) , ]
[0085] The {overscore (A)} in this context refers to a probe of the
value of A, not just its availability. This is not standard in CSP,
but is a useful extension which is easily implemented in
Handshaking expansion. Basically, the booleans for do_a, do_b,
no_a, and no_b may inspect the rails of A and B in order to decide
whether to actually receive from the channels. The selection
statements will suspend until either do a or no a are true. These
expansions are required to be stable; that is, as additional inputs
show up, they may not become false as a result.
[0086] For the Handshaking expansion, instead of assigning "unused"
to an internal variable, the f and g expansions examine the inputs
directly. The results of the do_a/no_a and do_b/no_b expansions
must be latched into internal variables u and v, so that A and B
may be acknowledged in parallel without destabilizing the guards of
do a and the like. The PCFB version of the Handshaking expansion
is: 16 u 0 , u 1 , v 0 , v 1 , ; * [ f 0 ( A , B , ) X 0 ; f 1 ( A
, B , ) X 1 ] , [ g 0 ( A , B , ) Y 0 ; f 1 ( A , B , ) Y 1 ] , , [
do_a ( A , B ) u 1 ; no_a ( A , B ) u 0 ] , [ do_b ( A , B ) v 1 ;
no_b ( A , B ) v 0 ] , ; [ u 1 A a ; u 0 skip ] , [ v 1 B a ; v 0
skip ] , ; en ; [ X a X 0 , X 1 ] , [ Y a Y 0 , Y 1 ] , , ( u 0 , u
1 ; [ A 0 A 1 A a A a ] ) , ( v 0 , v 1 ; [ B 0 B 1 B a B a ] ) ; ,
en ]
[0087] Similarly to the conditional output Handshaking expansion,
the guards for A.sup.a.dwnarw. and B.sup.a.dwnarw. are weakened to
allow the vacuous case. The skip again can pose a problem, since it
makes no change in the state. However. with the u.sup.0 and v.sup.0
variables it is possible to infer the skip and generate the correct
guard for en. On the reset phase, the u and v must return to the
neutral state. There are several places to put this, but the
symmetric placement which sequences them with the A.sup.a.dwnarw.
and B.sup.a.dwnarw. simplifies the PRS.
[0088] In many cases, this general template can be greatly
simplified. For instance, if a set of unconditional inputs
completely controls the conditions for reading the others, these
can be thought of as the "control" inputs. If raising the
acknowledges of the various inputs is sequenced so that the
conditional ones precede the control ones, then the variables u and
v may be eliminated without causing stability problems. Also in
some cases the u and v may be substituted with an expansion of the
outputs, instead of stored separately.
[0089] As a concrete example, the circuit for the merge process
reverses the split of the last section by conditionally reading one
of two data input channels (A and B) to the single output channel R
based on a control input M. The CSP is *[M?m; [m.fwdarw.A?x[
]m.fwdarw.B?x]; X!x]. Here the simplification of acknowledging the
data inputs A and B before the control input M is used. The PCHB
reshuffling is: 17 PCHB MERGE * [ [ X e ( M 0 A 0 M 1 B 0 ) X 0 ; X
e ( M 0 A 1 M 1 B 1 ] X 1 ] , [ M 0 A e ; M 1 B e ] ; M e ; [ X e
-> X 0 , X 1 ] , [ A 0 A 1 M 0 A e A e ] , [ B 0 B 1 M 1 B e B e
] , M e ]
[0090] A subtle simplification used here is to make
A.sup.e.Arrow-up bold. and B.sup.e.Arrow-up bold. check the
corresponding M.sup.e.Arrow-up bold. and M.sup.1. This reduces the
guard condition for M.sup.e.Arrow-up bold. and makes the reset
phase symmetric with the set phase. Some decomposition is done to
add A.sup.v, B.sup.v and X.sup.v to do validity and neutrality
checks. After bubble-reshuffling, the PRS is: 18 A 0 A 1 A v _ B 0
B 1 B v _ A v _ A v B v _ B v M e X e ( M 0 A 0 M 1 B 0 ) X 0 _ M e
X e ( M 0 A 1 M 1 B 1 ) X 1 _ X 0 _ X 0 X 1 _ X 1 X 0 _ X 1 _ X v A
v M 0 X v A e B v M 1 X v B e A e B e M e _ M e _ M e A 0 A 1 A v _
B 0 B 1 B v _ A v _ A v B v _ B v M e X e X 0 _ M e X e X 1 _ X 0 _
X 0 X 1 _ X 1 X 0 _ X 1 _ X v A v M 0 X v A e B v M 1 X v B e A e B
e M e _ M e _ M e
[0091] As usual for PCHB reshuffling, of the work is done in a
large network of n transistors. The circuit is shown in FIG. 6.
[0092] Internal State
[0093] Another extension to this design style is the ability to
store internal state from one cycle to the next. A CSP template for
a state holding process with state variable s is: 19 P3 s :=
initial_s ; * [ A ? a , B ? b , ; X ! f ( s , a , b , ) Y ! g ( s ,
a , b , ) , ; s := h ( s , a , b , ) ]
[0094] This can be implemented in a variety of ways. The simplest,
which requires no new circuits, is to feed an output of a normal
pipelined cell back around to an input, via several buffer stages.
One of these feedback buffers is initialized containing a token
with the value of the initial state. Enough buffers must be used to
avoid deadlock, and even more are needed to maximize the
throughput. Therefore, this solution can be quite large. For
control circuitry, where area is less of an issue, this is often
adequate. As an added benefit, the feed forward portion of the
state machine can be implemented as several sequential stages of
pipelined logic, which correspondingly reduces the number of
feedback buffers necessary and allows far more complicated
functions.
[0095] Aside from using feedback buffers, there are three main
approaches to retaining state, of increasing generality and
complexity. First, pipelining channels by themselves store state.
Usually, these values move forward down the pipeline, passing each
stage only once. However, if a stage uses but does not acknowledge
its input, the input value will still be there on the next cycle.
Essentially, the token is stopped and sampled many times. In CSP,
this can be expressed with the probe of the value of the channel. A
conditional input type of circuit is used, which uses an input to
produce outputs without acknowledging that input. This technique
can be used for certain problems. For example, a loop unroller
could take an instruction on the input channel, and produce many
copies of it on an output channel based on a control input. Of
course, this type of state variable can never be set, only read one
or more times from an input.
[0096] If the state variable is exclusively set or used in a cycle,
a simple modification of the standard pipelined reshuffling will
suffice. The state variable, s is assigned to a dual-rail value at
the same time the outputs are produced. On the reset phase, it
remains stable. Unlike the usual return-to-zero variables, s will
only briefly transition through neutrality between valid states. If
s doesn't change, it does not go through a neutral state at all.
The CSP for this behavior is expressed just like P3, except the
semicolon before the assignment to s is replaced with a comma. This
is made possible by the assumption that s only changes when the
outputs X and Y do not depend on it; this avoids any stability
problems.
[0097] The only tricky thing about deriving the Handshaking
expansion for this is the assignment statement. Basically, the
assignment is done by lowering the opposite rail first, then
raising the desired rail. This guarantees that the variable passes
through neutral when it changes, and also bubble-reshuffles nicely.
The completion detection of this assignment is basically equivalent
to checking that the value of s corresponds to the inputs to s. So
s:=x becomes [x.sup.0.fwdarw.s.sup.1.d- wnarw.;s.sup.1.Arrow-up
bold.[ ]x.sup.1.fwdarw.s.sup.0.dwnarw.s.sup.1.Arro- w-up bold.];
[x.sup.0{circumflex over ( )}s.sup.0vx.sup.1{circumflex over (
)}s.sup.1]. The PCFB version of the Handshaking expansion for this
type of state holding process is: 20 * [ [ X a f 0 ( s , A , B , )
X 0 ; X a f 1 ( s , A , B , ) X 1 ] , [ Y a g 0 ( s , A , B , ) Y 0
; Y a g 1 ( s , A , B , ) Y 1 ] , [ h 0 ( A , B , ) s 0 ; h 1 ( A ,
B , ) s 0 ; s 1 ] , ; A a , B a , ; en [ X a X 0 , X 1 ] , [ Y a Y
0 , Y 1 ] , , [ A 0 A 1 A a ] , [ B 0 B 1 B a ] , , en ]
[0098] It is often desirable to decompose the completion detection
of the state variable into a 4 phase completion variable s.sup.v
which detects the completion of the assignment on the set phase and
is cleared on the reset phase. This makes it easier to have
multiple state variables. One thing to note is that the assignment
sequence and completion has 3 transitions if it changes state, and
therefore often takes more transitions than a typical output
channel. However, on the reset phase or if the state is unchanged,
this only takes 1 transition. Another caveat is that the state
variable shown here works best for only dual rail 1 bit state
variables.
[0099] As an example of this type of state variable, consider the
"register" process x:=0; *[C?c; [c.fwdarw.R!x[ ]c.fwdarw.Lx]]. This
uses a control channel C to decide whether to read or write the
state bit x via the input and output channels L and R. Obviously,
the state bit is exclusively used or set on any given cycle. This
process also conditionally communicates on L and R.
[0100] The PCHB version of the Handshaking expansion is:
PCHB_REG.ident. 21 x 0 , x 1 ; * [ [ C 1 R e x 0 R 0 ; C 1 R e x 1
R 1 C 0 L 0 x 1 ; x 0 C 0 L 1 x 0 ; x 1 ] ; [ C 0 L e ; C 1 skip ]
; C e l [ R e R 0 R 1 R 0 , R 1 ] ; [ L 0 L 1 L e ] ; [ C 0 C 1 C e
] ]
[0101] The PRS has a few tricky features. Due to the exclusive
pattern of the communications the rules for C.sup.e an be
simplified. The decomposed and bubble reshuffled PRS follows. The
circuit is shown in FIG. 7. 22 C e C 0 R e x 0 R 0 _ C e C 0 R e x
1 R 1 _ R 0 _ R 0 R 1 _ R 1 R 0 _ R 1 _ R v R v R v _ C e C 1 L 0 x
1 C e C 1 L 1 x 0 L 0 x 0 x 1 L 1 x 1 x 0 C e ( x 0 L 0 x 1 L 1 ) L
e L e R v _ C e _ C e _ C e C e R e R 0 _ C e R e R 1 _ R 0 _ R 0 R
1 _ R 1 R 0 _ R 1 _ R v R v R v _ C e L 0 L 1 L e L e R v _ C e _ C
e _ C 0 C 1 C e
[0102] The most general form of state holding cell is one where the
state variable can be used and set in any cycle. In order to do
this, it is necessary to have separate storage locations for the
new state and the old state. This may be done by introducing an
extra state variable t which holds the new state until s is used.
The CSP for this is: 23 p 4 s := 0 ; * [ A ? a , B ? b , ; X ! f (
s , a , b , ) , Y ! g ( s , a , b , ) , t := h ( s , a , b , ) , ;
s := t ]
[0103] When this is converted into an Handshaking expansion, there
are several choices for where to put the assignment s:=t. It works
best to do this assignment on the reset phase of the channel
handshakes. After the assignment s:=t, t returns to neutral just
like a channel. The PCFB version of this type of cell is: 24 s := 0
; * [ [ X a f 0 ( s , A , B , ) X 0 [ Y a g 0 ( s , A , B , ) Y 0 ;
X a f 1 ( s , A , B , ) X 1 ] , Y a g 1 ( s , A , B , ) Y 1 ] , h 0
( s , A , B , ) t 0 ; h 1 ( s , A , B , ) t 0 ] , ; A a , B a , ;
en ; [ X a X o , X 1 ] , [ Y a - Y o , Y 1 ] , , [ t o s 1 ; s o ;
t o ; t 1 s o ; s 1 ; t 1 ] , , [ A o A 1 - A a ] , [ B o B 1 - B a
] , , en ; ]
[0104] The assignment statements may be compiled into production
rules as before. Of special interest is the compilation of the
sequence [t.sup.0.fwdarw.s.sup.1.dwnarw.;s.sup.0.Arrow-up
bold.;t.sup.0.dwnarw.[
]t.sup.1.fwdarw.s.sup.0.dwnarw.;s.sup.1.Arrow-up
bold.;t.sup.1.dwnarw. Due to correlations of the data, this
compiles into the simple (bubble-reshuffled) production rules: 25
en t 0 _ s 0 en t 1 _ s 1 s 0 t 1 _ s 1 s 1 t 0 _ s 0 en s 1 t 0 _
en s 0 t 1 _
[0105] The s* and s.sup.1 should also be reset to the correct
initial value. The completion of this sequence is just the normal
check for to {overscore (t)}.sup.0{circumflex over ( )} {overscore
(t)}.sup.1. If the state variable doesn't change, this sequence
takes only 1 transition, since the first 4 rules are vacuous. If
the state changes it takes 3 transitions. This is 2 transitions
longer than the reset of a normal output channel, so this should be
considered to optimize the low level production rule decomposition.
This type of structure only works well if s and t are dual-rail,
although several dual-rail state variables can be used in parallel
to encode more states.
[0106] In addition, extensions to these cells which allow for
conditionally receiving inputs or conditionally sending outputs
were explained. Finally, various approaches to storing internal
state in the cells are disclosed.
[0107] The prior state of the art was to use un-pipelined weak
condition logic. Extra buffers or registers would be added between
blocks of logic to add some pipelining. This approach was smaller,
but much slower. The extra buffers also increased the foreward
latency. Essentially, in the limit of using more and more buffers,
they should eventually be merged into the logic and all cells
should be "maximally" pipelined. That is, any discrete state of
logic gets its own pipelining, so that no more slack could be added
without just adding excess buffers. In practice, the cost of such
fine pipelining amounts to a 50% to 100% increase in area over a
completely un-pipelined circuit. It reduces the latency (since no
separate buffers are added), and, of course, increases the
throughput. At this natural limit of pipelining, all handshakes
between neighboring cells require a small number of transmissions
per cycle, typically 14 to 18. The internal cycles usually keep up.
This yields a very high peak throughput (comparable to 14
transition per cycle hyper-pipelined synchronous designs like the
DEC Alpha) but is more easily composable. However, composing fast
pipelined cells in various patters can yield much lower system
throughputs unless special care is taken to match the latencies as
well as the throughputs of the units.
[0108] Several simple modifications to these pipelined circuit
templates are also useful and novel.
[0109] 1. Go Signal
[0110] In the PCHB, it is possible to separate out the "en &
re" expressions for the logic pulldown and ".about.en &
.about.re" for the logic pullup into a 2-input c-element of "en"
and "re" which generates a single "go" signal used to precharge and
enable the logic. This improves the forward latency and analog
safety of the logic, although it adds 4 transitions to the
handshake on the output channel.
[0111] With more care, this "go" signal may be added to a PCFB as
well. In this case, the "go" signal must also be checked before
producing the left enables, or instabilities will result. This has
the side effect of reducing the slack to one half, but this is
irrelevant when the goal is high speed. When a "go" signal is used
with conditional outputs, the "go-" must not wait for the right
enable (re) to go down since it won't (as no data was sent on the
last cycle). Instead of a c-element this gives the PRS: "en &
re & .about.no_r->go+" and ".about.en &
(no_r.vertline..about.re)->go-".
[0112] 2. Shared Input/Output Completion
[0113] In most of these examples, the output completion is taken
before the inverters, since this allows the use of a NAND gate
instead of a NOR gate and gets the completion done a transition
earlier. However, it is possible to complete from after the
inverters as well. This is particularity useful when you can share
the output completion circuit of one cell with the input completion
of the next cell in the pipeline.
[0114] 3. Timing Assumptions
[0115] Although this patent primarily presents asynchronous
circuits in a quasi-delay-insensitive framework, it may prove
desirable to introduce timing assumptions in order to simplify or
speed up the circuit. Several useful non-QDI circuits can be
derived simply by omitting transistors from a QDI WCHB, PCHB, or
PCFB circuit. It is preferred if the introduced timing assumptions
can be met entirely by estimating the delays within the cell,
without making assumptions on the delays of its environment.
Several simple modifications can satisfy this property.
[0116] For example, in a PCFB with a single "go" signal, it can be
assumed that the output will precharge quickly after the "go" goes
low. The fact that "go" is low can be taken to imply that the
output data is precharged, or soon will be. This "implied
neutrality" timing assumption can eliminate many transitors of
completion detection, and allow the next cycle to begin earlier. In
a similar fashion, the input validity can sometimes be ignored if
the output validity implies that all input channels are valid.
[0117] Of the various types of state-holding cells, the more
restricted versions generally have simpler and faster
implementations, and should therefore be used if possible. For the
most general case, either a pair of state variables should be used,
or if area is not an issue, a feedback loop of buffers.
[0118] Three main types of handshaking reshuffling have proved
superior for different circumstances. The weak-condition halfbuffer
variety works well for buffers and copies without logic. The
precharge-logic half-buffering is the simplest good way to
implement most logic cells. The precharge-logic full-buffering has
an advantage in speed and is good at decoupling the handshakes of
neighboring units. It should be used when necessary to improve the
throughput.
[0119] Although only a few embodiments have been described in
detail above, other embodiments are contemplated by the inventor
and are intended to be encompassed within the following claims. In
addition, other modifications are contemplated and are also
intended to be covered.
* * * * *