Fermilab D0 Note 3058

D Zero Central Hardware Trigger Preliminary Implementation Studies of the "Base Line Design".

(or how many road equations will fit in an Altera 10K50 EPLD.)

by

R. Angstadt and F. Borcherding

August 15,1996

Introduction

This is a status report on efforts to program CFT L1 trigger equations into an Altera Embedded Programmable Logic Device or EPLD. Altera provides software tools for programming this family of chips including a language and drawing tools. The language is AHDL for Altera Hardware Description Language. It is a (simplified and reduced) subset of VHDL. The "V" in VHDL is short for the (Department of Defense) "Very High Speed Integrated Circuit" (language) or "VHSIC". Of coarse HDL is Hardware Description Language. VHDL has evolved into IEEE 1076-1987 standard. There is also a newer '93 standard with additions to address shortcomings in the '87 standard. VHDL is somewhat loosely based on ADA and includes constructs for parallel and/or serial logic and also specific time controlling statements.

Differences between AHDL and VHDL.

Syntactically both AHDL and VHDL are similar, however AHDL lacks a great many of the features available in VHDL for controlling whether sections of code are parallel or serial and also specifying timing statements. For this project it is possible to use AHDL without problems. All code is executed in parallel in AHDL (except for a state machine = "case" statement or inside a "for" loop). Serial code dependencies of the form x=y followed by z=x are handled by "synthesizing" the signals through an "or" gate which imposes about 1.5 Ns delay according to simulation experiments I have done. This avoids many of the pitfalls inherent in VHDL which leaves it to the programmer to correctly handle serial dependencies through a variety of means. (Books are written on the subject of how to avoid race conditions ect.)

Platforms and other Software Used

The Altera Software Suite "MAX+PLUS II" of which AHDL is a part also contains timing analysis and simulation software. It runs on a variety of platforms including Win 32s better known as Windows-95 and/or NT. In this case Windows-95 was used in conjunction with FTP (for transferring data files from the VAX) and Excel V5.0 (a spread sheet from Microsoft) which includes Visual Basic for Applications (VB) to help automatically convert tracks in the form of fiber index numbers into AHDL code.

About the Chips

AHDL Version 6.0 was used for this study which only had knowledge of an Altera EPF10K50 with 50,000 typical gates (36,000 to 116,000 usable) gates so all fitting was done to this device. Also the simulator did not know about speed grades so simulation was done at slightly greater than half the desired clock frequency of ~53Mhz. Although Altera is shipping an EPF10K100 with 100,000 typical gates (62,000 to 158,000 usable) we could do no fitting to the 10K100 with Version 6.0. Both these devices contain both Logic Elements and RAM. The trigger equations are Boolean and get implemented by the Altera software in a basic unit called a "Logic Element". Each logic element is a programmable look up table of four inputs. The 10K50 has 2,880 logic elements and the 10K100 has 4,992 logic elements. Any logic element may be connected to another through a matrix of global row and column (static) programmable connections. Also 8 logic elements are connected privately on their own busses to form a block. All of these resources are finite including the number of usable I/O pins which is 310 and 406 for the 10K50 and 10K100 respectively.

Detector Design Assumptions

This implementation is based on Fred Borcherding's and Manuel Martin's design assumptions presented in several previous D0 Notes and in conversations with them. Briefly this design is for 1/80 of the detector = 1 sector. For our purposes we are not interested in the stereo layers. The axial layers are arranged in 8 doublet layers A, B, C, D, E, F, G, and H of 16, 20, 24, 28, 32, 36, 40, and 44 fibers respectively. A doublet layer is two rows of overlapped fibers so the actual number of fibers in one sector is = 480 fibers per sector. To avoid inefficiencies for tracks bending out of a sector, fibers from the neighboring sectors must also be brought in (wings). Enough are included to go down to a Pt threshold of 1.5 GEV. Manuel has studied anchoring (where the road search begins) in different layers to determine that the wings are significantly smaller by anchoring in the outer "H" layer (184 fibers per adjacent sector) rather than anchoring in the inner "A" layer (272 fibers per adjacent sector). This is a significant savings in the number of equations needed and in terms of "logic elements" and cost. The total number of fibers for one sector and the wings for this scheme is 480 + (2 * 184)=848 fibers or bits. These must be moved into a chip before tracks can be found.

Since all known chips have fewer usable pins than 848, all these inputs must be multiplexed. Beam crossings occur every 132 Ns or every seventh tick of a 53Mhz clock. The scheme is to use four of the seven available clock cycles to read in the 848 fiber bits. A "one-hot" state machine was implemented in AHDL as per Manuel's design. A spreadsheet showing the input array indexes to the detector fiber mapping indexes is in "//d0server1/public" in (if using FTP then "cd" to disk "d:") "\angstadt\altera\signals.xls". This state machine appears to take about 4% to 5% of a 10K50's logic elements. According to Altera Field Engineers a "one-hot" design uses the fewest chip resources.

Road Equation Generation, Counting and More Geometry?

Fred used Monte Carlo techniques to generate a file available on the above server and disk in \angstadt\altera\table10.txt" which has 306 equations of positive tracks for 1/4 of the sector = a "cell". A cell is the smallest repeating pattern and happens to be the Lowest Common Denominator of the number of fibers in each doublet layer. These equations are for a threshold Pt=10 with the anchor in layer H. The first few lines of this file are shown.

23 28 34 41 47 53 60 67 29

23 28 35 41 47 53 60 67 114

23 29 35 41 47 53 60 67 151

23 29 35 41 47 54 60 67 79

23 29 35 41 47 54 61 67 44

23 29 35 41 48 54 60 67 21

23 29 35 41 48 54 61 67 53

23 29 35 41 48 54 61 68 19

23 29 35 42 48 54 60 67 17

23 29 35 42 48 54 61 67 70

23 29 35 42 48 54 61 68 254

....

The first 8 columns are fiber index numbers of the A, B, C..H layers. The last column is a counter of the number of times that road occurred in the Monte Carlo generation and is presently ignored (it could be used as a method to reduce equations). The counting scheme includes three full sectors including the previous, home, and next sector. All counting begins at 1 (not 0) from the previous sector. The Table columns below show the sector boundaries:
Previous Sector
Home Sector
Next Sector
1,2,3,.....................................44
45.....................................88
89.....................................132
1,2,3,..................................40
41..................................80
81.................................120
1,2,3,..............................36
37..............................72
73.............................108
1,2,3,.........................32
33.........................64
65.........................96
1,2,3,..................28
29..................56
57..................84
1,2,3,............24
25............48
49............72
1,2,3,.....20
21........40
41........60
1,2,..16
17.....32
33.....48

Cells are 1/4 of the number of fibers in a layer and represent the smallest repeating geometrical pattern. The table below show the cell divisions of the home sector:
1st cell2nd cell3rd cell 4th cell
45.................55
56.................66
67.................77
78.................88
41................50
51................60
61................70
71................80
37...............45
46...............54
55...............63
64...............72
33..............40
41..............48
49..............56
57..............64
29.............35
36.............42
43.............49
51.............56
25...........30
31...........36
37...........42
43...........48
21........25
26........30
31........35
36........40
17.....20
21.....24
25.....28
29.....32

Knowing that the anchor of this file is in the outer layer we can now easily see that that the this file is dealing with the third cell.

AHDL example Fragments.

Once the Fiber indexes have been obtained they may be automatically translated into AHDL code using your favorite Higher Level Language(s). Below are the first 7 roads from "table10.txt" translated into AHDL code with 8 terms making up an equation:
ttemp1=AL[23] AND BL[28] AND CL[34] AND DL[41] AND EL[47] AND FL[53] AND GL[60] AND HL[67];
ttemp2=AL[23] AND BL[28] AND CL[35] AND DL[41] AND EL[47] AND FL[53] AND GL[60] AND HL[67];
ttemp3=AL[23] AND BL[29] AND CL[35] AND DL[41] AND EL[47] AND FL[53] AND GL[60] AND HL[67];
ttemp4=AL[23] AND BL[29] AND CL[35] AND DL[41] AND EL[47] AND FL[54] AND GL[60] AND HL[67];
ttemp5=AL[23] AND BL[29] AND CL[35] AND DL[41] AND EL[47] AND FL[54] AND GL[61] AND HL[67];
ttemp6=AL[23] AND BL[29] AND CL[35] AND DL[41] AND EL[48] AND FL[54] AND GL[60] AND HL[67];
ttemp7=AL[23] AND BL[29] AND CL[35] AND DL[41] AND EL[48] AND FL[54] AND GL[61] AND HL[67];

....

This was done by reading the file into Excel (can cut and paste or write 93 lines of Visual Basic for Applications code.) In this case the (truly trivial) code was written in very short time. Once in Excel another 66 lines of truly trivial VB code converts it to the AHDL form shown above. (This code and spreadsheet is available on the previously mentioned server in "\angstadt\altera\rdsort.xls": the code is in "Module1" and the above equations were copied in from "pt=10" worksheet.) Another 36 lines of VB code generates AHDL code to "OR" all of the ttempxxx to see if any tracks were found:

if ttemp1.Q==VCC or ttemp2.Q==VCC or ttemp3.Q==VCC or ttemp4.Q==VCC

or ttemp5.Q==VCC or ttemp6.Q==VCC or ttemp7.Q==VCC or temp8.Q==VCC .........

then

atrig1=VCC;

end if;

Then the text (AHDL code) in the Excel spreadsheet is copied and pasted in (right now by hand) in the AHDL source file ("*.tdf") and compiled and built. Once a successful build and fit has occurred then an Altera *.rpt (report) file is generated that gives utilization and statistics on the chip resources used. Also timing and Boolean logic simulations may be done to see that the design performs as expected including timing.

The astute reader will note in the above equations that there is a repeat pattern in the first four terms of the index numbers and another repeat pattern in the second four terms. Fred has written FORTRAN code that generated equations that "and" the two four-term-sets, and then later "and" the result of the two four-terms-sets in an attempt to reduce the number of required logic elements. A file similar to the above "table10.txt" had 303 roads in it reduced to about 40 four term equations for the inner four layers and about 100 four term equations in the outer four layers. There are also 303 two term equations to match the inner and outer. This we will call the 2 x 4 term model as opposed to the obvious straight 8 term design.

Altera Software and Compiler Options.

A significant amount of effort was needed to learn a new language and to understand the quirks of the Altera Integrated Development Environment, IDE. The IDE includes their compiler and all of it's switches. The first thing to realize about their IDE is that whenever a compile "check" and/or build button or menu pick is clicked on, the IDE first saves all of the open source files over the previous ones. No rollback is provided so the user must protect original source by periodically making a new directory (external to the IDE) and copying the source files there before making changes.

Another problem may occur over time with the same project as the user sets and removes different IDE menu options. Some of the more esoteric menu options may not be removed from the configuration file that the IDE uses, "*.acf" file. This causes various parts of the compile cycle to be confused with potentially conflicting settings and thus fail. The only easy way out is to delete the associated "*.acf" so that the IDE will then automatically create a new one from the present environment on the next compile. This combined with the myriad of compiler switches made for considerable initial confusion on my part.

Eventually I learned that the trick is to work with the IDE menu terms: "Assign", "Global Project Logic Synthesis", "Global Project Synthesis Style" set to "WYSWIG" = What You See (is) What It Generates (?) . In this mode very little if any optimization is done so that a signal node in the language is mapped to the device. Thus one can trace signals through various stages using the simulator and see that signals go through the intended logic path as desired. After the code works as intended in the simulator this switched is changed to "Normal", this turns on Boolean optimization(= "logic synthesis" in Altera terms). In this mode many of the intermediate signal nodes that are there in the code may not be available to the simulator because they have been optimized out. This setting can make a big difference in logic element utilization: from 61% to 12% for the same source code.

In addition to the above, input signals that are not referenced in a path to the output are discarded with optimization on. Also signals that are not correctly connected (via AHDL) will be discarded. Warning messages that this is happening will be generated at compile time but they can be confusing. When the inputs were not connected to any output, utilization went down to 4 to 5% usage for the state machine logic. The trick is to get a working design according to the simulator with enough equations to utilize the inputs. This was eventually done and is on the same server in "\angstadt\altera3\alt7s8t\robust\*.tdf". Here "*" is either "trig_try" which is the 8 term code of 303 equation, "trigpos" is the 1224 equations of 8 terms and "trigtry2" is the 303 equations of 2 x 4 terms. As one can see variation in chip usage due to the above reasons can make initial utilization estimates relatively meaningless unless working code is first developed.

Another feature (bug or quirk) that took some time to realize is that when implementing a one hot state machine (perhaps any state machine?) in AHDL the initial or reset state is special in that it is apparently not guaranteed to be exclusive only to reset! The state machine may drop into the initial state for a nanosecond or two while switching between other subsequent states. This is just enough to clear the latches so painfully created. So all the initial state is good for is to idle. If one tries to clear latches in the idle state, previous latched input may be cleared again (apparently randomly) while it is changing to subsequent states. If latches must be explicitly cleared then a separate state at the end of the timing cycle (before idle) must be used to explicitly clear the latches! All of this is according to the simulation software. So far we have not built any of this in hardware.

The Bottom Line: logic utilization.

The following table summarizes Logic Element utilization of the total available 2880 elements in a 10K50 of the various schemes for 303 equations:
Scheme for 303 equations "WYSWIG""Normal" (fully optimized)
2 set of 4 terms anded together19% 9%
straight 8 terms17%11%

This study shows that the built in Altera logic reduction optimization is very good and that for preliminary study purposes the 8 term model will suffice since it is much easier to generate this code automatically from the index numbers than to do the 2 x 4 scheme: there is 11%-9%=2% difference between the two schemes. (It is also much more direct and less prone to coding errors.)

To utilize more of the input signals and get a more accurate estimate of the limits of the chip we expanded the 306 equations (1 cell or 1/4 of a sector) by 4. This gave us 1224 equations representing only the positive Pt roads. Another ~1224 equations are needed for the negative Pt roads with a minimum Pt=10. This also requires a change to the Monte Carlo code that was not yet done. However the 1224 equations would use more input signals and would help in seeing if utilization scaled in a linear fashion. The answer is that it is fairly linear:
Scheme for 1224 equations"WYSWIG" "Normal" (fully optimized)
straight 8 termsNA48.7%

With maximum optimization 1402 logic cells of 2880 available logic cells were used. A ~48.7% utilization was obtained for 1224 input equations of 8 terms along with the logic to read in and latch (store) of all 848 fibers necessary for one sector. The extra 4%, 48%- 11% * 4, is a non-linearity probably due to using more of the input signals. Equation expansion was done in Excel using simple arithmetic (the same workbook as before in the "Pt=10pos" sheet and the "autoOrPos" sheets).

Altera Fitting times

While utilization may be linear, computer fitting time is not. On an Intel Atlantis Mother Board with a 166 Mhz Intel Pentium (not a Pentium-Pro) with 256K of L2 cache with the latest Synch Burst Trident chip set with 82 megabytes of physical EDO RAM build times take ~2 minutes for ~300 equations. (Peak working RAM was ~24 Megabytes). For 1224 equations of 8 terms this goes up to ~2-4 hours and ~35 Megabytes. Note that no memory paging was occurring to disk. At the 48% level we are already getting "fail to fit" messages and it is starting over with another fitting iteration. {At an Altera instruction class the recommendation was that 80-85% utilization was a working maximum (~8 hours) for these devices.}

What has not been done:

No vector studies have been performed to check all input combinations. This will require extensive work including tools to help build the vectors. Right now reading in the fibers (correlating an input state and pin to a fiber signal) is not obvious and tools (i.e., custom mapping software) will be required to work backwards through the coded signals to the input pins and states. This will be somewhat tedious and complicated and has not been successfully done yet. I did it for one road, partly by hand and partly with help from several spreadsheets. Even if tables are generated and used it will still take time to think it through. The complicated read in scheme is dictated by external backplane, timing and signal symmetry considerations and probably will not significantly change. (For example the test input I used did not use any signals in the fourth input state so I don't know if there is some sort of incidental delay that would in some way preclude the present design or not. Previous state delays for 1,2 and 3 were fairly consistent and did not look to be a factor if they continue to be consistent for the fourth state.)

The Altera software simulation V6.0 had no speed grade for the 10K50 so all of the timing simulations were done at ~1/2 the necessary clock frequency. (40 Ns instead of 20 Ns). Simulations must be redone to verify that the timing will still work at the intended clock frequency. Internal propagation delays looked like it would work but still this must be checked further.

No fits have been made to the 10K100 series to see if the utilization is as would be expected.

A board needs to be built to see how many EPFGs can be hung on the input signal (fiber bits) without electrical problems including noise and slewing rise times due to input capacitance.

Conclusion

This looks like a technically viable scheme as long as the number of equations can be kept as low as possible. More equations will drive the cost up as the number of EPLDs are increased. Right now these devices are rather expensive as they are unique and state of the art (the largest available).

The Altera Logic Synthesis (Boolean reduction and optimization) was found to be very good with only a 2% difference between it and our own attempt at reducing the equations.

Like any powerful and complicated new tool the Altera Software takes some getting used to. However, the tools are usable and Altera has shown a commitment to them and supporting them. They also have a good reputation (if not the best than certainly one of the best) in the industry for both parts, software tools, and technical support.

Epilogue

Recently the question has been asked, "what if only 7 layers of VLPCs are working?" Fred has come up with a 4 times expansion scheme to take care of that. If the road really is:

ttemp1=AL[23] AND BL[28] AND CL[34] AND DL[41] AND EL[47] AND FL[53] AND GL[60] AND HL[67];

then the following four equations will replace the above and still find a road on any 7 of 8 existing layers:

ttemp1=AL[23] AND BL[28] AND CL[34] AND DL[41] AND EL[47] AND FL[53] AND (GL[60] OR HL[67]);

ttemp2=AL[23] AND BL[28] AND CL[34] AND DL[41] AND (EL[47] OR FL[53]) AND GL[60] AND HL[67];

ttemp3=AL[23] AND BL[28] AND (CL[34] OR DL[41]) AND EL[47] AND FL[53] AND GL[60] AND HL[67];

ttemp4=(AL[23] OR BL[28]) AND CL[34] AND DL[41] AND EL[47] AND FL[53] AND GL[60] AND HL[67];

When 306 original equations are expanded 4 times in this manner , the 1224 equations fit into a 10K50 in less than 10 minutes with 34% of the chip utilized. Instead of two chips to do a minim Pt = 10

probably 4 would be needed. (Two for the positive and two for the negative as opposed to one for positive and one for negative.)