Differences

This shows you the differences between two versions of the page.

Link to this comparison view

programmable_hardware [2007-06-18 10:13] – external edit 127.0.0.1programmable_hardware [2007-12-30 12:57] (current) nik
Line 1: Line 1:
- 
- 
  
 ==== Programmable Logic ==== ==== Programmable Logic ====
  
 aka programmable hardware, (re)configurable logic, etc aka programmable hardware, (re)configurable logic, etc
- 
  
 === Hardware === === Hardware ===
Line 27: Line 24:
   * http://www.al-williams.com/pldhome.htm   * http://www.al-williams.com/pldhome.htm
   * open directory category http://dmoz.org/Computers/Hardware/Programmable_Logic/   * open directory category http://dmoz.org/Computers/Hardware/Programmable_Logic/
-  * suvey/ tutorial http://www.eecg.toronto.edu/~jayar/pubs/brown/survey.html+  * survey/ tutorial http://www.eecg.toronto.edu/~jayar/pubs/brown/survey.html 
 + 
 +---- 
 + 
 +==== other ==== 
 + 
 +<code> 
 + 
 +Subject: Repost: Performance Benchmarks: Emulating FPGAs Using 
 +         General Purpose Processors 
 +Date: 09 Feb 1996 00:00:00 GMT 
 +newsgroups: comp.arch.fpga 
 + 
 +Reposted after Netcom corrupted the first copy. 
 + 
 +In <1996Feb9.082016.20892@super.org> sc-@vcc.com (Steve Casselman) 
 +writes:  
 +>>Another thread that would be good is just for some benchmark 
 +>>data on what people are doing now. For example if someone 
 +>>is doing neural nets you might post: 
 +>> 
 +>>I did a 12 neuron neural net that did 12 Billion connections/sec 
 +>>in 3000 gates and a ALPHA 330MegHz can do them @ 300 Million   
 +>>connections/sec.  (no I did not do this it is just an example  
 +>>of the detail of the proposed post) 
 + 
 +A while back I was wondering about this in general: "how much better 
 +are FPGAs at executing, in hardware, an arbitrary computation, than are 
 +modern general purpose processors at executing, in software, the same 
 +arbitrary computation?" And so I wrote the following loose analysis 
 +which might be interesting to discuss: 
 + 
 + 
 +Emulating FPGAs using general purpose processors 
 +Jan Gray, 1995 
 + 
 +It is well known that FPGAs can implement general purpose computers, 
 +and general purpose computers can emulate FPGAs.  In this message I 
 +consider how efficiently a general purpose processor can emulate an 
 +arbitrary FPGA circuit.  It arises from my curiosity regarding what 
 +performance advantage (if any) the custom reconfigurable computing 
 +crowd has over general purpose processors, to quantify, best case, how 
 +much faster an FPGA is at arbitrary FPGA problems. 
 + 
 + 
 +Setting aside such modern features as embedded SRAM, 3-state buses, 
 +asynchronous flip-flop set/reset, and even flip-flop clock enables, we 
 +model a typical lookup-table-based FPGA logic element as an arbitrary 
 +4-input logic function driving either another stage of logic or a 
 +flip-flop.  Without fudging too much we can then consider that all 
 +multilevel logic functions are pipelined with a register between each 
 +logic function.  This simplifies my equivalence construction below and 
 +allows us to compare against an FPGA clock rate based upon the sum of 
 +flip-flop clock-to-output delay, delay through some interconnect, one 
 +logic function delay, and the flip-flop setup time. 
 + 
 +Thus, we model an FPGA as a vector of n 4-input function generators 
 +F[i] driving n flip-flops R[i] which are in turn fed back into the 
 +function generator inputs. In practice most FPGAs are incapable of 
 +modeling arbitrary interconnect schemes between logic elements, instead 
 +greatly favouring local interconnect between nearby logic elements.  
 +However, we’ll be overly generous and permit any function generator 
 +F[i] to compute any function of any 4 flip-flop outputs: 
 + R[i]’ = F[i](R[a[i]], R[b[i]], R[c[i]], R[d[i]]); 
 +for i, a[i], b[i], c[i], d[i] all in [0..n). 
 + 
 +Note that a, b, c, d, are simple mappings which describe which 
 +flip-flop outputs are inputs to each given function generator.  For 
 +instance, if R[0]’ = R[2] xor R[3] xor R[5] xor R[7], we would have 
 +a[0]=2, b[0]=3, c[0]=5, d[0]=7. 
 + 
 +On a general purpose processor with word width w, we can implement the 
 +flip-flop vector by partitioning it into w bit registers.  To simplify 
 +the presentation, assume that n == w.  Then a simulation of one 
 +clocking of the FPGA is to form A, B, C, D, each a mapping of R such 
 +that 
 + A[i] == R[a[i]] 
 + B[i] == R[b[i]] 
 + C[i] == R[c[i]] 
 + D[i] == R[d[i]] 
 +and finally compute 
 + R’[i] = F[i](A[i], B[i], C[i], D[i]) 
 +word-wise, over all the bit positions i simultaneously. 
 + 
 +The first sub-problem is to compute 4 mappings of bits of R into A, B, 
 +C, D efficiently.  One approach is to perform wordwise bit shifting, 
 +xor, and mask operations upon R.  For instance, R can be bit-reversed 
 +in 5 lg w simple RISC instructions, arbitrarily permuted in 10 lg w 
 +instructions, and arbitrary mapped (any bit input to any bit output, 
 +including arbitrary bit replication) in roughly 20 lg w instructions 
 +employing 4 lg w constants.  (Knuth developed this by analogy from 
 +exchange networks).  Another approach is to build w/8 256-entry lookup 
 +tables, for each byte of bits in R, and "or" the contributions 
 +together: 
 +    word R, LUA[w/8][256], A = 0; 
 +    for (i = 0; i < w/8; i++) 
 + A |= LUA[i][(R>>(i*8)) % 256]; 
 +Written in-line this would require about w/8 shifts, masks, loads and 
 +"or"s.  Let us assume the tables fit into d-cache and charge overall 
 +4*w/8 == w/2 instructions to arbitrarily map R into A. 
 + 
 +For example, for w==64, the wordwise bit mapping approach would require 
 +approximately 120 instructions, whereas the table lookup approach would 
 +require approximately 32 instructions (including 8 loads).  Thus the 
 +four mappings A, B, C, D require about 4*32 == 128 instructions, 
 +including 32 loads and about 8 KB of lookup tables. 
 + 
 +The next sub-problem is to compute F(A,B,C,D).  To perform this in 
 +parallel across all n bits, it suffices to generate the 16 minterms 
 +M[0]=~A&~B&~C&~D, M[1]=~A&~B&~C&D, ..., M[15]=A&B&C&D and then combine 
 +them under 16 masks N[0..15]: 
 + R = M[0]&N[0] | M[1]&N[1] | ... | M[15]&N[15]; 
 +By reusing common subexpressions this can be done in about 60 
 +instructions. 
 + 
 +All totaled, on a w-bit processor, we need about 4*max(w/2, 20 lg w) + 
 +60 instructions to simulate one clocking of an arbitrary w-bit wide 
 +FPGA, as modeled.  Also of interest, the state required to describe the 
 +arbitrary w-bit FPGA of 4-input logic elements, is only about 16 + 4 lg 
 +w words. 
 + 
 + 
 +Let’s try out this model against some real hardware!  A lowly XC4002A-5 
 +has 8x8 CLBs each with 2 function generators and two flip-flops, and 
 +with a clock to flip-flop output delay of 3 ns, an "interconnect delay" 
 +which I'll empirically and arbitrarily state is 17 ns, and a combined 
 +function generator plus flip-flop setup time of 5 ns, all totaled, 
 +let’s say 25 ns.  In summary, we won’t be far wrong (?) to say an ‘02A 
 +is a "128-bit FPGA" which you can clock at 40 MHz.  In comparison a 
 +32x32 CLB XC4025-5 would be a "2048-bit FPGA"
 + 
 +In the general purpose processor corner, let’s employ an Alpha 21164A.  
 +I recall this machine can issue 4 instructions per clock at 300 MHz.  
 +Since w==64, it would take an Alpha about 200 instructions or about 50 
 +issue slots (167 ns) to emulate one clocking of an arbitrary 64-bit 
 +FPGA.  Thus our Alpha might emulate half an ‘02A at a rate of about 6 
 +MHz. 
 + 
 +For another example, the BiCMOS MicroUnity media processor 
 +implementation can apparently perform 128-bit operations at 1000 MHz.  
 +Here with w==128, it could emulate an arbitrary 128-bit FPGA at about 4 
 +MHz, perhaps faster if we can take advantage of some of its special bit 
 +shuffle instructions. 
 + 
 + 
 +Well!  These results surprised me.   Even executing a billion 
 +operations per second on 64- or 128-bit registers, these general 
 +purpose processors couldn’t achieve 10% of the speed*gates figure of a 
 +small FPGA.  Even if you consider my "12 ns" FPGA interconnect delay 
 +assumption far too generous given the arbitrary FPGA’s arbitrary 
 +interconnection topology, and derate it to 50 ns or more, the little 
 +FPGA is still several times faster at FPGA type work than a general 
 +purpose microprocessor. 
 + 
 + 
 +(I also wonder if this presentation doesn’t suggest a slow, yet 
 +compact, physical FPGA implementation.  Attach a vector of flip-flops 
 +to the input and output sides of a "4-context" n-stage 
 +shuffle/copy/exchange interconnect, set up to route its input values 
 +according to 4 different routing configurations.  The interconnect can 
 +be pipelined, and run at high speed, to map the flip-flop outputs into 
 +4 registered function generator inputs and every 8 or 9 clocks the 
 +flip-flop itself could be clocked.  Overall the pipeline could run at 
 +200+ MHz and the overall FPGA at 25 MHz.) 
 + 
 +Jan Gray 
 + 
 +</code> 
  • programmable_hardware.1182161602.txt.gz
  • Last modified: 2007-12-30 12:57
  • (external edit)