SlideShare a Scribd company logo
SWIFT MODELLING ON ARCHITECTURE OF MATRIX VECTOR
MULTIPLICATION ON ASIC
Shubhi Sharma (15MVD0074)
M.Tech VLSI Design
Vellore Institute of Technology, Vellore
shubhi0619sharma@gmail.com
Priya Pandey (15MVD0098)
M.Tech VLSI Design
Vellore Institute of Technology, Vellore
er.priyapandey@ymail.com
Abstract— Matrix-vector multiplication is an absolutely
fundamental operation, with countless applications in computer
science and scientific computing. Efficient algorithms for matrix-
vector multiplication are of paramount importance. However, the
sheer size of the matrix can be an issue: if the matrix is dense,
then Ω(n2
) time is certainly required for an n × n matrix. Suppose
one allows for a slightly super quadratic preprocessing of the
matrix. How quickly can matrix-vector multiplication be done
then? This question has been studied since the beginning of
scientific computing, but with an almost exclusive focus on
special, structured matrices.
Key words: Sparse Matrix Vector Multiplication, Floating Point,
FPGA, Compressed Row Storage
INTRODUCTION
In arithmetic, a lattice (plural frameworks) is a rectangular
exhibit of numbers, images, or expressions, organized in lines
and sections. The measurements of lattice are 2 × 3 (read "two
by three"), in light of the fact that there are two lines and three
segments. The individual things in a lattice are called its
components or sections. Given that they are the same size
(have the same number of lines and the same number of
sections), two lattices can be included or subtracted
component by component.
The guideline for network duplication, be that as it may, is
that two grids can be increased just when the quantity of
sections in the principal meets the quantity of columns in the
second. Any lattice can be increased component astute by a
scalar from its related field. A noteworthy utilization of grids
is to speak to direct changes, that is, speculations of straight
capacities, for example, f(x) = 4x.
For instance, the pivot of vectors in three dimensional space is
a straight change which can be spoken to by a revolution
network R: if v is a segment vector (a grid with one and only
segment) depicting the position of a point in space, the item
Rv is a segment vector portraying the position of that point
after a turn. The result of two change grids is a network that
speaks to the arrangement of two direct changes. Another use
of grids is in the arrangement of frameworks of direct
mathematical statements. On the off chance that the network is
square, it is conceivable to find some of its properties by
registering its determinant. For instance, a square lattice has a
backwards if and just if its determinant is not zero.
Understanding into the geometry of a straight change is
reachable (alongside other data) from the lattice's Eigen values
and eigenvectors. Utilizations of frameworks are found in
most logical fields. In each branch of material science,
including traditional mechanics, optics, electromagnetism,
quantum mechanics, and quantum electrodynamics, they are
utilized to contemplate physical wonders, for example, the
movement of inflexible bodies. In PC representation, they are
utilized to extend a 3-dimensional picture onto a 2-
dimensional screen.
In likelihood hypothesis and insights, stochastic networks are
utilized to portray sets of probabilities; for case, they are
utilized inside the Page Rank calculation that positions the
pages in a Google search. Matrix math sums up established
expository ideas, for example, subsidiaries and exponentials to
higher measurements.
A noteworthy branch of numerical investigation is committed
to the advancement of productive calculations for framework
calculations, a subject that is hundreds of years old and is
today an extending range of examination. Grid decay
strategies rearrange calculations, both hypothetically and for
all intents and purposes. Calculations that are customized to
specific lattice structures, for example, meager frameworks
and close corner to corner networks, speed up calculations in
limited component strategy and different calculations.
Interminable grids happen in planetary hypothesis and in
nuclear hypothesis. A straightforward case of an unbounded
grid is the framework speaking to the subsidiary administrator,
which follows up on the Taylor arrangement of a capacity.
II.OBJECTIVE
Our objective is to build a processor or arrange the hardware so
that the multiple operations can be performed at a time. When
we use pipelining multiple operations can be done without
changing the execution time of any instruction thus it reduces
our timing slack and hence the efficiency of our processor is
increased.
III. LITERATURE REVIEW
In this project we are doing arithmetic operations addition,
subtraction, multiplication, and division and one top module
which have combination of all these operations.
IV. PIPELINED MULTIPLICATION
Give us a chance to have two operands in division they are
operand A and operand B with a leading 1 which connotes the
standardized number is put away in 53-bit register A and 53-
bit register B. Presently we will get 106-piece item on
increasing 53-bit An and 53-bit B register esteem. Presently
the amalgamation apparatuses xilinx and altera does not give
us duplication of 53-bit by 53-bit so keeping in mind the end
goal to streamline our work we will soften 53-bit up 24-bit
and 17-bit littler various units and then at long last their
expansion is done at the completion took after by adjusting .
At long last enlist will store the 106 piece of result and yield
is lessened by making a movement if there is not 1 present in
the MSB.
The type exponent of operands A and B are included and after
that, the worth (1022) is subtracted from the aggregate of A
and B. In the event that the resultant type is under 0, than the
(item) enroll should be right moved by the sum. This worth is
put away in storage. The last type of the yield operand
becomes ‗0‘ for this situation, and the outcome will be a de-
normalized number. In the event that exponent under is more
prominent than 52, then the fraction will be moved out of the
item enroll, and the yield will be ‗0‘, and the ―underflow‖ sign
will be stated. The exponent yield from the module is in 56-bit
long length register. The MSB is a main ―0‖ to take into
overflow in the adjusting module. The primary bit ―0‖ is
trailed by the main ―1‖ for standardized numbers, or ―0‖ for
de-normalized numbers. At that point the 52 bit long of the
fraction takes after 2 additional bits take after the fraction, and
are utilized for adjusting purposes. The main additional piece
is taken from the following piece after the fraction in the 106 -
piece item consequence of the duplicate. The 2nd
additional
piece is an OR operation of the 52 LSB's of the 106 - piece
item.
Keeping in mind the end goal to increase the throughput as far
as range or power or timing in this paper we connected the
pipelining concept. Pipelining has the idea of dividing the
information in example and portion in two subunits and
performing them separately. Example is included a sub pipe
than standardization is performed in a typical standardization
sub pipe to bring the yield.
Figure1: the flow graph for multiplication using two operands A and B
V. SPARSE MATRIX
Sparse Matrix-Vector Multiplication (SMVM) is the critical
computational kernel of many iterative solvers for systems of
sparse linear equations. In this paper we propose an FPGA
design for SMVM which interleaves CRS (Compressed Row
Storage) format so that just a single floating point accumulator
is needed, which simplifies control, avoids any idle clock
cycles and sustains high throughput. The limited memory
bandwidth of this architecture heavily constraints the
performance demonstrated. However, the use of FIFO buffers
to stream input data makes the design portable to other FPGA-
based platforms with higher memory bandwidth.
Figure2: Example showing the Sparse Matrix Vector Multiplication
As floating-point performance achievable on FPGA has risen
beyond that of processors the motivation for the science
community to move computationally intensive kernels to
FPGA and improve performance of scientific application has
grown considerably. When computing SMVM,
microprocessors have been shown to achieve very low floating
point efficiency. Compared to dense kernels, sparse kernels
incur more overhead per non-zero matrix entry due to extra
instructions and indirect, irregular memory accesses. Also, the
large number of operands required per result and minimal reuse
stress load/store units while floating point units are often
under-utilized. In FPGA implementation, data structure
interpretation is performed by spatial logic. Thus, by using
streaming of data to/from memory and fully pipelined
functional units, FPGA systems can obtain high levels of
performance compared to their clock speeds. Current FPGAs
contain many more configurable logic blocks which allows
implementing multiple copies of the same computations.
Figure3: Example
VI. MATRIX – VECTOR MULTIPLICATION
The matrix–vector multiplication architecture defines a
pipeline as a single multiply accumulate (MAC) unit. The
matrix operands are only used once, but the vector values can
be reused if stored locally. We assume that these vector values
are stored in a single on-chip memory accessible to every
pipeline. To save on memory bandwidth the vector value is
stored until all multiplications using it are finished as shown
in. Since each element in the vector will be multiplied by an
element in every row of the matrix, the number of Iterations I
required to complete the calculation of a single value in the
resulting vector is M for M x N matrices. This design
performs best with one pipeline for every element in the
vector to complete the results in parallel. This will require N
pipelines, or N Uses of a single pipeline. shows the
organization of multiple pipelines. Each use of a pipeline
produces one value in the resulting vector. If there are less
pipelines than the size of the output vector, some pipelines
will compute multiple values in the resulting vector.
Figure4: Pipelined architecture
Figure5: Multiple pipelined architecture
VII. RESULTS
OUTPUT WAVEFORM
TOGGLECOVERAGE
CODE COVERAGE
VII. CONCLUSION
In this paper we have implemented the pipelined matrix vector
multiplier using the sparse tree matrix. By analyzing the RTL
code and synthesizing it on Synopsys server we concluded that
the dynamic leakage power is reduced by 20.56uW and the
time of execution is reduced by 4.71ps. Thus making our
architecture swifter and consumes low power.
ACKNOWLEDGMENT
We acknowledge the help of Prof. Jaykrishnan P who guided
us helped a lot to get idea and methodologies to do this project.
We are also thanking our program chair Prof. Harish Kittur
.We ate indebted to our lab assistant Prof. Karthikeyan and all
other faculties of VLSI System department. Above all we
thank god almighty. Also we are thanking to our classmates
and family who supported us.
REFERENCE
[1] An efficient implementation of floating point multiplier by Al-
Ashrfy ,M Salem Electronics communication and photonic
conference, 2011-saudi international.
[2] R. M. Jessani, M. Putrino, "Comparison of Single- and
Dual-Pass Multiply-Add Fused Floating-Point Units," IEEE
Transactions on Computers, vol. 47 no. 9, pp. 927-937,
1998.
[3] Design of fully pipelined single precision floating point unit by
Zhaoli LI,Xinyue Zhang,gongqion Lil Runde Zhou 7th
international conference,2001
[4] A 13.3ns double precision floating point ALU and multiper
Yamada, H ; hotta, T ; Murabayshi,F ; Yamauchi,T ;
Sawamoto,H.-IEEE International conference 1995
[5] Pipelined Floating-Point Arithmetic Unit (FPU) for Advanced
Computing Systems using FPGA by Rathindra Nath Giri,
M.K.Pandit-International Journal of Engineering and Advanced
Technology (IJEAT), 2012.
[6] IEEE-754 compliant Algorithms forFastMultiplication of
DoublePrecision Floating Point Numbers Geetanjali Wasson
International Journal of Research in Computer Science,2011.

More Related Content

PDF
Effective Sparse Matrix Representation for the GPU Architectures
IJCSEA Journal
 
PDF
A tutorial on CGAL polyhedron for subdivision algorithms
Radu Ursu
 
PDF
Colfax-Winograd-Summary _final (1)
Sangamesh Ragate
 
PDF
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Derryck Lamptey, MPhil, CISSP
 
PDF
F044062933
IJERA Editor
 
PDF
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...
ijdpsjournal
 
PDF
Neural Architecture Search: Learning How to Learn
Kwanghee Choi
 
PDF
Ak04605259264
IJERA Editor
 
Effective Sparse Matrix Representation for the GPU Architectures
IJCSEA Journal
 
A tutorial on CGAL polyhedron for subdivision algorithms
Radu Ursu
 
Colfax-Winograd-Summary _final (1)
Sangamesh Ragate
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Derryck Lamptey, MPhil, CISSP
 
F044062933
IJERA Editor
 
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...
ijdpsjournal
 
Neural Architecture Search: Learning How to Learn
Kwanghee Choi
 
Ak04605259264
IJERA Editor
 

What's hot (15)

PDF
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd Iaetsd
 
PDF
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
lauratoni4
 
DOCX
Flexible dsp accelerator architecture exploiting carry save arithmetic
Nexgen Technology
 
PDF
Research of 64-bits RISC Dual-core Microprocessor with High Performance and L...
TELKOMNIKA JOURNAL
 
PDF
Improving The Performance of Viterbi Decoder using Window System
IJECEIAES
 
PPT
Zoooooohaib
Usman Raza Qureshi
 
PDF
Learning Graph Representation for Data-Efficiency RL
lauratoni4
 
PDF
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Dhiraj Chaudhary
 
PDF
Reversed-Trellis Tail-Biting Convolutional Code (RT-TBCC) Decoder Architectur...
IJECEIAES
 
PDF
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
taeseon ryu
 
PDF
Crdom cell re ordering based domino on-the-fly mapping
VLSICS Design
 
PDF
Iaetsd pipelined parallel fft architecture through folding transformation
Iaetsd Iaetsd
 
PDF
3 - A critical review on the usual DCT Implementations (presented in a Malays...
Youness Lahdili
 
PDF
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO Systems
Radita Apriana
 
PDF
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
VLSICS Design
 
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd Iaetsd
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
lauratoni4
 
Flexible dsp accelerator architecture exploiting carry save arithmetic
Nexgen Technology
 
Research of 64-bits RISC Dual-core Microprocessor with High Performance and L...
TELKOMNIKA JOURNAL
 
Improving The Performance of Viterbi Decoder using Window System
IJECEIAES
 
Zoooooohaib
Usman Raza Qureshi
 
Learning Graph Representation for Data-Efficiency RL
lauratoni4
 
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Dhiraj Chaudhary
 
Reversed-Trellis Tail-Biting Convolutional Code (RT-TBCC) Decoder Architectur...
IJECEIAES
 
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
taeseon ryu
 
Crdom cell re ordering based domino on-the-fly mapping
VLSICS Design
 
Iaetsd pipelined parallel fft architecture through folding transformation
Iaetsd Iaetsd
 
3 - A critical review on the usual DCT Implementations (presented in a Malays...
Youness Lahdili
 
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO Systems
Radita Apriana
 
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
VLSICS Design
 
Ad

Similar to DSP IEEE paper (20)

PDF
Final Project Report
Riddhi Shah
 
PDF
FrackingPaper
Collin Purcell
 
PDF
Area, Delay and Power Comparison of Adder Topologies
VLSICS Design
 
PDF
Pretzel: optimized Machine Learning framework for low-latency and high throug...
NECST Lab @ Politecnico di Milano
 
PDF
Macromodel of High Speed Interconnect using Vector Fitting Algorithm
ijsrd.com
 
PDF
Parallel Processing Technique for Time Efficient Matrix Multiplication
IJERA Editor
 
PDF
Towards Efficient Modular Adders based on Reversible Circuits
IJSRED
 
PDF
Breaking the 49 qubit barrier in the simulation of quantum circuits
hquynh
 
PDF
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
Machine Learning @NECST
NECST Lab @ Politecnico di Milano
 
PDF
Analysis of different multiplication algorithm and FPGA implementation of rec...
IRJET Journal
 
PPTX
Modern processors
gowrivageesan87
 
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
PDF
Spine net learning scale permuted backbone for recognition and localization
Devansh16
 
PDF
Implementation of low power divider techniques using
eSAT Publishing House
 
PDF
Implementation of low power divider techniques using radix
eSAT Journals
 
PDF
E035425030
ijceronline
 
PDF
A White Paper On Neural Network Quantization
April Knyff
 
PDF
Design and Implementation of Multiplier Using Kcm and Vedic Mathematics by Us...
IJMER
 
PDF
Performance comparison of row per slave and rows set per slave method in pvm ...
eSAT Journals
 
Final Project Report
Riddhi Shah
 
FrackingPaper
Collin Purcell
 
Area, Delay and Power Comparison of Adder Topologies
VLSICS Design
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
NECST Lab @ Politecnico di Milano
 
Macromodel of High Speed Interconnect using Vector Fitting Algorithm
ijsrd.com
 
Parallel Processing Technique for Time Efficient Matrix Multiplication
IJERA Editor
 
Towards Efficient Modular Adders based on Reversible Circuits
IJSRED
 
Breaking the 49 qubit barrier in the simulation of quantum circuits
hquynh
 
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Machine Learning @NECST
NECST Lab @ Politecnico di Milano
 
Analysis of different multiplication algorithm and FPGA implementation of rec...
IRJET Journal
 
Modern processors
gowrivageesan87
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Spine net learning scale permuted backbone for recognition and localization
Devansh16
 
Implementation of low power divider techniques using
eSAT Publishing House
 
Implementation of low power divider techniques using radix
eSAT Journals
 
E035425030
ijceronline
 
A White Paper On Neural Network Quantization
April Knyff
 
Design and Implementation of Multiplier Using Kcm and Vedic Mathematics by Us...
IJMER
 
Performance comparison of row per slave and rows set per slave method in pvm ...
eSAT Journals
 
Ad

DSP IEEE paper

  • 1. SWIFT MODELLING ON ARCHITECTURE OF MATRIX VECTOR MULTIPLICATION ON ASIC Shubhi Sharma (15MVD0074) M.Tech VLSI Design Vellore Institute of Technology, Vellore [email protected] Priya Pandey (15MVD0098) M.Tech VLSI Design Vellore Institute of Technology, Vellore [email protected] Abstract— Matrix-vector multiplication is an absolutely fundamental operation, with countless applications in computer science and scientific computing. Efficient algorithms for matrix- vector multiplication are of paramount importance. However, the sheer size of the matrix can be an issue: if the matrix is dense, then Ω(n2 ) time is certainly required for an n × n matrix. Suppose one allows for a slightly super quadratic preprocessing of the matrix. How quickly can matrix-vector multiplication be done then? This question has been studied since the beginning of scientific computing, but with an almost exclusive focus on special, structured matrices. Key words: Sparse Matrix Vector Multiplication, Floating Point, FPGA, Compressed Row Storage INTRODUCTION In arithmetic, a lattice (plural frameworks) is a rectangular exhibit of numbers, images, or expressions, organized in lines and sections. The measurements of lattice are 2 × 3 (read "two by three"), in light of the fact that there are two lines and three segments. The individual things in a lattice are called its components or sections. Given that they are the same size (have the same number of lines and the same number of sections), two lattices can be included or subtracted component by component. The guideline for network duplication, be that as it may, is that two grids can be increased just when the quantity of sections in the principal meets the quantity of columns in the second. Any lattice can be increased component astute by a scalar from its related field. A noteworthy utilization of grids is to speak to direct changes, that is, speculations of straight capacities, for example, f(x) = 4x. For instance, the pivot of vectors in three dimensional space is a straight change which can be spoken to by a revolution network R: if v is a segment vector (a grid with one and only segment) depicting the position of a point in space, the item Rv is a segment vector portraying the position of that point after a turn. The result of two change grids is a network that speaks to the arrangement of two direct changes. Another use of grids is in the arrangement of frameworks of direct mathematical statements. On the off chance that the network is square, it is conceivable to find some of its properties by registering its determinant. For instance, a square lattice has a backwards if and just if its determinant is not zero. Understanding into the geometry of a straight change is reachable (alongside other data) from the lattice's Eigen values and eigenvectors. Utilizations of frameworks are found in most logical fields. In each branch of material science, including traditional mechanics, optics, electromagnetism, quantum mechanics, and quantum electrodynamics, they are utilized to contemplate physical wonders, for example, the movement of inflexible bodies. In PC representation, they are utilized to extend a 3-dimensional picture onto a 2- dimensional screen. In likelihood hypothesis and insights, stochastic networks are utilized to portray sets of probabilities; for case, they are utilized inside the Page Rank calculation that positions the pages in a Google search. Matrix math sums up established expository ideas, for example, subsidiaries and exponentials to higher measurements. A noteworthy branch of numerical investigation is committed to the advancement of productive calculations for framework calculations, a subject that is hundreds of years old and is today an extending range of examination. Grid decay strategies rearrange calculations, both hypothetically and for all intents and purposes. Calculations that are customized to specific lattice structures, for example, meager frameworks and close corner to corner networks, speed up calculations in limited component strategy and different calculations. Interminable grids happen in planetary hypothesis and in nuclear hypothesis. A straightforward case of an unbounded grid is the framework speaking to the subsidiary administrator, which follows up on the Taylor arrangement of a capacity. II.OBJECTIVE Our objective is to build a processor or arrange the hardware so that the multiple operations can be performed at a time. When we use pipelining multiple operations can be done without changing the execution time of any instruction thus it reduces our timing slack and hence the efficiency of our processor is increased.
  • 2. III. LITERATURE REVIEW In this project we are doing arithmetic operations addition, subtraction, multiplication, and division and one top module which have combination of all these operations. IV. PIPELINED MULTIPLICATION Give us a chance to have two operands in division they are operand A and operand B with a leading 1 which connotes the standardized number is put away in 53-bit register A and 53- bit register B. Presently we will get 106-piece item on increasing 53-bit An and 53-bit B register esteem. Presently the amalgamation apparatuses xilinx and altera does not give us duplication of 53-bit by 53-bit so keeping in mind the end goal to streamline our work we will soften 53-bit up 24-bit and 17-bit littler various units and then at long last their expansion is done at the completion took after by adjusting . At long last enlist will store the 106 piece of result and yield is lessened by making a movement if there is not 1 present in the MSB. The type exponent of operands A and B are included and after that, the worth (1022) is subtracted from the aggregate of A and B. In the event that the resultant type is under 0, than the (item) enroll should be right moved by the sum. This worth is put away in storage. The last type of the yield operand becomes ‗0‘ for this situation, and the outcome will be a de- normalized number. In the event that exponent under is more prominent than 52, then the fraction will be moved out of the item enroll, and the yield will be ‗0‘, and the ―underflow‖ sign will be stated. The exponent yield from the module is in 56-bit long length register. The MSB is a main ―0‖ to take into overflow in the adjusting module. The primary bit ―0‖ is trailed by the main ―1‖ for standardized numbers, or ―0‖ for de-normalized numbers. At that point the 52 bit long of the fraction takes after 2 additional bits take after the fraction, and are utilized for adjusting purposes. The main additional piece is taken from the following piece after the fraction in the 106 - piece item consequence of the duplicate. The 2nd additional piece is an OR operation of the 52 LSB's of the 106 - piece item. Keeping in mind the end goal to increase the throughput as far as range or power or timing in this paper we connected the pipelining concept. Pipelining has the idea of dividing the information in example and portion in two subunits and performing them separately. Example is included a sub pipe than standardization is performed in a typical standardization sub pipe to bring the yield. Figure1: the flow graph for multiplication using two operands A and B V. SPARSE MATRIX Sparse Matrix-Vector Multiplication (SMVM) is the critical computational kernel of many iterative solvers for systems of sparse linear equations. In this paper we propose an FPGA design for SMVM which interleaves CRS (Compressed Row Storage) format so that just a single floating point accumulator is needed, which simplifies control, avoids any idle clock cycles and sustains high throughput. The limited memory bandwidth of this architecture heavily constraints the performance demonstrated. However, the use of FIFO buffers to stream input data makes the design portable to other FPGA- based platforms with higher memory bandwidth. Figure2: Example showing the Sparse Matrix Vector Multiplication As floating-point performance achievable on FPGA has risen beyond that of processors the motivation for the science community to move computationally intensive kernels to FPGA and improve performance of scientific application has grown considerably. When computing SMVM, microprocessors have been shown to achieve very low floating point efficiency. Compared to dense kernels, sparse kernels incur more overhead per non-zero matrix entry due to extra instructions and indirect, irregular memory accesses. Also, the
  • 3. large number of operands required per result and minimal reuse stress load/store units while floating point units are often under-utilized. In FPGA implementation, data structure interpretation is performed by spatial logic. Thus, by using streaming of data to/from memory and fully pipelined functional units, FPGA systems can obtain high levels of performance compared to their clock speeds. Current FPGAs contain many more configurable logic blocks which allows implementing multiple copies of the same computations. Figure3: Example VI. MATRIX – VECTOR MULTIPLICATION The matrix–vector multiplication architecture defines a pipeline as a single multiply accumulate (MAC) unit. The matrix operands are only used once, but the vector values can be reused if stored locally. We assume that these vector values are stored in a single on-chip memory accessible to every pipeline. To save on memory bandwidth the vector value is stored until all multiplications using it are finished as shown in. Since each element in the vector will be multiplied by an element in every row of the matrix, the number of Iterations I required to complete the calculation of a single value in the resulting vector is M for M x N matrices. This design performs best with one pipeline for every element in the vector to complete the results in parallel. This will require N pipelines, or N Uses of a single pipeline. shows the organization of multiple pipelines. Each use of a pipeline produces one value in the resulting vector. If there are less pipelines than the size of the output vector, some pipelines will compute multiple values in the resulting vector. Figure4: Pipelined architecture Figure5: Multiple pipelined architecture VII. RESULTS OUTPUT WAVEFORM TOGGLECOVERAGE
  • 4. CODE COVERAGE VII. CONCLUSION In this paper we have implemented the pipelined matrix vector multiplier using the sparse tree matrix. By analyzing the RTL code and synthesizing it on Synopsys server we concluded that the dynamic leakage power is reduced by 20.56uW and the time of execution is reduced by 4.71ps. Thus making our architecture swifter and consumes low power. ACKNOWLEDGMENT We acknowledge the help of Prof. Jaykrishnan P who guided us helped a lot to get idea and methodologies to do this project. We are also thanking our program chair Prof. Harish Kittur .We ate indebted to our lab assistant Prof. Karthikeyan and all other faculties of VLSI System department. Above all we thank god almighty. Also we are thanking to our classmates and family who supported us. REFERENCE [1] An efficient implementation of floating point multiplier by Al- Ashrfy ,M Salem Electronics communication and photonic conference, 2011-saudi international. [2] R. M. Jessani, M. Putrino, "Comparison of Single- and Dual-Pass Multiply-Add Fused Floating-Point Units," IEEE Transactions on Computers, vol. 47 no. 9, pp. 927-937, 1998. [3] Design of fully pipelined single precision floating point unit by Zhaoli LI,Xinyue Zhang,gongqion Lil Runde Zhou 7th international conference,2001 [4] A 13.3ns double precision floating point ALU and multiper Yamada, H ; hotta, T ; Murabayshi,F ; Yamauchi,T ; Sawamoto,H.-IEEE International conference 1995 [5] Pipelined Floating-Point Arithmetic Unit (FPU) for Advanced Computing Systems using FPGA by Rathindra Nath Giri, M.K.Pandit-International Journal of Engineering and Advanced Technology (IJEAT), 2012. [6] IEEE-754 compliant Algorithms forFastMultiplication of DoublePrecision Floating Point Numbers Geetanjali Wasson International Journal of Research in Computer Science,2011.