Efficient FFT-Based CNN Acceleration with Intra-Patch Parallelization and Flex-Stationary Dataflow
SP Sunny, S Das - 2024 IEEE International Symposium on …, 2024 - ieeexplore.ieee.org
SP Sunny, S Das
2024 IEEE International Symposium on Circuits and Systems (ISCAS), 2024•ieeexplore.ieee.orgThis paper presents a novel Hadamard Product Generator (HPG) for FFT-based CNN
acceleration that effectively addresses computation and energy bottlenecks. The proposed
block uses Intra-Patch parallelization to optimize Complex Multiply and Accumulate (CMAC)
unit utilization and maintains identical reuse behavior across patch elements. This scheme
also offers multiple spatial unrolling schemes to increase resource reuse. Additionally, the
proposed HPG leverages the Flex-Stationary dataflow to adaptively store tensors with high …
acceleration that effectively addresses computation and energy bottlenecks. The proposed
block uses Intra-Patch parallelization to optimize Complex Multiply and Accumulate (CMAC)
unit utilization and maintains identical reuse behavior across patch elements. This scheme
also offers multiple spatial unrolling schemes to increase resource reuse. Additionally, the
proposed HPG leverages the Flex-Stationary dataflow to adaptively store tensors with high …
This paper presents a novel Hadamard Product Generator (HPG) for FFT-based CNN acceleration that effectively addresses computation and energy bottlenecks. The proposed block uses Intra-Patch parallelization to optimize Complex Multiply and Accumulate (CMAC) unit utilization and maintains identical reuse behavior across patch elements. This scheme also offers multiple spatial unrolling schemes to increase resource reuse. Additionally, the proposed HPG leverages the Flex-Stationary dataflow to adaptively store tensors with high reuse opportunities in the on-chip memory. The prototype is implemented on the Zynq MPSoC (XCZU7CG). It showcases throughput gains of 8.16× for VGG-16 and 9.30× for AlexNet compared to the state-of-the-art frequency domain accelerator with an area overhead of only 28.51%. It achieves an 8.82× average improvement over Eyeriss and a 7.87× improvement over Flexflow in EDP with a similar hardware configuration.
ieeexplore.ieee.org
Showing the best result for this search. See all results