DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators

Xing, Yu; Liang, Shuang; Sui, Lingzhi; Jia, Xijie; Qiu, Jiantao; Liu, Xin; Wang, Yushun; Wang, Yu; Shan, Yi

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:1902.07463 (cs)

[Submitted on 20 Feb 2019 (v1), last revised 25 Jul 2019 (this version, v2)]

Title:DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators

Authors:Yu Xing, Shuang Liang, Lingzhi Sui, Xijie Jia, Jiantao Qiu, Xin Liu, Yushun Wang, Yu Wang, Yi Shan

View PDF

Abstract:The convolutional neural network (CNN) has become a state-of-the-art method for several artificial intelligence domains in recent years. The increasingly complex CNN models are both computation-bound and I/O-bound. FPGA-based accelerators driven by custom instruction set architecture (ISA) achieve a balance between generality and efficiency, but there is much on them left to be optimized. We propose the full-stack compiler DNNVM, which is an integration of optimizers for graphs, loops and data layouts, and an assembler, a runtime supporter and a validation environment. The DNNVM works in the context of deep learning frameworks and transforms CNN models into the directed acyclic graph: XGraph. Based on XGraph, we transform the optimization challenges for both the data layout and pipeline into graph-level problems. DNNVM enumerates all potentially profitable fusion opportunities by a heuristic subgraph isomorphism algorithm to leverage pipeline and data layout optimizations, and searches for the best choice of execution strategies of the whole computing graph. On the Xilinx ZU2 @330 MHz and ZU9 @330 MHz, we achieve equivalently state-of-the-art performance on our benchmarks by naïve implementations without optimizations, and the throughput is further improved up to 1.26x by leveraging heterogeneous optimizations in DNNVM. Finally, with ZU9 @330 MHz, we achieve state-of-the-art performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38 TOPs/s for ResNet50 and 1.41 TOPs/s for GoogleNet.

Comments:	18 pages, 9 figures, 5 tables
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:1902.07463 [cs.DC]
	(or arXiv:1902.07463v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.1902.07463

Submission history

From: Yu Xing [view email]
[v1] Wed, 20 Feb 2019 09:30:17 UTC (2,926 KB)
[v2] Thu, 25 Jul 2019 09:41:31 UTC (3,279 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators