
Peer reviewed version

Link to published version (if available):
10.23919/FPL.2017.8056758

Link to publication record in Explore Bristol Research
PDF-document

This is the author accepted manuscript (AAM). The final published version (version of record) is available online via IEEE at http://ieeexplore.ieee.org/document/8056758/. Please refer to any applicable terms of use of the publisher.

University of Bristol - Explore Bristol Research
General rights

This document is made available in accordance with publisher policies. Please cite only the published version using the reference above. Full terms of use are available: http://www.bristol.ac.uk/pure/about/ebr-terms
A Systematic Approach to Design and Optimise Streaming Applications on FPGA Using High-Level Synthesis

Mohammad Hosseinabady and Jose Luis Nunez-Yanez
Department of Electrical and Electronic Engineering University of Bristol, UK.
Email: {m.hosseinabady, j.l.nunez-yanez}@bristol.ac.uk

Abstract—This paper proposes a systematic approach to help designers to optimise a given streaming application for FPGAs using High-Level Synthesis (HLS). The proposed technique specifically addresses the two main issues in a streaming application that are determining the exact amount of loop unrolling in the HLS code to increase the throughput and finding the optimum buffers’ size to prevent deadlocks. To evaluate the proposed techniques two applications from the machine learning optimisation area are studied in the paper. These applications are Hessian-vector product and Conjugate Gradient (CG). The experimental results show up to 38x speed-up in throughput compared to the original streaming implementations provided by knowledgeable engineers using the dataflow, loop pipelining and FIFO channel related pragmas provided by the HLS tool. In addition, these applications show up to 2.98 GB/sec usage of memory bandwidth which is 93.1% of the total memory bandwidth available on the system. The source codes of the designs are available at https://github.com/Hosseinabady/csdfg-hls.

I. INTRODUCTION

Recently High-Level Synthesis (HLS) has emerged as a promising approach for designing computation and communication tasks. HLS tools translate an application described in a high-level language such as C/C++/SystemC/OpenCL into a high-performance Register Transfer Level (RTL) code which can later be synthesised into a netlist to be used for the ASIC fabrication or FPGA configuration. However, describing algorithms in C/C++ to be synthesised efficiently is a challenge that requires proposing systematic approaches.

This paper proposes a systematic approach to provide a high-performance HLS implementation of an intrinsic streaming application. The main technique of this paper in providing a high-performance design is to saturate the memory bandwidth by transferring data to the FPGA that directly take part in the computation. Fig. 1 shows the overview of the proposed methodology. This approach starts with a formal modelling of the application (i.e., Path 1) in Fig. 1 which can be used to optimise the throughput and memory-bandwidth of the corresponding HLS description. The proposed formal model is based on the Cyclo-Static Data Flow Graph (CSDFG) [1] which will be augmented with parameters (i.e., Path 3 in Fig. 1) provided by the HLS tool synthesising the original implementation of the design (i.e., Path 2 in Fig. 1). The optimization techniques direct the designers (i.e., Path 4 in Fig. 1) to add HLS pragmas to the application code and use a specific coding structure to improve performance and memory throughput. This paper addresses Paths 3 and 4 and assumes that designers are already familiar with Paths 1, 2, and 5. We propose to use the actor initiation interval (II), obtained by an HLS tool, instead of the common notion of time in the CSDFG. This helps us to propose a clock-based definition for throughput and buffers’ length. Using the modified CSDFG two techniques are proposed to reduce the II and to detect artificial deadlocks. Whereas the former, called token grouping, increases the throughput the latter determines the minimum buffer length to prevent deadlocks.

For evaluation, we applied the proposed approach in designing two optimisation algorithms used in the area of machine learning as case studies. The first algorithm is Hessian-vector product which is a compute-intensive operator used in logistic regression. The second algorithm is Conjugate Gradient (CG) which is an iterative optimisation algorithm used in the deep learning area. The results of running these applications on the Xilinx Zynq SoC shows up to 38x speed-up in the throughput compared to the original streaming implementations provided by knowledgeable engineers using the dataflow, loop pipelining and FIFO channel related pragmas provided by the HLS tool. Furthermore, the applications consume up to 93.1% of the total bandwidth available on the four 64-bit High-Performance (HP) ports at 100MHz which is 3.2 GByte/sec.

The rest of this paper is organised as follows. The next section clarifies the motivations behind this paper and explains our contributions. Section III briefly explains the previous work. The details of the proposed methodology are explained in Section IV. Section V shows how to use the proposed approach in designing two algorithm as case studies. Finally, Section VI concludes the paper.

II. MOTIVATIONS AND CONTRIBUTIONS

Using a constructive example, this section explains the motivations and contributions of this paper. Let’s consider \( y = A^T A x \) operator as a running example in which \( A \) is a matrix of size \( n \times m \), and \( x \) and \( y \) are two vectors of size \( m \). This operator and its modified versions appear in many applications such as equation and eigenvalue solvers, iterative Newton methods for logistic regression [2]. An implementation can perform this operator using two matrix vector product operations. The first product calculates the \( z = A x \) expression and the second one produces the output \( y = A^T z \). The dataflow graph in Fig. 2a models this operator. This graph consists of five nodes in which Node n1 reads the \( x \) vector into a memory accessible by Node n3. The n2 node reads matrix \( A \) from the main memory and sends it to Node n3 and n4. Nodes n3 and n4 perform the two aforementioned matrix vector products. Finally, Node n5
Synchronous Dataflow Graph (SDFG) and its generalisations such as CSDFG, Multidimensional SDFG have been studied extensively in modelling and designing streaming applications. Stuijk et al. [3] study the trade-off space between throughput and buffer size in cyclo-static and synchronous data flow graph. For this purpose they propose a systematic technique to formulate the problem and approximate the optimum buffer-size. Similarly, Benazouz et al. [4] minimizes the cyclo-static dataflow graph using new formulation. However, they have not considered the pipelined actors and the overlap between execution of two consecutive firings of an actor. Unlike this technique, our approach tries to explain the capability of CSDFG in modelling streaming applications to be synthesised by an HLS tool.

### IV. Proposed Methodology

This section first defines the CSDFG modelling tool, augmented by the initiation interval and formulates the throughput and the memory bandwidth utilisation of a design. Then, using the parameters of throughput and bandwidth, it proposes the optimisation techniques.

#### A. CSDFG+II

A CSDFG is a directed graph in which nodes represent tasks, known as actors, and links denote communication media or channels between tasks. An actor in a cyclo-static dataflow graph (CSDFG), known as a multi-phase actor, has a periodic sequence of firings (or phases) with the size of \( N \), each of which consumes a fixed number of tokens. Actors may execute different functions, defined by \( f_{N-1}, ..., f_1, f_0 \) during their phases. The consumption and production rates of tokens on each channel are represented by a sequence of numbers in contrast to the only one token rate in SDFG. The rates at each port \( p \) (where a channel is connected to an actor) are denoted by \( t_{N-1}(p), ..., t_1(p), t_0(p) \). For the sake of simplicity, in this paper, we assume that each actor consumes and produces its tokens at the first and last lock cycle in each phase, respectively. Fig. 4 shows the CSDFG corresponding to the SDFG in Fig. 2b. In this paper, we restrict our discussion to acyclic CSDFG as the HLS tool considered in this paper only support this type of streaming dataflow synthesis. Note that for the sake of simplicity, self-loops around each actor which represent their states (e.g., loops’ indices) are omitted. In this case, each actor has \( N = m \) phases.

**Initiation Interval (II):** The initiation interval of two consecutive phases of an actor (denoted by \( II_i(a) \) for phases \( i \) and \( i + 1 \) of the \( a \) actor) is defined as the number of cycles between the starting point of the two phases. In this paper, we assume that the lower bound of the number of cycles between all consecutive phases of a given actor \( a \) are equal and denoted by \( II_{HLS}(a) \) determined by an HLS tool. Therefore, \( \forall i \in \{0,1, ..., N-2\}, II_i(a) \geq II_{HLS}(a) \).
B. Throughput Optimization

Equ. 1 represents the upper-bound of throughput in a CSDFG augmented with II.

\[ Thr(a) = \frac{\sum_{i=0}^{N(a)} t_i(a)}{\sum_{i=0}^{N(a)} I_i(a)} \leq \frac{\sum_{i=0}^{N(a)} t_i(a)}{N(a) II_{HLS}(a)} \]  

There are three main techniques to improve this throughput: utilising the wide-bus width, employing multiple memory ports and reducing the initiation interval. The experimental results show the impact of the first two techniques and the detail of the last one is described in the sequel.

The main factor determining the throughput of a streaming application mapped on FPGA is the achievable II at runtime. The initiation interval depends on the design and HLS tools and there are a few general techniques to improve it in a given HLS tool. In this paper, we propose a structural technique to increase the CSDFG throughput in which the actor with high II processes more than one token in its firing which is explained in the sequel.

Let’s consider a computational actor a with II(a) = k > 1 and N(a) = m. Approximately, it takes km cycles to finish its task. However, if the actor can process k tokens in each phase without changing its II then N(a) = m/k and consequently the number of cycles to finish the task reduces to (m/k)k = m. This technique is called token grouping. For example, the initiation interval of actor a3 in Fig. 4 is 5 then by utilising this technique the throughput increases by a factor of 5. The modified CSDFG is shown in Fig. 5 in which the a3 actor consumes all the tokens generated by actor a2 with II(a2) = 1 and still the FIFO implementing channel c3 has the length of 1.

To increase the throughput of the CSDFG, the bottleneck actor which has high II should be found and unrolled with the factor of its II without changing its original initiation interval. This situation is possible for many compute-intensive applications. Section V shows this situation for two case study benchmarks. A systematic way to implement the token grouping is using the partial loop unrolling which is one of the basic optimization techniques in HLS tools.

C. Bounded Buffer

An insufficient communication buffer between two actors may lead to stall insertions into pipeline stages or even may induce an artificial deadlock mainly because of an unbalanced latency of two divergent paths that converge at an actor. This latency can have two sources: the input data-size and the design structure. For example in our running example of Fig. 4, the two paths a2 → a3 → a4 and a2 → a4 can cause a deadlock if the size of buffer implementing the c3 channel is not sufficient to keep the tokens for actor a4 when it is required. Actors a2 and a4 are the source and sink of the reconvergent paths, respectively. Actor a4 can consume the tokens on the c2 channel on each phase, as based on the graph, the source actor a2 is able to provide them. However, actor a3 cannot immediately consume a required token from the c3 channel, as actor a3 provides that with a delay on its last phase, after II(a3)m + l_{a3} cycles, in which II(a3) is the initiation interval of a3, m is the number of iterations that a3 requires to generate a token on the c3 channel and l_{a3} is the latency of this actor. Therefore, the first path i.e., a1 → a2 → a3 has the latency of mII + l_{a3} cycles to deliver the first token to actor a3 while during this time actor a2 generates tokens that should be buffered in c3. Since II(a2) determines the rate of a2 generated tokens, then the lower bound of buffer size should be (mII(a3) + l_{a3})/II(a2). According to the discussion in previous subsection II(a2) = II(a3) = 5 and the latency of a3 actor is 12 obtains after synthesis. Therefore, the minimum buffer size on the c3 channel is [m + 12/5] = m + 3. Note that, if the actor reads the token on channel c3 with an offset of s clocks from the start of the iteration, it should be added to the length of the buffer.

V. Case Studies

This section explains two tasks used in machine learning considered as case studies. The source codes can be found at out Github site [5].

A. Hessian-vector product

The Hessian-vector product is defined as Equ. 2 in which d is a vector of size N and I, X and D are identity, normal and diagonal matrices of size N × N, respectively. Note that the D can be represented by a vector of size N. Similar to the approach in [2], the right hand side of this equation can be calculated as four terms connecting together serially: Xd, D(Xd), X^T (DXd) and d + X^T (DXd). In the rest of this paper, we assume the size of matrices and vectors in the Hessian-vector product are 1000×10000 and 10000, respectively. In the sequel, we will apply our design flow shown in Fig. 1 to this example.

\[ \nabla^2 f(w) = (I + X^T DX)d \]  

Path (1) in Fig. 1: Fig. 6a shows the corresponding CSDFG of a streaming implementation that a knowledgeable engineer who is familiar with stream computing will achieve using the pragmas available in the HLS tool. This CSDFG has a similar structure as the running example except for the extra actor and channel.

Path (2) in Fig. 1: The designer may use the RTL simulation provided by the HLS tool to determine the channels’ buffer size through a trial and error effort or may be consider a high value for buffers’ size. The second row of the table in Fig. 7b, denoted by or label, shows the resource utilisation of this implementation after synthesis by the Xilinx Vivado-HLS tool. The execution time of this implementation is about 50.2msec as shown in Fig. 7a.

Path (3) in Fig. 1: The HLS synthesis also reveals the actors’ II which are shown in the CSDFG of Fig. 6a. The two path a1 → a2 → a3 → a4 and a1 → a4 in Fig. 6a converge with unbalanced latency. Therefore, the lower bound for the buffer size on path a1 → a4 is \[ |N + (l_{a2} + l_{a3})/II_{a3}| = N + 4. \]

Path (4) in Fig. 1: Note that the token grouping can be applied to Actor a2 as it is the bottleneck in achieving the high throughput thanks to its high II. Applying the token grouping on actor a2 by processing 5 tokens (equal to its initiation interval) in each iteration improves its throughput by a factor of 5 (about 5 times faster as shown in Fig. 7a). In addition its corresponding resource utilisation is reported in the third row of table in Fig. 6a. The wide-bus and multiple memory port utilisations techniques focus on actor a1 to improve the throughput. The fourth and fifth rows of table in Fig. 6a shows the corresponding resource utilisation of considering wide bus and multiple memory port, respectively.

Path (5) in Fig. 1: Fig. 7a depicts that applying all optimisation techniques on the Hessian vector product speeds up the implementation by a factor of 38.0 compared to the original...
bandwidth utilisation is $256\text{MB}/0.86\text{sec} = 2.98\text{GB/sec}$ which shows $2.98/3.2 = 93.1\%$ efficiency.

VI. CONCLUSIONS AND FUTURE WORK

This paper has proposed a systematic design flow for streaming applications to be synthesised by a high-level synthesis tool for mapping on a target FPGA-based embedded system. The design flow utilises CSDFG model to describe the application. This model is annotated with information obtained from the HLS tool. The CSDFG is used to estimate the buffer size used as the communication channels between actors. Applying the proposed design flow to two applications (Hessian-vector product and conjugate gradient) shows its capability to detect bottlenecks and improve their execution time by a factor up to 38x.

REFERENCES