
Peer reviewed version

Link to published version (if available):
10.1109/MMSP.1999.793940

Link to publication record in Explore Bristol Research

PDF-document
FAST 2D-DCT IMPLEMENTATIONS FOR VLIW PROCESSORS

O.P. Sohm, D.R. Bull, C.N. Canagarajah
Centre for Communications Research
University of Bristol
Bristol, UK

Abstract - This paper analyzes various fast 2D-DCT algorithms regarding their suitability for VLIW processors. Operations for truncation or rounding which are usually neglected in proposals for fast algorithms have also been taken into consideration. Loeffler's algorithm with parallel multiplications [1] was found to be most suitable due to its parallel structure.

INTRODUCTION

The Discrete Cosine Transform (DCT) forms the basic building block of many image and video coding standards including the recently finalized MPEG-4 standard. Together with motion estimation the 2D-DCT is one of the most computationally intensive algorithms in a video coder. Therefore, for high bit rate applications, the DCT has usually been realized in dedicated hardware. The high performance of recently introduced VLIW-type digital signal processors [2] permits the DCT to be implemented in software.

This paper discusses properties of DCT algorithms that influence their performance on VLIW processors, such as arithmetic complexity, computational structure and accuracy. Various approaches to fast DCT algorithms are summarized and analyzed regarding their suitability for VLIW processors. Finally, performance results are discussed for implementations of selected algorithms on the Texas Instruments DSP TMS320C6201.

VLIW PROCESSORS

VLIW (Very Long Instruction Word) processors achieve their high performance by exploiting instruction-level parallelism (ILP) [3] through multiple parallel functional units. The main task remains with the compiler or programmer that has to optimize the data and control flow of a program with the aim to schedule as many operations as possible in parallel.

DCT algorithms, as most other numeric algorithms, have a highly parallel data flow and no control flow except the loop structure. Therefore they contain a high amount of instruction-level parallelism. Conventional digital signal processors, however, are not able to exploit much of this parallelism.
PROPERTIES OF DCT ALGORITHMS

A reduction in the number of arithmetic operations is often essential for both hardware and software implementations, but is not the only requirement for a fast implementation. Another very important property of an algorithm is its computational structure. For hardware implementations, the minimization of the number of multiplications and a regular structure are essential. However, for VLIW processors, a structure that is more suitable to their architecture can easily outweigh the advantage of saving one or two multiplications. Because there are typically more functional units available for additions than for multiplications or shift operations, multiplications and shifts are computationally more expensive. Shift operations which are required in practical implementations for truncation/rounding of the results of fixed-point multiplications and normalization are usually neglected in many proposals of fast algorithms. Other resource requirements, such as the number of registers, cannot be reflected in a measure of arithmetic complexity. For this purpose the structure of the algorithm has to be analyzed.

The accuracy of the computation is determined by the number of hits of the multiplier. Cosine coefficients have to be converted into a fixed-point representation for which the number of hits can be chosen according to the desired accuracy. For improved accuracy, results of multiplications can be rounded instead of just truncated. To further improve accuracy it is beneficial to keep the extended precision representation after a multiplication by rounding after summing up the products.

FAST ALGORITHMS

This section summarizes different approaches to fast DCT algorithms and analyses their suitability for VLIW processors.

Direct 2D approaches, such as [4], have a lower arithmetic complexity [5, 6] than row-column approaches but the processing of large data vectors causes register values to be stored temporarily in memory (register spilling) if the CPU does not have a sufficient number of registers. This can degrade execution speed drastically. Also, the complex computational structure prevents an implementation in a simple loop structure which results in increased code size.

Other approaches alter the cosine coefficient matrix in such a way that the number of multiplications and additions are reduced [7]. A further saving in the number of multiplications can be achieved by evaluating $2 \times 2$ sub-matrices as rotations [8].

Another method for computing a 2D-DCT is to first perform real-DFT's on the columns and then on the rows. A complex 2D-DFT is then established and rotations applied to obtain the 2D-DCT [8]. This is a kind of hybrid approach involving methods from the row-column and the direct 2D approach. Therefore, the same problems as with other direct 2D approaches exist.
Polynomial approaches, such as [9], map the DCT into complex polynomials. This representation of the DCT is similar to that of the DFT. A fast algorithm is then obtained through a polynomial transform similar to the FFT. This method can be used for 2D and row-column approaches.

Recursive algorithms [10] compute the DCT through lower order DCT's, e.g. the 8-point DCT is computed using two 4-point DCT's each of which are in turn computed using two 2-point DCT's. However, due to a required post-computation stage the overall number of stages is greater than that of other algorithms.

In multiplication-free structures [11], multiplications have been replaced by shift operations. This might be beneficial for hardware implementations, but in the case of VLIW processors such an implementation would not fully utilize resources since the multiplier units would be left unused. Additionally, the computational complexity would increase because one multiplication is replaced with up to 6 shift operations.

IMPLEMENTATION RESULTS

Among the various algorithms analyzed [4, 9, 6, 7, 1, 10, 11], the algorithm by Loeffler et al [1] with parallel multiplications (LLM PM) was found most suitable for VLIW processors. The standard algorithm (LLM) is shown in Figure 1. To obtain the version with parallel multiplications, the odd part (after the first stage) has to be replaced with the flow graph shown in Figure 2. The algorithm requires 12 multiplications and 32 additions. Although there are algorithms with fewer multiplications and additions, such as the standard LLM algorithm, the suitability of the LLM PM algorithm becomes apparent if practical implementation issues are considered.

Generally, in a row-column DCT algorithm without rounding shift operations are required for two purposes: (1) truncation after fixed-point multiplications and (2), at the end of the column DCT, truncation to normalize the output values. To perform rounding instead of truncation, addition operations for adding rounding constants have to be introduced.

The difference of the two LLM algorithms lies only in the odd part which shall now be compared to highlight the advantages of the PM structure. The LLM PM algorithm requires 4 shift operations and, if rounding is desired, one rounding add. For the row DCT the shifts perform the truncation required after the multiplications, and in the column DCT the shifts perform truncation and normalization. The cascaded multiplications structure of the standard LLM algorithm (Figure 1) requires two more shift operations and two more addition operations for rounding. The shifts are required because the multiplier results from the rotation C1 are fed again into multipliers and need to be truncated before. Both, the LLM standard and PM algorithm were implemented on the TMS320C6201 in assembly language with and without rounding. Table 1 shows a summary of the characteristics and the performance of the implementations.
As can be seen in the table, the PM algorithm requires fewer shift operations than the standard algorithm because no multiplications are in cascade. Without rounding operations, the standard algorithm appears to have a slight advantage with its fewer additions and multiplications and therefore executes faster. However, considering the implementations with rounding, one can see that the advantage of fewer additions is no longer relevant if the additions for rounding are taken into account. Both implementations require 34 and 35 additions for the row and column DCT, respectively. The disadvantage of the PM algorithm having one more multiplication is outweighed by the reduced number of shift operations. This makes the PM algorithm with 266 cycles clearly faster than the standard algorithm (297 cycles). As expected, the PM algorithm is superior in terms of accuracy because cascaded multipliers are avoided. The achieved accuracy of 0.0364 (mean square error) compares well with the accuracy of the floating-point implementation which is 0.0292.

**CONCLUSIONS**

For practical software implementations of fast DCT algorithms, other issues than multiplicative complexity have to be taken into account in order to achieve good performance. The structure of the algorithm determines the number of rounding (additions) and shift operations required. In particular shift operations, which can be implemented with almost no cost at all for VLSI, can be as computationally expensive as multiplications for software...
Figure 2: Odd stage of Loeffler's algorithm with parallel multiplications (LLM PM)

<table>
<thead>
<tr>
<th>Alg.</th>
<th>Rounding</th>
<th>Adds</th>
<th>Mults</th>
<th>Shifts</th>
<th>Total</th>
<th>Accuracy</th>
<th>Cycles</th>
</tr>
</thead>
<tbody>
<tr>
<td>Std</td>
<td>no</td>
<td>29</td>
<td>11</td>
<td>8/10</td>
<td>98</td>
<td>0.4845</td>
<td>255</td>
</tr>
<tr>
<td>PM</td>
<td>no</td>
<td>32</td>
<td>12</td>
<td>6/8</td>
<td>102</td>
<td>0.2968</td>
<td>272</td>
</tr>
<tr>
<td>Std</td>
<td>yes</td>
<td>34/35</td>
<td>11</td>
<td>8/10</td>
<td>109</td>
<td>0.0552</td>
<td>297</td>
</tr>
<tr>
<td>PM</td>
<td>yes</td>
<td>34/35</td>
<td>12</td>
<td>6/8</td>
<td>108</td>
<td>0.0364</td>
<td>266</td>
</tr>
<tr>
<td>Float</td>
<td>n/a</td>
<td>26</td>
<td>16</td>
<td>n/a</td>
<td>84</td>
<td>0.0292</td>
<td>n/a</td>
</tr>
</tbody>
</table>

Table 1: Number of operations and performance results of implementations of the LLM standard and PM algorithm with and without rounding (the number of operations is the same for both row and column DCT unless stated as row/column)

implementations. The structure of an algorithm also determines how efficient the resources of an VLIW processor can be exploited. As shown, resource requirements, the most important being the number and type of functional units and the number of registers, can be estimated through analysis of the algorithm structure, but only the actual implementation will reveal the true performance.

References


