While working on my current Master’s thesis involving FPGA development, I found that it was hard to find readable examples of intrinsically two-dimensional filters that cannot be simply decomposed into a horizontal and a vertical convolution, in a way that e.g. a Gaussian filter or a Sobel filter can. Non-separable convolutions or filters require you to calculate every output pixel as a function of its surrounding input pixels, i.e. a window that extends in both dimensions at once. After reviewing multiple sources and trying out various methods myself, I believe to have found a fast and resource-efficient way to do it! In order to keep the length of this article at bay, I will assume that you are at least somewhat familiar with Field-Programmable Gate Arrays (FPGAs), High-Level Synthesis (HLS) and the concept of pipelining.

*TL;DR: If you just wish to see the final result, jump to the last paragraph 😉*

### Background and task

HLS is a wonderful technology which vastly reduces the design time of complex hardware logic. However, any user will quickly see that, despite being admitted to use a well-known software programming language (C/C++) for the task of designing hardware logic, many characteristic difficulties of hardware development still remain noticeable and perhaps even unavoidable. Performing filter operations efficiently is one such task that is much less trivial to implement on an FPGA than on a regular processor.

While filters and convolutions are not exactly the same thing, the knowledge in this article applies to both of them. Even non-linear filters such as the bilateral filter can use this technique since there are no constraints on how the window is used during processing. Anyway, let’s consider a random image (shown below) as an example and perform a non-separable edge detection operation:

It is important to know that most FPGAs cannot fit a full image on the order of 1 MB onto their block RAM or other forms of local storage. Therefore, the data will have to be read from and written to a memory location outside of the FPGA in real time during the execution.

### Version 1: a naive implementation

A straightforward implementation that would work just fine on a CPU is shown below. Side note: I added two HLS directives (PIPELINE and ARRAY_PARTITION), please refer to this guide if you’re uncertain about what these mean.

#define WIN_SIZE 3 // must be odd #define HALF_SIZE (((WIN_SIZE) - 1) / 2) inline bool bounds_ok(int y, int x) { return (0 <= y && y < HEIGHT && 0 <= x && x < WIDTH); } // Defines the actual calculation for one output value. inline int single_operation(int window[3][3], int y, int x) { int result = 0; win_i : for (int i = -HALF_SIZE; i <= HALF_SIZE; i++) win_j : for (int j = -HALF_SIZE; j <= HALF_SIZE; j++) if (bounds_ok(y + i, x + j)) result += window[i + HALF_SIZE][j + HALF_SIZE] * (i + j); return result; } // A simple implementation of a 2D filter. void my_filter_v1(int data_out[HEIGHT][WIDTH], const int data_in[HEIGHT][WIDTH]) { int window[WIN_SIZE][WIN_SIZE]; #pragma HLS ARRAY_PARTITION variable=window complete for_y : for (int y = 0; y < HEIGHT; y++) { for_x : for (int x = 0; x < WIDTH; x++) { #pragma HLS PIPELINE // Load window load_i : for (int i = -HALF_SIZE; i <= HALF_SIZE; i++) load_j : for (int j = -HALF_SIZE; j <= HALF_SIZE; j++) if (bounds_ok(y + i, x + j)) window[i + HALF_SIZE][j + HALF_SIZE] = data_in[y + i][x + j]; // Calculate output value int val_out = single_operation(window, y, x); data_out[y][x] = val_out; } } }

So what exactly is wrong with this code? The problem is that communication with the outside world is expensive. The amount of simultaneous I/O accesses per clock cycle is strictly limited, creating a bottleneck which becomes even worse for large window sizes. For a 3×3 window, 9 input pixels have to be read for every single write to an output pixel. To illustrate this bottleneck, let’s synthesize our project on the Zedboard (Zynq 7020 SoC) and analyze the results:

Clearly, the excessive amount of read operations is slowing down the global throughput. The pipelined initiation interval is forced to be at least 5 clock cycles, because it is constrained by the I/O bottleneck. As a result, only one output pixel gets calculated every 5 cycles. However, this is a best case scenario because HLS does not take into account the real-world communication overhead at all. If we implement this on a SoC with an AXI master interface for example, the actual initiation interval will inevitably increase even further! So, how can we circumvent these issues and maximize the throughput in regime?

### Version 2: efficient streaming using line buffers

The solution is to ensure that `data_out`

and `data_in`

should be accessed as few times as possible. The ideal case is of course that every position gets accessed just *once *during the whole execution. Luckily, this can be done by using two cooperating data structures: a **line buffer** and a **window**.

The figure above shows how the input pixels are stored at any point in time for a 3*3 window. The line buffer has dimensions 2*width and serves as sort of a cache that fits *just *enough elements to fill the rightmost column of the window when it shifts at every new iteration. As such, only one new input pixel has to be read from the I/O port into the window on every clock cycle. Before we jump into the details of the code, here is the report:

Indeed, every iteration consists of just one read (C6) and one write (C11) to the external data stream. Thanks to this technique, the throughput can finally reach its optimal value of one output pixel per clock cycle!

## Final code

#include "hls_stream.h" #define WIN_SIZE 3 // must be odd #define HALF_SIZE (((WIN_SIZE) - 1) / 2) inline bool bounds_ok(int y, int x) { return (0 <= y && y < HEIGHT && 0 <= x && x < WIDTH); } // Defines the actual calculation for one output value. inline int single_operation(int window[WIN_SIZE][WIN_SIZE], int y, int x) { int result = 0; win_i : for (int i = -HALF_SIZE; i <= HALF_SIZE; i++) win_j : for (int j = -HALF_SIZE; j <= HALF_SIZE; j++) if (bounds_ok(y + i, x + j)) result += window[i + HALF_SIZE][j + HALF_SIZE] * (i + j); return result; } // A buffered implementation of a 2D filter. void my_filter_buffer(hls::stream& out_stream, hls::stream& in_stream) { int line_buf[WIN_SIZE - 1][WIDTH]; int window[WIN_SIZE][WIN_SIZE]; int right[WIN_SIZE]; #pragma HLS ARRAY_PARTITION variable=line_buf complete dim=1 #pragma HLS ARRAY_PARTITION variable=window complete dim=0 #pragma HLS ARRAY_PARTITION variable=right complete // Load initial values into line buffer int read_count = WIDTH * HALF_SIZE + HALF_SIZE + 1; buf_x1 : for (int x = WIDTH - HALF_SIZE - 1; x < WIDTH; x++) #pragma HLS PIPELINE line_buf[HALF_SIZE - 1][x] = in_stream.read(); buf_y : for (int y = HALF_SIZE; y < WIN_SIZE - 1; y++) buf_x2 : for (int x = 0; x < WIDTH; x++) #pragma HLS PIPELINE line_buf[y][x] = in_stream.read(); // Copy initial values into window win_y : for (int y = HALF_SIZE; y < WIN_SIZE; y++) win_x : for (int x = HALF_SIZE; x < WIN_SIZE; x++) #pragma HLS PIPELINE window[y][x] = line_buf[y - 1][x + WIDTH - WIN_SIZE]; // Start convolution for_y : for (int y = 0; y < HEIGHT; y++) { for_x : for (int x = 0; x < WIDTH; x++) { #pragma HLS PIPELINE // Calculate output value int val_out = single_operation(window, y, x); // Write output value out_stream.write(val_out); // Shift line buffer column up right[0] = line_buf[0][x]; for (int y = 1; y < WIN_SIZE - 1; y++) right[y] = line_buf[y - 1][x] = line_buf[y][x]; // Read input value int val_in = 0; if (read_count < HEIGHT * WIDTH) { val_in = in_stream.read(); read_count++; } right[WIN_SIZE - 1] = line_buf[WIN_SIZE - 2][x] = val_in; // Shift window left for (int y = 0; y < WIN_SIZE; y++) for (int x = 0; x < WIN_SIZE - 1; x++) window[y][x] = window[y][x + 1]; // Update rightmost window values for (int y = 0; y < WIN_SIZE; y++) window[y][WIN_SIZE - 1] = right[y]; } } }

A few observations:

- HLS streams are used to enforce the desired single-read and single-write behaviour.
- Before starting the loop, the buffer is initially filled with a few rows of input pixels. This is to ensure that enough data is available to produce one output value right from the first iteration. There exist various alternative approaches, though.
- The method
`single_operation`

can be modified to your liking and must not necessarily contain a linear operation. It also knows the image position you are currently processing so that boundary effects can be handled well. - The line buffer is partitioned vertically (i.e. across rows only), so that vertical shifting can happen without dependency constraint problems. Every iteration, one column is shifted up so that the ‘net result’ after one row is that the whole line buffer is shifted up as well.
- The window on the other hand is partitioned in both dimensions, e.g. a 7×7 window will produce 49 separate registers so that they can be accessed (and shifted) fully in parallel.
- If your window becomes too large and/or your filter operation too complex, it is likely that resources may skyrocket because we employ pipelining in order to keep the throughput high. Luckily though, the memory usage should not be a concern anymore.

I hope this helped. Feel free to leave a comment with any question or suggestion!