...

To create a new HLS project:

Run HLS by typing vivado_hls into a Linux terminal or running it from the Windows start menu.
Select Create New Project and give it a suitable name and location. Again, remember that the Xilinx tools do not like spaces in their names. Click next.
Set Top Function to toplevel. Click next. Click next.
Set the Clock Period to 10 (this is in nanoseconds, so corresponds to a 100MHz clock frequency, which is the default clock frequency provided to the FPGA fabric).
Select the Part to xc7z010clg400-1. Click OK.
Click Finish.

Warning

title	Important

When you add files to an HLS project the default location will be a hidden folder (usually .settings) - this will cause problems if you do not change it. Either put files in the project root, or create a new folder called something like src and put all your source files here instead.

You are now looking at an HLS project, but it is empty. Create two new source files (Project | New Source) called toplevel.cpp and toplevel.h, and paste in the code below.

Note

title	Code Errors?

When you change source code in HLS (such as pasting in the below), the editor may highlight errors in the code that aren't actually problems. These should disappear once the project is built (it will tell you about any real error at this point). You can also use the "Index C Source" button to make HLS rescan your code, which can sometimes help.

Code Block

language	cpp
title	toplevel.cpp
collapse	true

#include "toplevel.h"

//Input data storage
#define NUMDATA 100

uint32 inputdata[NUMDATA];

//Prototypes
uint32 addall(uint32 *data);
uint32 subfromfirst(uint32 *data);

uint32 toplevel(uint32 *ram, uint32 *arg1, uint32 *arg2, uint32 *arg3, uint32 *arg4) {
	#pragma HLS INTERFACE m_axi port=ram offset=slave bundle=MAXI
	#pragma HLS INTERFACE s_axilite port=arg1 bundle=AXILiteS register
	#pragma HLS INTERFACE s_axilite port=arg2 bundle=AXILiteS register
	#pragma HLS INTERFACE s_axilite port=arg3 bundle=AXILiteS register
	#pragma HLS INTERFACE s_axilite port=arg4 bundle=AXILiteS register
	#pragma HLS INTERFACE s_axilite port=return bundle=AXILiteS register

	readloop: for(int i = 0; i < NUMDATA; i++) {
		inputdata[i] = ram[i];
	}

	*arg2 = addall(inputdata);
	*arg3 = subfromfirst(inputdata);

	return *arg1 + 1;
}

uint32 addall(uint32 *data) {
    uint32 total = 0;
    addloop: for(int i = 0; i < NUMDATA; i++) {
        total = total + data[i];
    }
    return total;
}

uint32 subfromfirst(uint32 *data) {
    uint32 total = data[0];
    subloop: for(int i = 1; i < NUMDATA; i++) {
        total = total - data[i];
    }
    return total;
}

Code Block

language	cpp
title	toplevel.h
collapse	true

#ifndef __TOPLEVEL_H_
#define __TOPLEVEL_H_

#include <stdio.h>
#include <stdlib.h>
#include <ap_int.h>

//Typedefs
typedef unsigned int uint32;
typedef int int32;
 
uint32 toplevel(uint32 *ram, uint32 *arg1, uint32 *arg2, uint32 *arg3, uint32 *arg4);

#endif

Examine the Files

toplevel.cpp contains the structure that we will use throughout EMBS. The toplevel function has five arguments of type uint32. This is a typedef in toplevel.h that describes an unsigned integer of width 32 bits.

The file also contains pragmas which are called directives by HLS. Directives are used to tell HLS how to make your hardware. The directives here tell HLS to create an AXI Master interface, and an AXI Slave interface. This is all described in the knowledge base. The Master interface allows the component to access main memory, and the slave interface allows the ARM cores to pass in a few variables and to start, reset, and stop the component. Once built and exported, your component will look like this in Vivado:

The file also contains our functionality. Note three important things:

There is no main function.
We have declared a function toplevel. This will be the 'entry point' of the hardware.
The loops in the code have been given labels (readloop, addloop and subloop). Most programmers don't do this, but it is useful in HLS, as you will see later.

Testing Components

The last practicals showed you just how long hardware synthesis takes. You should therefore be pretty confident that your hardware is correct before building it. A testbench is an important part of this. Testbenches test the functional properties of the code, to make sure that it doesn't contain any logical errors and it does roughly what you want. Testbenches cannot test how fast the final hardware will be because it is simulated in software.

In HLS, right click 'Test Bench' in the Explorer on the left and select New File. Call it testbench.cpp and put it somewhere sensible (in the same folder as your toplevel.h is recommended, otherwise you will need to edit the #include to have the relative path to your header file).

Copy in the following code:

Code Block

title	testbench.cpp
collapse	true

#include "toplevel.h"
#define NUMDATA 100

uint32 mainmemory[NUMDATA];

int main() {

    //Create input data
    for(int i = 0; i < NUMDATA; i++) {
    	mainmemory[i] = i;
    }
    mainmemory[0] = 8000;

    //Set up the slave inputs to the hardware
    uint32 arg1 = 0;
    uint32 arg2 = 0;
    uint32 arg3 = 0;
    uint32 arg4 = 0;

    //Run the hardware
    toplevel(mainmemory, &arg1, &arg2, &arg3, &arg4);

    //Read the slave outputs
    printf("Sum of input: %d\n", arg2);
    printf("Values 1 to %d subtracted from value 0: %d\n", NUMDATA-1, arg3);

    //Check the values are as expected
    if(arg2 == 12950 && arg3 == 3050) {
        return 0;
    } else {
        return 1; //An error!
    }
}

Things to note:

We declared a block of memory as "main memory". In the real system this will be the 1GB of DDR memory on the Zybo Z7 board, but for the testbench we simply allocate an array that is large enough for our purposes.
The testbench should return 0 if everything is OK. It checks the values returned from the hardware against pre-calculated values to ensure all is correct.

To run the testbench select Project | Run C Simulation and click OK in the dialog that appears. You should see HLS do quite a bit of work, but eventually you will see the output of the testbench.

Sum of input: 12950
Values 1 to 99 subtracted from value 0: 3050
@I [SIM-1] CSim done with 0 errors

We have verified our design. (You should probably test a real design a bit more rigorously!)

High-level synthesis

So now we have a good design, we need to examine how to turn it into hardware. Near the top right of the window you should see a row of three buttons - Debug, Synthesis, and Analysis. These are perspectives and we will switch between them by clicking them. Click the Synthesis button to ensure you are in the synthesis perspective.

First, read the Important Terms You Should Know section of the Vivado HLS Knowledge Base. These are important terms and you should know them.

Open toplevel.cpp. Now click Solution | Run C Synthesis | Active Solution (or click the green arrow in the toolbar). This will begin synthesis. Synthesis is the process of turning the C++ description into hardware.

There are usually many different ways of achieving the same thing in hardware, all with different costs. HLS lets you explore these tradeoffs by applying directives.

Once synthesis completes, a synthesis report window will open. This tells you all about the design that you've just built. Under Performance Estimates look at the Latency summary. This should be similar to:

Latency		Interval
min	max	min	max	Type
611	611	612	612	none

This tells you that your design has an overall latency (time from first data in to last data out) of 611 clock cycles. Its interval is 612 which is the amount of cycles from first data in of one run to the design being able to accept the first data in of another run.

Expand the 'Loop' item under 'Detail' and you should see something like this:

	Latency			Initiation Interval
Loop Name	min	max	Iteration Latency	achieved	target	Trip Count	Pipelined
- readloop	200	200	2	-	-	100	no
- addloop	200	200	2	-	-	100	no
- subloop	198	198	2	-	-	99	no

This shows the three loops in the code, and why it is useful to give them labels. A loop's latency is the number of cycles it will take to complete. Sometimes HLS will not know this (if the loop variables are not static, for example). The iteration latency is the number of cycles one iteration takes. Initiation interval is only valid for pipelined loops (see later), and trip count is the total number of iterations that will be computed.

Just as important as performance is utilisation. Look at the utilisation summary further down and you will see a table similar to this:

Name	BRAM_18K	DSP48E	FF	LUT
DSP	-	-	-	-
Expression	-	-	0	128
FIFO	-	-	-	-
Instance	2	-	770	1004
Memory	1	-	0	0
Multiplexer	-	-	-	110
Register	-	-	316	-
Total	3	0	1086	1242
Available	120	80	35200	17600
Utilization (%)	2	0	3	7

This shows an estimate of the resources that your design will use. This only includes the HLS component. The rest of the design uses FPGA resources too! Of primary interest to you in this table are BRAMs and LUTs. This table helps to give a rough breakdown summary of where most of your design is using resources. You can further examine the resource for more detailed information should you need it.

For example, expand the 'Memory' item and you will see that HLS decided implement our inputdata using an 18 kilobit Block RAM (BRAM_18K). Block RAMs are small chunks of very high speed memory throughout the FPGA fabric. They can be read from or written to in a single clock cycle, but with a maximum of two accesses per block RAM, per clock cycle. This will become important later.

Tuning synthesis with directives

The design we have is OK, but it can be improved with directives. These tell HLS how to synthesise your software into hardware. First, let's have a closer look at how your design is being implemented. Click the Analysis button (top right) to go to the Analysis perspective. This should open a Performance tab.

The rows are operations that come from your compiled code. The columns are states, and so looking vertically shows all the things that are happening in parallel. If two things are in the same column, they happen in parallel. Currently we can see that the addloop and subloop loops do not overlap, so they are not done in parallel. Note also that addloop and subloop have two states. This corresponds to the Performance Estimates we got from the synthesis report that told us the iteration latency of both these loops is two.

You can fully expand each loop to see the individual operations that take place in each state. You can also right click operations and select Goto Source to see the line of C++ code which created it, or the line of generated Verilog or VHDL that will create the actual FPGA hardware.

Pipelining loops

The first optimisation we will do is to tell HLS to pipeline addloop and subloop. They both have two states so they could be working on two elements of data at once. Without pipelining only one iteration will be running at a time, as shown in these diagrams.

Normal loop	Pipelined loop

Close Performance report and go back to the Synthesis perspective. Open toplevel.cpp and select the Directive tab on the right. This tab shows the items in the source file that you can attach directives to. Find addloop, right click it and select Insert Directive. Sometimes HLS will complain that the file has been modified and you must save it, even if it is saved. If it does, close the file and reopen it.

In the Directive Editor dialog, select PIPELINE. It gives you two choices of where to store the directive. If you select 'Source File' it will edit the source file to add #pragmas. If you select 'Directive File' it will store them in a separate file. Directive File is normally preferable, because you can have multiple directive files for different optimisations of your design, but it is up to you. The default pipeline values are fine so click OK. Note the directive is now visible in the right panel. Repeat these steps for subloop.

Note: if a loop has one state there is nothing to pipeline, so this directive wouldn't do anything.

Save the file and re-run synthesis (click the green arrow). Once it is done look at the report. Immediately we see that the latency of the design is now 414 cycles, down from 611 before. This is because the two loops are processing multiple data items at once. Normally this costs a little extra hardware, but we are lucky. In this case it actually takes the same number of LUTs and slightly fewer flip-flops.

In the synthesis report expand Loop under Detail and you will see the two loops are now pipelined, and their Initiation Interval is now 1. This means a data item can be pushed into the loop each clock cycle. Their trip counts (number of times they execute) are 100 and 99, so their latencies are 100 and 99 cycles, down from the 200 and 198 of before.

Click the Analysis perspective. The pipelining has not made the loops take place in parallel. The speed boost comes from the fact that data items are pushed into the loops faster. Close the Performance tab and return to the Synthesis perspective.

Unrolling loops

More drastic than pipelining loops is unrolling them. The UNROLL directive tells HLS to try to execute the individual iterations of the loop in parallel. This is very fast, but can cost a lot more hardware depending on the level of unrolling.

In the Synthesis perspective, for toplevel.cpp right click the PIPELINE directives on the addloop and subloop loops and remove them. Sometimes a bug causes HLS to mess up the source code. If it does, fix it. Right click the addloop loop in the Directive panel and select Insert Directive. Select UNROLL. We could add a factor here to limit the unrolling, but lets leave it blank to say to unroll as much as possible. Click OK and repeat for subloop. Resynthesise.

Now our design latency is down to around 314 cycles, meaning overall we are running at almost twice the speed of our original design with no directives. However we are now using around 6000 LUTs - we are over 3 times the size! HLS has also decided to use 2 Block RAMs for memory instead of 1 so that more data can be accessed in parallel. This is a classic time/space tradeoff. Notice that if you expand 'Loop' in the synthesis report, only readloop is there now. The other two are gone, as they have been completely unrolled.

Click on the Analysis perspective. Our design looks quite different!

The first thing we notice that functions are visible. This is because previously the functions were so simple HLS had automatically inlined them. Now they are huge bits of hardware so it has not. Double-click one of the green function nodes to see inside:

The loops have gone and been replaced by lot of operations. In essence, we previously had one loop with 5 states:

Load item X
Load item X+1
Add
Increment X
If X < 40 jump to 1

Whereas now we have 3 operations repeated 100 times. This is faster overall because we don't need the X variable or the less-than check. Notice that there is no state (column) where more than two data_V_load_X operations begin. These states load from RAM (which you could verify by right clicking and selecting Goto Source). Why do only two happen at once? Recall that HLS implemented our inputdata array as a Block RAM, and Block RAMs have only two access ports so they can only support up to two accesses in parallel at once.

Recall from the analysis perspective, HLS is not inlining the functions any more. Therefore the hardware is completing the addall function before starting the subfromfirst function. Let's force it to inline these functions so that it can schedule operations from both to happen together. Add the INLINE directive to the addall and the subfromfirst functions and resynthesise.

Now we are down to 261 cycles, and we've saved some hardware because HLS has been able to optimise across the two functions. We can still do better though!

Swapping Block RAMs for LUTs

Let's tell HLS not to use Block RAMs and to instead just use normal registers. In some designs this can be very expensive, but it will allow true parallel access. Close the Performance tab and return to the Synthesis perspective. Right click inputdata in the Directives tab and select Insert Directive. Insert an ARRAY_PARTITION directive of type complete. It will ask which function to apply it to. Select toplevel.

Also, apply UNROLL to readloop as well, so we we can completely take advantage of the distributed RAM.

You should have the directives below. Resynthesise.

We now have a tiny design latency of only 111 cycles. Because the design reads in 100 data items we know our design will never be faster than 101 cycles so this is pretty good! Also notice we now are not using any Block RAMs for memory and our LUT usage has not changed much either. Normally ARRAY_PARTITION will significantly increase LUT usage, but in this case our previous design had so many intermediate storage registers that we were basically already doing it. Remember that the first unroll increased our LUT usage by a factor of 3. This shows the importance of experimenting with directives, and using the Analysis perspective to work out what is happening in parallel.

So we now have a very fast design, but we also know how to make it smaller (and slower) if we needed to by pipelining instead of unrolling.

Implementing it for real

We will work purely inside HLS using test benches for now. Later practicals will take HLS designs and connect them up to the ARM processing system.

Version	Old Version 32	New Version 33
Changes made by	Ian Gray	Ian Gray
Saved on	19 Feb, 2021	12 Mar, 2021

Versions Compared

Key

Examine the Files

Testing Components

High-level synthesis

Tuning synthesis with directives

Pipelining loops

Unrolling loops

Swapping Block RAMs for LUTs

Implementing it for real

Page Comparison

Versions Compared

Key

Examine the Files

Testing Components

High-level synthesis

Tuning synthesis with directives

Pipelining loops

Unrolling loops

Swapping Block RAMs for LUTs

Implementing it for real