Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To create a new HLS project:

  • Run HLS by typing vitis_hls into a Linux terminal or running it from the Windows start menu.

  • In the install the hardware labs, this launch hasn't always been set up, so you might have to run the tool directly:/opt/york/cs/net/xilinx_vitis-2020.2/Vitis_HLS/2020.2/bin/If Vitis doesn't run, try changing to the following root directory first:
    • cd /
    • vitis_hls
  • Select Create New Project and give it a suitable name and location. Again, remember that the Xilinx tools do not like spaces in their names. Click next.
  • Set Top Function to toplevel. Click next. Click next.
  • Set the Clock Period to 10 (this is in nanoseconds, so corresponds to a 100MHz clock frequency, which is the default clock frequency provided to the FPGA fabric).
  • Select the Part to xc7z010clg400-1. You can select this part, or go to "Boards" and find the Zybo Z7-10. Click OK.
  • Click Finish.
Warning
titleImportant

When you add files to an HLS project the default location could be anything. Double check the location, and make sure that it is inside your HLS project folder.

You are now looking at an HLS project, but it is empty. Create two new source files (Project | New Source) called toplevel.cpp and toplevel.h, and paste in the code below.

Note
titleCode Errors?

When you change source code in HLS (such as pasting in the below), the editor may highlight errors in the code that aren't actually problems. These should disappear once the project is built (it will tell you about any real error at this point). You can also use the "Index C Source" button to make HLS rescan your code, which can sometimes help.


Code Block
languagecpp
titletoplevel.cpp
collapsetrue
#include "toplevel.h"

//Input data storage
#define NUMDATA 100

uint32 inputdata[NUMDATA];

//Prototypes
uint32 addall(uint32 *data);
uint32 subfromfirst(uint32 *data);

uint32 toplevel(uint32 *ram, uint32 *arg1, uint32 *arg2, uint32 *arg3, uint32 *arg4) {
	#pragma HLS INTERFACE m_axi port=ram offset=slave bundle=MAXI
	#pragma HLS INTERFACE s_axilite port=arg1 bundle=AXILiteS 
	#pragma HLS INTERFACE s_axilite port=arg2 bundle=AXILiteS 
	#pragma HLS INTERFACE s_axilite port=arg3 bundle=AXILiteS 
	#pragma HLS INTERFACE s_axilite port=arg4 bundle=AXILiteS 
	#pragma HLS INTERFACE s_axilite port=return bundle=AXILiteS 

	readloop: for(int i = 0; i < NUMDATA; i++) {
        #pragma HLS PIPELINE off
        inputdata[i] = ram[i];
	}

	*arg2 = addall(inputdata);
	*arg3 = subfromfirst(inputdata);

	return *arg1 + 1;
}

uint32 addall(uint32 *data) {
    uint32 total = 0;
    addloop: for(int i = 0; i < NUMDATA; i++) {
         #pragma HLS PIPELINE off 
         total = total + data[i];
    }
    return total;
}

uint32 subfromfirst(uint32 *data) {
    uint32 total = data[0];
    subloop: for(int i = 1; i < NUMDATA; i++) {
         #pragma HLS PIPELINE off 
         total = total - data[i];
    }
    return total;
}


Code Block
languagecpp
titletoplevel.h
collapsetrue
#ifndef __TOPLEVEL_H_
#define __TOPLEVEL_H_

#include <stdio.h>
#include <stdlib.h>
#include <ap_int.h>

//Typedefs
typedef unsigned int uint32;
typedef int int32;

uint32 toplevel(uint32 *ram, uint32 *arg1, uint32 *arg2, uint32 *arg3, uint32 *arg4);

#endif

Examine the Files

toplevel.cpp contains the structure that we will use throughout EMBS. The toplevel function has five arguments of type uint32. This is a typedef in toplevel.h that describes an unsigned integer of width 32 bits.

The file also contains pragmas which are called directives by HLS. Directives are used to tell HLS how to make your hardware. The directives here tell HLS to create an AXI Master interface, and an AXI Slave interface. This is all described in the knowledge base. The Master interface allows the component to access main memory, and the slave interface allows the ARM cores to pass in a few variables and to start, reset, and stop the component. Once built and exported, your component will look like this in Vivado:


The file also contains our functionality. Note three important things:

  • There is no main function.

  • We have declared a function toplevel. This will be the 'entry point' of the hardware.
  • The loops in the code have been given labels (readloopaddloop and subloop). Most programmers don't do this, but it is useful in HLS, as you will see later.

Testing Components

The last practicals showed you just how long hardware synthesis takes. You should therefore be pretty confident that your hardware is correct before building it. A testbench is an important part of this. Testbenches test the functional properties of the code, to make sure that it doesn't contain any logical errors and it does roughly what you want. Testbenches cannot test how fast the final hardware will be because it is simulated in software.

In HLS, right click 'Test Bench' in the Explorer on the left and select New File. Call it testbench.cpp and put it somewhere sensible (in the same folder as your toplevel.h is recommended, otherwise you will need to edit the #include to have the relative path to your header file).

Copy in the following code:

Code Block
titletestbench.cpp
collapsetrue
#include "toplevel.h"
#define NUMDATA 100

uint32 mainmemory[NUMDATA];

int main() {

    //Create input data
    for(int i = 0; i < NUMDATA; i++) {
    	mainmemory[i] = i;
    }
    mainmemory[0] = 8000;

    //Set up the slave inputs to the hardware
    uint32 arg1 = 0;
    uint32 arg2 = 0;
    uint32 arg3 = 0;
    uint32 arg4 = 0;

    //Run the hardware
    toplevel(mainmemory, &arg1, &arg2, &arg3, &arg4);

    //Read the slave outputs
    printf("Sum of input: %d\n", arg2);
    printf("Values 1 to %d subtracted from value 0: %d\n", NUMDATA-1, arg3);

    //Check the values are as expected
    if(arg2 == 12950 && arg3 == 3050) {
        return 0;
    } else {
        return 1; //An error!
    }
}

Things to note:

  • We declared a block of memory as "main memory". In the real system this will be the 1GB of DDR memory on the Zybo Z7 board, but for the testbench we simply allocate an array that is large enough for our purposes.
  • The testbench should return 0 if everything is OK. It checks the values returned from the hardware against pre-calculated values to ensure all is correct.

To run the testbench select Project | Run C Simulation and click OK in the dialog that appears. You should see HLS do quite a bit of work, but eventually you will see the output of the testbench.


Sum of input: 12950
Values 1 to 99 subtracted from value 0: 3050
@I [SIM-1] CSim done with 0 errors


We have verified our design. (You should probably test a real design a bit more rigorously!)

High-level synthesis

So now we have a good design, we need to examine how to turn it into hardware. Near the top right of the window you should see a row of three buttons - Debug, Synthesis, and Analysis. These are perspectives and we will switch between them by clicking them. Click the Synthesis button to ensure you are in the synthesis perspective.

First, read the Important Terms You Should Know section of the Vitis HLS Knowledge Base. These are important terms and you should know them.

Open toplevel.cpp. Now click Solution | Run C Synthesis | C Synthesis (or click the green arrow in the toolbar). This will begin synthesis. Synthesis is the process of turning the C++ description into hardware.

There are usually many different ways of achieving the same thing in hardware, all with different costs. HLS lets you explore these tradeoffs by applying directives.

Once synthesis completes, a synthesis report window will open. This tells you all about the design that you've just built. Under Performance Estimates look at the Latency summary. This should be similar to:

This tells you that your design has an overall latency (time from first data in to last data out) of 709 clock cycles. Its interval is 710 which is the amount of cycles from first data in of one run to the design being able to accept the first data in of another run.

Below the toplevel line it shows the three loops in the code (and why it is useful to give them labels). A loop's latency is the number of cycles it will take to complete. Sometimes HLS will not know this (if the loop variables are not static, for example). The iteration latency is the number of cycles one iteration takes. Initiation interval is only valid for pipelined loops (see later), and trip count is the total number of iterations that will be computed.

Just as important as performance is utilisation. Look at the utilisation summary further along and you will see the usage of your design in FF (Flip flops) and LUTs (look up tables). These are measures of reconfigurable logic. Also DSP (digital signal processing) units, and BRAM (Block RAMs). Block RAMs are small chunks of very high speed memory throughout the FPGA fabric. They can be read from or written to in a single clock cycle, but with a maximum of two accesses per block RAM, per clock cycle. This will become important later. These numbers might be a bit meaningless out of context, so you can click the % symbol above the table to turn these into percentages of your FPGAs.

This shows an estimate of the resources that your design will use. This only includes the HLS component. The rest of the design uses FPGA resources too! Of primary interest to you in this table are BRAMs and FF/LUTs. This table helps to give a rough breakdown summary of where most of your design is using resources. As your design gets more complex, you can further examine the resource for more detailed information.

Tuning synthesis with directives 

The design we have is OK, but it can be improved with directives. These tell HLS how to synthesise your software into hardware. First, let's have a closer look at how your design is being implemented. Click the Analysis button (top right) to go to the Analysis perspective. This should open a Performance tab.

 

The rows are operations that come from your compiled code. The columns are states, and so looking vertically shows all the things that are happening in parallel. If two things are in the same column, they happen in parallel. Currently we can see that the addloop and subloop loops do not overlap, so they are not done in parallel. Note also that addloop and subloop have two states. This corresponds to the Performance Estimates we got from the synthesis report that told us the iteration latency of both these loops is two.

This is because I cheated. Sorry!

HLS would actually have done a better job, left to its own devices, but I included some directives in the code above to deliberately turn off some optimisations so that we could better see what they do. We will undo that in the next section. You can fully expand each loop to see the individual operations that take place in each state. You can also right click operations and select Goto Source to see the line of C++ code which created it, or the line of generated Verilog or VHDL that will create the actual FPGA hardware.


...