Important Terms You Should Know

LUTs or SLICEs
- These make up the area of the FPGA. It has a limited number of these and when it runs out your design is too big!
BRAMs or Block RAMs
- Memory in the FPGA. On the Z-7010s we are using, there are 60 of them, and they are 2KiB (actually, 36 kilobits) each.
Latency
- The number of clock cycles a design takes to produce an answer.
- The latency of a loop is the number of clock cycles one iteration takes.
Initiation Interval (or II, or Interval)
- The number of clock cycles that a design has to execute for before it can accept new data.
- This is not the same as latency! If the function is pipelined, many data items are flowing through it at once. Latency is the time for one data item to pop out after it is pushed in, whereas interval determines the rate at which data can be pushed in.
- The interval of a loop is the maximum rate at which loop iterations can be started, in clock cycles.

Useful Directives

This is a list of directives that you will find useful. There are others, and you can consult the help in HLS if you need more information.

Name	Applies to	Description
`PIPELINE`	Functions, loops	Causes inputs to be passed to the function or loop more frequently. A pipelined function or loop can process new inputs every N clock cycles, where N is the initiation interval. 'II' defaults to 1, and is the initiation interval that HLS should aim for (i.e. how fast should it try to input new data items into the pipeline).
`UNROLL`	Loops	Create factor copies of the loop to execute in parallel (if dataflow dependencies are otherwise met). Creates fast but large circuits. Leave factor unset to unroll as much as possible.
`ALLOCATION`	Various	Limit the number of instances of something. For example, if you only want three copies of function `foo` within another function `toplevel`, use `ALLOCATION` with location `toplevel`, limit set to 3, instances set to `foo` and type set to 'function'. This also works for specific operators (click the 'Help' button for a list).
`ARRAY_MAP`	Arrays	Map multiple smaller arrays into one larger one, to save access logic or BRAMs at the cost of access time. 'instance' can be set to any unused name. Use multiple `ARRAY_MAP` with the same instance to tell HLS to create a new array with the name 'instance' that contains all the smaller arrays. Leave 'offset' unset. Please be aware that some people have experienced a bug with this directive when mapping three or more initialised arrays into a single RAM. If you experience a difference in behaviour between simulation and the implemented design, try removing this directive.
`ARRAY_PARTITION`	Arrays	Split a big array into multiple smaller ones. This is useful to increase the potential for parallel access. If 'type' is 'block' then the source array will split into chunks. If it is 'cyclic' then the elements will be interleaved into the destination arrays. In both cases 'factor' is the number of smaller arrays to create. If 'type' is 'complete' then 'factor' is ignored and the array is totally split into component registers, thereby not using any Block RAMs.
`DATAFLOW`	Functions	See below
`INLINE`	Functions	Instead of treating the function as a single hardware unit, this directive makes HLS inline the function every time it is called. This increases potential parallelism at the cost of hardware. If 'recursive' is true, then all functions called by the inlined function are also treated as marked with `INLINE`.
`INTERFACE`	Function parameters	Tells HLS how to pass parameters between functions. This is vital in the top-level function because it defines the pinout of your design. In EMBS we have a template (above) that you should stick to.
`LATENCY`	Functions, loops	HLS normally tries to achieve minimum latency. If you specify a larger minimum latency with this directive, HLS will 'pad out' the function or loop and slow everything down. This helps with resource sharing, and is useful for creating delays. HLS will warn you if it cannot attain the latency you asked for.
`LOOP_FLATTEN`	Loops	Flatten nested loops into a single loop. Apply to the inner-most loop. Will make faster hardware, if successful.
`LOOP_TRIPCOUNT`	Loops	If a loop has variable loop bounds, HLS will not know how many iterations it will take. This means it cannot give you a definite value for the latency of your design. This allows you to specify minimum, average, and maximum trip counts (numbers of iterations) for a loop. This only affects reporting and does not affect hardware generation.
`RESOURCE`	Various	This is used to specify that a particular hardware resource should be used to implement a source code element. It is useful to specify whether an array should be implemented using BRAMs or LUTs. See below.

Arbitrary-Precision Types

You can use the normal C types (int, char, etc.) in HLS. However, frequently registers in your design will not exactly require 4, 8, or 16 bits. Rather than accept this inefficiency, you can use arbitrary-precision types to define exactly how wide your data types need to be.

The following shows how to use C and C++ style arbitrary-precision types. We recommend using C++, unless you have a specific reason not to.

In C:

Include the <ap_cint.h> header. Then you can declare variables with types like the following:

`uint5 x;`	unsigned integer, 5 bits wide
`int19 x;`	signed integer, 19 bits wide

In C++:

Include the <ap_int.h> header. Then you can declare variables with types like the following:

ap_uint<5> x; unsigned integer, 5 bits wide

ap_int<19> x;

signed integer, 19 bits wide

You should be able to print the arbitrary precision types normally, but if you are getting strange values out of printf then call to_int() first:

ap_uint<23> myAP;
printf("%d\n", myAP.to_int());

Reset Behaviour

In HLS, all static and global variables are initialised to zero (or to something else if an initialiser value is given). This includes RAMs, in which every element is cleared to zero. However, this initialisation only happens when the FPGA is first programmed. Any subsequent processor resets will not trigger initialisation.

You should include some kind of reset protocol if you need to clear your device's internal state.

AXI Slave and AXI Master interfaces

There are two interfaces that you can use in your HLS component, AXI Slave and AXI Master.

AXI Slave: The ARM cores use this interface to start and stop the HLS component. They can also use this interface to read and write a relatively small number of user-defined values.
AXI Master: If a larger amount of shared data is required, the HLS component can use an AXI Master interface to initiate transactions to read and write from main system memory.

You define the interfaces you need by specifying the arguments to the toplevel function in your HLS component, and by attaching directives to those arguments. This shows a component which only has a slave interface:

HLS component with slave interface

uint32 toplevel(uint32 *arg1, uint32 *arg2, uint32 *arg3, uint32 *arg4) {
	#pragma HLS INTERFACE s_axilite port=arg1 bundle=AXILiteS register
	#pragma HLS INTERFACE s_axilite port=arg2 bundle=AXILiteS register
	#pragma HLS INTERFACE s_axilite port=arg3 bundle=AXILiteS register
	#pragma HLS INTERFACE s_axilite port=arg4 bundle=AXILiteS register
	#pragma HLS INTERFACE s_axilite port=return bundle=AXILiteS register
}

And this is a component with both a slave and master interface:

HLS component with Slave and Master interfaces

uint32 toplevel(uint32 *ram, uint32 *arg1, uint32 *arg2, uint32 *arg3, uint32 *arg4) {
	#pragma HLS INTERFACE m_axi port=ram offset=slave bundle=MAXI
	#pragma HLS INTERFACE s_axilite port=arg1 bundle=AXILiteS register
	#pragma HLS INTERFACE s_axilite port=arg2 bundle=AXILiteS register
	#pragma HLS INTERFACE s_axilite port=arg3 bundle=AXILiteS register
	#pragma HLS INTERFACE s_axilite port=arg4 bundle=AXILiteS register
	#pragma HLS INTERFACE s_axilite port=return bundle=AXILiteS register
}

Note that you can add and remove arguments for the slave interface, and change their datatypes, just remember to also update the associated #pragmas. HLS will update the drivers of your component accordingly.

Master datatypes

Because the AXI Master interface connects to RAM, which is 32 bits wide, you should always use a 32-bit datatype when specifying an AXI Master interface.

Once you have decided on your interface, you should be able to rely on Vivado's Connection Automation to wire everything up for you. Speak to a demonstrator if you want to try something complicated!

Note that the pragma for the return port is important!

#pragma HLS INTERFACE s_axilite port=return bundle=AXILiteS register

Even if you do not use the return value of the function, this pragma tells HLS to bundle the start, stop, done, and reset signals into a control register in the AXI Slave interface. This will therefore generate corresponding driver functions to start and stop the generated IP core. If you do not include this pragma, then HLS will generate simple wires for these signals instead and your IP core will not be controllable from the ARM cores directly.

AXI Master with Multiple Types

Vitis HLS can be quite picky about copying values from the same master AXI port and interpreting them as different types.

For example, the following memcpy will probably result in a "Stored value type does not match pointer operand type!" error from LLVM during synthesis, when trying to treat the RAM as both uint32 and float types:

void toplevel(uint32 *ram) {
#pragma HLS INTERFACE m_axi port=ram offset=slave bundle=MAXI
	uint32 u_values[10];
	float f_values[10];
 
	memcpy(u_values, ram, 40);
	memcpy(f_values, ram+10, 40);
}

In order to properly force the type information of data being copied from RAM, a union can be used as follows:

typedef union {
	uint32 u;
	float f;
} ram_t;
 
void toplevel(ram_t *ram) {
#pragma HLS INTERFACE m_axi port=ram offset=slave bundle=MAXI
	uint32 u_values[10];
	float f_values[10];
 
	for (int i = 0; i < 10; i++) {
		ram_t data = ram[i];
		u_values[i] = data.u;
	}
	for (int i = 0; i < 10; i++) {
		ram_t data = ram[i+10];
		f_values[i] = data.f;
	}
}

Additionally, as long as the loop bounds start at zero (and are fixed), HLS should be clever enough to treat this as a burst transfer similar to memcpy - look for "Inferring bus burst read of length X on port 'MAXI'" during synthesis to confirm this.

Forcing and Preventing the Use of Block RAMs

HLS will automatically turn most of your arrays into BRAMs. Often this is useful, because arrays of registers are very expensive in terms of LUTs (FPGA space). However, the FPGA has a limited number of BRAMs. Also BRAMs only have 2 access ports. This means that at most two parallel processes can access the RAM at any one time. This can limit the parallelism potential of your design.

If HLS is using a BRAM where you do not want it to, apply to the array the directive ARRAY_PARTITION with type set to COMPLETE and dimension set to 1. This will force it to make the array out of registers. This uses a lot of FPGA space (LUTs) so be sparing!

To force HLS to use a BRAM, apply the directive BIND_STORAGE set to RAM_2P. (Press the Help button whilst adding to see a description of all the various options).

The ARRAY_MAP directive (see above) can help to save Block RAMs by automatically putting multiple smaller arrays into one larger one.

When you change your HLS

When you change your HLS code, perform the following steps to ensure that your bitfile is updated for testing.

Rerun synthesis.
Reexport the IP core.
In Vivado, it should have noticed the change and there will be a message saying "IP Catalog is out-of-date".
1. If not, click IP Status then Re-Run Report
2. Click Refresh IP Catalog
3. In the IP Status panel the toplevel IP should be selected. Click Upgrade Selected.
In the Generate Output Products dialog click Generate.
Click Generate Bitstream.
Export Hardware to Vitis.
In Vitis reprogram the FPGA and run your software.

Now you understand why testbenches and simulation are so important!

Loop Optimisations

In HLS, you can apply directives to loops to instruct it to unroll or pipeline. Consider the following loop:

myloop: for(int i = 0; i < 3; i++) {
    doSomething(X[i]);
}

HLS will by default execute each iteration of the loop sequentially. Its execution will look like this:

If each iteration of the loop takes 10 clock cycles, then the loop will take 30 cycles in total to complete.

If we give this loop the PIPELINE directive then HLS will try to start computing element 1 before element 0 has completed, creating a pipeline. This means that the overall execution time of the loop will be lower, but at the cost of more complex control logic and a larger number of registers to store intermediate data. The loop looks like this:

It can only do this if there are not dependencies that prevent this optimization. Consider the following code:

int lastVal;
 
for(int i = 0; i < 50; i++) {
    lastVal = calculateAValue(lastVal);
}

In this example, the loop is forced to execute sequentially because a value computed near the end of the loop body is required at the start of the next loop iteration. PIPELINE will still speed things up a little, but not drastically.

Finally, if we give the loop the UNROLL directive then HLS will try to execute the iterations of the loop in parallel. This requires a lot more hardware, but is very fast. The entire loop will only take 10 cycles in our example.

This requires that there are no data dependencies between the elements of the loop at all. If doSomething() kept a global counter of how many times it has executed, for example, then this dependency would block the UNROLL directive from working.

Note that the UNROLL by default attempts to unroll all iterations of the loop. This can result in a very large design! To keep things more reasonable, you can set the FACTOR parameter of UNROLL to tell the tools how many copies to create.

After applying UNROLL, it is a good idea to look in the analysis view to see whether it was actually applied or not. A successfully unrolled design will be very "vertical" in the analysis view. Operations in the same column are happening at the same time. If the view is still very "horizontal" with a lot of columns, then it is likely that a data dependency is preventing unrolling. You can try to work out what it is by clicking an operation. The tool will draw arrows to show what feeds in to it and what it feeds into. Remember that BlockRAMs can only have two accesses at any one time, so if you have a large array that the tools are making from a BlockRAM, unrolling or pipelining over it will only be able to create up to 2 copies. You can tell the tools to not use a BlockRAM with the ARRAY_PARTITION directive. This can be much faster, but use much more hardware.

Dataflow Optimisation

If you have not used directives that limit resources (such as the ALLOCATION directive), HLS seeks to minimize latency and improve concurrency. Data dependencies can limit this. For example, functions or loops that access arrays must finish all read/write accesses to the arrays before they complete. This prevents the next function or loop that consumes the data from starting.

It may be possible for the operations in a function or loop to start operation before the previous function or loop completes all its operations.

When dataflow optimization is specified, HLS:

Analyzes the dataflow between sequential functions or loops.
Seeks to create hardware that allows consumer functions or loops to start operation before the producer functions or loops have completed.

This allows functions or loops to operate in parallel, which in turn decreases latency and improves the throughput of the RTL design, but at the cost of additional hardware. Experiment with DATAFLOW and see if it helps your design.

'Cannot find crt1.o' Error

When attempting to run a test bench on a machine other than those in the hardware labs, you may get an error complaining that it can't find 'crt1.o'. If so, you need to set a custom linker flag for your project.

Click "Project" in the top menu, then Project Settings. Within this box, click "Simulation" on the left, then paste the following into the "Linker Flags" box:

-B"/usr/lib/x86_64-linux-gnu/"

My loops have ??? latency estimates!

Sometimes, the HLS synthesis report will contain ? instead of giving values for the minimum and maximum latency. This is because at least one loop in your design is data-dependent, i.e. the number of times it loops is dependent on data values which HLS cannot know about.

For example, the following code:

when synthesised gives the following in the synthesis report:

If we inspect the code, it is summing together elements from ram, but the exact number of elements to sum comes from the user as the arg1 input. Therefore HLS can't know ahead of time how long this hardware will take to execute because it is variable on each run. We say that its runtime is data-dependent. The generated hardware will work fine, we just can't predict how long it will take to run. Looking at the loop details, HLS can still tell us that the latency of the loop is 2. In other words, it doesn't know how many times it will iterate, but that each iteration will take 2 clock cycles.

In general you should try to avoid this. If HLS cannot predict the worst case, then it will have to be over-cautious and it might make larger hardware than you wanted. Also you cannot UNROLL loops with variable loop bounds.

Some algorithms are fundamentally data-dependent and if this situation cannot be avoided, then you can tell HLS to assume that a loop will take a given number of iterations for the purpose of reporting only, by adding the LOOP_TRIPCOUNT directive to the loop. The generated hardware will be exactly the same, but HLS will generate latency numbers under the assumption that the loop iterates that number of times. This means that your latency numbers are not "correct", but this can still be useful to help you to see if other optimisations have an overall positive effect.

Fixed Point Types

Fixed point types are useful for when you need to use fractional arithmetic but don't want to pay the large hardware cost of using floating points. Fixed point types are described in detail in the Vitis HLS User Guide but a short example is below:

Fixed Point example

#include <iostream>
#include <ap_fixed.h>
 
ap_fixed<15, 5> a = 3.45;
ap_fixed<15, 5> b = 9.645;
ap_fixed<20, 6> c = a / b * 2;
std::cout << c;
//Prints 0.7148. The accurate answer is 0.7154. More bits can be allocated to the types if more accuracy is required.

The C standard maths functions (in math.h) are implemented only for floating point, but Xilinx provide fixed point implementations of certain functions in hls_math.h under the hls:: namespace; e.g. hls::sqrt(), hls::cos() and hls::sin().

In addition, the following Xilinx example code shows an alternative fixed point square root implementation, which may be more efficient in certain situations.

fxp_sqrt.h

#ifndef __FXP_SQRT_H__
#define __FXP_SQRT_H__
#include <cassert>
#include <ap_fixed.h>
using namespace std;

/*
 * Provides a fixed point implementation of sqrt()
 * Must be called with unsigned fixed point numbers so convert before calling, follows:
 * ap_ufixed<32, 20> in = input_number;
 * ap_ufixed<32, 20> out;
 * fxp_sqrt(out, in);
 */
template <int W2, int IW2, int W1, int IW1>
void fxp_sqrt(ap_ufixed<W2,IW2>& result, ap_ufixed<W1,IW1>& in_val)
{
   enum { QW = (IW1+1)/2 + (W2-IW2) + 1 }; // derive max root width
   enum { SCALE = (W2 - W1) - (IW2 - (IW1+1)/2) }; // scale (shift) to adj initial remainer value
   enum { ROOT_PREC = QW - (IW1 % 2) };
   assert((IW1+1)/2 <= IW2); // Check that output format can accommodate full result
   ap_uint<QW> q      = 0;   // partial sqrt
   ap_uint<QW> q_star = 0;   // diminished partial sqrt
   ap_int<QW+2> s; // scaled remainder initialized to extracted input bits
   if (SCALE >= 0)
      s = in_val.range(W1-1,0) << (SCALE);
   else
      s = ((in_val.range(W1-1,0) >> (0 - (SCALE + 1))) + 1) >> 1;
   // Non-restoring square-root algorithm
   for (int i = 0; i <= ROOT_PREC; i++) {
      if (s >= 0) {
         s = 2 * s - (((ap_int<QW+2>(q) << 2) | 1) << (ROOT_PREC - i));
         q_star = q << 1;
         q = (q << 1) | 1;
      } else {
         s = 2 * s + (((ap_int<QW+2>(q_star) << 2) | 3) << (ROOT_PREC - i));
         q = (q_star << 1) | 1;
         q_star <<= 1;
      }
   }
   // Round result by "extra iteration" method
   if (s > 0)
      q = q + 1;
   // Truncate excess bit and assign to output format
   result.range(W2-1,0) = ap_uint<W2>(q >> 1);
}
#endif

Real-Time Systems

Vitis HLS Knowledge Base

Analytics