Practical 4 - Deploying HLS Components

Currently we have used HLS on its own. In this practical we will actually implement the HLS component as part of an FPGA design. First we will learn how to do this, and then we will create some hardware to solve some problems.

First create a new HLS project with the toplevel.cpp and toplevel.h files from Practical 3a.

  • Synthesise it.

  • Every time we have changed our hardware, we need to tell HLS to export it to a hardware description language and to generate all the various metadata that Vivado wants.

    • Select Solution → Export RTL → Select "Vivado IP for System Generator" → Click OK

  • Now we need to tell Vivado where our new IP is

    • Back in Vivado, open your Block Design. Click Window -> IP Catalog to open the IP Catalog. Click Settings in the Flow Navigator on the left. Select IP, and then Repository. Press the Plus icon. From the file browser select your HLS project directory and click Select. Vivado will scan your HLS project and should pop up a box showing that an IP was added to your project. Click OK, and OK.

  • Back in your Block Design, click the Add IP button on the left of the diagram. The IP core will be called the Display Name you entered, or Toplevel if you forgot to enter one. Double click the IP to add it. 

  • To allow your IP core to access DDR memory you need to enable an AXI slave interface on the Zynq processing system. Double click the Zynq IP block, select "PS-PL Configuration", expand "HP Slave AXI Interface", and check S AXI HP0 interface. Click OK and you should see a new port appear on the Zynq block. 

  • You can now use Connection Automation to complete the connection. Run Connection Automation and check S_AXI_HP0. It should be suggesting to connect to the m_axi port on your IP core. 

  • Also in Connection Automation check s_axi_AXILiteS and s_axi_control. It should be suggesting to connect to M_AXI_GP0 on the processing system. Click OK.

You will now have your IP connected through its slave interface to the AXI Interconnect called processing_system7_0_axi_periph and through its master interface to another AXI Interconnect called axi_mem_intercon which goes into the Zynq IP block.

You can now save your block design, Generate Bitstream, and export the hardware again. Overwrite your existing hardware specification (the XSA file).

Connection Automation issues

If you have issues with the AXI buses generated using connection automation (i.e. if they differ from the structure above), try deleting all AXI Interconnect blocks and running it again. If in doubt, ask a demonstrator.

The general principle is that your Zynq block's M_AXI should be traceable to all S_AXIs on your IP cores, and the IP core's M_AXI should be traceable to the S_AXI_HP0 on the Zynq block.

Using the IP in Vitis

When HLS exported our IP it helpfully autogenerated us a software driver. However we need to tell Vitis where to find this driver.

  • In Vitis, select Xilinx → Repositories. Under Local Repositories click New and select the folder of your HLS project. Click Rescan Repositories and OK.

  • Right click the design_1_wrapper  Platform and click "Update Hardware Specification" to bring in the fact that we have chnaged the hardware.

    • We should now be able to see the new IP and its drivers.

    • In the platform.spr  file of design_1_wrapper  under the Board Support Package settings you should be able to see your toplevel  IP listed, and the fact that it is using a driver, also called toplevel .

  • Create a new Application Project, for the design_1_wrapper  Platform. Let it create a new System. You have to do this when you make changes to the interface of your hardware.

You can now interact with your core as with the following example.

IP Core Example
#include <stdio.h> #include "platform.h" #include "xil_printf.h" #include "xparameters.h" #include "xtoplevel.h" #include "xil_cache.h" u32 shared[1000]; int main() { int i; XToplevel hls; init_platform(); Xil_DCacheDisable(); print("\nHLS test\n"); for(i = 0; i < 100; i++) { shared[i] = i; } shared[0] = 8000; XToplevel_Initialize(&hls, XPAR_TOPLEVEL_0_DEVICE_ID); XToplevel_Set_ram(&hls, (u32) shared); XToplevel_Start(&hls); while(!XToplevel_IsDone(&hls)); printf("arg2 = %lu\narg3 = %lu\n", XToplevel_Get_arg2(&hls), XToplevel_Get_arg3(&hls)); cleanup_platform(); return 0; }

Program the FPGA and launch this code and you should see the following:

Output
HLS test arg2 = 12950 arg3 = 3050

As you can see, your component can be started with XToplevel_Start and XToplevel_IsDone tells you when it is complete. XToplevel_Set_ram tells the HLS component where our shared memory is located in main memory. This allows HLS to read and write as if RAM starts at 0, but it will actually be pointing at our shared memory. Don't forget to set the RAM offset or your HLS component will be writing over random bits of memory!

AXI bus problems

If your software hangs when trying to write to registers in your HLS component, double-check that the AXI buses are connected correctly, take your Vitis back to basics. Reexport a new hardware platform, create a new Hello World on it, making sure you can see the output. Check the System has the driver for your IP core, then copy in your code to access the core.

Functions like XToplevel_Get_arg1 and XToplevel_Set_arg1 get and set the parameters of the toplevel function. Sometimes these getters and setters may have slightly different names based on how your core uses the arguments. For example, if you only read an input then HLS will generate functions like:

  • XToplevel_Set_arg1

  • XToplevel_Get_arg1

However, if you read and write a variable then the functions will be named:

  • XToplevel_Set_arg1_V_i

  • XToplevel_Get_arg1_V_o

Yes, frustrating isn't it? Anyway, to check the generated interface, in SDK click on the line #include "xtoplevel.h" and press F3. SDK will open the xtoplevel.h file and you can see the functions to use. Call a demonstrator if you're confused.

When you change your HLS

When you change your HLS code, perform the following steps to ensure that your bitfile is updated for testing.

  1. Re-run synthesis.

  2. Re-export the IP core.

  3. In Vivado, it should have noticed the change and there will be a message saying "IP Catalog is out-of-date". 

    1. If not, click IP Status then Re-Run Report

    2. Click Refresh IP Catalog

    3. In the IP Status panel the toplevel IP should be selected. Click Upgrade Selected.

  4. In the Generate Output Products dialog click Generate.

  5. Click Generate Bitstream.

  6. Export Hardware (including bitstream).

  7. In Vitis reprogram the FPGA and run your software.

    1. If you have changed the interface to your hardware you might have to regenerate your System and move your application project into it.

Now you understand why testbenches and simulation are so important!


Measuring Execution Times

You are going to use the Timer in the ARM processing system to measure how long a piece of code takes to execute, and then demonstrate that you can do the same operation faster in hardware. The code we are going to measure implements a test of the Collatz Conjecture. The conjecture states:

The Collatz Conjecture

Take any positive integer n (where n is not 0).
If n is even, divide it by 2 to get n / 2. If n is odd, multiply it by 3 and add 1 to obtain 3n + 1.
Repeat the process indefinitely.
The conjecture is that no matter what number you start with, you will always eventually reach 1.

You will create an HLS component to test the first 1000 integers to verify that if you perform the above steps they all eventually converge to 1. You will output the number of steps that each number takes to get to 1 in a shared array.

The following code should be the ARM software used:

main.c
#include <stdio.h> #include "platform.h" #include "xil_printf.h" #include "xparameters.h" #include "xtoplevel.h" #include "xil_cache.h" int shared[1000]; XToplevel hls; unsigned int collatz(unsigned int n) { int count = 0; while(n != 1) { if(n % 2 == 0) { n /= 2; } else { n = (3 * n) + 1; } count++; } return count; } void software() { int i; for(i = 0; i < 1000; i++) { shared[i] = collatz(i + 1); } } void hardware() { //Start the hardware IP core XToplevel_Start(&hls); //Wait until it is done while(!XToplevel_IsDone(&hls)); } void print_shared() { int i; for(i = 0; i < 1000; i++) { xil_printf("%d ", shared[i]); } xil_printf("\n"); } void setup_shared() { int i; for(i = 0; i < 1000; i++) { shared[i] = i+1; //(we use i+1 because collatz of 0 is an infinite loop) } } int main() { init_platform(); Xil_DCacheDisable(); //Initialise the HLS driver XToplevel_Initialize(&hls, XPAR_TOPLEVEL_0_DEVICE_ID); XToplevel_Set_ram(&hls, (int) shared); xil_printf("\nStart\n");   setup_shared(); software(); print_shared(); setup_shared(); hardware(); print_shared(); cleanup_platform(); return 0; }

Examine this code. The function software() is a software implementation of the Collatz iteration stage for the first 1000 integers, placing the iteration count in the global array shared. The main function sets shared to the integers 1 to 1001, runs software(), then prints the results out. It then resets shared and runs hardware() and prints the results out.

Read the section on "Sharing Memory Between HLS and the ARM" on the Software API page for more information on how memory is shared with HLS, including some useful information about caches.

Implement a hardware component in HLS to calculate the Collatz count for the first 1000 integers (just as the ARM software does). Start with the following top level structure:

toplevel.cpp

Because the Collatz loop is unbounded, HLS will just have question marks instead of timing estimates. You might find the LOOP_TRIPCOUNT directive useful. It doesn't change the generated hardware at all, it just tells HLS to pretend a loop will execute a certain number of times so that you can get timing estimates and therefore work on optimising the design. It might be a good idea to use a testbench to validate your core.

Then place your IP core into a design and run your IP core to test that it outputs the correct answers. The main.c above should work to drive your IP core.

Then modify clear_shared() and print_shared() to time how long software() and hardware() take. The Software API shows you how to measure time on the Zynq. Is the hardware or software faster?

Once you have completed this task speak to Ian. You now should have the basics to undertake the assessment. However the next practical will be excellent practice.