High Level Synthesis, Measuring Performance

Introduction to High-level synthesis

High-level synthesis transforms C functions into hardware IPs.

HLS works fairly well for inner blocks with fairly data-oriented (resource-dominated) functionality without complicated control flow structures. Examples would be digital signal processing, arithmetic on matrices, etc where loops have data-independent exit conditions.

It is not very good for those outer blocks which typically involve complicated control structures (control dominated). HLS-based generation of a control-dominated circuit such as a microcontroller is a holy grail.

The HLS tool is temperamental - sometimes you get very good results, and sometimes you end up wondering what just happened. Sometimes even slight code changes that should have little/no functional relevance can produce substantially different hardware. Well, using an inherently sequential high-level language to produce inherently parallel hardware is challenging.

Not all C code can be synthesized. Anything that depends on a runtime environment would not work. An example would be dynamic memory allocation (no malloc). Keep in mind that the goal of HLS is not to create something which executes on a processor, it is to create a sort of a processor itself.

Creating an HLS-based design based on default options might not yield a good IP/hardware. It will be kind of like going on an organized tour. You will get some feel of the place, you will have the bragging rights, but the experience is generally not so 'authentic'. You need to exert control of the hardware generated using various optimization directives as appropriate depending on the context, design requirements/ tradeoffs.

The interface of the IP generated by the HLS can be

AXIS. The IP generated is a drop-in replacement for the one you created in Lab 3, which is the option you should choose unless you wish to experiment. Note that the HLS-generated AXIS IP does not seem to work well with AXI DMA in the newer versions of Vitis.
The default HLS Protocol. This can then be integrated into your AXIS co-processor in the 'COMPUTING' state through HDL.
AXI4-Lite. This is pretty easy to use if you know what you are doing (adding registers and addressing etc).

HLS Flow

A step-by-step tutorial to create a coprocessor that is functionally identical to that of the template coprocessor given in lab 1 (to add 4 numbers and return the sum, sum+1, sum+2, and sum+3) packaged in lab 3 is in the HLS Flow page. Please try this out first.

Performance analysis

The process of collecting information about a program dynamically (i.e., during runtime) is called profiling. Many IDEs come with some sort of profiling tool. There are a lot of different approaches to profiling.

A typical way of doing profiling is to sample the program counter through the debug interface, which is supported by Xilinx tools - the TCF profiler is very very easy to use - please try it out. Based on this, the profiler gives you statistics regarding what fraction (percentage) of time your program spends in each function. Profiling gives you clues regarding which function might be a good candidate for you to spend your time and money on - you wouldn't want to bother much about a function that is not a performance bottleneck. These improvements could be through algorithmic optimizations and/or hardware acceleration. Note: You need to do 'Debug' rather than 'Run' at least once for the profile option to be active. Note that you may need to loop the contents of each function several 1000s of times (after receiving the data just one time via the serial console) to get a statistically meaningful comparison between these two when profiling using the TCF profiler.

Now, how do we measure the exact time taken to execute a specific segment of software code?

If the application is running on top of an OS (e.g. FreeRTOS), the OS typically provides some sort of system call / software timers. You could use this to log the timestamps at various points in your software code which allows you to measure the execution times for various segments of your software code. You won't need specialized software tools or hardware (apart from the hardware timer which the OS needs) for this. However, the time resolution you get is typically not very high.

We can also use a dedicated hardware timer to get the precise number of cycles required to execute a segment of software code. The difference between the counter readings before and after the segment of software code you want to analyze will give you the number of cycles, and hence the time taken. Make sure that you do not have any unnecessary code between the two points where you read the counter. For example, printing the first reading of the counter immediately after reading it would be a bad idea.

Using a hardware timer is the approach we follow in this lab. It is left entirely to you as a self-learning exercise. This will require you to integrate an AXI Timer in your hardware (Vivado) and use the appropriate driver functions in Vitis/SDK. By now, you should have a hang of things and it should be easy enough as the timer/counter is a pretty simple peripheral. A couple of hints:

You need only one counter, whereas the timer IP block incorporates 2 counters by default. You can double-click the block in Vivado and uncheck the second one to reduce the hardware usage and synthesis time.
You can get started with the xtmrctr_polled_example project for axis_timer_0 (tmrctr). You can get this from hardware_1_wrapper>platform.spr> Peripheral Drivers section of the Board Support Package. Try running this first, and later integrate the relevant parts into your software C code.

Assignment 4

1) The assignment mainly involves creating the same coprocessor (that you created in Lab 1 and scaled up/packaged in Lab 3) using HLS and integrating it into the system. You can just use your lab 3 block design and use the HLS-generated co-processor as a drop-in replacement for the HDL-based co-processor of lab 3. The exact same software C code should work, but the project has to be updated with the new .xsa file, as your hardware and hence bitstream is different. If you wish (not a requirement), you can add the HLS generated coprocessor into the system instead of replacing the lab 3 coprocessor.

All the required files are here

2) Further, you should integrate an AXI Timer into your block design, and modify your software C code to report the time taken by the

Software implementation of matrix multiplication. If you did not have the software version in Lab 3 and used pre-computed results, it is time to copy over the software version from Lab 2. If your code was such that the software version of multiplication was interspersed with the receiving of data from the serial console (RealTerm), please separate them out, i.e., receive the data first, and then do the matrix multiplication.
HLS-generated hardware implementation of matrix multiplication.

Any overhead that is irrelevant to a comparison between hardware should be excluded from profiling/measurement of the time taken.

The time taken for printing messages and receiving the data via serial (RealTerm) should not be included.
Comment out messages such as 'Transmitting Data for test case', 'Receiving data for test case' etc. This can be done conveniently using the C preprocessor (#define - #ifdef - #endif), without having to manually comment/uncomment each line/block. The time taken for the hardware version should be inclusive of the time taken for sending data to and receiving data from the coprocessor (i.e., writing to / reading from AXI Stream FIFO), as this is an unavoidable overhead* associated with offloading computations to hardware. * This overhead can possibly be ignored when using DMA in a non-blocking fashion, i.e., the CPU is performing some other useful task while the DMA data transfer and co-processor computations are in progress.

It is suggested that you create two separate functions called from the main program - something like matrix_multiply_soft() and matrix_multiply_hard() for the software and the hardware versions respectively. This also facilitates profiling using the TCF profiler which can do profiling only at a function level. Note that you may need to call these two functions many times in a loop (after receiving the data just one time from the serial console) to get a statistically meaningful comparison between these two when you do profiling using the TCF profiler.

3) You are also required to try at least one possible optimization in HLS and compare the performance on hardware (which wouldn't require any modifications to your software C code) with the vanilla (non-optimized) version. The C code for HLS needs to have appropriate pragmas inserted manually or graphically. This is a self-learning / self-exploration exercise.

You should read and get an overview of the following 4 optimizations (a fair idea is good enough, detailed knowledge is not expected) from the document https://www.xilinx.com/content/dam/xilinx/support/documentation/sw_manuals/xilinx2018_1/ug1270-vivado-hls-opt-methodology-guide.pdf. The page numbers below are the pages where the topic starts, not necessarily the only 4 pages you need to read. You can Google if you wish to know more about these.

pragma HLS array_partition.....83
pragma HLS dataflow...............91
pragma HLS pipeline................116
pragma HLS unroll....................125

Note that while these optimizations are applied independently, some optimizations work well only when some other optimizations are also used. For example, doing pipelining or loop unrolling without partitioning the array wouldn't help much, as the bottleneck will be accessing the memory, 1 or 2 elements at a time. The effect of these optimizations is not always that deterministic though, given the nature and non-maturity of HLS tools.

Also, note that some of these optimisations are done automatically by the HLS tools in the newer versions of Vitis. You'll be able to notice these in the reports, and can be controlled to some extent in the configuration file.

Submission Info

Assignment 4 (7 marks)

Demonstrate in week 9.

Upload

Upload a .zip file containing the

the .cpp files used for HLS implementation and test/co-simulation. The directives.tcl file should also be included if the 'Directive Destination' is 'Directive File' instead of 'Source File'.
.c/.h file(s) running on ARM Cortex A53 used to send data to the co-processor, including timer (only those you have modified).
A screenshot of your IP integrator canvas, i.e., the block diagram (please do not upload the entire Vivado project folder).
.xsa file(s).
A text file containing the information printed on the serial console, which should have info on the time taken.
Optional - a screenshot of the TCF profiler tab showing a comparison between the time spent for hardware and software versions of the matrix multiplication functions.

exactly as used for the demo to Canvas within 1 hour of your demo.

It should be as a .zip archive, with the filename Wed/Fri_GroupNum_Lab4.zip.

Please DO NOT upload the whole project!

References

Here are some references that can help you get started with Vivado High-Level Synthesis tool

Xilinx official presentation slides introducing HLS
A good Xilinx official presentation on optimization
https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/Getting-Started-with-Vitis-HLS
Vivado Design Hub - High-Level Synthesis. The documents here are very very useful.
Vivado HLS flow on Zynq workshop: just register for a Xilinx account then you can download all the material for the workshop. After finishing all the labs, you should be able to apply HLS to your project.
Parallel Programming for FPGAs : A very good free textbook on HLS (http://kastner.ucsd.edu/hlsbook/).