Skip to content

High Level Synthesis

Final

This assignment description is now complete. There could still be minor updates, which will be highlighted.

Follow the Spirit

Some screenshots may be taken on other versions of Vivado/Vitis or for other configurations. The instructions could also vary slightly depending on the exact design/configuration you follow - such as whether you have separate or combined designs with multiple coprocessors / interfacing method (DMA/FIFO). The spirit of the instructions remain the same. Understand the significance of each step rather than following it mechanically.

Introduction to High-level synthesis

High-level synthesis transforms C functions into hardware IPs.

HLS works fairly well for inner blocks with fairly data-oriented (resource-dominated) functionality without complicated control flow structures. Examples would be digital signal processing, arithmetic on matrices, etc where loops have data-independent exit conditions.

It is not very good for those outer blocks which typically involve complicated control structures (control dominated). HLS-based generation of a control-dominated circuit such as a microcontroller is a holy grail.

The HLS tool is temperamental - sometimes you get very good results, and sometimes you end up wondering what just happened. Sometimes even slight code changes that should have little/no functional relevance can produce substantially different hardware. Well, using an inherently sequential high-level language to produce inherently parallel hardware is challenging.

Not all C code can be synthesized. Anything that depends on a runtime environment would not work. An example would be dynamic memory allocation (no malloc). Keep in mind that the goal of HLS is not to create something which executes on a processor, it is to create a sort of a processor itself.

Creating an HLS-based design based on default options might not yield a good IP/hardware. It will be kind of like going on an organized tour. You will get some feel of the place, you will have the bragging rights, but the experience is generally not so 'authentic'. You need to exert control of the hardware generated using various optimization directives as appropriate depending on the context, design requirements/ tradeoffs.

The accelerator IP that we create needs to be interfaced with the rest of the system (that you created in Assignment 1) for the processor to make it act as a coprocessor. The interface of the IP generated by the HLS can be

Register-based: The processor can read and write the registers within the coprocessor to write inputs/read outputs - each register has an address within the address space of the processor. The parameters and return values are mapped to these registers/addresses. For example, A is in the offset range 0xx to 0xyy - the actual address range is the base address of the coprocessor peripheral (assigned in Vivado under the address tab) + the offset. The coprocessor in this case has an AXI or AXI Lite interface which can be connected to the AXI bus of the system as a slave (similar to how the timer was connected in Assignment 1).

Stream-based: There are separate input and output streams through which the data is streamed in/out. There is no concept of addresses, and the meaning of the data is derived from the order of the data (and possibly some 'tags'). For example, the first 512 words correspond to A, the next 64 correspond to B, and so on. Please read the Introduction to AXI Stream page.

Memory-based: The co-processor reads inputs from / writes output to memory directly. For now, we will start with Stream-based which is perhaps the easiest to get started, and follows directly from the previous labs.

Assignment 4

The template files are here.

The assignment involves

1) Creating a stream-based coprocessor aka accelerator do matrix multiplication (RES=A*B/256) using HLS and integrating it into the system. The matrix multiplication problem is exactly the same as in Lab 2 and 3 (the part of your program to send the matrices A and B from PC can be commented out and the matrices can be hard-coded in your program for convenience), just that the coprocessor is now generated using HLS rather than HDL.

You should do coprocessor Interfacing using AXI DMA. AXI FIFO is a fall-back option, in case you have issues with DMA (and will result in some reduction in marks as you will not be able to use it with Pynq). Nevertheless, it might not be a bad idea to try it with FIFO first before venturing into DMA.

2) Further, for each case above, you need to compare the hardware and software performance AXI Timer - the relevant parts only as with the previous labs (inclusive of overheads to send/receive data to/from co-processor for hardware designs, as it is an unavoidable overhead). Profiling for performance comparison is not mandatory, but you are enoucouraged to try it for your own experience.

3) You are also required to try at least one possible optimization in HLS and compare the performance on hardware (which wouldn't require any modifications to your software C code) with the vanilla (non-optimized) version. The C code for HLS needs to have appropriate pragmas inserted manually or graphically. This is a self-learning / self-exploration exercise.

You should read and get an overview of the following 4 optimizations from the document https://docs.amd.com/v/u/en-US/ug1270-vivado-hls-opt-methodology-guide. The page numbers below are the pages where the topic starts, not necessarily the only 4 pages you need to read. These were covered in the lecture on a conceptual level.

  • pragma HLS array_partition....83
  • pragma HLS dataflow...............91
  • pragma HLS pipeline................116
  • pragma HLS unroll....................125

Note that while these optimizations are applied independently, some optimizations work well only when some other optimizations are also used. For example, doing pipelining or loop unrolling without partitioning the array wouldn't help much, as the bottleneck will be accessing the memory, 1 or 2 elements at a time. The effect of these optimizations is not always that deterministic though, given the nature and non-maturity of HLS tools.

Newer versions of Vitis perform some of these optimizations (e.g., pipelining, array partitioning) automatically provided certain conditions are met. You can disable this from the settings or via pragmas to see the performance of the non-pipelined versions.

#pragma HLS pipeline off

It can also be done via tcl.

set_directive_pipeline -off [get_loops "loop_label"]

and also via hls_component > settings > hls_config.cfg > C Synthesis > Compile > compile.pipeline_loops to 0 (hls component and config file names to be changed as appropriate).

The dataflow optimisation by itself will likely not yield any improvement in performance without modifying the software (C program) significantly to take advantage of the hardware optimisations. This is not easy, and will ideally need an operating system (e.g., FreeRTOS that is supported out of the box), and hence is a purely optional exercise.

When evaluating the effect of HLS optimisations, you can choose either AXI Stream FIFO and AXI DMA-based interfacing, but the latter is strongly recommended. As is the case with FIFO and DMA based designs, it is fine to have separate projects or a combined project for the designs with and without optimisations.

To summarise, we have 2 scenarios. We can have either a single project combining the 2 above, or 2 separate projects. Of course, combining those into a single project will make for a better/faster demo - this page might give you some ideas.

  • Non-optimised coprocessor interfaced using DMA.
  • Optimised coprocessor interfaced using DMA/FIFO (DMA recommended).

4) You should run at least one design using Pynq. This is very straightforward if you have DMA-based interfacing working.

Submission Info

Assignment 4 (7 marks)

Demonstrate in week 9.

Upload .zip file containing the

  • the .cpp files used for HLS implementation and test/co-simulation. The directives.tcl file should also be included if the 'Directive Destination' is 'Directive File' instead of 'Source File'.
  • .C/H file(s) running on ARM Cortex A53 used to send data to the co-processor, including timer (only those you have modified).
  • A screenshot of your IP integrator canvas, i.e., the block diagram (please do not upload the entire Vivado project folder) for each case (or combined).
  • .xsa file(s).
  • A text file containing the information printed on the serial console in each case (or combined).
  • A text file containing the information printed on the serial/SSH console when running Pynq on Kria

to the Canvas within 1 hour of your demo. The exact same files should be used for evaluation.

It should be as a .zip archive, with the filename GroupNum_Lab4.zip.

Please DO NOT upload the whole project!

You will also need to do an online demonstration to a teaching assistant (based on what you submitted at the point of the assignment deadline, not the version you may have improved after the deadline) - arrangements will be made known in due course.

References

Here are some references that can help you get started with Vivado High-Level Synthesis tool