Quick Start

Summary

  1. Check the Requirements

  2. Download the Container

  3. Get the Application

  4. Find an FPGA Kernel

  5. Integrate Application and FPGA Kernel

  6. Run on FPGAs

1. Check the Requirements

To be able to quick start with the framework you would need:

  • One (or more) servers with
    1. Singularity CE - Singularity Download

    2. Compatible MPICH installation (MPICH v. 4.0.2) - MPICH Download

    3. AMD Vitis Tools 2022.1 (or superior) - Vitis Page

    4. AMD XRT version 2.13.466 (or superior) - XRT Page

    5. One (or more) AMD Alveo Board

2. Download the Container

For this tutorial, we will be using the OMPC FPGA container. It includes all the tools needed to run applications using the framework. We suggest using Singularity.

Download the container using Singularity:

singularity pull docker://pedroohr/runtime-fpga:latest

3. Get the Application

As an example, let’s start with a basic Vector Addition example. Following figure shows the application:

Element-wise vector Addition

A basic CPU kernel to execute the application can be implemented as the following:

1void vadd_cpu(int *A, int *B, int *C, int size) {
2   for (int i = 0; i < size; i++)
3      C[i] = A[i] + B[i];
4}

And it is called somewhere in the application as:

1vadd(A, B, C, N);

Note

This application is already available in the container on the path /examples/vadd/vadd_cpu.cpp.

But you can find a fully functional implementation example here: vadd_cpu

4. Find an FPGA Kernel

The framework facilitates the usage of any kernel that can be used as an alternative to a defined CPU function (i.e.: share equivalent prototypes).

Tip

Application developers can always change the CPU functions to match a desired FPGA kernel (even if the arguments will not be used in the CPU implementation).

So, let’s say we found an FPGA implementation for the vadd kernel (in this case, an HLS version of the kernel):

1void vadd_fpga(int *A, int *B, int *C, int size) {
2#pragma HLS INTERFACE m_axi port = A
3#pragma HLS INTERFACE m_axi port = B
4#pragma HLS INTERFACE m_axi port = C
5#pragma HLS INTERFACE s_axilite port = return
6   for (int i = 0; i < size; i++)
7      C[i] = A[i] + B[i];
8}

That implementation can be compiled using the AMD VitisTM Compiler. The code below shows how to compile for the AMD Alveo u55c board.

v++

Note

This kernel implementation is already available in the container on the path /examples/vadd/fpga_kernel.cpp.

But you can find the kernel implementation here: vadd_cpu.cpp

5. Integrate Application and FPGA Kernel

The integration of the FPGA kernel can be done with just a few lines of code.

To make the program understand we want to use the FPGA kernel as an alternative to the CPU kernel we need two lines of code (lines 1 and 2):

1void vadd_fpga(int *A, int *B, int *C, int size);
2#pragma omp declare variant(vadd_fpga) match(device={arch(alveo)})
3void vadd_cpu(int *A, int *B, int *C, int size) {
4   for (int i = 0; i < size; i++)
5      C[i] = A[i] + B[i];
6}

Finally, in the line we call that function in the code we need to create an OpenMP Target task (line 1) and establish a syncronization point (line 4), so the program knows when to execute the kernels.

1#pragma omp target map(to: A[:N], B[:N]) map(tofrom: C[:N]) nowait
2vadd_cpu(A, B, C, N);
3
4#pragma omp taskwait

Important

Observe how the original call to vadd_cpu do not change even if using FPGAs!

Note

This application is already available in the container on the path /examples/vadd/vadd_fpga.cpp.

But you can find a fully functional implementation example here: vadd_fpga.cpp

6. Run on FPGAs

To run the application using the FPGA kernel, one need to compile first, and then run, using the provided container:

Compiling it using Singularity:

singularity exec runtime-fpga_latest.sif clang++ -fopenmp -fopenmp-targets=alveo -fno-openmp-new-driver vadd_fpga.cpp -o vadd_fpga

Run it using Singularity:

# Runs using 1 worker node containing FPGAs
mpirun -np 2 singularity exec runtime-fpga_latest.sif ./fpga_vadd

Important

Currently, we run the applications using mpirun, the number of nodes will always be: 1 + number of workers

That is it! Happy coding with FPGAs