皇冠新体育APP

IT技术之家

皇冠新体育APP > 硬件开发

硬件开发

Xilinx?Alveo系列FPGA加速卡软件开发入门_小强不吃菜

发布信息时长:2023-08-25 16:18:26 硬件开发 66次 标签:皇冠新体育APP:fpga开发 Powered by 金山文档
随着异构计算越来越火,FPGA加速卡在诸多领域的应用也越来越多。FPGA加速卡与GPU加速卡加速原理完全不同。GPU本质上是依靠海量的并行运算单元,提升整体的吞吐量,来吃尽内存带宽。FPGA是通用的门阵列,按照数据结构和运算特点搭建专用运算单元,能够以更低功耗和时延实现高吞吐。本文主要以一个简单demo介绍FPGA的项目结构和功能...

背景

逐渐异构算起愈来愈越火,FPGA迅速卡在大多数范畴的操作也愈来愈越多越。 FPGA加快速度卡与GPU加快速度卡加快速度原里截然与众不同。 GPU本质上上是仰仗巨量的多处理机系统运算单元测试卷,增强总体的周转量,来吃尽手机内存服务器带宽。 FPGA是通用型的门阵列,假设按照的数据组成部分和运算作用修建专业级运算单元测试,就可以以更低工作频率和时延改变高吞吐。 上篇当我们己经搞定了大环境架设,本篇将其主要详细介绍工程结构设计和事业设计原理

整体架构

动用GPU会加快时,CPU读取数据源和指今到GPU就可以了,暂时无法要考虑继续执行指今的运算核的定制,FPGA集成电路芯片运算核是必须 搭建的。有以下几点图如图是:主机电源在PCIe衔接FPGA会加快卡。所以在Host端和Device端都是有相对应的的代碼。 过去的FPGA开放习惯运用HDL语言学,是无法新动态展示的的编辑FPGA内层的工作表。FPGAt加速的核心理念便是运算核更紧密结合新动态展示数据资料,当新动态展示数据资料组成或清理流程图不同,就要编辑运算核组成。但如果总是编辑运算核要让工作器断掉电源尽管不显示,于是该行业的一般行为便是将集成ic内系统盘,可分切不可编辑的静止区和可编辑的新动态展示的区。静止产业园区内便是DMA、PCIe、DDR等基本条件的工作表核,用户的的运算核则安排到新动态展示的区中,确认AXI接口类型拼接。

xilinx出具了Host端和Device的数据库信息交互,我们公司必须要应对的Host的系统程序和Device的Kernel。 有以下几点图如下图所示,的项目在校园营销推广活动的环节之中所构建划分3个部门。Host的方式、FPGA的Kernel、各种过渡Kernel和FPGA内部人员的Link。随着有link的会出现,祛除了了部门FPGAIC芯片尺寸文化差异,缩减了Kernel的開發分值也可在必须度上与计算机硬件解耦。

创建第一个工程

官方版比较适合的一号个DEMO内容即是向量乘法,不方便初专家快捷熟记内容构造和自动运行方式。 开启水利前提是需app平台文本,我此处选择的是Alveo U50。最常见的仍有Alveo U200 , Alveo U50 草率 取个业务名子 拷贝到公司打造的例程,考虑初学家考虑这位,也包括了HLS、OpenCL的便用并且任务卡并行计算化的便用。 构建做完后不错看过活动涵盖数个包。最终要的就Host和Kernel的项目工程。

代码简介

Host步骤其主要包扩下面的操作: 调用xclbin文件名,得Device,调用xclbin后来得所要的Kernel。 给Kernel合理安排前景,异步施行,等结局传送数据。 然后用CPU再算以下,检测高速度卡和CPU的核算最终结果。
/**
* Copyright (C) 2019-2021 Xilinx, Inc
*
* Licensed under the Apache License, Version 2.0 (the "License"). You may
* not use this file except in compliance with the License. A copy of the
* License is located at
*
*     //www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
*/

#include "xcl2.hpp"
#include <algorithm>
#include <vector>
#define DATA_SIZE 4096

int main(int argc, char** argv) {
    if (argc != 2) {
        std::cout << "Usage: " << argv[0] << " <XCLBIN File>" << std::endl;
        return EXIT_FAILURE;
    }

    std::string binaryFile = argv[1];
    size_t vector_size_bytes = sizeof(int) * DATA_SIZE;
    cl_int err;
    cl::Context context;
    cl::Kernel krnl_vector_add;
    cl::CommandQueue q;
    // Allocate Memory in Host Memory
    // When creating a buffer with user pointer (CL_MEM_USE_HOST_PTR), under the
    // hood user ptr
    // is used if it is properly aligned. when not aligned, runtime had no choice
    // but to create
    // its own host side buffer. So it is recommended to use this allocator if
    // user wish to
    // create buffer using CL_MEM_USE_HOST_PTR to align user buffer to page
    // boundary. It will
    // ensure that user buffer is used when user create Buffer/Mem object with
    // CL_MEM_USE_HOST_PTR
    std::vector<int, aligned_allocator<int> > source_in1(DATA_SIZE);
    std::vector<int, aligned_allocator<int> > source_in2(DATA_SIZE);
    std::vector<int, aligned_allocator<int> > source_hw_results(DATA_SIZE);
    std::vector<int, aligned_allocator<int> > source_sw_results(DATA_SIZE);

    // Create the test data
    std::generate(source_in1.begin(), source_in1.end(), std::rand);
    std::generate(source_in2.begin(), source_in2.end(), std::rand);
    for (int i = 0; i < DATA_SIZE; i++) {
        source_sw_results[i] = source_in1[i] + source_in2[i];
        source_hw_results[i] = 0;
    }

    // OPENCL HOST CODE AREA START
    // get_xil_devices() is a utility API which will find the xilinx
    // platforms and will return list of devices connected to Xilinx platform
    auto devices = xcl::get_xil_devices();
    // read_binary_file() is a utility API which will load the binaryFile
    // and will return the pointer to file buffer.
    auto fileBuf = xcl::read_binary_file(binaryFile);
    cl::Program::Binaries bins{{fileBuf.data(), fileBuf.size()}};
    bool valid_device = false;
    for (unsigned int i = 0; i < devices.size(); i++) {
        auto device = devices[i];
        // Creating Context and Command Queue for selected Device
        OCL_CHECK(err, context = cl::Context(device, nullptr, nullptr, nullptr, &err));
        OCL_CHECK(err, q = cl::CommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err));
        std::cout << "Trying to program device[" << i << "]: " << device.getInfo<CL_DEVICE_NAME>() << std::endl;
        cl::Program program(context, {device}, bins, nullptr, &err);
        if (err != CL_SUCCESS) {
            std::cout << "Failed to program device[" << i << "] with xclbin file!\n";
        } else {
            std::cout << "Device[" << i << "]: program successful!\n";
            OCL_CHECK(err, krnl_vector_add = cl::Kernel(program, "vadd", &err));
            valid_device = true;
            break; // we break because we found a valid device
        }
    }
    if (!valid_device) {
        std::cout << "Failed to program any device found, exit!\n";
        exit(EXIT_FAILURE);
    }

    // Allocate Buffer in Global Memory
    // Buffers are allocated using CL_MEM_USE_HOST_PTR for efficient memory and
    // Device-to-host communication
    OCL_CHECK(err, cl::Buffer buffer_in1(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, vector_size_bytes,
                                         source_in1.data(), &err));
    OCL_CHECK(err, cl::Buffer buffer_in2(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, vector_size_bytes,
                                         source_in2.data(), &err));
    OCL_CHECK(err, cl::Buffer buffer_output(context, CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, vector_size_bytes,
                                            source_hw_results.data(), &err));

    int size = DATA_SIZE;
    OCL_CHECK(err, err = krnl_vector_add.setArg(0, buffer_in1));
    OCL_CHECK(err, err = krnl_vector_add.setArg(1, buffer_in2));
    OCL_CHECK(err, err = krnl_vector_add.setArg(2, buffer_output));
    OCL_CHECK(err, err = krnl_vector_add.setArg(3, size));

    // Copy input data to device global memory
    OCL_CHECK(err, err = q.enqueueMigrateMemObjects({buffer_in1, buffer_in2}, 0 /* 0 means from host*/));

    // Launch the Kernel
    // For HLS kernels global and local size is always (1,1,1). So, it is
    // recommended
    // to always use enqueueTask() for invoking HLS kernel
    OCL_CHECK(err, err = q.enqueueTask(krnl_vector_add));

    // Copy Result from Device Global Memory to Host Local Memory
    OCL_CHECK(err, err = q.enqueueMigrateMemObjects({buffer_output}, CL_MIGRATE_MEM_OBJECT_HOST));
    q.finish();
    // OPENCL HOST CODE AREA END

    // Compare the results of the Device to the simulation
    bool match = true;
    for (int i = 0; i < DATA_SIZE; i++) {
        if (source_hw_results[i] != source_sw_results[i]) {
            std::cout << "Error: Result mismatch" << std::endl;
            std::cout << "i = " << i << " CPU result = " << source_sw_results[i]
                      << " Device result = " << source_hw_results[i] << std::endl;
            match = false;
            break;
        }
    }

    std::cout << "TEST " << (match ? "PASSED" : "FAILED") << std::endl;
    return (match ? EXIT_SUCCESS : EXIT_FAILURE);
}
Kernel编号下述,编号很非常简单就会乘法,重要就会HLS的写法可以顺应:
/**
* Copyright (C) 2019-2021 Xilinx, Inc
*
* Licensed under the Apache License, Version 2.0 (the "License"). You may
* not use this file except in compliance with the License. A copy of the
* License is located at
*
*     //www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
*/

/*******************************************************************************
Description:

    This example uses the load/compute/store coding style which is generally
    the most efficient for implementing kernels using HLS. The load and store
    functions are responsible for moving data in and out of the kernel as
    efficiently as possible. The core functionality is decomposed across one
    of more compute functions. Whenever possible, the compute function should
    pass data through HLS streams and should contain a single set of nested loops.

    HLS stream objects are used to pass data between producer and consumer
    functions. Stream read and write operations have a blocking behavior which
    allows consumers and producers to synchronize with each other automatically.

    The dataflow pragma instructs the compiler to enable task-level pipelining.
    This is required for to load/compute/store functions to execute in a parallel
    and pipelined manner.

    The kernel operates on vectors of NUM_WORDS integers modeled using the hls::vector
    data type. This datatype provides intuitive support for parallelism and
    fits well the vector-add computation. The vector length is set to NUM_WORDS
    since NUM_WORDS integers amount to a total of 64 bytes, which is the maximum size of
    a kernel port. It is a good practice to match the compute bandwidth to the I/O
    bandwidth. Here the kernel loads, computes and stores NUM_WORDS integer values per
    clock cycle and is implemented as below:
                                       _____________
                                      |             |<----- Input Vector 1 from Global Memory
                                      |  load_input |       __
                                      |_____________|----->|  |
                                       _____________       |  | in1_stream
Input Vector 2 from Global Memory --->|             |      |__|
                               __     |  load_input |        |
                              |  |<---|_____________|        |
                   in2_stream |  |     _____________         |
                              |__|--->|             |<--------
                                      | compute_add |      __
                                      |_____________|---->|  |
                                       ______________     |  | out_stream
                                      |              |<---|__|
                                      | store_result |
                                      |______________|-----> Output result to Global Memory

*******************************************************************************/

// Includes
#include <hls_vector.h>
#include <hls_stream.h>
#include "assert.h"

#define MEMORY_DWIDTH 512
#define SIZEOF_WORD 4
#define NUM_WORDS ((MEMORY_DWIDTH) / (8 * SIZEOF_WORD))

#define DATA_SIZE 4096

// TRIPCOUNT identifier
const int c_size = DATA_SIZE;

static void load_input(hls::vector<uint32_t, NUM_WORDS>* in,
                       hls::stream<hls::vector<uint32_t, NUM_WORDS> >& inStream,
                       int vSize) {
mem_rd:
    for (int i = 0; i < vSize; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
        inStream << in[i];
    }
}

static void compute_add(hls::stream<hls::vector<uint32_t, NUM_WORDS> >& in1_stream,
                        hls::stream<hls::vector<uint32_t, NUM_WORDS> >& in2_stream,
                        hls::stream<hls::vector<uint32_t, NUM_WORDS> >& out_stream,
                        int vSize) {
// The kernel is operating with vector of NUM_WORDS integers. The + operator performs
// an element-wise add, resulting in NUM_WORDS parallel additions.
execute:
    for (int i = 0; i < vSize; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
        out_stream << (in1_stream.read() + in2_stream.read());
    }
}

static void store_result(hls::vector<uint32_t, NUM_WORDS>* out,
                         hls::stream<hls::vector<uint32_t, NUM_WORDS> >& out_stream,
                         int vSize) {
mem_wr:
    for (int i = 0; i < vSize; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
        out[i] = out_stream.read();
    }
}

extern "C" {

/*
    Vector Addition Kernel

    Arguments:
        in1  (input)  --> Input vector 1
        in2  (input)  --> Input vector 2
        out  (output) --> Output vector
        size (input)  --> Number of elements in vector
*/

void vadd(hls::vector<uint32_t, NUM_WORDS>* in1,
          hls::vector<uint32_t, NUM_WORDS>* in2,
          hls::vector<uint32_t, NUM_WORDS>* out,
          int size) {
#pragma HLS INTERFACE m_axi port = in1 bundle = gmem0
#pragma HLS INTERFACE m_axi port = in2 bundle = gmem1
#pragma HLS INTERFACE m_axi port = out bundle = gmem0

    static hls::stream<hls::vector<uint32_t, NUM_WORDS> > in1_stream("input_stream_1");
    static hls::stream<hls::vector<uint32_t, NUM_WORDS> > in2_stream("input_stream_2");
    static hls::stream<hls::vector<uint32_t, NUM_WORDS> > out_stream("output_stream");

    // Since NUM_WORDS values are processed
    // in parallel per loop iteration, the for loop only needs to iterate 'size / NUM_WORDS' times.
    assert(size % NUM_WORDS == 0);
    int vSize = size / NUM_WORDS;
#pragma HLS dataflow

    load_input(in1, in1_stream, vSize);
    load_input(in2, in2_stream, vSize);
    compute_add(in1_stream, in2_stream, out_stream, vSize);
    store_result(out, out_stream, vSize);
}
}
用的HW检查,对比图计算公式报告恰当。

接下是建设工程的捆扎打包文件资料 DEMO项目在线下载

性能调优

和常见的CPU编号各不相同,我选用会加快卡的要求即使要给选用会加快,高机械的能力才称得上偏重要的。vitis IDE集变为了强的机械的能力概述专用工具vitis analyzer,并能开通低层Kernel的一位module到最第一层的C++编号的企业每一位个API。从全局变量概述整一个源程序的机械的能力困局和等待时间操作使用,最快位置相关问题。

在活动施行进行后,相匹配子目录下易生产.xrt程序。也可以在使用vitis analyzer点击

针对Kernel组织结构某一种module的正常运作稳定性,深入解析系统文件在HW的项目 下,可能看清楚某一种module的正常运作稳定性,还可能打开浏览器Vitis HLS确定进一部的深入解析。

简单分析

谨此加减法建筑项目来说,实行非常简单深入分析,尽可能的讲非常清楚迅速卡的实用消费场景。 采用异构计算出来还说速度卡以前要先要确认计算出来的结构类型。打比方这些DEMO对其进行的运可算 A+B的方便运算。 DEMO中采用CPU算的形式正确,不断循环4096次需时仅为21us两边,
// CPU计算A+B
    for (int i = 0; i < DATA_SIZE; i++) {
        source_sw_results[i] = source_in1[i] + source_in2[i];
    }
采用vitis Analyzer察看的使用FPGA统计用时间达到88ms,一旦算上Host与Device采用PCIe的互传用时间,会加快卡的特性弱爆。 那麼倒底快速了个啥?

CPU主频有高至几GHz,GPU仅有1GHz的样子,FPGA通常会只能有数百人MHz。 在运算A+B这样运算的时会,也许是用10丰富前的老古董CPU来计算方法,速度慢也应该便捷碾过如今的最开始进的变快卡。

异构换算必要都不適合CPU净化处理的数据文件结构类型一些运算,才適合删除注册表到加速器卡中作业。

GPU是经过巨量的并行性确定单元测试卷,发挥出来高服务器带宽起步运行存储空间的优越,提高整个的大数据分析吞吐。也那就是经过超宽运行存储空间服务器带宽起步和巨量集成ic显卡功耗来提高确定机械性能。全部如DL/ML的练习,大数据分析库在线查询之类的都能否以利用GPU来加快和提升。

FPGA是根据互通门阵列,cpu电源电路设计满足运算阶段,往往FPGA的优缺点就变得更加适于数值空间结构和运算性状。正是因为是cpu电源电路设计满足数值折算,单条数值的折算时长更短也更加稳定定,同一时间FPGA促进卡有更高数值接口方式,更融入数值源,往往更适于做当下基础上数值运算。诸如,银相关行业相关行业的市场交易软件、数值中心点的存算分开等。