Xilinx?Alveo系列FPGA加速卡软件开发入门_小强不吃菜
发布信息时长:2023-08-25 16:18:26 硬件开发 66次
标签:皇冠新体育APP:fpga开发
Powered by 金山文档
随着异构计算越来越火,FPGA加速卡在诸多领域的应用也越来越多。FPGA加速卡与GPU加速卡加速原理完全不同。GPU本质上是依靠海量的并行运算单元,提升整体的吞吐量,来吃尽内存带宽。FPGA是通用的门阵列,按照数据结构和运算特点搭建专用运算单元,能够以更低功耗和时延实现高吞吐。本文主要以一个简单demo介绍FPGA的项目结构和功能...
背景
逐渐异构算起愈来愈越火,FPGA迅速卡在大多数范畴的操作也愈来愈越多越。 FPGA加快速度卡与GPU加快速度卡加快速度原里截然与众不同。 GPU本质上上是仰仗巨量的多处理机系统运算单元测试卷,增强总体的周转量,来吃尽手机内存服务器带宽。 FPGA是通用型的门阵列,假设按照的数据组成部分和运算作用修建专业级运算单元测试,就可以以更低工作频率和时延改变高吞吐。 上篇当我们己经搞定了大环境架设,本篇将其主要详细介绍工程结构设计和事业设计原理整体架构
动用GPU会加快时,CPU读取数据源和指今到GPU就可以了,暂时无法要考虑继续执行指今的运算核的定制,FPGA集成电路芯片运算核是必须 搭建的。有以下几点图如图是:主机电源在PCIe衔接FPGA会加快卡。所以在Host端和Device端都是有相对应的的代碼。


创建第一个工程
官方版比较适合的一号个DEMO内容即是向量乘法,不方便初专家快捷熟记内容构造和自动运行方式。




代码简介
Host步骤其主要包扩下面的操作: 调用xclbin文件名,得Device,调用xclbin后来得所要的Kernel。 给Kernel合理安排前景,异步施行,等结局传送数据。 然后用CPU再算以下,检测高速度卡和CPU的核算最终结果。/**
* Copyright (C) 2019-2021 Xilinx, Inc
*
* Licensed under the Apache License, Version 2.0 (the "License"). You may
* not use this file except in compliance with the License. A copy of the
* License is located at
*
* //www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
*/
#include "xcl2.hpp"
#include <algorithm>
#include <vector>
#define DATA_SIZE 4096
int main(int argc, char** argv) {
if (argc != 2) {
std::cout << "Usage: " << argv[0] << " <XCLBIN File>" << std::endl;
return EXIT_FAILURE;
}
std::string binaryFile = argv[1];
size_t vector_size_bytes = sizeof(int) * DATA_SIZE;
cl_int err;
cl::Context context;
cl::Kernel krnl_vector_add;
cl::CommandQueue q;
// Allocate Memory in Host Memory
// When creating a buffer with user pointer (CL_MEM_USE_HOST_PTR), under the
// hood user ptr
// is used if it is properly aligned. when not aligned, runtime had no choice
// but to create
// its own host side buffer. So it is recommended to use this allocator if
// user wish to
// create buffer using CL_MEM_USE_HOST_PTR to align user buffer to page
// boundary. It will
// ensure that user buffer is used when user create Buffer/Mem object with
// CL_MEM_USE_HOST_PTR
std::vector<int, aligned_allocator<int> > source_in1(DATA_SIZE);
std::vector<int, aligned_allocator<int> > source_in2(DATA_SIZE);
std::vector<int, aligned_allocator<int> > source_hw_results(DATA_SIZE);
std::vector<int, aligned_allocator<int> > source_sw_results(DATA_SIZE);
// Create the test data
std::generate(source_in1.begin(), source_in1.end(), std::rand);
std::generate(source_in2.begin(), source_in2.end(), std::rand);
for (int i = 0; i < DATA_SIZE; i++) {
source_sw_results[i] = source_in1[i] + source_in2[i];
source_hw_results[i] = 0;
}
// OPENCL HOST CODE AREA START
// get_xil_devices() is a utility API which will find the xilinx
// platforms and will return list of devices connected to Xilinx platform
auto devices = xcl::get_xil_devices();
// read_binary_file() is a utility API which will load the binaryFile
// and will return the pointer to file buffer.
auto fileBuf = xcl::read_binary_file(binaryFile);
cl::Program::Binaries bins{{fileBuf.data(), fileBuf.size()}};
bool valid_device = false;
for (unsigned int i = 0; i < devices.size(); i++) {
auto device = devices[i];
// Creating Context and Command Queue for selected Device
OCL_CHECK(err, context = cl::Context(device, nullptr, nullptr, nullptr, &err));
OCL_CHECK(err, q = cl::CommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err));
std::cout << "Trying to program device[" << i << "]: " << device.getInfo<CL_DEVICE_NAME>() << std::endl;
cl::Program program(context, {device}, bins, nullptr, &err);
if (err != CL_SUCCESS) {
std::cout << "Failed to program device[" << i << "] with xclbin file!\n";
} else {
std::cout << "Device[" << i << "]: program successful!\n";
OCL_CHECK(err, krnl_vector_add = cl::Kernel(program, "vadd", &err));
valid_device = true;
break; // we break because we found a valid device
}
}
if (!valid_device) {
std::cout << "Failed to program any device found, exit!\n";
exit(EXIT_FAILURE);
}
// Allocate Buffer in Global Memory
// Buffers are allocated using CL_MEM_USE_HOST_PTR for efficient memory and
// Device-to-host communication
OCL_CHECK(err, cl::Buffer buffer_in1(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, vector_size_bytes,
source_in1.data(), &err));
OCL_CHECK(err, cl::Buffer buffer_in2(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, vector_size_bytes,
source_in2.data(), &err));
OCL_CHECK(err, cl::Buffer buffer_output(context, CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, vector_size_bytes,
source_hw_results.data(), &err));
int size = DATA_SIZE;
OCL_CHECK(err, err = krnl_vector_add.setArg(0, buffer_in1));
OCL_CHECK(err, err = krnl_vector_add.setArg(1, buffer_in2));
OCL_CHECK(err, err = krnl_vector_add.setArg(2, buffer_output));
OCL_CHECK(err, err = krnl_vector_add.setArg(3, size));
// Copy input data to device global memory
OCL_CHECK(err, err = q.enqueueMigrateMemObjects({buffer_in1, buffer_in2}, 0 /* 0 means from host*/));
// Launch the Kernel
// For HLS kernels global and local size is always (1,1,1). So, it is
// recommended
// to always use enqueueTask() for invoking HLS kernel
OCL_CHECK(err, err = q.enqueueTask(krnl_vector_add));
// Copy Result from Device Global Memory to Host Local Memory
OCL_CHECK(err, err = q.enqueueMigrateMemObjects({buffer_output}, CL_MIGRATE_MEM_OBJECT_HOST));
q.finish();
// OPENCL HOST CODE AREA END
// Compare the results of the Device to the simulation
bool match = true;
for (int i = 0; i < DATA_SIZE; i++) {
if (source_hw_results[i] != source_sw_results[i]) {
std::cout << "Error: Result mismatch" << std::endl;
std::cout << "i = " << i << " CPU result = " << source_sw_results[i]
<< " Device result = " << source_hw_results[i] << std::endl;
match = false;
break;
}
}
std::cout << "TEST " << (match ? "PASSED" : "FAILED") << std::endl;
return (match ? EXIT_SUCCESS : EXIT_FAILURE);
}
Kernel编号下述,编号很非常简单就会乘法,重要就会HLS的写法可以顺应:
/**
* Copyright (C) 2019-2021 Xilinx, Inc
*
* Licensed under the Apache License, Version 2.0 (the "License"). You may
* not use this file except in compliance with the License. A copy of the
* License is located at
*
* //www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
*/
/*******************************************************************************
Description:
This example uses the load/compute/store coding style which is generally
the most efficient for implementing kernels using HLS. The load and store
functions are responsible for moving data in and out of the kernel as
efficiently as possible. The core functionality is decomposed across one
of more compute functions. Whenever possible, the compute function should
pass data through HLS streams and should contain a single set of nested loops.
HLS stream objects are used to pass data between producer and consumer
functions. Stream read and write operations have a blocking behavior which
allows consumers and producers to synchronize with each other automatically.
The dataflow pragma instructs the compiler to enable task-level pipelining.
This is required for to load/compute/store functions to execute in a parallel
and pipelined manner.
The kernel operates on vectors of NUM_WORDS integers modeled using the hls::vector
data type. This datatype provides intuitive support for parallelism and
fits well the vector-add computation. The vector length is set to NUM_WORDS
since NUM_WORDS integers amount to a total of 64 bytes, which is the maximum size of
a kernel port. It is a good practice to match the compute bandwidth to the I/O
bandwidth. Here the kernel loads, computes and stores NUM_WORDS integer values per
clock cycle and is implemented as below:
_____________
| |<----- Input Vector 1 from Global Memory
| load_input | __
|_____________|----->| |
_____________ | | in1_stream
Input Vector 2 from Global Memory --->| | |__|
__ | load_input | |
| |<---|_____________| |
in2_stream | | _____________ |
|__|--->| |<--------
| compute_add | __
|_____________|---->| |
______________ | | out_stream
| |<---|__|
| store_result |
|______________|-----> Output result to Global Memory
*******************************************************************************/
// Includes
#include <hls_vector.h>
#include <hls_stream.h>
#include "assert.h"
#define MEMORY_DWIDTH 512
#define SIZEOF_WORD 4
#define NUM_WORDS ((MEMORY_DWIDTH) / (8 * SIZEOF_WORD))
#define DATA_SIZE 4096
// TRIPCOUNT identifier
const int c_size = DATA_SIZE;
static void load_input(hls::vector<uint32_t, NUM_WORDS>* in,
hls::stream<hls::vector<uint32_t, NUM_WORDS> >& inStream,
int vSize) {
mem_rd:
for (int i = 0; i < vSize; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
inStream << in[i];
}
}
static void compute_add(hls::stream<hls::vector<uint32_t, NUM_WORDS> >& in1_stream,
hls::stream<hls::vector<uint32_t, NUM_WORDS> >& in2_stream,
hls::stream<hls::vector<uint32_t, NUM_WORDS> >& out_stream,
int vSize) {
// The kernel is operating with vector of NUM_WORDS integers. The + operator performs
// an element-wise add, resulting in NUM_WORDS parallel additions.
execute:
for (int i = 0; i < vSize; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
out_stream << (in1_stream.read() + in2_stream.read());
}
}
static void store_result(hls::vector<uint32_t, NUM_WORDS>* out,
hls::stream<hls::vector<uint32_t, NUM_WORDS> >& out_stream,
int vSize) {
mem_wr:
for (int i = 0; i < vSize; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
out[i] = out_stream.read();
}
}
extern "C" {
/*
Vector Addition Kernel
Arguments:
in1 (input) --> Input vector 1
in2 (input) --> Input vector 2
out (output) --> Output vector
size (input) --> Number of elements in vector
*/
void vadd(hls::vector<uint32_t, NUM_WORDS>* in1,
hls::vector<uint32_t, NUM_WORDS>* in2,
hls::vector<uint32_t, NUM_WORDS>* out,
int size) {
#pragma HLS INTERFACE m_axi port = in1 bundle = gmem0
#pragma HLS INTERFACE m_axi port = in2 bundle = gmem1
#pragma HLS INTERFACE m_axi port = out bundle = gmem0
static hls::stream<hls::vector<uint32_t, NUM_WORDS> > in1_stream("input_stream_1");
static hls::stream<hls::vector<uint32_t, NUM_WORDS> > in2_stream("input_stream_2");
static hls::stream<hls::vector<uint32_t, NUM_WORDS> > out_stream("output_stream");
// Since NUM_WORDS values are processed
// in parallel per loop iteration, the for loop only needs to iterate 'size / NUM_WORDS' times.
assert(size % NUM_WORDS == 0);
int vSize = size / NUM_WORDS;
#pragma HLS dataflow
load_input(in1, in1_stream, vSize);
load_input(in2, in2_stream, vSize);
compute_add(in1_stream, in2_stream, out_stream, vSize);
store_result(out, out_stream, vSize);
}
}
用的HW检查,对比图计算公式报告恰当。

性能调优
和常见的CPU编号各不相同,我选用会加快卡的要求即使要给选用会加快,高机械的能力才称得上偏重要的。vitis IDE集变为了强的机械的能力概述专用工具vitis analyzer,并能开通低层Kernel的一位module到最第一层的C++编号的企业每一位个API。从全局变量概述整一个源程序的机械的能力困局和等待时间操作使用,最快位置相关问题。




简单分析
谨此加减法建筑项目来说,实行非常简单深入分析,尽可能的讲非常清楚迅速卡的实用消费场景。 采用异构计算出来还说速度卡以前要先要确认计算出来的结构类型。打比方这些DEMO对其进行的运可算 A+B的方便运算。 DEMO中采用CPU算的形式正确,不断循环4096次需时仅为21us两边,// CPU计算A+B
for (int i = 0; i < DATA_SIZE; i++) {
source_sw_results[i] = source_in1[i] + source_in2[i];
}
采用vitis Analyzer察看的使用FPGA统计用时间达到88ms,一旦算上Host与Device采用PCIe的互传用时间,会加快卡的特性弱爆。

皇冠新体育APP相关的文章
- @Validated注解无效,有效的解决方案看这里_kanekiqwq_@validated 不生效
- 皇冠新体育APP:基于Java的超市积分管理系统(附:论文 源码 课件)_搞程序的菇凉_java超市会员积分系统
- 皇冠新体育APP:java web项目之黑马旅游网笔记_恐龙灭绝是我干滴
- 皇冠新体育APP:美团Leaf实战(分布式Id算法)_ZNineSun_美团leaf
- DTD之XML的元素&属性定义_YY…
- 手把手教你如何对接支付宝支付接口(奶爸级别)_胡庚申_支付宝支付接口
- idea / eclipse 配置 Tomcat 并发布 Web 项目_攻城狮杰森_eclipsetomcat部署web项目
- 解决Incorrect result size: expected 1, actual 0!_好奇心害猫_incorrect result size
- Java项目:酒店管理系统(java+SSM+Maven+LayUI+mysql)_qq1334611189_java酒店管理系统项目
- 皇冠新体育APP:3 CompletableFuture_乘风会落雨
- 皇冠新体育APP:java创建线程安全的Set集合_Tango小黄_线程安全的set
- 数据库画简单ER图_游坦之_数据库er图怎么画
- Mac 环境下 java 自带的 Jvisualvm 使用笔记_小程同学-94_jvisualvm mac
- 两个Tomcat部署在同一台服务器上,如何各自独立运行_flytalei_一台服务器部署两个tomcat
- 帆软 finereport FCRA 考试 题库+答案,共收录561题,大部分有答案_Napoli di Italia_哪个linux指令不属于文件传输
- 皇冠新体育APP:JavaWeb项目开发流程_jat0824_做一个javaweb项目的步骤