CUDA 编程 GPU 矩阵运算

内核不能显式返回值;所有结果数据必须写入传递给函数的数组(如果计算标量,则可能传递一个单元素数组);
内核在调用时显式声明其线程层次结构:即线程块的数量和每个块的线程数(请注意,虽然内核编译一次,但可以使用不同的块大小或网格大小多次调用它)。
from numba import cuda
@cuda.jit
def my_kernel(io_array):
"""
Code for kernel.
"""
# code here
内核调用
内核通常以以下方式启动:
import numpy
# Create the data array - usually initialized some other way
data = numpy.ones(256)
# Set the number of threads in a block
threadsperblock = 32
# Calculate the number of thread blocks in the grid
blockspergrid = (data.size + (threadsperblock - 1)) // threadsperblock
# Now start the kernel
my_kernel[blockspergrid, threadsperblock](data)
# Print the result
print(data)
主要有两个步骤:
通过指定每个网格的块数和每个块的线程数来实例化内核。两者的乘积将给出启动的线程总数。内核实例化是通过获取编译的内核函数并使用整数元组对其进行索引来完成的。
运行内核,将输入数组传递给它(如果需要,还可以传递任何单独的输出数组)。默认情况下,运行内核是同步的:当内核完成执行并且数据同步返回时,函数返回。
@cuda.jit
def my_kernel(io_array):
# Thread id in a 1D block
tx = cuda.threadIdx.x
# Block id in a 1D grid
ty = cuda.blockIdx.x
# Block width, i.e. number of threads per block
bw = cuda.blockDim.x
# Compute flattened index inside the array
pos = tx + ty * bw
if pos < io_array.size: # Check array boundaries
io_array[pos] *= 2 # do the computation
numba.cuda.threadIdx
- The thread indices in the current thread block. For 1-dimensional blocks, the index (given by the x attribute) is an integer spanning the range from 0 tonumba.cuda.blockDim
- 1. A similar rule exists for each dimension when more than one dimension is used.numba.cuda.blockDim
- The shape of the block of threads, as declared when instantiating the kernel. This value is the same for all threads in a given kernel, even if they belong to different blocks (i.e. each block is “full”).numba.cuda.blockIdx
- The block indices in the grid of threads launched a kernel. For a 1-dimensional grid, the index (given by thex
attribute) is an integer spanning the range from 0 tonumba.cuda.gridDim
- 1. A similar rule exists for each dimension when more than one dimension is used.numba.cuda.gridDim
- The shape of the grid of blocks, i.e. the total number of blocks launched by this kernel invocation, as declared when instantiating the kernel.
These objects can be 1-, 2- or 3-dimensional, depending on how the kernel was invoked. To access the value at each dimension, use the x
, y
and z
attributes of these objects, respectively.
Absolute positions
Simple algorithms will tend to always use thread indices in the same way as shown in the example above. Numba provides additional facilities to automate such calculations:
numba.cuda.grid(ndim)
- Return the absolute position of the current thread in the entire grid of blocks.ndim
should correspond to the number of dimensions declared when instantiating the kernel. Ifndim
is 1, a single integer is returned. Ifndim
is 2 or 3, a tuple of the given number of integers is returned.numba.cuda.gridsize(ndim)
- Return the absolute size (or shape) in threads of the entire grid of blocks.ndim
has the same meaning as ingrid()
above.
Using these functions, the our example can become:
from __future__ import division
from numba import cuda
import numpy
import math
# CUDA kernel
@cuda.jit
def my_kernel(io_array):
pos = cuda.grid(1)
if pos < io_array.size:
io_array[pos] *= 2 # do the computation
# Host code
data = numpy.ones(256)
threadsperblock = 256
blockspergrid = math.ceil(data.shape[0] / threadsperblock)
my_kernel[blockspergrid, threadsperblock](data)
print(data)
Memory management
Numba has been automatically transferring the NumPy arrays to the device when you invoke the kernel. However, it can only do so conservatively by always transferring the device memory back to the host when a kernel finishes. To avoid the unnecessary transfer for read-only arrays, it is possible to manually control the transfer.
device_array = cuda.device_array( shape )
Allocate an empty device ndarray
. Similar to numpy.empty()
.
device_array = cuda.to_device( array )
Allocate and transfer a NumPy ndarray
to the device.