欢迎光临散文网 会员登陆 & 注册

CUDA 编程 GPU 矩阵运算

2023-02-11 15:52 作者:高频交易dragon  | 我要投稿


内核不能显式返回值;所有结果数据必须写入传递给函数的数组(如果计算标量,则可能传递一个单元素数组);


内核在调用时显式声明其线程层次结构:即线程块的数量和每个块的线程数(请注意,虽然内核编译一次,但可以使用不同的块大小或网格大小多次调用它)。

from numba import cuda

@cuda.jit def my_kernel(io_array): 

 """    Code for kernel.    """    

# code here


内核调用


内核通常以以下方式启动:


import numpy

# Create the data array - usually initialized some other way

data = numpy.ones(256)


# Set the number of threads in a block

threadsperblock = 32 


# Calculate the number of thread blocks in the grid

blockspergrid = (data.size + (threadsperblock - 1)) // threadsperblock


# Now start the kernel

my_kernel[blockspergrid, threadsperblock](data)


# Print the result

print(data)

主要有两个步骤:

通过指定每个网格的块数和每个块的线程数来实例化内核。两者的乘积将给出启动的线程总数。内核实例化是通过获取编译的内核函数并使用整数元组对其进行索引来完成的。

运行内核,将输入数组传递给它(如果需要,还可以传递任何单独的输出数组)。默认情况下,运行内核是同步的:当内核完成执行并且数据同步返回时,函数返回。

@cuda.jit

def my_kernel(io_array):

    # Thread id in a 1D block

    tx = cuda.threadIdx.x

    # Block id in a 1D grid

    ty = cuda.blockIdx.x

    # Block width, i.e. number of threads per block

    bw = cuda.blockDim.x

    # Compute flattened index inside the array

    pos = tx + ty * bw

    if pos < io_array.size:  # Check array boundaries

        io_array[pos] *= 2 # do the computation

  • numba.cuda.threadIdx - The thread indices in the current thread block. For 1-dimensional blocks, the index (given by the x attribute) is an integer spanning the range from 0 to numba.cuda.blockDim - 1. A similar rule exists for each dimension when more than one dimension is used.

  • numba.cuda.blockDim - The shape of the block of threads, as declared when instantiating the kernel. This value is the same for all threads in a given kernel, even if they belong to different blocks (i.e. each block is “full”).

  • numba.cuda.blockIdx - The block indices in the grid of threads launched a kernel. For a 1-dimensional grid, the index (given by the x attribute) is an integer spanning the range from 0 to numba.cuda.gridDim - 1. A similar rule exists for each dimension when more than one dimension is used.

  • numba.cuda.gridDim - The shape of the grid of blocks, i.e. the total number of blocks launched by this kernel invocation, as declared when instantiating the kernel.

These objects can be 1-, 2- or 3-dimensional, depending on how the kernel was invoked. To access the value at each dimension, use the xy and z attributes of these objects, respectively.

Absolute positions

Simple algorithms will tend to always use thread indices in the same way as shown in the example above. Numba provides additional facilities to automate such calculations:

  • numba.cuda.grid(ndim) - Return the absolute position of the current thread in the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel. If ndim is 1, a single integer is returned. If ndim is 2 or 3, a tuple of the given number of integers is returned.

  • numba.cuda.gridsize(ndim) - Return the absolute size (or shape) in threads of the entire grid of blocks. ndim has the same meaning as in grid() above.

Using these functions, the our example can become:

from __future__ import division

from numba import cuda

import numpy

import math


# CUDA kernel

@cuda.jit

def my_kernel(io_array):

    pos = cuda.grid(1)

    if pos < io_array.size:

        io_array[pos] *= 2 # do the computation


# Host code   

data = numpy.ones(256)

threadsperblock = 256

blockspergrid = math.ceil(data.shape[0] / threadsperblock)

my_kernel[blockspergrid, threadsperblock](data)

print(data)

Memory management

Numba has been automatically transferring the NumPy arrays to the device when you invoke the kernel. However, it can only do so conservatively by always transferring the device memory back to the host when a kernel finishes. To avoid the unnecessary transfer for read-only arrays, it is possible to manually control the transfer.

device_array = cuda.device_array( shape )

Allocate an empty device ndarray. Similar to numpy.empty().

device_array = cuda.to_device( array )

Allocate and transfer a NumPy ndarray to the device.


CUDA 编程 GPU 矩阵运算的评论 (共 条)

分享到微博请遵守国家法律