CNN卷积神经网络--MIT公开课程《Introduction to Machine Learning》第八章译文

2023-01-27 01:02 作者:GUCCI-GUJI 0人读过 | 我要投稿

CNN卷积神经网络 Convolutional Neural Networks

MIT公开课程《Introduction to Machine Learning》第八章译文

So far, we have studied what are called fully connected neural networks, in which all of the units at one layer are connected to all of the units in the next layer. This is a good arrangement when we don’t know anything about what kind of mapping from inputs to outputs we will be asking the network to learn to approximate. But if we do know something about our problem, it is better to build it into the structure of our neural network. Doing so can save computation time and significantly diminish the amount of training data required to arrive at a solution that generalizes robustly.

到目前为止，我们已经研究了所谓的全连接神经网络，其中一层的所有单元都连接到下一层的全部单元。当我们不知道从输入到输出的映射类型时，这是一个很好的安排，并且我们要求网络学会逼近。但如果我们确实知道一些关于我们的问题的信息，那么最好将其构建到我们的神经网络结构中。这样做可以节省计算时间，并显著减少训练数据量以获得稳健推广的解决方案。

One very important application domain of neural networks, where the methods have achieved an enormous amount of success in recent years, is signal processing. Signals might be spatial (in two-dimensional camera images or three-dimensional depth or CAT scans) or temporal (speech or music). If we know that we are addressing a signal-processing problem, we can take advantage of invariant properties of that problem. In this chapter, we will focus on two-dimensional spatial problems (images) but use one-dimensional ones as a simple example. Later, we will address temporal problems.

神经网络的一个非常重要的应用领域是信号处理，近年来这些方法已经取得了巨大的成功。信号可能是空间的（二维相机图像或三维深度或CAT扫描）或时间的（语音或音乐）。如果我们知道正在解决一个信号处理问题，就可以利用这个问题的不变性。本文将关注二维空间问题（图像），但将一维问题作为一个简单的例子。稍后，我们将讨论时间问题。

Imagine that you are given the problem of designing and training a neural network that takes an image as input, and outputs a classification, which is positive if the image contains a cat and negative if it does not. An image is described as a two-dimensional array of pixels( A pixel is a “picture element.”), each of which may be represented by three integer values, encoding intensity levels in red, green, and blue color channels.

想象一下，你面临的问题是设计和训练一个神经网络，该网络将图像作为输入，并输出一个分类，如果图像包含猫，则分类为正，如果不包含猫，分类为负。图像被描述为二维像素阵列（像素是“图片元素”），每个像素可以由三个整数值表示，以红色、绿色和蓝色通道编码强度级别。

There are two important pieces of prior structural knowledge we can bring to bear on this problem:

对于这个问题，我们可以利用两个重要的先验结构知识：

Spatial locality: The set of pixels we will have to take into consideration to find a cat will be near one another in the image. So, for example, we won’t have to consider some combination of pixels in the four corners of the image, in order to see if they encode cat-ness.

空间局部性：为了找到一只猫，我们必须考虑的像素集将在图像中彼此靠近。例如，我们不必考虑图像四角的像素组合以查看它们是否存在猫。（个人理解：猫的图像像素集是一整块“粘”在一起的，所以在寻找存在猫的像素集时不必考虑零散/东一块西一块的组合平凑后是否会存在猫）

Translation invariance: The pattern of pixels that characterizes a cat is the same no matter where in the image the cat occurs. Cats don’t look different if they’re on the left or the right side of the image.

平移不变性：无论猫出现在图像中的哪个位置，捕获猫特征的像素模式是相同的。猫在图像的左侧或右侧看起来并没有什么不同。

We will design neural network structures that take advantage of these properties.

我们将设计利用这些特性的神经网络结构。

1 Filters

We begin by discussing image filters (Unfortunately in AI/ML/CS/Math, the word “filter” gets used in many ways: in addition to the one we describe here, it can describe a temporal process (in fact, our moving averages are a kind of filter) and even a somewhat esoteric algebraic structure). An image filter is a function that takes in a local spatial neighborhood of pixel values and detects the presence of some pattern in that data.

我们首先讨论图像过滤器（不幸的是，在AI/ML/CS/Math中，“过滤器”一词有很多种用法：除了我们在这里描述的过滤器，它还可以描述一个时间过程（事实上，我们的移动平均值是一种过滤器），甚至是一种有点深奥的代数结构）。图像过滤器是一种函数，它获取像素值的局部空间邻域，并检测数据中是否存在某种图案。

Let’s consider a very simple case to start, in which we have a 1-dimensional binary “image” and a filter F of size two. The filter is a vector of two numbers, which we will move along the image, taking the dot product between the filter values and the image values at each step, and aggregating the outputs to produce a new image.

让我们考虑一个非常简单的例子，我们有一个一维二进制“图像”和一个大小为2的滤波器F。过滤器是两个数字的矢量，它将沿着图像移动，在每个步骤中获取过滤器值和图像值之间的点积，并聚合（加总）输出以生成新图像。

Let X be the original image, of size d; then pixel i of the the output image is specified by

设X为原始图像，大小为d；则输出图像的像素i如下所示

To ensure that the output image is also of dimension d, we will generally “pad” the input image with 0 values if we need to access pixels that are beyond the bounds of the input image. This process of applying the filter to the image to create a new image is called “convolution.” (And filters are also sometimes called convolutional kernels.)

为了确保输出图像也是维度d的，如果我们需要访问超出输入图像边界的像素，通常会用0值“填充”输入图像。将滤波器应用于图像以创建新图像的过程称为“卷积” (滤波器有时也称为卷积核。)

If you are already familiar with what a convolution is, you might notice that this definition corresponds to what is often called a correlation and not to a convolution. Indeed, correlation and convolution refer to different operations in signal processing. However, in the neural networks literature, most libraries implement the correlation (as described in this chapter) but call it convolution. The distinction is not significant; in principle, if convolution is required to solve the problem, the network could learn the necessary weights. For a discussion of the difference between convolution and correlation and the conventions used in the literature you can read section 9.1 in this excellent book: https://www.deeplearningbook.org.

如果您已经熟悉卷积是什么，您可能会注意到这个定义对应的是通常所称的相关性，而不是卷积。实际上，相关和卷积指的是信号处理中的不同操作。然而，在神经网络文献中，大多数库实现相关性（如本章所述），但称其为卷积。区别不大；原则上，如果需要卷积来解决问题，网络可以学习必要的权重。关于卷积和相关性之间的差异以及文献中使用的惯例的讨论，您可以阅读本优秀书籍中的9.1节：https://www.deeplearningbook.org.

Here is a concrete example. Let the filter F1 = (−1, +1). Then given the first image below, we can convolve it with filter F1 to obtain the second image. You can think of this filter as a detector for “left edges” in the original image—to see this, look at the places where there is a 1 in the output image, and see what pattern exists at that position in the input image. Another interesting filter is F2 = (−1, +1, −1). The third image below shows the result of convolving the first image with F2.

这是一个具体的例子。设过滤器F1=（−1，+1）。然后，给定下面的第一幅图像，我们可以将其与滤波器F1卷积以获得第二幅图像。您可以将此过滤器视为原始图像中“左边缘”的检测器，为验证该过滤器，可以查看输出图像中有1的位置，并查看输入图像中相同位置出现的形式。另一个有趣的过滤器是F2=（−1，+1，−1）。下面的第三幅图像显示了将第一幅图像与F2卷积的结果。

Two-dimensional versions of filters like these are thought to be found in the visual cortex of all mammalian brains. Similar patterns arise from statistical analysis of natural images. Computer vision people used to spend a lot of time hand-designing filter banks. A filter bank is a set of sets of filters, arranged as shown in the diagram below.

人们认为，在所有哺乳动物大脑的视觉皮层中都可以找到这样的二维滤镜。类似的模式来自自然图像的统计分析。计算机视觉人员过去花了大量时间手工设计滤波器组。滤波器组是一组滤波器，排列如下图所示。

All of the filters in the first group are applied to the original image; if there are k such filters, then the result is k new images, which are called channels. Now imagine stacking all these new images up so that we have a cube of data, indexed by the original row and column indices of the image, as well as by the channel. The next set of filters in the filter bank will generally be three-dimensional: each one will be applied to a sub-range of the row and column indices of the image and to all of the channels.

第一组中的所有滤波器都应用于原始图像；如果有k个这样的过滤器，那么结果是k个新的图像，它们被称为通道。现在想象一下，将所有这些新图像堆叠起来，这样我们就有了一个数据立方体，由图像的原始行和列索引以及通道索引。过滤器组中的下一组过滤器通常是三维的：每个过滤器将应用于图像的行和列索引的子范围以及所有通道。

These 3D chunks of data are called tensors( We will use a popular piece of neural-network software called Tensorflow because it makes operations on tensors easy). The algebra of tensors is fun, and a lot like matrix algebra, but we won’t go into it in any detail.

这些3D数据块被称为张量（我们将使用一种流行的神经网络软件Tsensorflow，因为它使张量的操作变得容易）。张量代数很有趣，很像矩阵代数，但我们不会详细讨论它。

Here is a more complex example of two-dimensional filtering. We have two 3 × 3 filters in the first layer, f1 and f2. You can think of each one as “looking” for three pixels in a row, f1 vertically and f2 horizontally . Assuming our input image is n × n, then the result of filtering with these two filters is an n × n × 2 tensor. Now we apply a tensor filter (hard to draw!) that “looks for” a combination of two horizontal and two vertical bars (now represented by individual pixels in the two channels), resulting in a single final n × n image. When we have a color image as input, we treat it as having 3 channels, and hence as an n×n×3 tensor.

这里是一个更复杂的二维过滤示例。我们在第一层中有两个3×3滤波器，f1和f2。你可以把每一个都看作是“寻找”一行中的三个像素，f1垂直，f2水平。假设我们输入的图像是n×n，那么用这两个滤波器进行滤波的结果就是n×n×2张量。现在我们应用一个张量滤波器（很难绘制！），它“寻找”两个水平和两个垂直条的组合（现在由两个通道中的单个像素表示），从而生成一个最终的n×n图像。当我们将彩色图像作为输入时，我们将其视为具有3个通道（RGB颜色通道，一幅RGB色彩模式的图像含有3个原色通道，分别是红原色通道，绿原色通道和蓝原色通道，每种颜色都是三个整数的组合），因此视为n×n×3张量。

We are going to design neural networks that have this structure. Each “bank” of the filter bank will correspond to a neural-network layer. The numbers in the individual filters will be the “weights” (plus a single additive bias or offset value for each filter) of the network, which we will train using gradient descent. What makes this interesting and powerful (and somewhat confusing at first) is that the same weights are used many many times in the computation of each layer. This weight sharing means that we can express a transformation on a large image with relatively few parameters; it also means we’ll have to take care in figuring out exactly how to train it!

我们将设计具有这种结构的神经网络。滤波器组的每个“组”将对应于一个神经网络层。单个滤波器中的数字将是网络的“权重”（加上每个滤波器的单个加性偏差或偏移值），我们将使用梯度下降来训练该网络。这一有趣而强大（起初有些令人困惑）的原因是，在每一层的计算中多次使用相同的权重。这种权重共享意味着我们可以用相对较少的参数在大图像上表达变换；这也意味着我们必须仔细研究如何训练它！

We will define a filter layer l formally with:（For simplicity , we are assuming that all images and filters are square (having the same number of rows and columns). That is in no way necessary, but is usually fine and definitely simplifies our notation.）

我们将用以下形式定义过滤层l：（为了简单起见，我们假设所有图像和过滤器都是正方形的（具有相同的行数和列数）。这绝对不是必要的，但通常很好，而且绝对简化了我们的表示法。）

（这段数学符号太多了，简略敲一点）

过滤器的数量。。。;（此处每个过滤器应该都是一个三维的）

过滤器的大小为。。。。加1偏置值（对于这个滤波器）

步长sl是我们将滤波器应用于图像的间距；到目前为止，在我们的所有示例中，我们都使用了1的步幅，但如果我们“跳过”并仅在图像的奇数索引处应用过滤器，那么它将具有2的步幅（并生成一半大小的结果图像）；

输入张量大小。。。

填充值:pl是我们在输入边缘周围添加的额外像素数（通常值为0）。对于大小为nl−1×nl−2×ml−1的输入，我们新的带填充的有效输入大小变为（nl−1+2*pl）

该层将产生大小为nl×nl×ml的输出张量，其中nl=d（nl−1+2*pl−（kl−1））/sle.权重是定义滤波器的值：将有ml个不同的kl×kl×ml−1张量的权重值；加上每个滤波器可能有一个偏置项，这意味着每个滤波器还有一个权重值。带偏置的滤波器的操作与上面的滤波器示例相同，只是我们将偏置添加到输出。例如，如果我们在上面的滤波器F2中加入了0.5的偏置项，则输出将是（−0.5，0.5，−0.5，0.5%，−1.5，1.5，−0.5），而不是（−1，0，−1，1，−2，1，-1，0）。

注1：ceiling function为向上进一取整数

This may seem complicated, but we get a rich class of mappings that exploit image structure and have many fewer weights than a fully connected layer would.

这可能看起来很复杂，但我们得到了丰富的映射类，它们利用了图像结构，并且比完全连接的层具有更少的权重。

2 Max Pooling

It is typical to structure filter banks into a pyramid (Both in engineering and in nature), in which the image sizes get smaller in successive layers of processing. The idea is that we find local patterns, like bits of edges in the early layers, and then look for patterns in those patterns, etc. This means that, effectively, we are looking for patterns in larger pieces of the image as we apply successive filters. Having a stride greater than one makes the images smaller, but does not necessarily aggregate information over that spatial range.

典型的做法是将滤波器组构造成金字塔（无论在工程上还是在自然界中），其中图像大小在连续的处理层中变小。我们的想法是找到局部图案，比如早期图层中的边缘，然后在这些图案中寻找局部图案等。这意味着，当我们应用连续的过滤器时，我们实际上是在图像中较大的部分寻找局部图案。步幅大于1会使图像变小，但不一定会聚集该空间范围内的信息。

Another common layer type, which accomplishes this aggregation, is max pooling. A max pooling layer operates like a filter, but has no weights. Y ou can think of it as a pure functional layer, like a ReLU layer in a fully connected network. It has a filter size, as in a filter layer, but simply returns the maximum value in its field (We sometimes use the term receptive field or just field to mean the area of an input image that a filter is being applied to). Usually , we apply max pooling with the following traits:

• stride > 1, so that the resulting image is smaller than the input image; and

• k > =stride, so that the whole image is covered.

实现这种聚合的另一种常见层类型是最大池化。最大池化层的操作类似于过滤器，但没有权重。你可以将其视为纯功能层，就像完全连接网络中的ReLU层。它与过滤器具有同样大小，就像在过滤器层中一样，但只返回其范围（我们有时使用术语“接受场”或“仅场”来表示正在应用过滤器的输入图像的区域。）内的最大值。通常，我们使用具有以下特征的最大池：

•步幅>1，使得得到的图像小于输入图像；

•k>=步幅，以便覆盖整个图像。

As a result of applying a max pooling layer, we don’t keep track of the precise location of a pattern. This helps our filters to learn to recognize patterns independent of their location.

Consider a max pooling layer of stride = k = 2. This would map a 64 × 64 × 3 image to a 32 × 32 × 3 image. Note that max pooling layers do not have additional bias or offset values.

由于应用了最大池化层，我们无法跟踪样式的精确位置。这有助于过滤器不依赖位置学会识别样式。

考虑最大池化层的步长=k=2。这将把64×64×3图像映射到32×32×3图像。请注意，最大池化层没有额外的偏移或偏移值。

3 Typical architecture

Here is the form of a typical convolutional network:

以下是典型卷积网络的形式：

Source: https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html

After each filter layer there is generally a ReLU layer; there maybe be multiple filter/ReLU layers, then a max pooling layer, then some more filter/ReLU layers, then max pooling. Once the output is down to a relatively small size, there is typically a last fullyconnected layer, leading into an activation function such as softmax that produces the final output. The exact design of these structures is an art—there is not currently any clear theoretical (or even systematic empirical) understanding of how these various design choices affect overall performance of the network.

在每个过滤层之后，通常存在ReLU层；可能有多个过滤器/ReLU层，然后是最大池化层，然后还有一些过滤器/ReLU层，然后最大池化。一旦输出减小到相对较小的大小，通常会有最后一个完全连接的层，导致产生最终输出的激活函数（如softmax）。这些结构的精确设计是一门艺术，目前还没有对这些不同的设计选择如何影响网络的整体性能的任何明确的理论（甚至系统的经验）理解。

The critical point for us is that this is all just a big neural network, which takes an input and computes an output. The mapping is a differentiable function of the weights, which means we can adjust the weights to decrease the loss by performing gradient descent, and we can compute the relevant gradients using back-propagation! (Well, the derivative is not continuous, both because of the ReLU and the max pooling operations, but we ignore that fact.)

对我们来说，关键点是这是一个大的神经网络，它接受输入并计算输出。映射是权重的可微函数（由于ReLU和最大池操作，导数不是连续的，但我们忽略了这一事实。），这意味着我们可以通过执行梯度下降来调整权重以减少损失，并且我们可以使用反向传播来计算相关的梯度！

Let’s work through a very simple example of how back-propagation can work on a convolutional network. The architecture is shown below. Assume we have a one-dimensional single-channel image, of size n × 1 × 1 and a single k × 1 × 1 filter (where we omit the filter bias) in the first convolutional layer. Then we pass it through a ReLU layer and a fully-connected layer with no additional activation function on the output.

让我们通过一个非常简单的例子来说明反向传播如何在卷积网络上工作。架构如下所示。假设我们有一个尺寸为n×1×1的一维单通道图像和第一卷积层中的单个k×1×1。然后，将其通过ReLU层和完全连接的层，输出上没有附加的激活功能。

For simplicity assume k is odd, let the input image X = A0, and assume we are using squared loss. Then we can describe the forward pass as follows

为了简单起见，假设k是奇数，让输入图像X=A0，并假设我们使用平方损失。然后我们可以如下描述向前传递

$$\displaystyle Z_ i^1\displaystyle = {W^1}^ T \cdot A^0_{[i-\lfloor k/2 \rfloor : i + \lfloor k/2 \rfloor ]} \\ \displaystyle A^1\displaystyle = ReLU(Z^1) \\ \displaystyle A^2\displaystyle = {W^2}^ T A^1 \\ \displaystyle L(A^2, y)\displaystyle = (A^2-y)^2$$

How do we update the weights in filter W1?

我们如何更新过滤器W1中的权重？

标签：

CNN卷积神经网络--MIT公开课程《Introduction to Machine Learning》第八章译文

CNN卷积神经网络 Convolutional Neural Networks

1 Filters

2 Max Pooling

3 Typical architecture