卷积神经网络

2023-12-18

1.Lenet和现代经典卷积神经网络

手敲一遍过去三四十年的传统神经网络，加深一下对于网络演变的认识，给自己后续学习新进的网络结构打好基础。

主要是分为：

传统卷积神经网络：LeNet
现代卷积神经网络：AlexNet、VGG、NiN、GoogleLeNet、Resnet、DenseNet

2.传统卷积神经网络

2.1LeNet

下面是简化版的LeNet

import torch
from torch import nn
from d2l import torch as d2l

net = nn.Sequential(
    nn.Conv2d(1,6,kernel_size=5,padding=2),nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2,stride=2),
    nn.Conv2d(6,16,kernel_size=5),nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2,stride=2),
    nn.Flatten(),
    nn.Linear(16*5*5,120),nn.Sigmoid(),
    nn.Linear(120,84),nn.Sigmoid(),
    nn.Linear(84,10)
)

设置输入数据并输出每一层的形状，以此来检查模型

X = torch.rand(size=(1,1,28,28),dtype=torch.float32)
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape: \t',X.shape)

Conv2d output shape: 	 torch.Size([1, 6, 28, 28])
Sigmoid output shape: 	 torch.Size([1, 6, 28, 28])
AvgPool2d output shape: 	 torch.Size([1, 6, 14, 14])
Conv2d output shape: 	 torch.Size([1, 16, 10, 10])
Sigmoid output shape: 	 torch.Size([1, 16, 10, 10])
AvgPool2d output shape: 	 torch.Size([1, 16, 5, 5])
Flatten output shape: 	 torch.Size([1, 400])
Linear output shape: 	 torch.Size([1, 120])
Sigmoid output shape: 	 torch.Size([1, 120])
Linear output shape: 	 torch.Size([1, 84])
Sigmoid output shape: 	 torch.Size([1, 84])
Linear output shape: 	 torch.Size([1, 10])

在整个卷积块中，与上一层相比，每一层特征的高度和宽度都减小了。第一个卷积层使用2个像素的填充，来补偿$5 \times 5$卷积核导致的特征减少。相反，第二个卷积层没有填充，因此高度和宽度都减少了4个像素。随着层叠的上升，通道的数量从输入时的1个，增加到第一个卷积层之后的6个，再到第二个卷积层之后的16个。同时，每个汇聚层的高度和宽度都减半。最后，每个全连接层减少维数，最终输出一个维数与结果分类数相匹配的输出。

训练评估Lenet-5模型

batch_size = 256
train_iter,test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)

lr,num_epochs = 0.9,10
d2l.train_ch6(net,test_iter,test_iter,num_epochs,lr,d2l.try_gpu())#使用GPU进行训练

总结：

为了构造高性能的卷积神经网络，我们通常对卷积层进行排列，逐渐降低其表示的空间分辨率，同时增加通道数。
在传统的卷积神经网络中，卷积块编码得到的表征在输出之前需由一个或多个全连接层进行处理。

3.现代经典卷积神经网络

3.1AlexNet

模型设计：

因为ImageNet大多数图像的宽高比MINIST多十倍，因此需要更大的卷积窗口来捕获特征，所以第一层卷积窗口形状是11X11。第二层中的卷积窗口形状被缩减为5 × 5，然后是3 × 3。此外，在第一层、第二层和第五层卷积层之后，加入窗口形状为3 × 3、步幅为2的最大汇聚层。而且，AlexNet的卷积通道数目是LeNet的10倍。

和LeNet还有一点不同的是，AlexNet使用RELU激活函数替换掉了原来的Sigmoid激活函数。

import torch
from torch import nn
from d2l import torch as d2l

net = nn.Sequential(
    nn.Conv2d(1,96,kernel_size=11,stride=4,padding=1),nn.ReLU(),
    nn.MaxPool2d(kernel_size=3,stride=2),
    
    nn.Conv2d(96,256,kernel_size=5,padding=2),nn.ReLU(),
    nn.MaxPool2d(kernel_size=3,stride=2),
    
    nn.Conv2d(256,384,kernel_size=3,padding=1),nn.ReLU(),
    nn.Conv2d(384,384,kernel_size=3,padding=1),nn.ReLU(),
    nn.Conv2d(384,256,kernel_size=3,padding=1),nn.ReLU(),
    nn.MaxPool2d(kernel_size=3,stride=2),
    nn.Flatten(),
    
    nn.Linear(6400,4096), nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(4096,4096), nn.ReLU(),
    nn.Dropout(p=0.5),
    
    nn.Linear(4096,10))

X = torch.randn(1,1,224,224)
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape\t',X.shape)

输出网络形状变化：

Conv2d output shape	 torch.Size([1, 96, 54, 54])
ReLU output shape	 torch.Size([1, 96, 54, 54])
MaxPool2d output shape	 torch.Size([1, 96, 26, 26])
Conv2d output shape	 torch.Size([1, 256, 26, 26])
ReLU output shape	 torch.Size([1, 256, 26, 26])
MaxPool2d output shape	 torch.Size([1, 256, 12, 12])
Conv2d output shape	 torch.Size([1, 384, 12, 12])
ReLU output shape	 torch.Size([1, 384, 12, 12])
Conv2d output shape	 torch.Size([1, 384, 12, 12])
ReLU output shape	 torch.Size([1, 384, 12, 12])
Conv2d output shape	 torch.Size([1, 256, 12, 12])
ReLU output shape	 torch.Size([1, 256, 12, 12])
MaxPool2d output shape	 torch.Size([1, 256, 5, 5])
Flatten output shape	 torch.Size([1, 6400])
Linear output shape	 torch.Size([1, 4096])
ReLU output shape	 torch.Size([1, 4096])
Dropout output shape	 torch.Size([1, 4096])
Linear output shape	 torch.Size([1, 4096])
ReLU output shape	 torch.Size([1, 4096])
Dropout output shape	 torch.Size([1, 4096])
Linear output shape	 torch.Size([1, 10])

训练：

batch_size = 128
train_iter,test_iter = d2l.load_data_fashion_mnist(batch_size,resize=224)
lr,num_epochs = 0.01,10
d2l.train_ch6(net,train_iter,test_iter,num_epochs,lr,d2l.try_gpu())

输出：

3.2VGG

本质上是块的累积，更大更深的AlexNet

说明：

白色矩形框：代表卷积和激活函数
红色矩形框：代表最大池化下载量
蓝色矩形框：全连接层和激活函数
橙色矩形框：softmax处理

结构过程：（配置表和结构图一起观察）

1、首先输入一张2242243大小的图像，经过两个33的卷积层之后，所得到的特征图大小为224224*64（尺寸大小不变，因为采用的是64个卷积核，所以深度也为64）。

2、通过一个最大池化下载量层，得到的特征图为11211264（大小缩小一半，不改变深度）。

3、再通过两个33128的卷积层，得到的特征图为112112128（深度变为128）。

4、通过一个最大池化下载量层，得到的特征图为5656128（大小缩小一半，不改变深度）。

5、再通过三个33256的卷积层，得到的特征图为5656256（深度变为256）。

6、通过一个最大池化下载量层，得到的特征图为2828256（大小缩小一半，不改变深度）。

7、再通过三个33512的卷积层，得到的特征图为2828512（深度变为512）。

8、通过一个最大池化下载量层，得到的特征图为1414512（大小缩小一半，不改变深度）。

9、再通过三个33512的卷积层，得到的特征图为1414512（深度变为512）。

10、通过一个最大池化下载量层，得到的特征图为77512（大小缩小一半，不改变深度）。

11、再通过两个为4000个节点的全连接层以及激活函数，得到114096向量

12、再通过一个为1000个节点的全连接层（因为1000个类别），注意不需要激活函数，得到111000向量。

13、最后将通过全连接层得到的一维向量，输入到softmax激活函数，将预测结果转化为概率分布。

模型设计：

import torch
from torch import nn
from d2l import torch as d2l

def vgg_block(num_convs, in_channels, out_channels):
    layers = []
    for _ in range(num_convs):
        layers.append(nn.Conv2d(in_channels,out_channels,
                                kernel_size=3,padding=1))
        layers.append(nn.ReLU())
        in_channels = out_channels
    layers.append(nn.MaxPool2d(kernel_size=2,stride=2))
    
    return nn.Sequential(*layers)

#给一个VGG架构
conv_arch = ((1,64),(1,128),(2,256),(2,512),(2,512))

#实现VGG-11
def vgg(conv_arch):
    conv_blks = []
    in_channels = 1
    
    #卷积层部分
    for (num_convs, out_channels) in conv_arch:
        conv_blks.append(vgg_block(num_convs,in_channels,out_channels))
        in_channels = out_channels
    return nn.Sequential(
        *conv_blks,nn.Flatten(),
        #全连接部分
        nn.Linear(out_channels * 7 * 7, 4096),nn.ReLU(),nn.Dropout(0.5),
        nn.Linear(4096,4096),nn.ReLU(),nn.Dropout(0.5),
        nn.Linear(4096,10)
    )
#实现网络
net = vgg(conv_arch)

#构建输入样本并观察输入网络每层形状
X = torch.randn(size=(1,1,224,224))
for blk in net:
    X = blk(X)
    print(blk.__class__.__name__,'output shape\t',X.shape)

模型形状输出：

Sequential output shape	 torch.Size([1, 64, 112, 112])
Sequential output shape	 torch.Size([1, 128, 56, 56])
Sequential output shape	 torch.Size([1, 256, 28, 28])
Sequential output shape	 torch.Size([1, 512, 14, 14])
Sequential output shape	 torch.Size([1, 512, 7, 7])
Flatten output shape	 torch.Size([1, 25088])
Linear output shape	 torch.Size([1, 4096])
ReLU output shape	 torch.Size([1, 4096])
Dropout output shape	 torch.Size([1, 4096])
Linear output shape	 torch.Size([1, 4096])
ReLU output shape	 torch.Size([1, 4096])
Dropout output shape	 torch.Size([1, 4096])
Linear output shape	 torch.Size([1, 10])

训练模型：

#训练模型
lr, num_epochs,batch_size = 0.05,10,128
train_iter,test_iter = d2l.load_data_fashion_mnist(batch_size,resize=224)
d2l.train_ch6(net,train_iter,test_iter,num_epochs,lr,d2l.try_gpu())

3.3NiN

import torch
from torch import nn
from d2l import torch as d2l 

def nin_block(in_channels,out_channels,kernel_size,strides,padding):
    return nn.Sequential(
        nn.Conv2d(in_channels,out_channels,kernel_size,strides,padding),
        nn.ReLU(),
        nn.Conv2d(out_channels,out_channels,kernel_size=1),nn.ReLU(),
        nn.Conv2d(out_channels,out_channels,kernel_size=1),nn.ReLU()
    )

net = nn.Sequential(
    nin_block(1,96,kernel_size=11, strides=4,padding=0),
    nn.MaxPool2d(3,stride=2),
    nin_block(96,256,kernel_size=5, strides=1, padding=2),
    nn.MaxPool2d(3,stride=2),
    nin_block(256,384,kernel_size=3,strides=1,padding=1),
    nn.MaxPool2d(3,stride=2),
    nn.Dropout(0.5),
    nin_block(384,10,kernel_size=3,strides=1,padding=1),
    nn.AdaptiveAvgPool2d((1,1)),
    nn.Flatten()
)
#我们创建一个数据样本来查看每个块的输出形状
X = torch.rand(size=(1, 1, 224, 224))
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape:\t', X.shape)
    
#训练模型
lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

总结：

NiN使用由一个卷积层和多个$1\times 1$卷积层组成的块。该块可以在卷积神经网络中使用，以允许更多的每像素非线性。
NiN去除了容易造成过拟合的全连接层，将它们替换为全局平均汇聚层（即在所有位置上进行求和）。该汇聚层通道数量为所需的输出数量（例如，Fashion-MNIST的输出为10）。
移除全连接层可减少过拟合，同时显著减少NiN的参数。
NiN的设计影响了许多后续卷积神经网络的设计。

3.4GoogLeNet

核心：提出了Inception块，它是把多个卷积或池化操作，放在一起组装成一个网络模块，设计神经网络时以模块为单位去组装整个网络结构。

模型实现：

首先实现关键的Inception模块：

import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

class Inception(nn.Module):
    def __init__(self,in_channels,c1,c2,c3,c4,**kwargs):
        super(Inception,self).__init__(**kwargs)
        #线路1，单1x1juanjc
        self.p1_1 = nn.Conv2d(in_channels,c1,kernel_size=1)
        #线路2，1x1卷积层后接3x3卷积层
        self.p2_1 = nn.Conv2d(in_channels,c2[0],kernel_size=1)
        self.p2_2 = nn.Conv2d(c2[0],c2[1],kernel_size=3,padding=1)
        #线路3，后接5x5卷积层
        self.p3_1 = nn.Conv2d(in_channels,c3[0],kernel_size=1)
        self.p3_2 = nn.Conv2d(c3[0],c3[1],kernel_size=5,padding=2)
        #线路4，3x3最大汇聚层后接1x1卷积层
        self.p4_1 = nn.MaxPool2d(kernel_size=3,stride=1,padding=1)
        self.p4_2 = nn.Conv2d(in_channels,c4,kernel_size=1)
    
    def forward(self,x):
        p1 = F.relu(self.p1_1(x))
        p2 = F.relu(self.p2_2(F.relu(self.p2_1(x))))
        p3 = F.relu(self.p3_2(F.relu(self.p3_1(x))))
        p4 = F.relu(self.p4_2(self.p4_1(x)))
        #在通道维度上连接输出
        return torch.cat((p1,p2,p3,p4),dim=1)

利用Inception实现GoogLeNet模块：

# 实现GoogleNet的各个模块
#模块1：使用64个通道，7x7卷积层
b1 = nn.Sequential(nn.Conv2d(1,64,kernel_size=1,stride=2,padding=3),
                   nn.ReLU(),
                   nn.MaxPool2d(kernel_size=3,stride=2,padding=1))
#第二个模块使用两个卷积层：第一个卷积层64个通道、1x1卷积层；第二个卷积层使用通道数量增加3倍的3x3卷积层
b2 = nn.Sequential(nn.Conv2d(64,64,kernel_size=1),
                   nn.ReLU(),
                   nn.Conv2d(64,192,kernel_size=3,padding=1),
                   nn.ReLU(),
                   nn.MaxPool2d(kernel_size=3,stride=2,padding=1))
#第三个模块
b3 = nn.Sequential(Inception(192,64,(96,128),(16,32),32),
                   Inception(256,128,(128,192),(32,96),64),
                   nn.MaxPool2d(kernel_size=3,stride=2,padding=1))

#第四个模块
b4 = nn.Sequential(Inception(480, 192, (96, 208), (16, 48), 64),
                   Inception(512, 160, (112, 224), (24, 64), 64),
                   Inception(512, 128, (128, 256), (24, 64), 64),
                   Inception(512, 112, (144, 288), (32, 64), 64),
                   Inception(528, 256, (160, 320), (32, 128), 128),
                   nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
#第五个模块
b5 = nn.Sequential(Inception(832, 256, (160, 320), (32, 128), 128),
                   Inception(832, 384, (192, 384), (48, 128), 128),
                   nn.AdaptiveAvgPool2d((1,1)),
                   nn.Flatten())
net = nn.Sequential(b1,b2,b3,b4,b5,nn.Linear(1024,10))

创建样例数据并输出模型形状结构：

#创建样例数据
X = torch.rand(size=(1, 1, 96, 96))
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape:\t', X.shape)

输出：

Sequential output shape:	 torch.Size([1, 64, 26, 26])
Sequential output shape:	 torch.Size([1, 192, 13, 13])
Sequential output shape:	 torch.Size([1, 480, 7, 7])
Sequential output shape:	 torch.Size([1, 832, 4, 4])
Sequential output shape:	 torch.Size([1, 1024])
Linear output shape:	 torch.Size([1, 10])

训练：

lr,num_epochs,batch_size = 0.1,10,128
train_iter,test_iter = d2l.load_data_fashion_mnist(batch_size,resize=96)
d2l.train_ch6(net,train_iter,test_iter,num_epochs,lr,d2l.try_gpu())

训练结果：

3.5BatchNorm（补充知识）

本质上就是将通道层作为卷积层，利用小批量的均值和标准差，不断调整神经网络的中间输出，使整个神经网络各层的中间输出值更加稳定。

从形式上来说，用$\mathbf{x} \in \mathcal{B}$表示一个来自小批量$\mathcal{B}$的输入，批量规范化$\mathrm{BN}$根据以下表达式转换$\mathbf{x}$：

$\mathrm{BN}(\mathbf{x}) = \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \hat{\boldsymbol{\mu}}_\mathcal{B}}{\hat{\boldsymbol{\sigma}}_\mathcal{B}} + \boldsymbol{\beta}.$

$\hat{\boldsymbol{\mu}}\mathcal{B}$是小批量$\mathcal{B}$的样本均值，$\hat{\boldsymbol{\sigma}}\mathcal{B}$是小批量$\mathcal{B}$的样本标准差。应用标准化后，生成的小批量的平均值为0和单位方差为1。由于单位方差（与其他一些魔法数）是一个主观的选择，因此我们通常包含拉伸参数（scale）$\boldsymbol{\gamma}$和偏移参数（shift）$\boldsymbol{\beta}$，它们的形状与$\mathbf{x}$相同。

对于全连接层处批量归一化计算通常，我们将批量规范化层置于全连接层中的仿射变换和激活函数之间。

设全连接层的输入为x，权重参数和偏置参数分别为$\mathbf{W}$和$\mathbf{b}$，激活函数为$\phi$，批量规范化的运算符为$\mathrm{BN}$。

那么，使用批量规范化的全连接层的输出的计算详情如下：

$\mathbf{h} = \phi(\mathrm{BN}(\mathbf{W}\mathbf{x} + \mathbf{b}) ).$

**均值和方差是在应用变换的"相同"小批量上计算的。**

构建并测试：

构建一个具有张量的批量规划化层：

import torch
from torch import nn
from d2l import torch as d2l


def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    # 通过is_grad_enabled来判断当前模式是训练模式还是预测模式
    if not torch.is_grad_enabled():
        # 如果是在预测模式下，直接使用传入的移动平均所得的均值和方差
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            # 使用全连接层的情况，计算特征维上的均值和方差
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else:
            # 使用二维卷积层的情况，计算通道维上（axis=1）的均值和方差。
            # 这里我们需要保持X的形状以便后面可以做广播运算
            mean = X.mean(dim=(0, 2, 3), keepdim=True)
            var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
        # 训练模式下，用当前的均值和方差做标准化
        X_hat = (X - mean) / torch.sqrt(var + eps)
        # 更新移动平均的均值和方差
        moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
        moving_var = momentum * moving_var + (1.0 - momentum) * var
    Y = gamma * X_hat + beta  # 缩放和移位
    return Y, moving_mean.data, moving_var.data

集成到自定义层中（处理数据移动到GPU上、分配初始化任何必须的变量、跟踪移动平均线“均值、方差”）

class BatchNorm(nn.Module):
    # num_features：完全连接层的输出数量或卷积层的输出通道数。
    # num_dims：2表示完全连接层，4表示卷积层
    def __init__(self, num_features, num_dims):
        super().__init__()
        if num_dims == 2:
            shape = (1, num_features)
        else:
            shape = (1, num_features, 1, 1)
        # 参与求梯度和迭代的拉伸和偏移参数，分别初始化成1和0
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        # 非模型参数的变量初始化为0和1
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.ones(shape)

    def forward(self, X):
        # 如果X不在内存上，将moving_mean和moving_var
        # 复制到X所在显存上
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        # 保存更新过的moving_mean和moving_var
        Y, self.moving_mean, self.moving_var = batch_norm(
            X, self.gamma, self.beta, self.moving_mean,
            self.moving_var, eps=1e-5, momentum=0.9)
        return Y

部署BatchNorm到LeNet中：

net = nn.Sequential(
    nn.Conv2d(1, 6, kernel_size=5), BatchNorm(6, num_dims=4), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Conv2d(6, 16, kernel_size=5), BatchNorm(16, num_dims=4), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(),
    nn.Linear(16*4*4, 120), BatchNorm(120, num_dims=2), nn.Sigmoid(),
    nn.Linear(120, 84), BatchNorm(84, num_dims=2), nn.Sigmoid(),
    nn.Linear(84, 10))
#训练测试
lr, num_epochs, batch_size = 1.0, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

输出：

loss 0.271, train acc 0.899, test acc 0.821
24521.9 examples/sec on cuda:0

3.6ResNet

越来越深越复杂的网络并不代表性能越来越好，达到一个阈值之后就会出现一个瓶颈（前面层数提取到的特征很大一部分会丢失）。

而ResNet层利用一个残差块加入快速通道来尽可能保留前面卷积层保留的参数。输入可通过跨层数据线路更快地向前传播。

模型构建：

残差块模块：

import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

#残差块
class Residual(nn.Module):  #@save
    def __init__(self, input_channels, num_channels,
                 use_1x1conv=False, strides=1):
        super().__init__()
        self.conv1 = nn.Conv2d(input_channels, num_channels,
                               kernel_size=3, padding=1, stride=strides)
        self.conv2 = nn.Conv2d(num_channels, num_channels,
                               kernel_size=3, padding=1)
        if use_1x1conv:
            self.conv3 = nn.Conv2d(input_channels, num_channels,
                                   kernel_size=1, stride=strides)
        else:
            self.conv3 = None
        self.bn1 = nn.BatchNorm2d(num_channels)
        self.bn2 = nn.BatchNorm2d(num_channels)

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        Y += X
        return F.relu(Y)
    
#实现ResNet残差模块
def resnet_block(input_channels,num_channels,num_residuals,first_block=False):
    blk = []
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.append(Residual(input_channels,num_channels,
                                use_1x1conv=True,strides=2))
        else:
            blk.append(Residual(num_channels,num_channels))
    return blk

ResNet整体模型构建：

#构建ResNet整体模型
b1 = nn.Sequential(nn.Conv2d(1,64,kernel_size=7,stride=2,padding=3),
                   nn.BatchNorm2d(64),nn.ReLU(),
                   nn.MaxPool2d(kernel_size=3,stride=2,padding=1))

b2 = nn.Sequential(*resnet_block(64,64,2,first_block=True))
b3 = nn.Sequential(*resnet_block(64,128,2))
b4 = nn.Sequential(*resnet_block(128,256,2))
b5 = nn.Sequential(*resnet_block(256,512,2))

net = nn.Sequential(b1,b2,b3,b4,b5,
                    nn.AdaptiveAvgPool2d((1,1)),
                    nn.Flatten(),nn.Linear(512,10))

#观察ResNet不同模块输入形状如何变化
X = torch.rand(size=(1,1,224,224))
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape/t',X.shape)

模型形状变化：

Sequential output shape/t torch.Size([1, 64, 56, 56])
Sequential output shape/t torch.Size([1, 64, 56, 56])
Sequential output shape/t torch.Size([1, 128, 28, 28])
Sequential output shape/t torch.Size([1, 256, 14, 14])
Sequential output shape/t torch.Size([1, 512, 7, 7])
AdaptiveAvgPool2d output shape/t torch.Size([1, 512, 1, 1])
Flatten output shape/t torch.Size([1, 512])
Linear output shape/t torch.Size([1, 10])

训练模型：

#训练模型
lr, num_epochs, batch_size = 0.05,10,256
train_iter,test_iter = d2l.load_data_fashion_mnist(batch_size,resize=96)
d2l.train_ch6(net,train_iter,test_iter,num_epochs,lr,d2l.try_gpu())

输出：

loss 0.012, train acc 0.997, test acc 0.893
5032.7 examples/sec on cuda:0

小结：

学习嵌套函数（nested function）是训练神经网络的理想情况。在深层神经网络中，学习另一层作为恒等映射（identity function）较容易（尽管这是一个极端情况）。
残差映射可以更容易地学习同一函数，例如将权重层中的参数近似为零。
利用残差块（residual blocks）可以训练出一个有效的深层神经网络：输入可以通过层间的残余连接更快地向前传播。
残差网络（ResNet）对随后的深层神经网络设计产生了深远影响。

3.7DenseNet

ResNet和DenseNet的关键区别在于，DenseNet输出是连接（用图中的[, ]表示）而不是如ResNet的简单相加。因此，在应用越来越复杂的函数序列后，我们执行从x到其展开式的映射：

最后，将这些展开式结合到多层感知机中，再次减少特征的数量。

稠密网络主要由2部分构成：稠密块（dense block）和过渡层（transition layer）。前者定义如何连接输入和输出，而后者则控制通道数量，使其不会太复杂。

构建模型：

import torch
from torch import nn
from d2l import torch as d2l

#卷积块
def conv_block(input_channels,num_channels):
    return nn.Sequential(
        nn.BatchNorm2d(input_channels),nn.ReLU(),
        nn.Conv2d(input_channels,num_channels,kernel_size=3,padding=1)
    )
    
#稠密块使用。nn.Sequential(*layer) 创建一个顺序模型，将列表 layer 中的卷积块按顺序连接起来。*layer 的作用是将列表 layer 中的卷积块解包，使它们成为 nn.Sequential 函数的位置参数。这样，self.net 就包含了按顺序连接的卷积块组成的密集层。
class DenseBlock(nn.Module):
    def __init__(self,num_convs,input_channels,num_channels):
        super(DenseBlock,self).__init__()
        layer =[]
        for i in range(num_convs):
            layer.append(conv_block(
                num_channels*i+input_channels,num_channels))
        self.net = nn.Sequential(*layer)
    
    def forward(self,X):
        for blk in self.net:
            Y = blk(X)
            #连接通道维度上每个块的输入和输出
            X = torch.cat((X,Y),dim=1)
        return X

#过渡层:控制模型复杂度。利用11x1卷积层减少通道层，并利用stride=2的平均池化层来减半高宽，进一步降低模型复杂度。
def transition_block(input_channels,num_channels):
    return nn.Sequential(
        nn.BatchNorm2d(input_channels),nn.ReLU(),
        nn.Conv2d(input_channels,num_channels,kernel_size=1),
        nn.AvgPool2d(kernel_size=2,stride=2)
    )

接着我们开始构建DenseNet，DenseNet首先使用同ResNet一样的单卷积层和最大汇聚层。

b1 = nn.Sequential(
    nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
    nn.BatchNorm2d(64), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2, padding=1))	

接下来，类似于ResNet使用的4个残差块，DenseNet使用的是4个稠密块。与ResNet类似，我们可以设置每个稠密块使用多少个卷积层。这里我们设成4，从而与 sec_resnet的ResNet-18保持一致。稠密块里的卷积层通道数（即增长率）设为32，所以每个稠密块将增加128个通道。在每个模块之间，ResNet通过步幅为2的残差块减小高和宽，DenseNet则使用过渡层来减半高和宽，并减半通道数。

# num_channels为当前的通道数
num_channels, growth_rate = 64, 32
num_convs_in_dense_blocks = [4, 4, 4, 4]
blks = []
for i, num_convs in enumerate(num_convs_in_dense_blocks):
    blks.append(DenseBlock(num_convs, num_channels, growth_rate))
    # 上一个稠密块的输出通道数
    num_channels += num_convs * growth_rate
    # 在稠密块之间添加一个转换层，使通道数量减半
    if i != len(num_convs_in_dense_blocks) - 1:
        blks.append(transition_block(num_channels, num_channels // 2))
        num_channels = num_channels // 2
        
#最后接上全局汇聚层和全连接层来输出结果。
net = nn.Sequential(
    b1, *blks,
    nn.BatchNorm2d(num_channels), nn.ReLU(),
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
    nn.Linear(num_channels, 10))

利用如下代码来观察DenseNet网络每层的形状：

X = torch.rand(size=(1,1,224,224))
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape\t',X.shape)

输出：

Sequential output shape	 	 	 torch.Size([1, 64, 56, 56])
DenseBlock output shape	 	 	 torch.Size([1, 192, 56, 56])
Sequential output shape			 torch.Size([1, 96, 28, 28])
DenseBlock output shape	 		 torch.Size([1, 224, 28, 28])
Sequential output shape	 	 	 torch.Size([1, 112, 14, 14])
DenseBlock output shape	 	 	 torch.Size([1, 240, 14, 14])
Sequential output shape	 	 	 torch.Size([1, 120, 7, 7])
DenseBlock output shape	 	 	 torch.Size([1, 248, 7, 7])
BatchNorm2d output shape	 	 torch.Size([1, 248, 7, 7])
ReLU output shape	 		 	 torch.Size([1, 248, 7, 7])
AdaptiveAvgPool2d output shape	 torch.Size([1, 248, 1, 1])
Flatten output shape	 		 torch.Size([1, 248])
Linear output shape	 			 torch.Size([1, 10])

训练：

lr, num_epochs, batch_size = 0.1, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

loss 0.140, train acc 0.948, test acc 0.885
5626.3 examples/sec on cuda:0

小结：

在跨层连接上，不同于ResNet中将输入与输出相加，稠密连接网络（DenseNet）在通道维上连结输入与输出。
DenseNet的主要构建模块是稠密块和过渡层。
在构建DenseNet时，我们需要通过添加过渡层来控制网络的维数，从而再次减少通道的数量。

4.总结

上述复现了过去几十年卷积神经网络的经典网络并进行测试训练，同时也学习了BatchNorm（不能提高准确率，只能够提高收敛速度）。