0%

多层感知机

1.目标

​ 简单实现多层感知机,并学习其他方法:模型选择、欠拟合过拟合、权重衰减(L1、L2正则化)、暂退法(Dropout)、前向传播和反向传播、数值稳定性和模型初始化、环境和分布偏移。

2.多层感知机

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import torch
from torch import nn
from d2l import torch as d2l
from d2l import util

#实现一个具有单隐藏层的多层感知机,它包含256个隐藏单元
num_inputs,num_outputs,num_hiddens = 784,10,256

W1 = nn.Parameter(
torch.randn(num_hiddens,num_hiddens,requires_grad=True))
b1 = nn.Parameter(torch.zeros(num_hiddens,requires_grad=True))
W2 = nn.Parameter(torch.randn(num_hiddens,num_outputs,requires_grad=True)*0.01)
b2 = nn.Parameter(torch.zeros(num_outputs,requires_grad=True))

params = [W1,b1,W2,b2]

#实现Relu激活函数
def relu(X):
a = torch.zeros_like(X)
return torch.max(X,a)
#初始化
def init_weights(m):
if type(m) == nn.Linear:
nn.init.normal_(m.weight,std=0.01)
#实现模型
def net(X):
X = X.reshape((-1,num_inputs))
H = relu(X @ W1 + b1)
return (H @ W2 + b2)
#损失函数
net = nn.Sequential(nn.Flatten(),
nn.Linear(784,256),
nn.Linear(256,10)
)
net.apply(init_weights)

# 训练
batch_size, lr, num_epochs = 256,0.1,10
loss = nn.CrossEntropyLoss(reduction='none')
train_iter,test_iter = d2l.load_data_fashion_mnist(batch_size)
updater = torch.optim.SGD(net.parameters(),lr=lr)

util.train_ch3(net,train_iter,test_iter,loss,num_epochs,updater)

3.数值稳定性和模型初始化

存在两个问题:梯度爆炸和梯度消失

  • 梯度爆炸
1
2
3
4
5
6
#梯度爆炸
M = torch.normal(0,1,size=(4,4))
print("初始矩阵\n",M)
for i in range(100):
M = torch.mm(M,torch.normal(0,1,(4,4)))
print("乘以100个矩阵后\n",M)
1
2
3
4
5
6
7
8
9
10
初始矩阵
tensor([[ 0.5473, 0.2958, -0.6352, 0.8350],
[ 0.6142, -2.1497, -0.9319, -0.7660],
[-0.1770, -0.5082, 0.6372, 1.4567],
[-0.7529, 1.1125, 0.6797, 0.0318]])
乘以100个矩阵后
tensor([[-5.5963e+21, -3.1355e+21, 1.8995e+22, 6.6069e+21],
[ 9.9777e+22, 5.5904e+22, -3.3867e+23, -1.1780e+23],
[-2.2861e+20, -1.2809e+20, 7.7594e+20, 2.6989e+20],
[-5.6377e+22, -3.1587e+22, 1.9136e+23, 6.6558e+22]])
  • 梯度消失
1
2
3
4
5
6
7
#梯度消失
x = torch.arange(-8.0,8.0,0.1,requires_grad=True)
y = torch.sigmoid(x)
y.backward(torch.ones_like(x))

d2l.plot(x.detach().numpy(),[y.detach().numpy(),x.grad.numpy()],
legend=['sigmoid','gradient'],figsize=(4.5,2.5))

4.总结

  • 在深度网络训练前的参数初始化时需要小心,要保证梯度和参数可以得到很好的控制,以防出现梯度消失和梯度爆炸
  • RELU激活函数可以缓解梯度消失问题,加速收敛
  • 随机初始化是保证在进行优化前打破对称性的关键
  • Xavier初始化表面,对于每一层,输出的方差不受输入数量的影响,任何梯度的方差不受输出数量的影响
坚持原创技术分享,您的支持将鼓励我继续创作!