BN、SyncBN、IN、LN、GN学习记录_Cassiel_cx
BatchNorm、SyncBatchNorm、InstanceNorm、LayerNorm、GroupNorm学习笔记...
1 BatchNorm

BN的原理
BN是换算方式机触觉最应用的条件化的方式,它顺着N、H、W层级对显示表现形式图求标准差和方差,继而再灵活运用标准差和方差来归一化表现形式图。换算方式历程相应图已知,1)顺着检修入口通道层级换算方式别层级的标准差;2)顺着检修入口通道层级换算方式别层级的方差;3)归一化表现形式图;4)参加可学校基本参数γ和β(在每回单向校园营销后更换),对归一化的表现形式图实行涉及调整图片大小光明移的仿射实际操作,

Args:
? ? num_features: 输入特征通道数
? ? eps: 为保证数值稳定性(分母不能趋近或取0), 给分母加上的值, 默认值是1e-5
? ? momentum: 计算running_mean和running_var时使用的动量(指数平均因子), 默认值是0.1
? ? affine: 布尔值, 是否给BN层添加仿射变换的可学习参数γ和β, 默认为True
? ? track_running_stats: 布尔值, 是否记录训练中的running mean和variance, 若为False, 则该BN层在训练和验证阶段都只在当前输入中统计mean和variance, 如果此时的batch_size比较小, 那么其统计特性就会和全局统计特性有着较大偏差,可能导致糟糕的效果. 默认值为True
更新running_mean和running_var的公式
其中,为模型更新前的running_mean或running_var,
为此次输入的mean或者var。在验证时(model.eval()),running_mean和running_var被视为均值和方差来标准化输入张量。
BN的优点
- BN致使网洛中每一层设置资料的划分相对的安全稳定(可适用较大的的的学习率),不仅能很大程度上大幅提升了操练速度快,收敛性进程远远加快推进;
- BN导致建模方法对无线线上中的参数设置不很明感,弱化对原始化的强忽略性,变得简化调参过程中 ,导致无线线上掌握更加的稳定的;
- BN合法无线网络安全使用供大于求性缴活数学函数(举例说明sigmoid等),归一化后的参数,能让等度形成在对比大的值和变迁率,减轻症状等度消掉或 爆款;
- 有稍微的正则凝成用(相等于于给隐形层参与噪音分贝,类式Dropout,能缓和过拟合曲线。
BN的缺点
- 对batchsize的尺寸大小相对较皮肤敏感。比如batchsize太窄,则估算的对数正态划分、方差不充分以带表一个数据显示划分。小的bathsize建立的js随机数性很大,其特性很难高达一致收敛;
- 不合最适合于RNN、風格移迁等世界任务。拿風格移迁实例,考虑到Mini-Batch内有机会存有多张相关的高清照片,去运算这样的高清照片的标准差和方差会薄弱单页高清照片本就专用的一点地方图片信息。
代码实例
(1)任意初期化键盘输入张量和样例化BNimport torch
import torch.nn as nn
# 固定随机种子, 使随机生成的input每次都一样
torch.manual_seed(42)
# 随机生成形状为[1,2,2,2]输入
input = torch.randn((1,2,2,2)).cuda()
print('input:', input)
# 实例化BN
bn = nn.BatchNorm2d(num_features=2, eps=0.00001, momentum=0.1, affine=True, track_running_stats=True).cuda()
bn.running_mean = (torch.ones([2])*2).cuda()
bn.running_var = (torch.ones([2])*1).cuda()
bn.train()
# 查看模型更新前的参数
print('trainning:', bn.training)
print('running_mean:', bn.running_mean)
print('running_var:', bn.running_var)
print('weight:', bn.weight) # γ, 初始值为1
print('bias:', bn.bias) # β, 初始值为0
# 打印结果
'''
input: tensor([[[[ 0.3367, 0.1288],
[ 0.2345, 0.2303]],
[[-1.1229, -0.1863],
[ 2.2082, -0.6380]]]], device='cuda:0')
trainning: True
running_mean: tensor([2., 2.], device='cuda:0')
running_var: tensor([1., 1.], device='cuda:0')
weight: Parameter containing:
tensor([1., 1.], device='cuda:0', requires_grad=True)
bias: Parameter containing:
tensor([0., 0.], device='cuda:0', requires_grad=True)
'''
(2)經過BN层,更改伤害后果
# 输出
output = bn(input)
print('output:', output)
# 查看模型更新后的参数
print('trainning:', bn.training)
print('running_mean:', bn.running_mean)
print('running_var:', bn.running_var)
print('weight:', bn.weight)
print('bias:', bn.bias)
# 打印结果, 由于没有反向传播, 所以γ和β值不变
'''
output: tensor([[[[ 1.4150, -1.4102],
[ 0.0257, -0.0305]],
[[-0.9276, -0.1964],
[ 1.6731, -0.5491]]]], device='cuda:0',
grad_fn=<CudnnBatchNormBackward0>)
trainning: True
running_mean: tensor([1.8233, 1.8065], device='cuda:0')
running_var: tensor([0.9007, 1.1187], device='cuda:0')
weight: Parameter containing:
tensor([1., 1.], device='cuda:0', requires_grad=True)
bias: Parameter containing:
tensor([0., 0.], device='cuda:0', requires_grad=True)
'''
(3)跟据BN的操作过程,自个写1段归一化代码怎么用
# 计算输入数据的均值和方差. 注意, torch.var()函数中unbiased默认为True,表示方差的无偏估计,这里需将它设为False
cur_mean = torch.mean(input, dim=[0,2,3])
cur_var = torch.var(input, dim=[0,2,3], unbiased=False)
print('cur_mean:', cur_mean)
print('cur_var:', cur_var)
# 计算running_mean和running_var
new_mean = (torch.ones([2])*2) * (1-bn.momentum) + cur_mean * bn.momentum
new_var = (torch.ones([2])*1) * (1-bn.momentum) + cur_var * bn.momentum
print('new_mean:', new_mean)
print('new_var:', new_var)
# 打印结果, 可以看到, 计算出的new_mean和new_var与步骤2的running_mean和running_var一致
'''
cur_mean: tensor([0.2326, 0.0653])
cur_var: tensor([0.0072, 2.1872])
new_mean: tensor([1.8233, 1.8065])
new_var: tensor([0.9007, 1.1187])
'''
# 计算输出结果, 训练时用当前数据的mean和方差做标准化, 验证时用running_mean和running_var做标准化
output2 = (input - cur_mean) / torch.sqrt(cur_var + bn.eps)
print('output2:', output2)
# 打印结果, 可以看到, 计算出的output2与步骤2的output一致
'''
output2: tensor([[[[ 1.4150, -1.4102],
[ 0.0257, -0.0305]],
[[-0.9276, -0.1964],
[ 1.6731, -0.5491]]]])
'''

2 SyncBatchNorm
BN的视觉功效与batchsize的大大小小有巨大影响。而像关键监测、语义分配这钓鱼任务,损坏运存较高,每篇卡分在的图片文字数马上会变低,而在DP基本模式,下,每篇卡必须收到自个那地方的计算方法出报告单。因为在认可可能软件测试模板时实用雷同的running_mean和running_var,DP基本模式,便只拿主卡上计算方法出的平均值和方差去自动更新running_mean和running_var,BN的视觉功效自然环境马上会越差。两个应对想法正是用SyncBN换用BN,实用缺省的BN汇总表量来规定化放入,相对于单卡的BN汇总表量,缺省的BN汇总表量会更精确度。SyncBatchNorm的原理
本小节的几十张图片搜索来://cloud.tencent.com/developer/article/2126838 (1)计算方式各张卡的标准差和方差

SyncBN源码
import torch
from torch.autograd.function import Function
class SyncBatchNorm(Function):
@staticmethod
def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size):
input = input.contiguous()
size = input.numel() // input.size(1)
if size == 1:
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
count = torch.Tensor([size]).to(input.device)
# calculate mean/invstd for input.
mean, invstd = torch.batch_norm_stats(input, eps)
count_all = torch.empty(world_size, 1, dtype=count.dtype, device=count.device)
mean_all = torch.empty(world_size, mean.size(0), dtype=mean.dtype, device=mean.device)
invstd_all = torch.empty(world_size, invstd.size(0), dtype=invstd.dtype, device=invstd.device)
count_l = list(count_all.unbind(0))
mean_l = list(mean_all.unbind(0))
invstd_l = list(invstd_all.unbind(0))
# using all_gather instead of all reduce so we can calculate count/mean/var in one go
count_all_reduce = torch.distributed.all_gather(count_l, count, process_group, async_op=True)
mean_all_reduce = torch.distributed.all_gather(mean_l, mean, process_group, async_op=True)
invstd_all_reduce = torch.distributed.all_gather(invstd_l, invstd, process_group, async_op=True)
# wait on the async communication to finish
count_all_reduce.wait()
mean_all_reduce.wait()
invstd_all_reduce.wait()
# calculate global mean & invstd
mean, invstd = torch.batch_norm_gather_stats_with_counts(
input,
mean_all,
invstd_all,
running_mean,
running_var,
momentum,
eps,
count_all.view(-1).long().tolist()
)
self.save_for_backward(input, weight, mean, invstd, count_all)
self.process_group = process_group
# apply element-wise normalization
out = torch.batch_norm_elemt(input, weight, bias, mean, invstd, eps)
return out
@staticmethod
def backward(self, grad_output):
grad_output = grad_output.contiguous()
saved_input, weight, mean, invstd, count_tensor = self.saved_tensors
grad_input = grad_weight = grad_bias = None
process_group = self.process_group
# calculate local stats as well as grad_weight / grad_bias
sum_dy, sum_dy_xmu, grad_weight, grad_bias = torch.batch_norm_backward_reduce(
grad_output,
saved_input,
mean,
invstd,
weight,
self.needs_input_grad[0],
self.needs_input_grad[1],
self.needs_input_grad[2]
)
if self.needs_input_grad[0]:
# synchronizing stats used to calculate input gradient.
# TODO: move div_ into batch_norm_backward_elemt kernel
sum_dy_all_reduce = torch.distributed.all_reduce(
sum_dy, torch.distributed.ReduceOp.SUM, process_group, async_op=True)
sum_dy_xmu_all_reduce = torch.distributed.all_reduce(
sum_dy_xmu, torch.distributed.ReduceOp.SUM, process_group, async_op=True)
# wait on the async communication to finish
sum_dy_all_reduce.wait()
sum_dy_xmu_all_reduce.wait()
divisor = count_tensor.sum()
mean_dy = sum_dy / divisor
mean_dy_xmu = sum_dy_xmu / divisor
# backward pass for gradient calculation
grad_input = torch.batch_norm_backward_elemt(
grad_output,
saved_input,
mean,
invstd,
weight,
mean_dy,
mean_dy_xmu
)
# synchronizing of grad_weight / grad_bias is not needed as distributed
# training would handle all reduce.
if weight is None or not self.needs_input_grad[1]:
grad_weight = None
if weight is None or not self.needs_input_grad[2]:
grad_bias = None
return grad_input, grad_weight, grad_bias, None, None, None, None, None, None
SyncBN的使用
特别注意,SyncBN必须 在DDP区域环境默认化后默认化,同时要在DDP模式化前几天完成任务默认化。import torch
from torch import distributed
distributed.init_process_group(backend='nccl')
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = torch.nn.parallel.DistributedDataParallel(model)
@classmethod
def convert_sync_batchnorm(cls, module, process_group=None):
? ? module_output = module
? ? if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
module_output = torch.nn.SyncBatchNorm(
module.num_features,
module.eps,
module.momentum,
module.affine,
module.track_running_stats,
process_group,
)
? ? ? ? if module.affine:
? ? ? ? ? ? with torch.no_grad():
? ? ? ? ? ? ? ? module_output.weight = module.weight
? ? ? ? ? ? ? ? module_output.bias = module.bias
? ? ? ? module_output.running_mean = module.running_mean
? ? ? ? module_output.running_var = module.running_var
? ? ? ? module_output.num_batches_tracked = module.num_batches_tracked
? ? ? ? if hasattr(module, "qconfig"):
? ? ? ? ? ? module_output.qconfig = module.qconfig
? ? for name, child in module.named_children():
module_output.add_module(
name, cls.convert_sync_batchnorm(child, process_group)
)
? ? del module
? ? return module_output
3 InstanceNorm

IN的原理
BN着重对batchsize信息报告归一化,只是在画像画风化重任中,添加的画风效果重点依耐于相应画像范例,以至于对一小部分batchsize信息报告实现归一化不是很适合,故此强调了IN,只对HW关键点实现归一化,IN调取了N、C的关键点。确定操作过程相应图已知,1)笔直H、W关键点,对进入张量求对数正态分布和方差;2)充分利用求得的对数正态分布和方差来的安全标准化进入张量;3)填加可学习培训产品参数γ和β,对的安全标准化后的信息报告做仿射改换,
IN的使用
torch.nn.InstanceNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
class InstanceNorm2d(_InstanceNorm):
def _get_no_batch_dim(self):
return 3
def _check_input_dim(self, input):
if input.dim() not in (3, 4):
raise ValueError('expected 3D or 4D input (got {}D input)'
.format(input.dim()))
class _InstanceNorm(_NormBase):
def __init__(
self,
num_features: int,
eps: float = 1e-5,
momentum: float = 0.1,
affine: bool = False,
track_running_stats: bool = False,
device=None,
dtype=None
) -> None:
factory_kwargs = {'device': device, 'dtype': dtype}
super(_InstanceNorm, self).__init__(
num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)
def _check_input_dim(self, input):
raise NotImplementedError
def _get_no_batch_dim(self):
raise NotImplementedError
def _handle_no_batch_input(self, input):
return self._apply_instance_norm(input.unsqueeze(0)).squeeze(0)
def _apply_instance_norm(self, input):
return F.instance_norm(
input, self.running_mean, self.running_var, self.weight, self.bias,
self.training or not self.track_running_stats, self.momentum, self.eps)
def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
missing_keys, unexpected_keys, error_msgs):
version = local_metadata.get('version', None)
# at version 1: removed running_mean and running_var when
# track_running_stats=False (default)
if version is None and not self.track_running_stats:
running_stats_keys = []
for name in ('running_mean', 'running_var'):
key = prefix + name
if key in state_dict:
running_stats_keys.append(key)
if len(running_stats_keys) > 0:
error_msgs.append(
'Unexpected running stats buffer(s) {names} for {klass} '
'with track_running_stats=False. If state_dict is a '
'checkpoint saved before 0.4.0, this may be expected '
'because {klass} does not track running stats by default '
'since 0.4.0. Please remove these keys from state_dict. If '
'the running stats are actually needed, instead set '
'track_running_stats=True in {klass} to enable them. See '
'the documentation of {klass} for details.'
.format(names=" and ".join('"{}"'.format(k) for k in running_stats_keys),
klass=self.__class__.__name__))
for key in running_stats_keys:
state_dict.pop(key)
super(_InstanceNorm, self)._load_from_state_dict(
state_dict, prefix, local_metadata, strict,
missing_keys, unexpected_keys, error_msgs)
def forward(self, input: Tensor) -> Tensor:
self._check_input_dim(input)
if input.dim() == self._get_no_batch_dim():
return self._handle_no_batch_input(input)
return self._apply_instance_norm(input)
IN的优点
IN比较合适于合成式抗击网格的重要性世界任何,如情格知识。图片大全合成的没想到通常依耐于某一图案案例,对整一个batchsize去BN控制不比较合适的风格知识世界任何,在该世界任何中动用IN不单就能够会加快模板一致收敛,然后就能够维持每一位图案案例左右的独力性,不用安全通道和batchsize的直接影响。IN的缺点
若要巧用到症状图检修通道间的相关联性,不提案在使用IN做归一化整理。4 LayerNorm

LN的原理
在NLP神器目标任务中,比如说文本文档神器目标任务,有所差异样例的长必然不似的,进行BN来细则化则不太有效。因而提供了LN,对CHW空间关键点展开归一化。算出过程中 以下几点图如图,1)延着C、H、W空间关键点求进入张量的平平均值和方差;2)进行妄求得的平平均值和方差细则化进入;3)加盟可深造因素γ和β,对细则化后的统计数据做仿射调换,
LN的使用
torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True)
class LayerNorm(Module):
__constants__ = ['normalized_shape', 'eps', 'elementwise_affine']
normalized_shape: Tuple[int, ...]
eps: float
elementwise_affine: bool
def __init__(self, normalized_shape: _shape_t, eps: float = 1e-5, elementwise_affine: bool = True,
device=None, dtype=None) -> None:
factory_kwargs = {'device': device, 'dtype': dtype}
super(LayerNorm, self).__init__()
if isinstance(normalized_shape, numbers.Integral):
# mypy error: incompatible types in assignment
normalized_shape = (normalized_shape,) # type: ignore[assignment]
self.normalized_shape = tuple(normalized_shape) # type: ignore[arg-type]
self.eps = eps
self.elementwise_affine = elementwise_affine
if self.elementwise_affine:
self.weight = Parameter(torch.empty(self.normalized_shape, **factory_kwargs))
self.bias = Parameter(torch.empty(self.normalized_shape, **factory_kwargs))
else:
self.register_parameter('weight', None)
self.register_parameter('bias', None)
self.reset_parameters()
def reset_parameters(self) -> None:
if self.elementwise_affine:
init.ones_(self.weight)
init.zeros_(self.bias)
def forward(self, input: Tensor) -> Tensor:
return F.layer_norm(
input, self.normalized_shape, self.weight, self.bias, self.eps)
def extra_repr(self) -> str:
return '{normalized_shape}, eps={eps}, ' \
'elementwise_affine={elementwise_affine}'.format(**self.__dict__)
LN的优点
LN无需自定义康复操练。在单条数据源内壁就能进行归一化基本操作,由于应该代替batchsize=1和RNN的康复操练中,结果比BN可荐。有所不一样的设置模板有有所不一样的标准差和方差,应该更多、更多地高达绝佳结果。LN无需维持batchsize的标准差和方差,合理节省了另外的保存余地。LN的缺点
LN与batchsize决定,在小batchsize上目的可以会比BN好,只是在大batchsize的目的亦或是BN更快。5 GroupNorm

GN的原理
GN是为了更好地克服BN对较小的batchsize结果差的问題,它将过道拆成num_groupss组,每组富含channel/num_groups个过道,则基本特征图换为(N, G, C//G, H, W),第三来换算每组(C//G, H, W)方面的平均和方差,其实就与batchsize决定。GN的享乐主义事情只是LN和IN,区别对应着G相当1和G相当C。GN的来换算具体步骤如下所述图如图是,1)顺着C//G、H、W方面来换算显示张量的平均和方差;2)利用妄求得的平均和方差的标准化作业显示;3)放入可了解运作γ和β,对的标准化作业后的数据源做仿射调整,
GN的使用
torch.nn.GroupNorm(num_groups, num_channels, eps=1e-05, affine=True, device=None, dtype=None)
class GroupNorm(Module):
__constants__ = ['num_groups', 'num_channels', 'eps', 'affine']
num_groups: int
num_channels: int
eps: float
affine: bool
def __init__(self, num_groups: int, num_channels: int, eps: float = 1e-5, affine: bool = True,
device=None, dtype=None) -> None:
factory_kwargs = {'device': device, 'dtype': dtype}
super(GroupNorm, self).__init__()
if num_channels % num_groups != 0:
raise ValueError('num_channels must be divisible by num_groups')
self.num_groups = num_groups
self.num_channels = num_channels
self.eps = eps
self.affine = affine
if self.affine:
self.weight = Parameter(torch.empty(num_channels, **factory_kwargs))
self.bias = Parameter(torch.empty(num_channels, **factory_kwargs))
else:
self.register_parameter('weight', None)
self.register_parameter('bias', None)
self.reset_parameters()
def reset_parameters(self) -> None:
if self.affine:
init.ones_(self.weight)
init.zeros_(self.bias)
def forward(self, input: Tensor) -> Tensor:
return F.group_norm(
input, self.num_groups, self.weight, self.bias, self.eps)
def extra_repr(self) -> str:
return '{num_groups}, {num_channels}, eps={eps}, ' \
'affine={affine}'.format(**self.__dict__)
GN的优点
GN不依赖感于batchsize,还可以比较好适合于RNN,这就是GN的大特点。参考文献阐明G为32或一个group的入口数为16时,的效果最有效的;在batchsize超过16时,GN不同于BN。

GN的缺点
在大batchsize时,成效拼不过BN。6 总结
- BN对小batchsize的感觉不够好;
- IN角色在数字图像清晰度上,使代替复古风化搬迁;
- LN重要对RNN效果显著的;
- GN将channel分组名称,进而再做归一化, 在batchsize<16的阶段, 功能强于BN。
参考文章
【搜狐博客园】//www.cnblogs.com/lxp-never/p/11566064.html 【知乎精华】//zhuanlan.zhihu.com/p/395855181 【腾讯游戏云】//cloud.tencent.com/developer/article/2126838皇冠新体育APP相关的文章
- 皇冠新体育APP:flask模板注入_jjj34
- Tensorflow on multiple GPUs with FastAPI_victorbai2
- TensorFlow搭建LSTM实现多变量多步长时间序列预测(一):直接多输出_Cyril_KI_lstm多输出
- Mac M1配置tensorflow以及切换虚拟环境导入至Spyder_wuicer_spyder切换虚拟环境
- 微信小程序与webview关于iphone X的兼容设置_CRMEB_小程序设置viewport-fit
- centos环境下部署Flask框架_zznnniuu_centos安装flask
- Tensorflow安装教程_白小白求教_tensorflow官方安装教程
- 皇冠新体育APP:ReactNative WebView组件详解_Anita-Sun_react webview
- 经典模型WDCNN中的AdaBN复现_北漂炼丹青年_adabn
- 运行javaweb项目时显示404-解决办法_就躺了吧_javaweb运行后404
- Mac M1 安装配置TensorFlow-GPU_葫芦娃啊啊啊啊
- QT退出应用程序_培根芝士_qt退出整个程序
- (虚拟环境py3)基于ros实现mask-rcnn,启动摄像头,并输出预测的可视化结果,全步骤记录_pengfeiChu
- 【DL】第13章 使用 TensorFlow加载和预处理数据_Sonhhxg_柒
- 皇冠新体育APP:如何创建python虚拟环境_码猿小菜鸡_conda创建python2的虚拟环境
- 皇冠新体育APP:tensorflow 训练好的模型的保存与加载_kyccom_tensorflow 模型保存和加载