皇冠新体育APP

IT技术之家

BN、SyncBN、IN、LN、GN学习记录_Cassiel_cx

发布消息期限:2023-08-25 16:11:12 大数据 30次 标签:计算机视觉 pytorch Powered by 金山文档
BatchNorm、SyncBatchNorm、InstanceNorm、LayerNorm、GroupNorm学习笔记...

1 BatchNorm

BN的原理

BN是换算方式机触觉最应用的条件化的方式,它顺着N、H、W层级对显示表现形式图求标准差和方差,继而再灵活运用标准差和方差来归一化表现形式图。换算方式历程相应图已知,1)顺着检修入口通道层级换算方式别层级的标准差;2)顺着检修入口通道层级换算方式别层级的方差;3)归一化表现形式图;4)参加可学校基本参数γ和β(在每回单向校园营销后更换),对归一化的表现形式图实行涉及调整图片大小光明移的仿射实际操作, pytorch中的BN有以下几种:torch.nn.BatchNorm1d、torch.nn.BatchNorm2d、torch.nn.BatchNorm3d。这里英文拿torch.nn.BatchNorm2d来阐述,它的参数表如表,
Args:
? ? num_features: 输入特征通道数
? ? eps: 为保证数值稳定性(分母不能趋近或取0), 给分母加上的值, 默认值是1e-5
? ? momentum: 计算running_mean和running_var时使用的动量(指数平均因子), 默认值是0.1
? ? affine: 布尔值, 是否给BN层添加仿射变换的可学习参数γ和β, 默认为True
? ? track_running_stats: 布尔值, 是否记录训练中的running mean和variance, 若为False, 则该BN层在训练和验证阶段都只在当前输入中统计mean和variance, 如果此时的batch_size比较小, 那么其统计特性就会和全局统计特性有着较大偏差,可能导致糟糕的效果. 默认值为True

更新running_mean和running_var的公式

其中,为模型更新前的running_mean或running_var,为此次输入的mean或者var。在验证时(model.eval()),running_mean和running_var被视为均值和方差来标准化输入张量。

BN的优点

  1. BN致使网洛中每一层设置资料的划分相对的安全稳定(可适用较大的的的学习率),不仅能很大程度上大幅提升了操练速度快,收敛性进程远远加快推进;
  1. BN导致建模方法对无线线上中的参数设置不很明感,弱化对原始化的强忽略性,变得简化调参过程中 ,导致无线线上掌握更加的稳定的;
  1. BN合法无线网络安全使用供大于求性缴活数学函数(举例说明sigmoid等),归一化后的参数,能让等度形成在对比大的值和变迁率,减轻症状等度消掉或 爆款;
  1. 有稍微的正则凝成用(相等于于给隐形层参与噪音分贝,类式Dropout,能缓和过拟合曲线。

BN的缺点

  1. 对batchsize的尺寸大小相对较皮肤敏感。比如batchsize太窄,则估算的对数正态划分、方差不充分以带表一个数据显示划分。小的bathsize建立的js随机数性很大,其特性很难高达一致收敛;
  1. 不合最适合于RNN、風格移迁等世界任务。拿風格移迁实例,考虑到Mini-Batch内有机会存有多张相关的高清照片,去运算这样的高清照片的标准差和方差会薄弱单页高清照片本就专用的一点地方图片信息。

代码实例

(1)任意初期化键盘输入张量和样例化BN
import torch
import torch.nn as nn


# 固定随机种子, 使随机生成的input每次都一样
torch.manual_seed(42)
# 随机生成形状为[1,2,2,2]输入
input = torch.randn((1,2,2,2)).cuda()
print('input:', input)

# 实例化BN
bn = nn.BatchNorm2d(num_features=2, eps=0.00001, momentum=0.1, affine=True, track_running_stats=True).cuda()
bn.running_mean = (torch.ones([2])*2).cuda()
bn.running_var = (torch.ones([2])*1).cuda()
bn.train()
# 查看模型更新前的参数
print('trainning:', bn.training)
print('running_mean:', bn.running_mean)
print('running_var:', bn.running_var)
print('weight:', bn.weight)  # γ, 初始值为1
print('bias:', bn.bias)  # β, 初始值为0

# 打印结果
'''
input: tensor([[[[ 0.3367,  0.1288],
          [ 0.2345,  0.2303]],

         [[-1.1229, -0.1863],
          [ 2.2082, -0.6380]]]], device='cuda:0')
trainning: True
running_mean: tensor([2., 2.], device='cuda:0')
running_var: tensor([1., 1.], device='cuda:0')
weight: Parameter containing:
tensor([1., 1.], device='cuda:0', requires_grad=True)
bias: Parameter containing:
tensor([0., 0.], device='cuda:0', requires_grad=True)
'''
(2)經過BN层,更改伤害后果
# 输出
output = bn(input)
print('output:', output)

# 查看模型更新后的参数
print('trainning:', bn.training)
print('running_mean:', bn.running_mean)
print('running_var:', bn.running_var)
print('weight:', bn.weight)
print('bias:', bn.bias)

# 打印结果, 由于没有反向传播, 所以γ和β值不变
'''
output: tensor([[[[ 1.4150, -1.4102],
          [ 0.0257, -0.0305]],

         [[-0.9276, -0.1964],
          [ 1.6731, -0.5491]]]], device='cuda:0',
       grad_fn=<CudnnBatchNormBackward0>)
trainning: True
running_mean: tensor([1.8233, 1.8065], device='cuda:0')
running_var: tensor([0.9007, 1.1187], device='cuda:0')
weight: Parameter containing:
tensor([1., 1.], device='cuda:0', requires_grad=True)
bias: Parameter containing:
tensor([0., 0.], device='cuda:0', requires_grad=True)
'''
(3)跟据BN的操作过程,自个写1段归一化代码怎么用
# 计算输入数据的均值和方差. 注意, torch.var()函数中unbiased默认为True,表示方差的无偏估计,这里需将它设为False
cur_mean = torch.mean(input, dim=[0,2,3])
cur_var = torch.var(input, dim=[0,2,3], unbiased=False)
print('cur_mean:', cur_mean)
print('cur_var:', cur_var)

# 计算running_mean和running_var
new_mean = (torch.ones([2])*2) * (1-bn.momentum) + cur_mean * bn.momentum
new_var = (torch.ones([2])*1) * (1-bn.momentum) + cur_var * bn.momentum
print('new_mean:', new_mean)
print('new_var:', new_var)

# 打印结果, 可以看到, 计算出的new_mean和new_var与步骤2的running_mean和running_var一致
'''
cur_mean: tensor([0.2326, 0.0653])
cur_var: tensor([0.0072, 2.1872])
new_mean: tensor([1.8233, 1.8065])
new_var: tensor([0.9007, 1.1187])
'''

# 计算输出结果, 训练时用当前数据的mean和方差做标准化, 验证时用running_mean和running_var做标准化
output2 = (input - cur_mean) / torch.sqrt(cur_var + bn.eps)
print('output2:', output2)

# 打印结果, 可以看到, 计算出的output2与步骤2的output一致
'''
output2: tensor([[[[ 1.4150, -1.4102],
          [ 0.0257, -0.0305]],

         [[-0.9276, -0.1964],
          [ 1.6731, -0.5491]]]])
'''

2 SyncBatchNorm

BN的视觉功效与batchsize的大大小小有巨大影响。而像关键监测、语义分配这钓鱼任务,损坏运存较高,每篇卡分在的图片文字数马上会变低,而在DP基本模式,下,每篇卡必须收到自个那地方的计算方法出报告单。因为在认可可能软件测试模板时实用雷同的running_mean和running_var,DP基本模式,便只拿主卡上计算方法出的平均值和方差去自动更新running_mean和running_var,BN的视觉功效自然环境马上会越差。两个应对想法正是用SyncBN换用BN,实用缺省的BN汇总表量来规定化放入,相对于单卡的BN汇总表量,缺省的BN汇总表量会更精确度。

SyncBatchNorm的原理

本小节的几十张图片搜索来://cloud.tencent.com/developer/article/2126838 (1)计算方式各张卡的标准差和方差 (2)发送到各卡范围内的对数正态分布和方差 利用率torch.distributed.all_gather函数值征集各GPU上的对数正态分布和方差,拥有静态的对数正态分布和方差,更新系统running_mean和running_var; (3)标准规范化插入,该操作过程与BN类式。

SyncBN源码

import torch
from torch.autograd.function import Function


class SyncBatchNorm(Function):

    @staticmethod
    def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size):
        input = input.contiguous()

        size = input.numel() // input.size(1)
        if size == 1:
            raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
        count = torch.Tensor([size]).to(input.device)

        # calculate mean/invstd for input.
        mean, invstd = torch.batch_norm_stats(input, eps)

        count_all = torch.empty(world_size, 1, dtype=count.dtype, device=count.device)
        mean_all = torch.empty(world_size, mean.size(0), dtype=mean.dtype, device=mean.device)
        invstd_all = torch.empty(world_size, invstd.size(0), dtype=invstd.dtype, device=invstd.device)

        count_l = list(count_all.unbind(0))
        mean_l = list(mean_all.unbind(0))
        invstd_l = list(invstd_all.unbind(0))

        # using all_gather instead of all reduce so we can calculate count/mean/var in one go
        count_all_reduce = torch.distributed.all_gather(count_l, count, process_group, async_op=True)
        mean_all_reduce = torch.distributed.all_gather(mean_l, mean, process_group, async_op=True)
        invstd_all_reduce = torch.distributed.all_gather(invstd_l, invstd, process_group, async_op=True)

        # wait on the async communication to finish
        count_all_reduce.wait()
        mean_all_reduce.wait()
        invstd_all_reduce.wait()

        # calculate global mean & invstd
        mean, invstd = torch.batch_norm_gather_stats_with_counts(
            input,
            mean_all,
            invstd_all,
            running_mean,
            running_var,
            momentum,
            eps,
            count_all.view(-1).long().tolist()
        )

        self.save_for_backward(input, weight, mean, invstd, count_all)
        self.process_group = process_group

        # apply element-wise normalization
        out = torch.batch_norm_elemt(input, weight, bias, mean, invstd, eps)
        return out

    @staticmethod
    def backward(self, grad_output):
        grad_output = grad_output.contiguous()
        saved_input, weight, mean, invstd, count_tensor = self.saved_tensors
        grad_input = grad_weight = grad_bias = None
        process_group = self.process_group

        # calculate local stats as well as grad_weight / grad_bias
        sum_dy, sum_dy_xmu, grad_weight, grad_bias = torch.batch_norm_backward_reduce(
            grad_output,
            saved_input,
            mean,
            invstd,
            weight,
            self.needs_input_grad[0],
            self.needs_input_grad[1],
            self.needs_input_grad[2]
        )

        if self.needs_input_grad[0]:
            # synchronizing stats used to calculate input gradient.
            # TODO: move div_ into batch_norm_backward_elemt kernel
            sum_dy_all_reduce = torch.distributed.all_reduce(
                sum_dy, torch.distributed.ReduceOp.SUM, process_group, async_op=True)
            sum_dy_xmu_all_reduce = torch.distributed.all_reduce(
                sum_dy_xmu, torch.distributed.ReduceOp.SUM, process_group, async_op=True)

            # wait on the async communication to finish
            sum_dy_all_reduce.wait()
            sum_dy_xmu_all_reduce.wait()

            divisor = count_tensor.sum()
            mean_dy = sum_dy / divisor
            mean_dy_xmu = sum_dy_xmu / divisor
            # backward pass for gradient calculation
            grad_input = torch.batch_norm_backward_elemt(
                grad_output,
                saved_input,
                mean,
                invstd,
                weight,
                mean_dy,
                mean_dy_xmu
            )

        # synchronizing of grad_weight / grad_bias is not needed as distributed
        # training would handle all reduce.
        if weight is None or not self.needs_input_grad[1]:
            grad_weight = None

        if weight is None or not self.needs_input_grad[2]:
            grad_bias = None

        return grad_input, grad_weight, grad_bias, None, None, None, None, None, None

SyncBN的使用

特别注意,SyncBN必须 在DDP区域环境默认化后默认化,同时要在DDP模式化前几天完成任务默认化。
import torch
from torch import distributed

distributed.init_process_group(backend='nccl')
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = torch.nn.parallel.DistributedDataParallel(model)
@classmethod
def convert_sync_batchnorm(cls, module, process_group=None):
? ? module_output = module
? ? if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
            module_output = torch.nn.SyncBatchNorm(
                module.num_features,
                module.eps,
                module.momentum,
                module.affine,
                module.track_running_stats,
                process_group,
            )
? ? ? ? if module.affine:
? ? ? ? ? ? with torch.no_grad():
? ? ? ? ? ? ? ? module_output.weight = module.weight
? ? ? ? ? ? ? ? module_output.bias = module.bias
? ? ? ? module_output.running_mean = module.running_mean
? ? ? ? module_output.running_var = module.running_var
? ? ? ? module_output.num_batches_tracked = module.num_batches_tracked
? ? ? ? if hasattr(module, "qconfig"):
? ? ? ? ? ? module_output.qconfig = module.qconfig
? ? for name, child in module.named_children():
            module_output.add_module(
                name, cls.convert_sync_batchnorm(child, process_group)
            )
? ? del module
? ? return module_output

3 InstanceNorm

IN的原理

BN着重对batchsize信息报告归一化,只是在画像画风化重任中,添加的画风效果重点依耐于相应画像范例,以至于对一小部分batchsize信息报告实现归一化不是很适合,故此强调了IN,只对HW关键点实现归一化,IN调取了N、C的关键点。确定操作过程相应图已知,1)笔直H、W关键点,对进入张量求对数正态分布和方差;2)充分利用求得的对数正态分布和方差来的安全标准化进入张量;3)填加可学习培训产品参数γ和β,对的安全标准化后的信息报告做仿射改换,

IN的使用

torch.nn.InstanceNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
class InstanceNorm2d(_InstanceNorm):
    def _get_no_batch_dim(self):
        return 3

    def _check_input_dim(self, input):
        if input.dim() not in (3, 4):
            raise ValueError('expected 3D or 4D input (got {}D input)'
                             .format(input.dim()))


class _InstanceNorm(_NormBase):
    def __init__(
        self,
        num_features: int,
        eps: float = 1e-5,
        momentum: float = 0.1,
        affine: bool = False,
        track_running_stats: bool = False,
        device=None,
        dtype=None
    ) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(_InstanceNorm, self).__init__(
            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)

    def _check_input_dim(self, input):
        raise NotImplementedError

    def _get_no_batch_dim(self):
        raise NotImplementedError

    def _handle_no_batch_input(self, input):
        return self._apply_instance_norm(input.unsqueeze(0)).squeeze(0)

    def _apply_instance_norm(self, input):
        return F.instance_norm(
            input, self.running_mean, self.running_var, self.weight, self.bias,
            self.training or not self.track_running_stats, self.momentum, self.eps)

    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
                              missing_keys, unexpected_keys, error_msgs):
        version = local_metadata.get('version', None)
        # at version 1: removed running_mean and running_var when
        # track_running_stats=False (default)
        if version is None and not self.track_running_stats:
            running_stats_keys = []
            for name in ('running_mean', 'running_var'):
                key = prefix + name
                if key in state_dict:
                    running_stats_keys.append(key)
            if len(running_stats_keys) > 0:
                error_msgs.append(
                    'Unexpected running stats buffer(s) {names} for {klass} '
                    'with track_running_stats=False. If state_dict is a '
                    'checkpoint saved before 0.4.0, this may be expected '
                    'because {klass} does not track running stats by default '
                    'since 0.4.0. Please remove these keys from state_dict. If '
                    'the running stats are actually needed, instead set '
                    'track_running_stats=True in {klass} to enable them. See '
                    'the documentation of {klass} for details.'
                    .format(names=" and ".join('"{}"'.format(k) for k in running_stats_keys),
                            klass=self.__class__.__name__))
                for key in running_stats_keys:
                    state_dict.pop(key)

        super(_InstanceNorm, self)._load_from_state_dict(
            state_dict, prefix, local_metadata, strict,
            missing_keys, unexpected_keys, error_msgs)

    def forward(self, input: Tensor) -> Tensor:
        self._check_input_dim(input)

        if input.dim() == self._get_no_batch_dim():
            return self._handle_no_batch_input(input)

        return self._apply_instance_norm(input)

IN的优点

IN比较合适于合成式抗击网格的重要性世界任何,如情格知识。图片大全合成的没想到通常依耐于某一图案案例,对整一个batchsize去BN控制不比较合适的风格知识世界任何,在该世界任何中动用IN不单就能够会加快模板一致收敛,然后就能够维持每一位图案案例左右的独力性,不用安全通道和batchsize的直接影响。

IN的缺点

若要巧用到症状图检修通道间的相关联性,不提案在使用IN做归一化整理。

4 LayerNorm

LN的原理

在NLP神器目标任务中,比如说文本文档神器目标任务,有所差异样例的长必然不似的,进行BN来细则化则不太有效。因而提供了LN,对CHW空间关键点展开归一化。算出过程中 以下几点图如图,1)延着C、H、W空间关键点求进入张量的平平均值和方差;2)进行妄求得的平平均值和方差细则化进入;3)加盟可深造因素γ和β,对细则化后的统计数据做仿射调换,

LN的使用

torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True)
class LayerNorm(Module):
    __constants__ = ['normalized_shape', 'eps', 'elementwise_affine']
    normalized_shape: Tuple[int, ...]
    eps: float
    elementwise_affine: bool

    def __init__(self, normalized_shape: _shape_t, eps: float = 1e-5, elementwise_affine: bool = True,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(LayerNorm, self).__init__()
        if isinstance(normalized_shape, numbers.Integral):
            # mypy error: incompatible types in assignment
            normalized_shape = (normalized_shape,)  # type: ignore[assignment]
        self.normalized_shape = tuple(normalized_shape)  # type: ignore[arg-type]
        self.eps = eps
        self.elementwise_affine = elementwise_affine
        if self.elementwise_affine:
            self.weight = Parameter(torch.empty(self.normalized_shape, **factory_kwargs))
            self.bias = Parameter(torch.empty(self.normalized_shape, **factory_kwargs))
        else:
            self.register_parameter('weight', None)
            self.register_parameter('bias', None)

        self.reset_parameters()

    def reset_parameters(self) -> None:
        if self.elementwise_affine:
            init.ones_(self.weight)
            init.zeros_(self.bias)

    def forward(self, input: Tensor) -> Tensor:
        return F.layer_norm(
            input, self.normalized_shape, self.weight, self.bias, self.eps)

    def extra_repr(self) -> str:
        return '{normalized_shape}, eps={eps}, ' \
            'elementwise_affine={elementwise_affine}'.format(**self.__dict__)

LN的优点

LN无需自定义康复操练。在单条数据源内壁就能进行归一化基本操作,由于应该代替batchsize=1和RNN的康复操练中,结果比BN可荐。有所不一样的设置模板有有所不一样的标准差和方差,应该更多、更多地高达绝佳结果。LN无需维持batchsize的标准差和方差,合理节省了另外的保存余地。

LN的缺点

LN与batchsize决定,在小batchsize上目的可以会比BN好,只是在大batchsize的目的亦或是BN更快。

5 GroupNorm

GN的原理

GN是为了更好地克服BN对较小的batchsize结果差的问題,它将过道拆成num_groupss组,每组富含channel/num_groups个过道,则基本特征图换为(N, G, C//G, H, W),第三来换算每组(C//G, H, W)方面的平均和方差,其实就与batchsize决定。GN的享乐主义事情只是LN和IN,区别对应着G相当1和G相当C。GN的来换算具体步骤如下所述图如图是,1)顺着C//G、H、W方面来换算显示张量的平均和方差;2)利用妄求得的平均和方差的标准化作业显示;3)放入可了解运作γ和β,对的标准化作业后的数据源做仿射调整,

GN的使用

torch.nn.GroupNorm(num_groups, num_channels, eps=1e-05, affine=True, device=None, dtype=None)
class GroupNorm(Module):
    __constants__ = ['num_groups', 'num_channels', 'eps', 'affine']
    num_groups: int
    num_channels: int
    eps: float
    affine: bool

    def __init__(self, num_groups: int, num_channels: int, eps: float = 1e-5, affine: bool = True,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(GroupNorm, self).__init__()
        if num_channels % num_groups != 0:
            raise ValueError('num_channels must be divisible by num_groups')

        self.num_groups = num_groups
        self.num_channels = num_channels
        self.eps = eps
        self.affine = affine
        if self.affine:
            self.weight = Parameter(torch.empty(num_channels, **factory_kwargs))
            self.bias = Parameter(torch.empty(num_channels, **factory_kwargs))
        else:
            self.register_parameter('weight', None)
            self.register_parameter('bias', None)

        self.reset_parameters()

    def reset_parameters(self) -> None:
        if self.affine:
            init.ones_(self.weight)
            init.zeros_(self.bias)

    def forward(self, input: Tensor) -> Tensor:
        return F.group_norm(
            input, self.num_groups, self.weight, self.bias, self.eps)

    def extra_repr(self) -> str:
        return '{num_groups}, {num_channels}, eps={eps}, ' \
            'affine={affine}'.format(**self.__dict__)

GN的优点

GN不依赖感于batchsize,还可以比较好适合于RNN,这就是GN的大特点。参考文献阐明G为32或一个group的入口数为16时,的效果最有效的;在batchsize超过16时,GN不同于BN。

GN的缺点

在大batchsize时,成效拼不过BN。

6 总结

  1. BN对小batchsize的感觉不够好;
  1. IN角色在数字图像清晰度上,使代替复古风化搬迁;
  1. LN重要对RNN效果显著的;
  1. GN将channel分组名称,进而再做归一化, 在batchsize<16的阶段, 功能强于BN。

参考文章

【搜狐博客园】//www.cnblogs.com/lxp-never/p/11566064.html 【知乎精华】//zhuanlan.zhihu.com/p/395855181 【腾讯游戏云】//cloud.tencent.com/developer/article/2126838