参考:
- https://zhuanlan.zhihu.com/p/165152789
- https://zhuanlan.zhihu.com/p/176998729
- https://pytorch.org/docs/stable/amp.html
- https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples
pytorch 版本有点旧,更新一下就好了,我直接更新到1.7
import torch
print(torch.__version__)
print(torch.version.cuda)
print(torch.cuda.amp)
print(torch.cuda.amp.autocast)AMP:Automatic mixed precision,自动混合精度。
torch.float32 (float)和torch.float16 (half)。 linear layers and convolutions中使用torch.float16 (half)会快很多。reductions就需要float32。Mixed precision会自动的为不同的操作配置合适的数据类型。torch.cuda.amp.autocast和torch.cuda.amp.GradScalar一般同时使用。
torch.cuda.amp.autocast 使用混合精度,在调用autocast的上下文中model(s) or inputs就不要调用.half()。 反向传播就不要使用了,只包含在前向传播和损失函数计算就好了。反向传播和前向传播的数据类型是对应的。
# Creates model and optimizer in default precision
model = Net().cuda()#模型
optimizer = optim.SGD(model.parameters(), ...)#优化器
for input, target in data:
    optimizer.zero_grad()#梯度置零
    # Enables autocasting for the forward pass (model + loss)
    with autocast():#gradient penalty, multiple models/losses, custom autograd functions
        output = model(input)#前向传播的模型使用混合精度
        loss = loss_fn(output, target)#前向传播的损失函数使用混合精度
    # Exits the context manager before backward()
    loss.backward()//反向传播者不推荐使用
    optimizer.step()还可以在前线传播中直接使用装饰器
class AutocastModel(nn.Module):
    ...
    @autocast()
    def forward(self, input):
        ...# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")
with autocast():#创建的tensor是float16的与外面float32类型不匹配,会自动转换
    # torch.mm is on autocast's list of ops that should run in float16.
    # Inputs are float32, but the op runs in float16 and produces float16 output.
    # No manual casts are required.
    e_float16 = torch.mm(a_float32, b_float32)
    # Also handles mixed input types
    f_float16 = torch.mm(d_float32, e_float16)
# After exiting autocast, calls f_float16.float() to use with d_float32,可以转换到float32
g_float32 = torch.mm(d_float32, f_float16.float())
# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")
with autocast():
    e_float16 = torch.mm(a_float32, b_float32)
    with autocast(enabled=False):#在autocast上下文中局部禁用autocase,数据类型就转换为float32
        # Calls e_float16.float() to ensure float32 execution
        # (necessary because e_float16 was created in an autocasted region)
        f_float32 = torch.mm(c_float32, e_float16.float())
    # No manual casts are required when re-entering the autocast-enabled region.
    # torch.mm again runs in float16 and produces float16 output, regardless of input types.
    g_float16 = torch.mm(d_float32, f_float32)
torch.cuda.amp.GradScalar梯度放缩,如果前向传播时float16,那反向传播也是float16,假设传播的梯度值非常小float16不足以表示,这时候梯度就会下溢到0 underflow,这样就没办法更新对应的参数了。“gradient scaling”将网络的损失 network’s loss(es)乘以一个缩放因子scale factor,并调用对scaled loss(es)的反向传播。然后,通过反向传播流动的梯度按同样的因子缩放。也就是梯度增大了,不会变成0了。
每个参数的梯度(.grad )在优化器更新参数之前,应该取消缩放,这样缩放因子就不会干扰学习速率。
这个配方recipe以默认精度度量一个简单网络的性能,然后通过添加autocast和GradScaler来以混合精度运行同一个网络,从而提高性能。混合精度主要有利于张量核支持架构(Volta, Turing, Ampere)。这个配方在这些架构上应该显示出显著的(2-3)加速。
import torch, time, gc
# Timing utilities
start_time = None
def start_timer():
    global start_time
    gc.collect()#启动完全的垃圾回收
    torch.cuda.empty_cache()#释放显存
    torch.cuda.reset_max_memory_allocated()#重置显存分配峰值的起点。
    torch.cuda.synchronize()#等待当前设备上所有流中的所有核心完成。
    start_time = time.time()
def end_timer_and_print(local_msg):
    torch.cuda.synchronize()#等待当前设备上所有流中的所有核心完成。
    end_time = time.time()
    print("\n" + local_msg)
    print("Total execution time = {:.3f} sec".format(end_time - start_time))
    print("Max memory used by tensors = {} bytes".format(torch.cuda.max_memory_allocated()))A simple network
def make_model(in_size, out_size, num_layers):
    layers = []
    for _ in range(num_layers - 1):
        layers.append(torch.nn.Linear(in_size, in_size))
        layers.append(torch.nn.ReLU())
    layers.append(torch.nn.Linear(in_size, out_size))
    return torch.nn.Sequential(*tuple(layers)).cuda()batch_size, in_size, out_size和num_layers被选择为足够大,以使GPU工作饱和。改变参数的大小,并查看混合精度加速如何变化。
batch_size = 512 # Try, for example, 128, 256, 513.
in_size = 4096
out_size = 4096
num_layers = 3
num_batches = 50
epochs = 3
# Creates data in default precision.
# The same data is used for both default and mixed precision trials below.
# You don't need to manually change inputs' dtype when enabling mixed precision.
data = [torch.randn(batch_size, in_size, device="cuda") for _ in range(num_batches)]
targets = [torch.randn(batch_size, out_size, device="cuda") for _ in range(num_batches)]
loss_fn = torch.nn.MSELoss().cuda()Default Precision
不使用autocast
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        output = net(input)
        loss = loss_fn(output, target)
        loss.backward()
        opt.step()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Default precision:")使用autocast
for epoch in range(0): # 0 epochs, this section is for illustration only
    for input, target in zip(data, targets):
        # Runs the forward pass under autocast.
        with torch.cuda.amp.autocast():
            output = net(input)
            # output is float16 because linear layers autocast to float16.
            assert output.dtype is torch.float16
            loss = loss_fn(output, target)
            # loss is float32 because mse_loss layers autocast to float32.
            assert loss.dtype is torch.float32
        # Exits autocast before backward().
        # Backward passes under autocast are not recommended.
        # Backward ops run in the same dtype autocast chose for corresponding forward ops.
        loss.backward()
        opt.step()
        opt.zero_grad() # set_to_none=True here can modestly improve performanceAdding GradScaler
# Constructs scaler once, at the beginning of the convergence run, using default args.
# If your network fails to converge with default GradScaler args, please file an issue.
# The same GradScaler instance should be used for the entire convergence run.
# If you perform multiple convergence runs in the same script, each run should use
# a dedicated fresh GradScaler instance.  GradScaler instances are lightweight.
scaler = torch.cuda.amp.GradScaler()
for epoch in range(0): # 0 epochs, this section is for illustration only
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast():
            output = net(input)
            loss = loss_fn(output, target)
        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        scaler.scale(loss).backward()
        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(opt)
        # Updates the scale for next iteration.
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performanceAll together: “Automatic Mixed Precision”
use_amp = True
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)
start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast(enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Mixed precision:")Inspecting/modifying gradients (e.g., clipping)
for epoch in range(0): # 0 epochs, this section is for illustration only
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast():
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(opt)
        # Since the gradients of optimizer's assigned params are now unscaled, clips as usual.
        # You may use the same value for max_norm here as you would without gradient scaling.
        torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=0.1)
        scaler.step(opt)
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performanceSaving/Resuming
保存时,将scaler状态字典与通常的模型和优化器状态字典一起保存。在任何前向传递之前的迭代开始时,或在scaler.update()之后的迭代结束时执行此操作。
checkpoint = {"model": net.state_dict(),
              "optimizer": opt.state_dict(),
              "scaler": scaler.state_dict()}
# Write checkpoint as desired, e.g.,
# torch.save(checkpoint, "filename")在恢复时,在加载模型和优化器状态字典的同时加载scaler状态字典。
# Read checkpoint as desired, e.g.,
# dev = torch.cuda.current_device()
# checkpoint = torch.load("filename",
#                         map_location = lambda storage, loc: storage.cuda(dev))
net.load_state_dict(checkpoint["model"])
opt.load_state_dict(checkpoint["optimizer"])
scaler.load_state_dict(checkpoint["scaler"])AUTOMATIC MIXED PRECISION EXAMPLES
# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)
# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()
for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        # Runs the forward pass with autocasting.
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        # Backward passes under autocast are not recommended.
        # Backward ops run in the same dtype autocast chose for corresponding forward ops.
        scaler.scale(loss).backward()
        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(optimizer)
        # Updates the scale for next iteration.
        scaler.update()Working with Unscaled Gradients
所有由scaler.scale(loss).backward()产生的梯度都会被缩放。如果您想修改或检查backward()和scale.step(optimizer)之间的参数的.grad属性,您应该先取消它们的缩放。例如一组梯度的梯度裁剪操作,使它们的global norm参见torch.nn.utils.clip_grad_norm_())或最大值(参见torch.nn.utils.clip_grad_value_())为<=某个用户设定的阈值。如果您试图在不取消缩放的情况下进行剪切,那么渐变的norm/maximum大小也将被缩放,所以您所请求的阈值(即未缩放渐变的阈值)将是无效的。
Gradient clipping
在裁剪之前调用scaler.unscale_(optimizer)可以让您像往常一样裁剪未缩放的梯度:
scaler = GradScaler()
for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(optimizer)
        # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
        # optimizer's gradients are already unscaled, so scaler.step does not unscale them,
        # although it still skips optimizer.step() if the gradients contain infs or NaNs.
        scaler.step(optimizer)
        # Updates the scale for next iteration.
        scaler.update()scaler记录了在这个迭代中已经为这个优化器调用了scaler.unscale_(optimizer),所以scaler.step(optimizer)知道在(内部)调用optimizer.step()之前不要多余地使用unscale渐变。
Working with Scaled Gradients
Gradient accumulation
梯度累加在batch_per_iter * iters_to_accumulate(如果是分布式的,则* num_procs)大小的有效批上累加梯度。尺度应该针对有效batch进行校准,这意味着inf/NaN检查,如果发现inf/NaN梯度则跳过步骤,并且尺度更新应该在有效批的粒度上进行。此外,梯度应该保持可伸缩,并且比例因子应该保持不变,而给定有效批次的梯度是累积的。如果在累积完成之前梯度是未缩放的(或缩放因子发生了变化),那么下一次反向传递将把缩放的梯度添加到未缩放的梯度(或用不同的因子缩放的梯度),之后就不可能恢复累积的未缩放的梯度step。
因此,如果你想unscale_梯度(例如,允许剪切未缩放的梯度),在step之前调用unscale_,毕竟下一个step的所有(缩放的)梯度已经累积。同样,只有在你调用了完整有效批处理的step的迭代结束时才调用update:
scaler = GradScaler()
for epoch in epochs:
    for i, (input, target) in enumerate(data):
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
            loss = loss / iters_to_accumulate
        # Accumulates scaled gradients.
        scaler.scale(loss).backward()
        if (i + 1) % iters_to_accumulate == 0:
            # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()Gradient penalty
梯度惩罚通常使用torch.autograd.grad()创建梯度实现,将它们组合起来创建惩罚值,并将惩罚值添加到损失中。
下面是一个没有梯度缩放或autocasting的L2惩罚的普通例子:
for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        # Creates gradients
        grad_params = torch.autograd.grad(outputs=loss,
                                          inputs=model.parameters(),
                                          create_graph=True)
        # Computes the penalty term and adds it to the loss
        grad_norm = 0
        for grad in grad_params:
            grad_norm += grad.pow(2).sum()
        grad_norm = grad_norm.sqrt()
        loss = loss + grad_norm
        loss.backward()
        # clip gradients here, if desired
        optimizer.step()为了实现梯度缩放的梯度惩罚,传递给torch.autograd.grad()的输出张量应该被缩放。因此,产生的梯度将被缩放,并且在组合为创建惩罚值之前应该取消缩放。另外,惩罚项计算是前向传递的一部分,因此应该位于自动转换上下文中。
scaler = GradScaler()
for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
        # Scales the loss for autograd.grad's backward pass, producing scaled_grad_params
        scaled_grad_params = torch.autograd.grad(outputs=scaler.scale(loss),
                                                 inputs=model.parameters(),
                                                 create_graph=True)
        # Creates unscaled grad_params before computing the penalty. scaled_grad_params are
        # not owned by any optimizer, so ordinary division is used instead of scaler.unscale_:
        inv_scale = 1./scaler.get_scale()
        grad_params = [p * inv_scale for p in scaled_grad_params]
        # Computes the penalty term and adds it to the loss
        with autocast():
            grad_norm = 0
            for grad in grad_params:
                grad_norm += grad.pow(2).sum()
            grad_norm = grad_norm.sqrt()
            loss = loss + grad_norm
        # Applies scaling to the backward call as usual.
        # Accumulates leaf gradients that are correctly scaled.
        scaler.scale(loss).backward()
        # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
        # step() and update() proceed as usual.
        scaler.step(optimizer)
        scaler.update()Working with Multiple Models, Losses, and Optimizers
如果您的网络有多个损失,您必须每一个都单独调用scaler.scale。如果你的网络有多个优化器,你可以在每一个单独优化器的调用scaler.unscale_ ,然后必须调用 scaler.step 。
然而,scaler.update 应该只被调用一次,在所有使用这个迭代的优化器都被步骤执行之后:
scaler = torch.cuda.amp.GradScaler()
for epoch in epochs:
    for input, target in data:
        optimizer0.zero_grad()
        optimizer1.zero_grad()
        with autocast():
            output0 = model0(input)
            output1 = model1(input)
            loss0 = loss_fn(2 * output0 + 3 * output1, target)
            loss1 = loss_fn(3 * output0 - 5 * output1, target)
        # (retain_graph here is unrelated to amp, it's present because in this
        # example, both backward() calls share some sections of graph.)
        scaler.scale(loss0).backward(retain_graph=True)
        scaler.scale(loss1).backward()
        # You can choose which optimizers receive explicit unscaling, if you
        # want to inspect or modify the gradients of the params they own.
        scaler.unscale_(optimizer0)
        scaler.step(optimizer0)
        scaler.step(optimizer1)
        scaler.update()Working with Multiple GPUs
DataParallel in a single process
torch.nn.DataParallel 在每个设备上生成线程来运行向前传递。autocast的状态是线程本地的,所以下面的将不会工作:
model = MyModel()
dp_model = nn.DataParallel(model)
# Sets autocast in the main thread
with autocast():
    # dp_model's internal threads won't autocast.  The main thread's autocast state has no effect.
    output = dp_model(input)
    # loss_fn still autocasts, but it's too late...
    loss = loss_fn(output)解决办法很简单。在MyModel.forward中启用autocast:
MyModel(nn.Module):
    ...
    @autocast()
    def forward(self, input):
       ...
# Alternatively
MyModel(nn.Module):
    ...
    def forward(self, input):
        with autocast():
            ...现在在dp_model的线程(向前执行)和主线程(执行loss_fn)中自动转换如下:
model = MyModel()
dp_model = nn.DataParallel(model)
with autocast():
    output = dp_model(input)
    loss = loss_fn(output)DistributedDataParallel, one GPU per process
torch.nn.parallel.DistributedDataParallel的文档建议每个进程使用一个GPU以获得最佳性能。在这种情况下,DistributedDataParallel不会在内部生成线程,因此autocast和GradScaler的使用不会受到影响。
DistributedDataParallel, multiple GPUs per process
这里,torch.nn.parallel.DistributedDataParallel可能会衍生一个侧线程来在每个设备上运行向前传递,比如torch.nn.DataParallel。修复方法The fix is the same:是一样的:应用autocast作为模型的forward方法的一部分,以确保在侧线程中启用它。
Autocast and Custom Autograd Functions
如果你的网络使用 custom autograd functions ( torch.autograd.Function的子类),如果有函数需要修改自动转换的兼容性:
- 取多个浮点张量输入,
- 包装任何可自动转换的op(参见 Autocast Op Reference),
- 需要一个特定的dtype(例如,如果它包装了仅为dtype编译的 CUDA extensions )。
在所有情况下,如果你正在导入函数而不能改变它的定义,一个安全的补救措施是禁用autocast,并在任何使用错误发生的地方强制执行float32(或dtype):
with autocast():
    ...
    with autocast(enabled=False):
        output = imported_function(input1.float(), input2.float())如果您是函数的作者(或者可以改变它的定义),一个更好的解决方案是使用torch.cuda.amp.custom_fwd()和torch.cuda.amp.custom_bwd() 装饰器,如下面的相关案例所示。
Functions with multiple inputs or autocastable ops
分别对forward和backward应用custom_fwd 和custom_bwd(不带参数)。这确保了forward执行与当前的autocast状态相同,backward与forward相同的autocast(这可以防止类型不匹配的错误):
class MyMM(torch.autograd.Function):
    @staticmethod
    @custom_fwd
    def forward(ctx, a, b):
        ctx.save_for_backward(a, b)
        return a.mm(b)
    @staticmethod
    @custom_bwd
    def backward(ctx, grad):
        a, b = ctx.saved_tensors
        return grad.mm(b.t()), a.t().mm(grad)现在MyMM可以在任何地方被调用,而无需禁用自动转换或手动转换输入:
mymm = MyMM.apply
with autocast():
    output = mymm(input1, input2)Functions that need a particular dtype
考虑一个需要torch.float32自定义函数。将custom_fwd(cast_inputs=torch.float32)应用到forward,将custom_bwd (不带参数)应用到backward。如果forward在启用了自动强制转换的区域运行,decorator将浮点CUDA张量输入强制转换为float32,并在forward和backward时在本地禁用自动强制转换:
class MyFloat32Func(torch.autograd.Function):
    @staticmethod
    @custom_fwd(cast_inputs=torch.float32)
    def forward(ctx, input):
        ctx.save_for_backward(input)
        ...
        return fwd_output
    @staticmethod
    @custom_bwd
    def backward(ctx, grad):
        ...现在,可以在任何地方调用MyFloat32Func,而无需手动禁用自动广播或强制转换输入:
func = MyFloat32Func.apply
with autocast():
    # func will run in float32, regardless of the surrounding autocast state
    output = func(input)
