Autocasting和GradScaler

Autocasting和GradScaler是什么

torch.autocast 是一个上下文管理器，它可以将数据类型从 float32 自动转换为 float16。这可以提高性能，因为 float16 比 float32 更小，因此可以更快地处理。
torch.cuda.amp.GradScaler 是一个类，它可以自动缩放梯度，以便在使用混合精度时保持准确性。这很重要，因为使用混合精度时，梯度可能会被缩小，从而导致准确性下降。

bfloat16是什么

BFloat16 (Brain Floating Point)是一种16bit的浮点数格式，动态表达范围和float32是一样的，但是精度低。

动态表达范围是指浮点数可以表示的数值范围。

BFloat16 的精度降低是因为它使用 16 位来存储数据，而 float32 使用 32 位来存储数据。

BFloat16 的名称来自它的开发者，Google Brain。

Autocasting

Autocast 实例可以作为上下文管理器或装饰器，允许您的脚本的部分在混合精度下运行。

autocast 应该只包装您的网络的前向传递，包括损失计算。不推荐在自动转换下进行反向传递。反向操作在与相应的正向操作相同的类型下运行。

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

for input, target in data:
    optimizer.zero_grad()

    # Enables autocasting for the forward pass (model + loss)
    with autocast():
        output = model(input)
        loss = loss_fn(output, target)

    # Exits the context manager before backward()
    loss.backward()
    optimizer.step()

在自动转换启用区域中生成的浮点张量可能为 float16

# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")

with autocast():
    # torch.mm is on autocast's list of ops that should run in float16.
    # Inputs are float32, but the op runs in float16 and produces float16 output.
    # No manual casts are required.
    e_float16 = torch.mm(a_float32, b_float32)
    # Also handles mixed input types
    f_float16 = torch.mm(d_float32, e_float16)

# After exiting autocast, calls f_float16.float() to use with d_float32
g_float32 = torch.mm(d_float32, f_float16.float())

autocast(enabled=False)

# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")

with autocast():
    e_float16 = torch.mm(a_float32, b_float32)
    with autocast(enabled=False):
        # Calls e_float16.float() to ensure float32 execution
        # (necessary because e_float16 was created in an autocasted region)
        f_float32 = torch.mm(c_float32, e_float16.float())

    # No manual casts are required when re-entering the autocast-enabled region.
    # torch.mm again runs in float16 and produces float16 output, regardless of input types.
    g_float16 = torch.mm(d_float32, f_float32)

GradScaler

如果特定操作的前向传递具有 float16 输入，则该操作的反向传递将生成 float16 梯度。

小幅度梯度值可能无法表示为 float16。这些值将刷新为零（“下溢”），因此相应参数的更新将丢失。

为了防止下溢，梯度缩放会将网络的损失乘以一个缩放因子，并在缩放的损失上调用反向传递。然后，通过网络流回的梯度将按相同的因子进行缩放。换句话说，梯度值具有更大的幅度，因此不会刷新为零。

在优化器更新参数之前，应先unscaled每个参数的梯度，以便缩放因子不会干扰学习率。

import torch
from torch import nn, optim

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

# Creates a gradient scale object
scale = torch.cuda.amp.GradScaler()

for input, target in data:
    optimizer.zero_grad()

    # Enables autocasting for the forward pass (model + loss)
    with autocast():
        output = model(input)
        loss = loss_fn(output, target)

    # Scales the loss before backward()
    scale.scale(loss).backward()

    # Unscales the gradients after backward()
    scale.step(optimizer)
    scale.update()

原文链接：https://blog.csdn.net/C_C666/article/details/132266180