loss 为nan???

  • Post author:
  • Post category:其他


在训练的过程中经常会出现loss=NaN的情况,在网上查了查一般做法是

减小学习速率

或者

增大batch_size

。尝试了一下减小学习速率,可以解决问题。但是不明白为什么。所以整理了一下loss为nan的问题。

现在依然不清楚为什么减小学习速率会解决这个问题,请各位不吝赐教。

如果一开始loss就为nan, 可以考虑自己的输入是否有问题。

参考:

https://stackoverflow.com/questions/33962226/common-causes-of-NaNs-during-training


Gradient blow up   梯度爆炸  loss不断增大


Reason:

large gradients throw the learning process off-track.


What you should expect:

Looking at the runtime log, you should look at the loss values per-iteration. You’ll notice that the loss starts to grow

significantly

from iteration to iteration, eventually the loss will be too large to be represented by a floating point variable and it will become

nan

.


What can you do:

Decrease the

base_lr

(in the solver.prototxt) by an order of magnitude (at least). If you have several loss layers, you should inspect the log to see which layer is responsible for the gradient blow up and decrease the

loss_weight

(in train_val.prototxt) for that specific layer, instead of the general

base_lr

.


Bad learning rate policy and params  减小学习速率


Reason:

caffe fails to compute a valid learning rate and gets

'inf'

or

'nan'

instead, this invalid rate multiplies all updates and thus invalidating all parameters.


What you should expect:

Looking at the runtime log, you should see that the learning rate itself becomes

'nan'

, for example:

... sgd_solver.cpp:106] Iteration 0, lr = -nan


What can you do:

fix all parameters affecting the learning rate in your

'solver.prototxt'

file.

For instance, if you use

lr_policy: "poly"

and you forget to define

max_iter

parameter, you’ll end up with

lr = nan



For more information about learning rate in caffe, see

this thread

.


Faulty Loss function  不恰当的损失函数


Reason:

Sometimes the computations of the loss in the loss layers causes

nan

s to appear. For example, Feeding



InfogainLoss


layer with non-normalized values

, using custom loss layer with bugs, etc.


What you should expect:

Looking at the runtime log you probably won’t notice anything unusual: loss is decreasing gradually, and all of a sudden a

nan

appears.


What can you do:

See if you can reproduce the error, add printout to the loss layer and debug the error.

For example: Once I used a loss that normalized the penalty by the frequency of label occurrence in a batch. It just so happened that if one of the training labels did not appear in the batch at all – the loss computed produced

nan

s. In that case, working with large enough batches (with respect to the number of labels in the set) was enough to avoid this error.


Faulty input   输入中可能有nan


Reason:

you have an input with

nan

in it!


What you should expect:

once the learning process “hits” this faulty input – output becomes

nan

. Looking at the runtime log you probably won’t notice anything unusual: loss is decreasing gradually, and all of a sudden a

nan

appears.


What can you do:

re-build your input datasets (lmdb/leveldn/hdf5…) make sure you do not have bad image files in your training/validation set. For debug you can build a simple net that read the input layer, has a dummy loss on top of it and runs through all the inputs: if one of them is faulty, this dummy net should also produce

nan

.


stride larger than kernel size in

"Pooling"

layer   stride大于池化中的kernel大小

For some reason, choosing

stride

>

kernel_size

for pooling may results with

nan

s. For example:

layer {
  name: "faulty_pooling"
  type: "Pooling"
  bottom: "x"
  top: "y"
  pooling_param {
    pool: AVE
    stride: 5
    kernel: 3
  }
}

results with

nan

s in

y

.


Instabilities in

"BatchNorm"

It was reported that under some settings

"BatchNorm"

layer may output

nan

s due to numerical instabilities.



版权声明:本文为qq_32799915原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。