Optimization Algorithms

Using the notation for mini-batch gradient descent. To what of the following does

a^{[2]\lbrace 4 \rbrace(3)}

$a^{[2] {4} (3)}$

correspond?

The activation of the third layer when the input is the fourth example of the second mini-batch.
The activation of the second layer when the input is the third example of the fourth mini-batch.
The activation of the fourth layer when the input is the second example of the third mini-batch.
The activation of the second layer when the input is the fourth example of the third mini-batch.

Which of these statements about mini-batch gradient descent do you agree with?

You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches so that the algorithm processes all mini-batches at the same time (vectorization).
Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
When the mini-batch size is the same as the training size, mini-batch gradient descent is equivalent to batch gradient descent.

(

解释： Batch gradient descent uses all the examples at each iteration, this is equivalent to having only one mini-batch of the size of the complete training set in mini-batch gradient descent.

)

Why is the best mini-batch size usually not 1 and not m, but instead something in-between? Check all that are true.

If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.
If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.
If the mini-batch size is 1, you end up having to process the entire training set before making any progress.
If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.

While using mini-batch gradient descent with a batch size larger than 1 but less than m, the plot of the cost function JJ looks like this:

You notice that the value of

J

$J$

is not always decreasing. Which of the following is the most likely reason for that?

In mini-batch gradient descent we calculate
$J(\hat{y} ^{\{t\}} ,{y} ^{\{t\}} )) J ( y ^ { t } , y { t } )) thus with each batch we compute over a new set of data.$
A bad implementation of the backpropagation process, we should use gradient check to debug our implementation.
You are not implementing the moving averages correctly. Using moving averages will smooth the graph.
The algorithm is on a local minimum thus the noisy behavior.

(

解释：Yes. Since at each iteration we work with a different set of data or batch the loss function doesn’t have to be decreasing at each iteration.

)

$v_2=10 v 2 = 10 , v 2 c o r r e c t e d = 10 v^{corrected}_{2}=10 v 2 correc t e d = 10$
$v_2=7.5 v 2 = 7.5 , v 2 c o r r e c t e d = 7.5 v^{corrected}_{2}=7.5 v 2 correc t e d = 7.5$
$v_2=7.5 v 2 = 7.5 , v 2 c o r r e c t e d = 10 v^{corrected}_{2}=10 v 2 correc t e d = 10$
$v_2=10 v 2 = 10 , v 2 c o r r e c t e d = 7.5 v^{corrected}_{2}=7.5 v 2 correc t e d = 7.5$

The intuition behind it is that for later epochs our parameters are closer to a minimum thus it is more convenient to take smaller steps to prevent large oscillations.
We use it to increase the size of the steps taken in each mini-batch iteration.
The intuition behind it is that for later epochs our parameters are closer to a minimum thus it is more convenient to take larger steps to accelerate the convergence.
It helps to reduce the variance of a model.

(

解释：Reducing the learning rate with time reduces the oscillation around a minimum.

)

Consider the figure:

Suppose this plot was generated with gradient descent with momentum
$\beta = 0.01 β = 0.01 . What happens if we increase the value of β$
\beta

$β$

to 0.1?

The gradient descent process starts oscillating in the vertical direction.
The gradient descent process starts moving more in the horizontal direction and less in the vertical.
The gradient descent process moves less in the horizontal direction and more in the vertical direction.
The gradient descent process moves more in the horizontal and the vertical axis.

（

解释：随着β增大，走的步伐跨度越大，振幅越小，The use of a greater value of β causes a more efficient process thus reducing the oscillation in the horizontal direction and moving the steps more in the vertical direction.

）

Adam can only be used with batch gradient descent and not with mini-batch gradient descent.
The most important hyperparameter on Adam is
Adam combines the advantages of RMSProp and momentum.
Adam automatically tunes the hyperparameter