Deep learning noter – Gradient based learning – 11 & 13/10 2021

With a liminiear approximation can be made from the gradiant. This is done by going the oposite direction.

Usage of Derivates – Source: SDU

With multiple inputs it is used the same way, but returns an output with weights matching the amount of dimensions of the input.


SGD is using mini batches. Instead of using all connections to get the most precesise results, batches are used to approximate the result somewhat wore propbably.

From the mini batches an average gradient is calculated and used.

The positive asspect of SGD is that it is faster.

Deep learning specialities

Deep learning needs the same components as machine learning

  1. Optimization procedure
  2. Cost function
  3. Model family

Differences causes non-convex loss.


  • Can end in local minimum
  • Can end in local maximum
  • Can end at saddle points – The worst point: learning is stopped, and the results are likely not good.
  • Learning rate – Big learning rates might have large jumps in results and small learning rates might be slow.

If a particular bad promlem or cliff is encountered, new initialization might be the best solution.

Cliffs and exploding gradients

Cliffs in Neural Network are sudden drops or rices in learning.

Theese can be midigated by implementing a gradient.

Inexact gradients are caused by biace or variance.

About gradients

Gradients must be large and somewhat predictable to guide the learning algorithm.

Functions that saturate (flatten results) undermine gradients.

Negative log-likelihood helps avoid saturation problem for many models. It undoees gradient accidents as far as i understand.

Different optimizers

  • SGD
    • Standard gradient decent
    • utilizes decay, momentum, and Nesterov
    • for frequent parameter
  • RMSProp
    • Root Mean Square Propagation
    • utilizes exponentially decaying running average of the squared gradients (similar to momentum)
    • decay is usually 0.9.
  • Adagrad
    • utilizes larger updates for infrequent parameters and smaller updates
    • Good for word embeddings with that has infrequent words
    • Often has diminising learning rate
    • Tries to ease on the diminishing learning rate decay from Adagrad
  • Adadelta
    • similar to RMSProp
    • utilizes root squared parameter updates
    • Does not need set learning rate
  • Adam
    • Adaptive Moment Estimation
    • prefers flat minima in the error surface
    • Adam is basically RMSProp with momentum
  • Adamax
    • Same as Adam, but based on the max-norm ^inf.
  • Nadam
    • Nesterov-accelerated Adaptive Moment Estimation
    • Nadam is basically Adam with Nesterov momentum

Choosing a bad optimizer or bad parameters in general, can lead to inefeicent or slow learning.

Batch gradient decent can use fixed learning rate.

SGD jas a source error.

Source: SDU


Momentum hels accelerea gradientdecent. Usefull when

  • Surfaces that curve more steeply in one direction than in another direction
    § Facing high curvature
    § Small but consistent gradients
    § Noisy gradients

Dampening of the oscillation – Source: SDU

Note: The need for dampening of the oscilation can only be found trhough trial and error.

Momentum introduces velocity, v. It is the direction and speed at which parameters move through parameter
space. Also known as mass in physics.

Source: SDU

Nesterov Accelerated Gradient

Calculation of gradient if implementet.

Nesterov Accelerated Gradient – Source: SDU

Adaptive Learning Rate (ada)

Learning rate for each adaptive parameter.