This is complementary to works of

## Forward pass

In a supervised learning, for a single input X we have the following two layers neural net, where * corresponds to the the correct output.

$H = \frac{1}{1+\exp{( W^1 X + W^h H^{t-1} + b})}$

The softmax function $$p = \frac{e^{f^*}}{ \sum_j e^{f_j} }$$ is a part of cross entropy which is used as loss function. For the ith training example the loss is:

$L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right)$

The loss for a training batch  is:

$L = \underbrace{\frac{1}{N} \sum_i L_i}_ \text{data loss} + \underbrace{\frac{1}{2} \lambda \sum_k\sum_l W_{k,1}^2}_\text{regularization loss}$

For the first class the loss is:

$Loss =-log{(p^1)}= -\log\left(\frac{e^{\hat y^*}}{ e^{\hat y^1}+e^{\hat y^*}+e^{\hat y^3} }\right)$

## Back propagation

For the last weights and biases we have:

and for the first weights and biases we have:

$\frac{\partial{Loss}}{ \partial{W^1_{2,1}}} =\frac{\partial{Loss}}{ \partial{H^1}}\times \frac{\partial{H^1}}{ \partial{W^1_{2,1}}}$

## References:

[2] http://karpathy.github.io/2015/05/21/rnn-effectiveness/