print("Total number of trainable parameters: {}".format(total_param_no))
```
%% Cell type:markdown id:tender-leisure tags:
Mainstream activation functions and their derivatives:
- Sigmoid: Map inputs (regardless of the sign) between 0 and 1. Can be used as a switch.
- Hyperbolic Tangent (tanh): Map inputs between -1 and 1. Can be used where output is allowed to be negative.
- Rectified Linear Unit (ReLU): Linearly activated when inputs are positive. Recommended to be used in hidden layers to prevent vanishing gradient problem, as Sigmoid and tanh are prone to saturate.
- Linear: Does literally nothing. Can cause exploding gradients (means if the optimum solution is somewhere on the earth, we are leaving the planet).
%% Cell type:code id:changed-gilbert tags:
``` python
defsigmoid(Z):
return1/(1+np.exp(-Z))
defrelu(Z):
returnnp.maximum(0,Z)
deftanh(Z):
returnnp.tanh(Z)
deflinear(Z):
returnZ
defsigmoid_backward(dA,Z):
sig=sigmoid(Z)
returndA*sig*(1-sig)
defrelu_backward(dA,Z):
dZ=np.array(dA,copy=True)
dZ[Z<=0]=0;
returndZ;
deftanh_backward(dA,Z):
tnh=np.tanh(Z)
returndA*(1-tnh**2)
deflinear_backward(dA,Z):
returndA
```
%% Cell type:markdown id:noted-details tags:
Let's visualize the activation functions and derivatives
returnactivation_func(Z_curr),Z_curr#We keep the output of the layer before and after passed through activation,
#as it will be used during backpropagation
```
%% Cell type:markdown id:inclusive-marine tags:
We can visualize the first full forward pass of our neural network (which is also the initial guess of our model):
%% Cell type:code id:tracked-vietnamese tags:
``` python
memory_fp={}
inp_fp=xgrid#Make sure our initial input is our grid.
forlayer_idx,layerinenumerate(nn_architecture):#Iterate over layers from first to last.
print('Layer {}'.format(layer_idx+1))
memory_fp['A'+str(layer_idx)]=inp_fp
A_layer_fp,Z_layer_fp=single_layer_forward_propagation(inp_fp,nn_params['W'+str(layer_idx+1)],nn_params['b'+str(layer_idx+1)],layer['activation'])#Iterate over layers from first to last.
print("Blue is true output, Orange is our initial guess per our initial parameters.")
```
%% Cell type:markdown id:shaped-mechanism tags:
Another function to conduct forward propagation across all layers of the network (note that we keep track of the previous values, to use later on back propagation):
returnA_curr,memory#Spit out the final output and cache of all hidden outputs
```
%% Cell type:markdown id:numeric-orbit tags:
Mean squared error (MSE) is a popular loss function for NNs:
- Error between model prediction and true output.
- This is what we want to minimize by adjusting model parameters.
- Squared error provides positive value and accelerates the gradient descent.
%% Cell type:code id:bacterial-complex tags:
``` python
defget_loss_value(Y_hat,Y):
m=Y.shape[-1]
returnnp.sum((Y_hat-Y)**2)/m
defget_loss_grad(Y_hat,Y):
return2*(Y_hat-Y)
```
%% Cell type:markdown id:essential-elite tags:
Print out the loss value obtained with the initial guess, and take the square root to get a sense of the order of magnitude of prediction error in the same scale with the actual output:
dZ_curr=backward_activation_func(dA_curr,Z_curr)#Gradient of activated output and value before activation.
dW_curr=np.dot(dZ_curr,A_prev.T)/m#Derivative with respect to the weights. Multiplication of gradient of activation and the input of the layer
db_curr=np.sum(dZ_curr,axis=1,keepdims=True)/m#Derivative with respect to the bias.
dA_prev=np.dot(W_curr.T,dZ_curr)#Finally gradient of the layer overall to be passed to the next (or previous in this case) layer.
returndA_prev,dW_curr,db_curr
```
%% Cell type:markdown id:annual-florence tags:
Since we already have completed the first forward pass and evaluated the initial loss function value, we can also evaluate the gradients with a single backward propagation.
As we print out gradient values, we can see a gradient value corresponding to every single parameter of our network:
Now we update the parameters using gradients calculated with back propagation. This is where we decide what is the step size we want to go with towards the gradient to update our parameters:
- Large learning rate: Can avoid local optima, unstable due to skipping.
- Small learning rate: Steady convergence, can stuck in local optima.