Note on the implementation of a convolutional neural networks.

This post is a follow-up on the second assignment proposed as part of the Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition.

The last part of the assignement deals with the implementation of a convolutional neural network. Among other things, this implies the implementation of forward and backward passes for convolutional layers, pooling layers, spatial-batch normalisations and other non-linearities.

Instead of writing everything on papers and, undoubtedly, lost myself in the jungle of indices, I decided to document my derivation in this post.

Convolutional layer

Convolutional layers are the building blocks of conv-net. They convolve their inputs with learnable filters to extract what are called activation maps.

Notation

I am going to follow the notation of the assignment.

In particular,

  • Input $x$ ($N$, $C$, $H$, $W$)
  • Weights $w$ ($F$, $C$, $HH$, $WW$)
  • Output $y$ ($N$,$F$, $Hh$, $Hw$)

and the indices, unless differently specified, are

  • $n$ for the different images
  • $f$ for the different filters
  • $c$ for the different channels
  • $f,k$ for the spatial outputs

Forward pass

The first step is the implementation of the forward pass. Instead of jumping into it, let’s first look at an example to see what it does.

We are going to consider the convolution of an input $x$ of size $H=W=5$ with a filter of size $HH=WW=3$ with stride $S=2$ and zero padding $P=0$.

Graphically, $x$ looks like

the filter $w$

and the bias is just $b$ as we consider only one filter. The output $y$ is a vector of size as $Hh$ (and $Hw$ here) is given by

During the convolution process, the filter is convolved to the input in a way that is defined by the stride and produces its proper activation map $y$. For instance, $y_{11}$ resumes to the multiplication element-wise of the lower right part (size ) of $x$ with the filter. Mathematically, it reads

where the beginning of the sum is given by the index of the element times the stride $S$, and the size of the sums are given by the size of their respective filter dimensions ($HH$ or $WW$).

Generalising from this example, we see that the output reads

where

and $x^{pad}$ is the input padded with the adequate number of zeros (given by $P$).

With $N$ images, $F$ filters and $C$ channels, the example above easily generalised and the forward pass for the convolutional layer reads

which nicely translates mathematically that to obtain the specific value indexed by $k,l$ in the $f$ activation map for an input $n$, select the corresponding subpart of the input of size , multiply it by the filter $f$ and sum all the resulting terms, i.e. the convolution!

In python, it looks like

x_pad = np.pad(x, ((0,), (0,), (P,), (P,)), 'constant')
out = np.zeros((N, F, Hh, Hw))
for n in range(N):  # First, iterate over all the images
    for f in range(F):  # Second, iterate over all the kernels
        for k in range(Hh):
            for l in range(Hw):
                out[n, f, k, l] = np.sum(x_pad[n, :, k * S:k * S + HH, l * S:l * S + WW] * w[f, :]) + b[f]

Backward pass

During the backward pass, we have to compute where each gradient with respect to a quantity contains a vector of size equal to the quantity itself and where we know from the previous pass (see previous post for more details).

Gradient with respect to the weights

The gradient of the loss with respect to the weights has the same size as the weights themselves ($F$,$C$,$HH$,$WW$). Chaining by the gradient of the loss with respect to the outputs $y$, it reads

The expression of $y_{n,f,k,l}$ is derived above and therefore, its derivative with respect to the weight $w$ reads

Injecting this expression back into the gradient of the loss with respect to the weights, we then have

which in python translates

dw = np.zeros((F, C, HH, WW))
for fprime in range(F):
    for cprime in range(C):
        for i in range(HH):
            for j in range(WW):
                sub_xpad = x_pad[:, cprime, i:i + Hh * S:S, j:j + Hw * S:S]
                dw[fprime, cprime, i, j] = np.sum(dout[:, fprime, :, :] * sub_xpad)

Gradient with respect to the bias

The gradient of the loss with respect to the bias is of size ($F$). Chaining by the gradient of the loss with respect to the outputs $y$ and simplifying, it reads

which, in python, translates

db = np.zeros((F))
for fprime in range(F):
    db[fprime] = np.sum(dout[:, fprime, :, :])

Gradient with respect to the input

As above, we first chain by the gradient of the loss with respect to the output $y$, which gives

The second term reads

where we have to handle carefully the fact that $y$ depends on the padded version of $x^{pad}$ and not $x$ itself. In other words, we better first chain with the gradient of $y$ with respect to the padded version of $x^{pad}$ to get

Let’s first look at the second term, the gradient of $x^{pad}$ with respect to $x$. There is a simple relationship between the two which reads

then,

For the first term,

Then, putting both terms together, we find that

Hence, the gradient of the loss with respect to the inputs finally reads

which in python can be written with $9$ beautiful loops ;)

 # For dx : Size (N,C,H,W)
dx = np.zeros((N, C, H, W))
for nprime in range(N):
    for cprime in range(C):
        for i in range(H):
            for j in range(W):
                for f in range(F):
                    for k in range(Hh):
                        for l in range(Hw):
                            for p in range(HH):
                                for q in range(WW):
                                    if (p + k * S == i + P) & (q + S * l == j + P):
                                        dx[nprime, cprime, i, j] += dout[nprime,
                                                                             f, k, l] *
                                                                             w[f, cprime, p, q]

Though inefficient, this implementation has the advantage of translating point by point the formula. A may be more clever implementation could look like

dx = np.zeros((N, C, H, W))
for nprime in range(N):
    for i in range(H):
        for j in range(W):
            for f in range(F):
                for k in range(Hh):
                    for l in range(Hw):
                        mask1 = np.zeros_like(w[f, :, :, :])
                        mask2 = np.zeros_like(w[f, :, :, :])
                        if (i + P - k * S) < HH and (i + P - k * S) >= 0:
                            mask1[:, i + P - k * S, :] = 1.0
                        if (j + P - l * S) < WW and (j + P - l * S) >= 0:
                            mask2[:, :, j + P - l * S] = 1.0
                        w_masked = np.sum(w[f, :, :, :] * mask1 * mask2, axis=(1, 2))
                        dx[nprime, :, i, j] += dout[nprime, f, k, l] * w_masked

which is somewhat still very inefficient ;) If anyone has any idea on how to remove the i and j loops, please tell me !

Pooling layer

A pooling layer reduces the spatial dimension of its input without affecting its depth.

Basically, if a given input $x$ has a size ($N,C,H,W$), then the output will have a size ($N,C,H_1,W_1$) where $H_1$ and $W_1$ are given by

and where $H_p$, $W_p$ and $S$ are three hyperparameters which corresponds to

  • $H_p$ is the height of the pooling region
  • $H_w$ is the width of the pooling region
  • $S$ is the stride, the distance between two adjacent pooling region.

Forward pass

The forward pass is very similar to the one of the convolutional layer and reads

or, more pleasantly,

which in python translates to

out = np.zeros((N, C, H1, W1))
for n in range(N):
    for c in range(C):
        for k in range(H1):
            for l in range(W1):
                out[n, c, k, l] = np.max(x[n, c, k * S:k * S + Hp, l * S:l * S + Wp])

Backward pass

The gradient of the loss with respect to the input $x$ of the pooling layer writes

Let’s look at the second term. In particular, we are going to assume that the spatial indices of the max in $y_{n,c,k,l}$ are $p_m$ and $q_m$ respectively. Therefore,

\begin{eqnarray} y_{n,c,k,l} &=& x_{n,c,p_m,q_m} \end{eqnarray}

and

and we are done! Indeed, in python, find the indices of the max is fairly easy and the lazy compute of the gradient reads

dx = np.zeros((N, C, H, W))
for nprime in range(
    for cprime in range(C):
        for i in range(H):
            for j in range(W):
                for k in range(H1):
                    for l in range(W1):
                        x_pooling = x[nprime, cprime, k * S:k * S+ Hp, l * S:l * S + Wp]
                        maxi = np.max(x_pooling)
                        # Make sure to find the indexes in x and not x_pooling !!!!
                        x_mask = x[nprime, cprime, :, :] == maxi
                        pm, qm = np.unravel_index(x_mask.argmax(), x_mask.shape)
                        if (i == pm) & (j == qm):
                            dx[nprime, cprime, i,j] += dout[nprime, cprime, k, l]

Note here that we are calculating the same x_pooling many times. A more clever solution, computationally speaking is the following

dx = np.zeros((N, C, H, W))
for nprime in range(N):
    for cprime in range(C):
        for k in range(H1):
            for l in range(W1):
                x_pooling = x[nprime, cprime, k *S:k * S + Hp, l * S:l * S + Wp]
                maxi = np.max(x_pooling)
                x_mask = x_pooling == maxi
                dx[nprime, cprime, k * S:k * S + Hp, l * S:l *S + Wp] += dout[nprime, cprime, k, l] * x_mask

But we are note looking for efficiency here, are we ?? ;)

Spatial batch-normalization

Finally, we are asked to implement a vanilla version of batch norm for the convolutional layer.

Indeed, following the argument that the feature map was produced using convolutions, then we expect the statistics of each feature channel to be relatively consistent both between different images and different locations within the same image. Therefore spatial batch normalization computes a mean and variance for each of the $C$ feature channels by computing statistics over both the minibatch dimension $N$ and the spatial dimensions $H$ and $W$.

Forward pass

The forward pass is straighforward here

where

In four line of python, it resumes to

mu = (1. / (N * H * W) * np.sum(x, axis=(0, 2, 3))).reshape(1, C, 1, 1)
var = (1. / (N * H * W) * np.sum((x - mu)**2,axis=(0, 2, 3))).reshape(1, C, 1, 1)
xhat = (x - mu) / (np.sqrt(eps + var))
out = gamma.reshape(1, C, 1, 1) * xhat + beta.reshape(1, C, 1, 1)

Backward pass

In the backward pass, we have to find an expression for where each gradient with respect to a quantity contains a vector of size equal to the quantity itself.

I spent already an entire post explaining how to derive the gradient of the loss with respect to the centred inputs in a previous post and I just drop the generalized version for the conv-net application here.

Gradient of the loss with respect to $\beta$

dbeta = np.sum(dout, axis=(0, 2, 3))

Gradient of the loss with respect to $\gamma$

dgamma = np.sum(dout * xhat, axis=(0, 2, 3))

Gradient of the loss with respect to the input $x$

In python

gamma = gamma.reshape(1, C, 1, 1)
beta = beta.reshape(1, C, 1, 1)
Nt = N * H * W
dx = (1. / Nt) * gamma * (var + eps)**(-1. / 2.) * (Nt * dout
        - np.sum(dout, axis=(0, 2, 3)).reshape(1, C, 1, 1)
        - (x -  mu) * (var  + eps)**(-1.0) *  np.sum(dout * (x  - mu),axis=(0, 2, 3)).reshape(1, C, 1, 1)) 

Conclusion

This post focus on the derivation of the forward and backward passes for different building blocks of convolutional neural networks. Namely

  • A convolutional layer
  • A pooling layer
  • Spatial-batch normalization

I also document the corresponding code in python for those who want to implement their own convolutional neural networks.

To finish, I’d like to thank all the team from the CS231 Standford class who do a fantastic work in vulgarising the knowledge behind neural networks.

For those who want to take a look to my full implementation of a convolutional neural networks, you can found it here.

Written on February 2, 2016