From ResNet to ResNeSt

Over the years, various improvements have been made on top of the original ResNet architecture. In the following, I’ll review a few papers listing some of them.


  1. ResNeXt adopts group convolution to incorporate a multi-path strategy (similar to Inception) within the bottleneck block of ResNet.
  2. SE-Net introduced a channel-wise attention mechanism which allows to re-calibrate each feature map.
  3. SK-Net introduced an attention mechanism which operates on feature map produced by different kernel-size, aka receptive field. It is then allowed to learn how to best combine the information at different scales.
  4. ResNeSt combines the idea of the papers above into one.


The ResNet architecture consists of a stack of bottleneck blocks, each of them boiling down to the the element-wise summation of two pathways

  1. The first path encodes the feature map into a smaller embedding and projects it back to the original size
  2. A second path which is just the identity

This identity shortcut connection allows the gradients to flow through the shortcut connections to any other earlier layer and greatly improve training (no more vanishing gradient problem).


Following ResNet, ResNeXt introduced a multi-path (coming from inception) into ResNet. At the difference from Inception, all paths share the same topology. But we can play over the cardinality G - the number of multipaths to increase the model capacity.

As for ResNet, the ResNetXt block consists of an identity path but also get G additional pathways. Each of them

  1. Encode the feature map to a smaller embedding space - 4 here.
  2. Use 3x3 convolution to process the embedding
  3. Decode to the original size.

In practise, they use group convolution to implement the same idea

This post contains some very good explanation/illustration about what group convolution is. Here is a typical convolution operation. In particular, we convolve a feature map \(c_1 \times H \times W\) with \(c_2\) filters, each with the size \(c_1 \times h_f \times w_f\).

During group convolution - each group will convolve a feature map \(\frac{c_1}{g} \times H \times W\) with a \(\frac{c_2}{g}\) filters, each with the size \(\frac{c_1}{g} \times h_f \times w_f\). Each group therefore outputs a \(\frac{c_2}{g} \times H \times W\) group of features. We then concatenate all the G feature groups to get \(c2\) filters.

This greatly reduces computation and tends to work just as well in practise.

Denote the group size by G, then both the number of parameters and the computational cost will be divided by G, compared to the ordinary convolutio

Squeeze and excitation blocks

In 2018, SE-Net introduced the idea of a channel-wise attention to recalibrate the feature map.

Specifically, they introduce another type of block, the csSE block, i.e. channel and spatial Squeeze and Excitation.

The cSE block factors out the spatial dependency by global average pooling to learn a channel wise descriptor which is used to rescale the channel description – highlighting only useful channels. It squeezes along the spatial domains and excites along the channel.

In contrast the sSE block sqeeze along the channel dimension and excites the spatial dimensions. The advantage of using the cSE block is to have a receptive field on all the images from the start. The sSE block is akin to spatial attention.

Below is an implementation of this block using pytorch

class scSE(nn.Module): def __init__(self, in_channels, reduction): super().__init__() self.cse_block = nn.Sequential( nn.AdaptiveAvgPool2d(1), nn.Conv2d(in_channels, in_channels // 2, kernel_size=1), nn.ReLU(inplace=True), nn.Conv2d(in_channels // 2, in_channels, kernel_size=1), nn.Sigmoid()) self.sse_block = nn.Sequential( nn.Conv2d(in_channels, 1, kernel_size=1), nn.Sigmoid()) def forward(self, x): return self.sse_block(x) * x + self.cse_block(x) * x


SK-Net proposes a way to enable the neurons to adaptively adjust their receptive fields.

The main idea of this paper is to

  1. Compute feature map with different receptive field by varying the kernel size, aka split
  2. Then, global contextual information with embedded channel-wise statistics can be gathered with global average pooling across spatial dimensions followed by 1x1 convolution to a vector z of size d.
  3. We then compute a channel wise attention and use it to compute a weighted average of all the different feature maps.

An implementation of the block in pytorch is available below where UnitBlock is a sequence of Conv+BN+ReLu with a group of cardinality G.

class SKBlock(nn.Module): def __init__(self, in_planes, G=32, r=16, L=32): """ Args: in_planes (int): Nb of input channels G (int): Num of convolution groups r (int): the ratio for compute d, the length of z L (int): The minimum size for d """ super(SKBlock, self).__init__() self.d = max(int(in_planes / r), L) self.in_planes = in_planes ## split # 3x3 kernel self.kernel0 = UnitBlock(self.in_planes, self.in_planes, kernel_size=3, groups=G) # 5x5 kernel as a 3x3 kernel with dilation=2 self.kernel1 = UnitBlock(self.in_planes, self.in_planes, kernel_size=3, dilation=2, padding=2, groups=1) ## fuse self.fuse0 = nn.AdaptiveAvgPool2d(1) self.fuse1 = UnitBlock(self.in_planes, self.d, kernel_size=1, padding=0) # select self.A = nn.Conv2d(self.d, self.in_planes, kernel_size=1, bias=False) self.B = nn.Conv2d(self.d, self.in_planes, kernel_size=1, bias=False) self.softmax = nn.Softmax(dim=1) def split(self, x): """Split layer --> produce different feature maps with different receptive fieds. """ u0 = self.kernel0(x) # BxCxHxW u1 = self.kernel1(x) # BxCxHxW return u0, u1 def fuse(self, u0, u1): """Fuse together the different feature map to produce a channel-wise descriptor z. """ u = u0 + u1 # BxCxHxW s = self.fuse0(u) # BxCx1x1 return self.fuse1(s) # Bxdx1x1 def compute_attention_map(self, z): """Create the attention map to know which channel to attend for each of the feature map. """ a = self.A(z) # BxCx1x1 b = self.B(z) # BxCx1x1 # attention_map # Bx2xCx1x1 attention_map =[a, b], 1).view(-1, 2, self.in_planes, 1, 1) # learn which channel to attend for each branch. attention_map = self.softmax(attention_map) # Bx2xCx1x1 return attention_map def select(self, u0, u1, attention_map): """ As each feature map look at the input with a specific receptive field, we effectively trained the network how to combine the channel from different receptive field. """ s = u0.shape # concat u --> # Bx2xCxHxW u =[u0, u1], 1).view(-1, 2, self.in_planes, s[-2], s[-1]) return torch.sum(u * attention_map, axis=1) ## BxCxHxW


ResNeSt combines the idea of groups and split attentions to form a new block - the SplitAttentionBlock

It is parameterized by \(K\) the cardinality and \(R\) the radix. From ReNeXt, the feature map is divided into K groups. Each group is further broken down into \(R\) splits.

Following SEnet and SKnet, they combine the information from the different splits by first computing a channel-wise attention map (BxRxCx1x1) which they multiply element wise by the concat output of all the splits (BxRxCxHxW).

The process is very similar than in SK-net and depicted below in the paper