Skip to main content

Strides in CNNs

In a Convolutional Neural Network, the Stride is the number of pixels by which the filter (kernel) shifts over the input matrix. While Padding is used to maintain size, Stride is one of the primary ways we control the spatial dimensions of our feature maps.

1. What is a Stride?

When the stride is set to 1, the filter moves one pixel at a time. This results in highly overlapping receptive fields and a larger output.

When the stride is set to 2 (or more), the filter jumps two pixels at a time. This skips over pixels, resulting in a smaller output and less overlap.

2. The Impact of Striding

A. Dimensionality Reduction

Increasing the stride is an alternative to Pooling. By jumping over pixels, the network effectively "downsamples" the image. For example, a stride of 2 will roughly halve the width and height of the output.

B. Receptive Field

A larger stride allows the network to cover more area with fewer parameters, but it comes at a cost: Information Loss. Because the filter skips pixels, some fine-grained spatial details might be missed.

C. Computational Efficiency

Larger strides mean fewer total operations (multiplications and additions), which can significantly speed up the training and inference time of a model.

3. Mathematical Formula

To determine the output size when using strides (SS), we use the general convolution formula:

O=WK+2PS+1O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1
  • WW: Input width/height
  • KK: Kernel size
  • PP: Padding
  • SS: Stride
note

If the result of the division is not a whole number, most frameworks will "floor" the value (round down), meaning the last few pixels of the image might be ignored if the filter can't fit.

4. Comparing Stride and Pooling

Both techniques are used to reduce the size of the data, but they differ in how they do it:

FeatureLarge Stride ConvolutionPooling Layer
LearningThe network learns which pixels to weight during the jump.Uses a fixed rule (Max or Average).
ParametersContains weights and biases.No parameters.
TrendModern architectures (like ResNet) often prefer strided convolutions.Classic architectures (like VGG) rely heavily on Pooling.

5. Implementation

TensorFlow / Keras

from tensorflow.keras.layers import Conv2D

# A standard convolution (Stride 1)
conv_std = Conv2D(filters=32, kernel_size=(3, 3), strides=(1, 1))

# A downsampling convolution (Stride 2)
conv_down = Conv2D(filters=32, kernel_size=(3, 3), strides=(2, 2))

PyTorch

import torch.nn as nn

# Strides are defined as an integer or a tuple (height, width)
# This will halve the input dimensions
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=2, padding=1)

References


We’ve covered how the filter moves, how it handles edges, and how it extracts features. Now, how do we combine all these pieces into a complete network?