How to Compute Faster 2-D Convolution in Python

In image processing, 2-D convolution is a highly useful operation. It can be used for tasks such as blurring, morphology, edge detection, and sharpening. In Python, a naive 2-D convolution method takes a huge computational load for a large image. This post introduces the use of np.lib.stride_tricks for enhancing performance of the convolution algorithm. The functions in this library directly access memory, so care must be taken when writing data to memory.

A basic implementation of 2-D convolution can be illustrated with the following code:

import numpy as np
import time

def conv2d(x, kernel):
    H, W = x.shape
    H_k, W_k = kernel.shape
    kernel_width_half = H_k // 2

    out = np.zeros_like(x)
    x_pad = np.pad(x, kernel_width_half)

    for i in range(0, H):
        for j in range(0, W):
            for m in range(0, H_k):
                for n in range(0, W_k):
                    out[i][j] += x_pad[i+m][j+n] * kernel[m][n]

    return out

This method involves multiple nested for-loops, which can lead to computation time increment as the size of the input matrix grows. For instance, with a 320x240 image and a 3x3 kernel, this 2-D convolution algorithm takes about 500 ms on a MacBook Air with an M1 processor.

To improve the performance of the 2-D convolution algorithm, np.lib.stride_tricks.as_strided() can be utilized. Essentially, this function creates a new array with the desired shape and strides. Strides refer to the number of bytes to skip when moving one step in a specific dimension. For example, let us consider the numpy array x = np.array([[1,2],[3,4]], dtype=np.int64). The raw data bytes of x can be viewed as:

x = np.array([[1,2],[3,4]], dtype=np.int64)
print(x.tobytes()) # little-endian
# b'\x01\x00\x00\x00\x00\x00\x00\x00 \x02\x00\x00\x00\x00\x00\x00\x00 \x03\x00\x00\x00\x00\x00\x00\x00 \x04\x00\x00\x00\x00\x00\x00\x00'

Here, x[0] corresponds to b'\x01\x00\x00\x00 \x00\x00\x00\x00' and x[1] to b'\x02\x00\x00\x00 \x00\x00\x00\x00', because the stride of x along the column axis is 8 bytes. In the same manner, the stride along the row axis is 16 bytes.

print(x.strides)
# (16, 8)

Thus, np.lib.stride_tricks.as_strided(x, (4,), (8,)) results in [1,2,3,4], and np.lib.stride_tricks.as_strided(x, (4,), (4,)) results in [1, 8589934592, 2, 12884901888], where b'\x00\x00\x00\x00 \x02\x00\x00\x00' corresponds to \(256^4 \times 2 = 8,589,934,592\) and b'\x00\x00\x00\x00 \x03\x00\x00\x00' corresponds to \(256^4 \times 3 = 12,884,901,888\).

Returning to the 2-D convolution method, for an \(H \times W\) matrix x and an \(M \times N\) kernel, kernel, padding the matrix with x_pad = np.pad(x, ((M//2, M//2), (N//2, N//2))) and using np.lib.stride_tricks.as_strided(x_pad, (H, W, M, N), x_pad.strides + x_pad.strides) creates a tensor of size \(H \times W \times M \times N\). Each 2-D sub-matrix formed by the third and fourth dimensions will be element-wise multiplied by the kernel. The final output matrix is obtained using np.einsum('klij,ij->kl', sub_matrices, kernel), which employs Einstein summation convention. In the first argument, klij represents the first to fourth dimensions of the sub_matrices tensor, and ij represents the dimensions of the kernel matrix. Hence, klij,ij->kl denotes:

\[O_{kl} = \sum_{i,j} S_{klij} \cdot K_{ij}\]

where \(S\) is the sub_matrices tensor, \(K\) is the kernel matrix, and \(O\) is the output matrix. The full enhanced 2-D convolution code is as follows:

def fast_conv2d(x, kernel):
    H, W = x.shape
    H_k, W_k = kernel.shape
    kernel_width_half = H_k // 2
    
    x_pad = np.pad(x, kernel_width_half)

    sub_matrices = np.lib.stride_tricks.as_strided(x_pad, (H, W, H_k, W_k), x_pad.strides*2)
    return np.einsum('klij,ij->kl', sub_matrices, kernel)

In the above code, np.lib.stride_tricks.as_strided(x_pad, (H, W, H_k, W_k), x_pad.strides*2) can be replaced with np.lib.stride_tricks.sliding_window_view(x_pad, (H_k, W_k)).

def fast_conv2d(x, kernel):
    H_k, W_k = kernel.shape
    kernel_width_half = H_k // 2
    
    x_pad = np.pad(x, kernel_width_half)

    sub_matrices = np.lib.stride_tricks.sliding_window_view(x_pad, (H_k, W_k))
    return np.einsum('klij,ij->kl', sub_matrices, kernel)

Comparing the computation times of the three methods above, it is shown that the latter two methods are significantly faster. Additionally, as_strided() performs slightly better than sliding_window_view(). The error bars in the graph below are exaggerated to represent three times the standard deviation.