Last updated: 1 January 2023
A concise peek into NN activation functions and normalisation routines.
Layers discussed below:
Resources: medium, nptel lectures.
For CNN discussions below consider the training batch consisting of a 4-D tensor of dimensions \(N \times C \times H \times W\). \(N\) is the batch size (images), \(C\) is the number of convolution channels at this depth, \(H \times W\) are the spatial dimensions of the image.
Applied to linear layers and CNN.
Linear layer: Consider a \(N \times D\) training batch matrix, with \(N\) data points and \(D\) features. We standardise each feature by computing it’s mean and standard deviation. We keep track of \(D\) \(\mu\)s and \(\sigma^2\)s.
ε = 1e-5
X = torch.randn(10, 4)
μ = X.mean(dim=0, keepdims=True) # 1 x 4
ν = X.var(dim=0, keepdims=True) # 1 x 4
Xstd = (X - μ) / (ν + ε)**(0.5) # 10 x 4
Xbn = ɣ * Xstd + β # ɣ, β -> 1 x 4; scale and shift
CNN: For CNN’s we compute one mean and standard devation per channel. We will have \(C\), \(\mu\)s and \(\sigma\)s.
ε = 1e-5
X = torch.randn(10, 4, 128, 64) # 10 imgs, 4 channels, H - 128, W - 64
μ = X.mean(dim=[0,2,3], keepdims=True) # shape [1, 4, 1, 1]
ν = X.var(dim=[0,2,3], keepdims=True) # shape [1, 4, 1, 1]
Xstd = (X - μ) / (ν + ε)**(0.5) # 10 x 4 x 128 x 64
Xbn = ɣ * Xstd + β # ɣ, β -> [1, 4, 128, 64]; scale and shift
Salient points
train
and eval
phases while using batch norms.Applied to linear, CNN, and RNN layers.
Linear layer: Consider a \(N \times D\) training batch matrix, with \(N\) data points and \(D\) features. We standardise all the features of a single training/test instance with a single \(mu\) and \(sigma\) learnt from all the features of that instance. We compute \(N\) \(\mu\)s and \(\sigma^2\)s but we don’t have to keep track of them during testing.
ε = 1e-5
X = torch.randn(10, 4)
μ = X.mean(dim=1, keepdims=True) # 10 x 1
ν = X.var(dim=1, keepdims=True) # 10 x 1
Xstd = (X - μ) / (ν + ε)**(0.5) # 10 x 4
Xln = ɣ * Xstd + β # ɣ, β -> 1 x 4; scale and shift
CNN: For CNN’s we compute one mean and standard devation per image across all channels. We would have computed \(N\), \(\mu\)s and \(\sigma\)s.
ε = 1e-5
X = torch.randn(10, 4, 128, 64) # 10 imgs, 4 channels, H - 128, W - 64
μ = X.mean(dim=[1,2,3], keepdims=True) # shape [10, 1, 1, 1]
ν = X.var(dim=[1,2,3], keepdims=True) # shape [10, 1, 1, 1]
Xstd = (X - μ) / (ν + ε)**(0.5) # 10 x 4 x 128 x 64
Xbn = ɣ * Xstd + β # ɣ, β -> 4 x 128 x 64; scale and shift
For RNN’s we don’t normalise over the time dimension unlike CNN.
Applied to CNN layers. Not applicable to training data tensors smaller than 3 dimensions.
CNN: For CNN’s we compute one mean and standard devation per image per channel. We will have \(N \times C\), \(\mu\)s and \(\sigma\)s.
ε = 1e-5
X = torch.randn(10, 4, 128, 64) # 10 imgs, 4 channels, H - 128, W - 64
μ = X.mean(dim=[2,3], keepdims=True) # shape [10, 4, 1, 1]
ν = X.var(dim=[2,3], keepdims=True) # shape [10, 4, 1, 1]
Xstd = (X - μ) / (ν + ε)**(0.5) # 10 x 4 x 128 x 64
ɣ = β = torch.randn(1, 1, 128, 64)
Xin = ɣ * Xstd + β # ɣ, β -> 1 x 1 x 128 x 64; scale and shift
Applied to CNN layers. Not applicable to training data tensors smaller than 3 dimensions.
CNN: For CNN’s we compute one mean and standard devation per image for a group of channels. Say we have \(G\) groups and \(C'\) channels per group (\(C=GC'\)), we will have \(N \times G\), \(\mu\)s and \(\sigma\)s.