Gaussian Error Linear Unit (GeLU)

Gaussian Error Linear Unit (GeLU)#

Definition and Formula#

GeLU activation function is defined as:

\(\text{GeLU}(x) = x \cdot \Phi(x)\)

where \(\Phi(x)\) is the cumulative distribution function (CDF) of the standard normal distribution:

\(\Phi(x) = \frac{1}{2}[1 + \text{erf}(\frac{x}{\sqrt{2}})]\)

Practical Approximation#

For computational efficiency, GeLU can be approximated as:

\(\text{GeLU}(x) \approx 0.5x(1 + \tanh[\sqrt{2/\pi}(x + 0.044715x^3)])\)

Intuition Behind GeLU#

In GeLU, the output is \(x \Phi(x)\), where \(\Phi(x)\) is the cumulative distribution function (CDF) of the standard normal distribution. This combines \(x\) with a probabilistic weighting based on its magnitude and sign, as determined by the CDF.

Key Interactions#

Large positive \(x\)#

\(\Phi(x) \approx 1\), so \(x\) is kept almost as-is
These values are important and should be retained
Behavior similar to ReLU in this region

Small positive \(x\)#

\(\Phi(x) < 1\), so \(x\) is scaled down
These values are less critical but still contribute
Provides smooth transition unlike ReLU

Large negative \(x\)#

\(\Phi(x) \approx 0\), so \(x\) is suppressed
Reduces large negative contributions
Helps prevent destabilizing large negative summations

Near-zero \(x\)#

Function smoothly transitions
Ensures differentiability
Avoids sharp cuts like ReLU

Note: GeLU is a deterministic function. There is no random sampling happening. The key idea is that input values are modulated according to their magnitude and sign, using the CDF of the standard normal distribution.

Comparison with Common Activation Functions#

ReLU#

\(\text{ReLU}(x) = \max(0, x)\)

Most commonly used activation function
Zero output for negative inputs
Linear for positive inputs
Non-differentiable at \(x = 0\)

Leaky ReLU#

\(\text{Leaky ReLU}(x) = \max(\alpha x, x)\), where \(\alpha\) is typically 0.01

Modified version of ReLU
Small positive slope for negative inputs
Still non-differentiable at \(x = 0\)
Fixed slope parameter \(\alpha\)

Sigmoid#

\(\sigma(x) = \frac{1}{1 + e^{-x}}\)

Classic activation function
Output range [0,1]
Smooth and differentiable
Suffers from vanishing gradients
Non-zero centered outputs

Tanh#

\(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)

Scaled and shifted sigmoid
Output range [-1,1]
Zero-centered
Still has vanishing gradients
Higher computational cost

Swish#

\(\text{Swish}(x) = x \cdot \sigma(\beta x)\)

Similar form to GeLU
Uses learnable parameter \(\beta\)
Non-monotonic like GeLU
More complex to implement
Potential overfitting due to parameter

ELU#

\(\text{ELU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\)

Exponential Linear Unit
Smooth alternative to ReLU
Has negative values unlike ReLU
Parameter \(\alpha\) controls negative slope
More expensive computation for negative inputs

Why GeLU is Superior to Common Activation Functions#

1. Advantages over ReLU#

Differentiability#

ReLU Problem: Non-differentiable at \(x = 0\), causing potential gradient instability
GeLU Solution: Smooth transition around zero, ensuring stable gradient flow
Impact: Better training stability and convergence

Dead Neurons#

ReLU Problem: Neurons can permanently die when stuck in negative region
GeLU Solution: Probabilistic (modulation) scaling of negative inputs keeps neurons partially active
Impact: More robust network capacity utilization

2. Advantages over Sigmoid#

Gradient Flow#

Sigmoid Problem: Severe vanishing gradients for large inputs (both positive and negative)
GeLU Solution: Linear behavior for large positive values, gradual suppression for negative values
Impact: Faster training and better convergence

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

torch.manual_seed(47)
plt.style.use('dark_background')

# Define the range and standard normal CDF
x = torch.linspace(-4, 4, 500, requires_grad=True)
phi_x = 0.5 * (1 + torch.erf(x / torch.sqrt(torch.tensor(2.0))))  # CDF of x

# Compute GeLU using PyTorch's built-in function
x_input = torch.linspace(-4, 4, 500, requires_grad=True)
gelu_x = F.gelu(x_input)

# Compute the gradient (derivative) of GeLU
gelu_x.backward(torch.ones_like(x_input))
gelu_x_derivative = x_input.grad.clone()

# Compute x * Phi(x) and its derivative
x_cdf = x * phi_x
x_cdf.backward(torch.ones_like(x))
x_cdf_derivative = x.grad.clone()

# Plotting
fig, axs = plt.subplots(2, 2, figsize=(8, 6), dpi=300)

# Plot 1: CDF of Standard Normal Variate (Phi(x))
axs[0, 0].plot(x.detach().numpy(), phi_x.detach().numpy(), color='green')
axs[0, 0].set_title(r"CDF of Standard Normal Variate ($\Phi(x)$)")
axs[0, 0].set_xlabel("x")
axs[0, 0].set_ylabel(r"$\Phi(x)$")
axs[0, 0].grid(alpha=0.3)

# Plot 2: x * Phi(x)
axs[0, 1].plot(x.detach().numpy(), x_cdf.detach().numpy(), color='blue')
axs[0, 1].set_title(r"$x \cdot \Phi(x)$")
axs[0, 1].set_xlabel("x")
axs[0, 1].set_ylabel(r"$x \cdot \Phi(x)$")
axs[0, 1].grid(alpha=0.3)

# Plot 3: Derivative of x * Phi(x)
axs[1, 0].plot(x.detach().numpy(), x_cdf_derivative.detach().numpy(), color='orange')
axs[1, 0].set_title(r"Derivative of $x \cdot \Phi(x)$")
axs[1, 0].set_xlabel("x")
axs[1, 0].set_ylabel(r"$\frac{d}{dx} \left[ x \cdot \Phi(x) \right]$")
axs[1, 0].grid(alpha=0.3)

# Plot 4: GeLU and Its Derivative (using x_input)
axs[1, 1].plot(x_input.detach().numpy(), gelu_x.detach().numpy(), label="GeLU (x)", color='blue')
axs[1, 1].plot(x_input.detach().numpy(), gelu_x_derivative.detach().numpy(), label="GeLU Derivative (x)", color='red', linestyle='--')
axs[1, 1].set_title(r"GeLU ($x$) and Its Derivative")
axs[1, 1].set_xlabel("x")
axs[1, 1].set_ylabel("y")
axs[1, 1].legend()
axs[1, 1].grid(alpha=0.3)

# Adjust layout
plt.tight_layout()
plt.show()

../../_images/da20505583244bb9fad582a62ffb37744a0fcbdccd731daaedd7e56c2b193e8d.png