Understanding Cumulative Distribution Functions (CDF) and Their Empirical Estimation

Understanding Cumulative Distribution Functions (CDF) and Their Empirical Estimation#

Introduction to CDFs#

Before diving into empirical distributions, it’s essential to understand what a cumulative distribution function represents and why it’s useful in statistical analysis. A CDF helps us understand the complete probability distribution of a random variable by showing the probability of the variable being less than or equal to any given value.

CDF Definition and Intuition#

The CDF \(F_X(x)\) of a random variable \(X\) gives the probability that \(X\) is less than or equal to \(x\):

\(F_X(x) = P(X \leq x)\)

To build intuition:

If we’re measuring heights in a population, \(F_X(170)\) would tell us the proportion of people who are 170 cm or shorter
For test scores, \(F_X(80)\) would represent the fraction of students scoring 80 or below
In reliability analysis, \(F_X(t)\) might represent the probability of a component failing by time \(t\)

CDF for Discrete Random Variables#

For a discrete random variable \(X\) with probability mass function (PMF) \(p_X(x)\), the CDF is the cumulative sum of the probabilities up to \(x\):

\(F_X(x) = \sum_{x_i \leq x} p_X(x_i)\)

Example: Rolling a Die#

Consider rolling a fair six-sided die:

\(P(X \leq 3) = P(X = 1) + P(X = 2) + P(X = 3) = \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = \frac{1}{2}\)
The CDF would jump by \(\frac{1}{6}\) at each value from 1 to 6

CDF for Continuous Random Variables#

For a continuous random variable \(X\) with probability density function (PDF) \(f_X(x)\), the CDF is the integral of the PDF from \(-\infty\) to \(x\):

\(F_X(x) = \int_{-\infty}^{x} f_X(t) \, dt\)

Example: Normal Distribution#

For a standard normal distribution:

The CDF at \(x = 0\) is 0.5, meaning 50% of the data lies below zero
The CDF at \(x = 1\) is approximately 0.84, indicating about 84% of the data lies below 1

Properties of CDF#

The CDF has several important properties that make it useful for statistical analysis:

Non-decreasing:
- The CDF is always non-decreasing: \(F_X(x) \leq F_X(y)\) for \(x \leq y\)
- This makes intuitive sense as accumulating probabilities can’t decrease
Limits:
- \(\lim_{x \to -\infty} F_X(x) = 0\): The probability of getting an infinitely small value is zero
- \(\lim_{x \to \infty} F_X(x) = 1\): The probability of getting any value is 1 (certainty)
Right-continuity:
- \(\lim_{\epsilon \to 0^+} F_X(x + \epsilon) = F_X(x)\)
- This ensures smooth behavior when approaching values from the right
- Particularly important for discrete distributions where jumps occur
Range:
- \(0 \leq F_X(x) \leq 1\) for all \(x\)
- This follows from the basic properties of probability

Empirical CDF (ECDF)#

The empirical CDF is a practical tool for estimating the true CDF when we only have sample data. Given a sample \(X_1, X_2, \dots, X_n\), the ECDF \(\hat{F}_n(x)\) is defined as:

\(\hat{F}_n(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(X_i \leq x)\)

Where:

\(\mathbb{I}(X_i \leq x)\) is the indicator function
\(n\) is the sample size
\(\hat{F}_n(x)\) represents the proportion of data points ≤ \(x\)

Example: Computing ECDF#

For the dataset {1, 2, 2, 3, 4, 4, 5}:

\(\hat{F}_n(2) = \frac{3}{7}\) (3 values ≤ 2)
\(\hat{F}_n(4) = \frac{6}{7}\) (6 values ≤ 4)
The function jumps by \(\frac{1}{7}\) at each unique value

Practical Applications of ECDF#

Non-parametric Estimation
- Provides distribution estimates without assuming a specific form
- Useful when the underlying distribution is unknown
- More robust than parametric methods
Goodness-of-fit Testing
- Kolmogorov-Smirnov test compares ECDF to theoretical CDF
- Helps validate distribution assumptions
- Example: Testing if data follows a normal distribution
Quantile Estimation
- Find sample percentiles directly from ECDF
- Median: Value where ECDF = 0.5
- Quartiles: Values where ECDF = 0.25, 0.75
Anomaly Detection
- Identify outliers using ECDF percentiles
- Points in extreme tails (e.g., ECDF > 0.99) might be anomalies
- More robust than parametric methods

import numpy as np
import matplotlib.pyplot as plt

plt.style.use('dark_background')
np.random.seed(47)

# Generate data
mean = 0
std = 2
sample_size = 500
sample = np.random.normal(loc=mean, scale=std, size=sample_size)

# Define support range (empirical)
support = (min(sample) - 0.5, max(sample) + 0.5)

# Number of points to calculate the relative frequency for
num_points = 20
points = np.linspace(support[0], support[1], num_points)

# Calculate relative frequencies
cumulative_reqlative_frequencies = []
for point in points:
    cumulative_reqlative_frequencies.append(
        (sample <= point).sum() / sample_size
    ) 

# Plot the cumulative relative frequencies vs points
plt.figure(figsize=(4, 3), dpi=300)
plt.step(points, cumulative_reqlative_frequencies, 
         where='mid', color='magenta')
plt.xlabel('Values')
plt.ylabel('Cumulative Relative Frequency')
plt.title('ECDF')
plt.grid(True, alpha=0.5, linestyle='--')
plt.show()

../../_images/6b98f5091f42edba9a39853f820f6b7d735bdede08adbf01bab980ef5bbe439b.png