Reproduce Pandas Exponential Weighted Variance

Understand why the exponentially weighted variance cannot accurately estimate variance.

Author: Adrian Letchford Translator: stmoonar

You are likely familiar with the method of calculating averages using exponential weighting. The principle of the exponential weighted average method is to assign higher weights to more recent information. The weights for the exponential weighted average method are as follows:

$\displaystyle{w_t = (1 - \alpha)^t}$

For $t \in [0, \dots, T]$ , the exponentially weighted average of the sequence $X_t$ is:

$\displaystyle{\bar{X}_T = \frac{1}{\sum_t w_t} \sum_t w_t X_t}$

In Pandas, you can easily calculate the exponentially weighted moving average using the following method:

df.ewm(alpha=0.1).mean()

If you calculate the exponentially weighted average yourself, you will find that it matches the results given by Pandas. However, we will soon see that if we try to calculate the variance, we will get a very poor estimate. This is known as estimation bias (estimation bias).

What is bias?#

The bias of an estimator refers to the difference between the expected value of the estimator and the true value of the parameter being estimated (in this case, the variance). A biased estimate has a non-zero difference from the true value, while an unbiased estimate has a zero difference from the true value.

Let's try to measure the variance and see what happens. The variance of a random variable $X$ is:

$\displaystyle{\sigma^2 = E[(X - \mu)^2]}$

If we have a sample of $n$ values of $X$ , we can try to estimate the variance using the sample mean to replace the expected value $E[\cdot]$ :

$\displaystyle{\frac{1}{n} \sum_i \left(X_i - \mu \right)^2}$

Then we replace $\mu$ with the sample mean:
$\displaystyle{\begin{aligned} \bar{X} &= \frac{1}{n}\sum_i X_i \\ \hat{\sigma}^2 &= \frac{1}{n} \sum_i \left(X_i - \bar{X} \right)^2 \end{aligned}}$

We can write a simple simulation in Python to see how good our estimator $\hat{\sigma}^2$ is.

from scipy.stats import norm

num_simulations = 10_000_000
n = 5  # Sample size

# This creates an array where each row is a simulation and the columns are sample values.
simulations = norm.rvs(loc=0, scale=1, size=(num_simulations, n))

# From this, we get the estimated mean for each simulation.
avg = simulations.mean(axis=1).reshape(-1, 1)

# Use our estimator to estimate the variance for each simulation.
estimates = ((simulations - avg)**2).sum(1) / n

Translator's Note:

The code uses the norm.rvs method to randomly generate sample values from a normal distribution, where loc=0 indicates a mean of 0, scale=1 indicates a standard deviation of 1, and size=(num_simulations, n) indicates that the generated array will have num_simulations rows, each with n sample values.

Now we have 10 million variance estimates, and the true variance is 1. So what is our average estimate? Calculate the mean of the estimates:

estimates.mean()

It yields about 0.8! Our estimate is biased by 0.2. This is bias!

Where does bias come from?#

Let's return to the definition of variance and how we convert it into a sample estimate. To calculate our estimate, we replaced $\mu$ with the sample mean:

$\displaystyle{\frac{1}{n} \sum_i \left(X_i - \mu \right)^2 \quad\Rightarrow\quad \frac{1}{n} \sum_i \left(X_i - \bar{X} \right)^2}$

This is where the bias arises. The sample mean ( $\bar{X}$ ) will be closer to these samples than the population mean ( $\mu$ ), which can be more intuitively understood through the following example.

Figure 1 shows an example with 100 random points centered at 0, where 5 points are randomly selected, and their mean is represented by a black cross. The mean of these 5 samples is the point that is closest to these 5 samples. By definition, the sample mean will be closer to the sample than the population mean. Therefore:

$\displaystyle{\frac{1}{n} \sum_i \left(X_i - \bar{X} \right)^2 \quad\lt\quad \frac{1}{n} \sum_i \left(X_i - \mu \right)^2}$

Our estimate will underestimate the variance of $X$ !

Figure 1: Illustration of bias. This is a plot of 100 random points with a mean of (0,0). Five points are randomly selected and highlighted. The black cross represents the mean of these 5 random points.

Figure 1: Illustration of bias. This is a plot of 100 random points with a mean of (0,0). Five points are randomly selected and highlighted. The black cross represents the mean of these 5 random points.

In fact, if we repeat the above Python simulation but use 0 (the population mean) instead of the sample mean:

avg = 0

Then the average of the sample variance will be 1:

estimates.mean()

If we know the population mean, we can obtain an unbiased estimate of the variance from a set of samples. But in reality, we do not know the population mean. Fortunately, we can quantify the bias and correct it.

Quantifying bias#

So far, we have seen that $\hat{\sigma}^2$ is a biased estimate of the population variance. We discovered this by simulating multiple samples to obtain $\hat{\sigma}^2$ and taking the average. The simulation results indicate:

$\displaystyle{E[\hat{\sigma}^2] \ne \sigma^2}$

Now we want to get rid of the simulation and calculate the exact value of $E[\hat{\sigma}^2]$ . We can achieve this by expanding it out. We start from:

$\displaystyle{E[\hat{\sigma}^2] = E \left[ \frac{1}{n} \sum_i \left(X_i - \bar{X} \right)^2 \right]}$

We can let $\bar{X} = \mu - (\mu - \bar{X})$ , which means we can expand it to:

$\displaystyle{E[\hat{\sigma}^2] = E \left[ \frac{1}{n} \sum_i \left((X_i - \mu)- (\bar{X} - \mu) \right)^2 \right]}$

Through some algebra, we can expand the square:

$\displaystyle{\begin{aligned} E[\hat{\sigma}^2] &= E \left[ \frac{1}{n} \sum_i \left((X_i - \mu)^2 - 2(\bar{X} - \mu)(X_i - \mu) + (\bar{X} - \mu)^2\right) \right] \\ &= E \left[ \frac{1}{n} \sum_i (X_i - \mu)^2 - 2(\bar{X} - \mu) \frac{1}{n} \sum_i(X_i - \mu) + \frac{1}{n} \sum_i(\bar{X} - \mu)^2 \right] \\ &= E \left[ \frac{1}{n} \sum_i (X_i -\mu)^2 - 2(\bar{X} - \mu) \frac{1}{n} \sum_i(X_i - \mu) + (\bar{X} -\mu)^2 \right] \end{aligned}}$

Now, note that:

$\displaystyle{\frac{1}{n} \sum_i(X_i - \mu) = \frac{1}{n} \sum_i X_i - \frac{1}{n} \sum_i \mu = \frac{1}{n} \sum_i X_i - \mu = \bar{X} - \mu}$

This means:

$\displaystyle{\begin{aligned} E[\hat{\sigma}^2] &= E \left[ \frac{1}{n} \sum_i (X_i - \mu)^2 - 2(\bar{X} - \mu)^2 + (\bar{X} - \mu)^2 \right] \\ &= E \left[ \frac{1}{n} \sum_i (X_i - \mu)^2 - (\bar{X} - \mu)^2 \right] \\ &= E \left[ \frac{1}{n} \sum_i (X_i - \mu)^2 \right] - E \left[ (\bar{X} - \mu)^2 \right] \end{aligned}}$

The key point here is:

$\displaystyle{E \left[ \frac{1}{n} \sum_i (X_i - \mu)^2 \right] = \sigma^2}$

This means:

$\displaystyle{\begin{aligned} E[\hat{\sigma}^2] &= \sigma^2 - E \left[ (\bar{X} - \mu)^2 \right] \end{aligned}}$

The term $E \left[ (\bar{X} - \mu)^2 \right]$ is the variance of the sample mean. According to Bienaymé’s identity, we know it equals:

$\displaystyle{E \left[ (\bar{X} - \mu)^2 \right] = \frac{1}{n}\sigma^2}$

This gives us:

$\displaystyle{E[\hat{\sigma}^2] = (1 - \frac{1}{n}) \sigma^2 = (1 - \frac{1}{5}) \times 1 = 0.8}$

Recall our Python simulation; with a sample size of $n=5$ and a true variance of $\sigma^2 = 1$ , the estimated variance is $\hat{\sigma}^2 = 0.8$ . If we plug $n$ and $\sigma^2$ into the above formula, we get the biased answer:

$\displaystyle{E[\hat{\sigma}^2] = (1 - \frac{1}{n}) \sigma^2 = (1 - \frac{1}{5}) \times 1 = 0.8}$

Unbiased estimator#

Now that we know the exact value of $E[\hat{\sigma}^2]$ , we can find a way to correct $\hat{\sigma}^2$ to make it an unbiased estimate of $\sigma^2$ .

The correction factor is:

$\displaystyle{\frac{n}{n-1}}$

Through calculation, we can see:

$\displaystyle{\begin{aligned} E[\frac{n}{n-1} \hat{\sigma}^2] &=\frac{n}{n-1} E[\hat{\sigma}^2] \\ &= \frac{n}{n-1}(1 - \frac{1}{n}) \sigma^2 \\ &= \frac{n(1 - \frac{1}{n})}{n-1} \sigma^2 \\ &= \frac{n - 1}{n-1} \sigma^2 \\ &= \sigma^2 \end{aligned}}$

Thus, the unbiased estimate of $\sigma^2$ obtained from a set of samples is:

$\displaystyle{\begin{aligned} \frac{n}{n-1} \hat{\sigma}^2 &= \frac{n}{n-1} \frac{1}{n} \sum_i \left(X_i - \bar{X} \right)^2 \\ &= \frac{1}{n-1} \sum_i \left(X_i - \bar{X} \right)^2 \end{aligned}}$

Unbiased weighted estimator#

Now, we will extend the above content to the case of weighted samples. The weighted sample mean is:

$\displaystyle{\bar{X} = \frac{1}{\sum_i w_i} \sum_i w_i X_i}$

The weighted variance is:

$\hat{\sigma}^2 = \frac{1}{\sum_i w_i} \sum_i w_i\left(X_i - \bar{X} \right)^2$

Following the same expansion process as before, we find:

$\displaystyle{\begin{aligned} E[\hat{\sigma}^2] &= \sigma^2 - E \left[ (\bar{X} - \mu)^2 \right] \end{aligned}}$

The variance of the mean is:

$\displaystyle{\begin{aligned} E \left[ (\bar{X} - \mu)^2 \right] &= \text{var}(\bar{X}) \\ &= \text{var}\left(\frac{1}{\sum w_i} \sum w_i X_i \right) \\ &= \frac{1}{(\sum w_i)^2} \sum \text{var} (w_i X_i) \\ &= \frac{1}{(\sum w_i)^2} \sum w_i^2 \text{var} (X_i) \\ &= \frac{\sum w_i^2}{(\sum w_i)^2} \sigma^2 \end{aligned}}$

$var$ is used to calculate variance.

Thus, we obtain:

$\displaystyle{\begin{aligned} E[\hat{\sigma}^2] &= \sigma^2 - \frac{\sum w_i^2}{(\sum w_i)^2} \\ \sigma^2 &= \left(1 - \frac{\sum w_i^2}{(\sum w_i)^2} \right)\sigma^2 \end{aligned}}$

The bias correction term is:

$\displaystyle{b = \frac{(\sum w_i)^2}{(\sum w_i)^2 - \sum w_i^2}}$

This means the unbiased weighted estimate of variance is:

$\displaystyle{b \hat{\sigma}^2}$

Reproducing Pandas Exponentially Weighted Variance#

Now, we have all the tools needed to reproduce the Pandas exponentially weighted variance.

import numpy as np
import pandas as pd

N = 1000

# Create some fake data
df = pd.DataFrame()
df['data'] = np.random.randn(N)

# Set the halflife for EWM and convert it to alpha for calculation.
halflife = 10
a = 1 - np.exp(-np.log(2)/halflife)  # alpha

# This is Pandas' ewm
df['var_pandas'] = df.ewm(alpha=a).var()

# Initialize variable
varcalc = np.zeros(len(df))

# Calculate exponentially moving variance
for i in range(0, N):

    x = df['data'].iloc[0:i+1].values

    # Weights
    n = len(x)
    w = (1-a)**np.arange(n-1, -1, -1) # Reverse order to match the sequence order

    # Calculate exponentially moving average
    ewma = np.sum(w * x) / np.sum(w)

    # Calculate bias correction term
    bias = np.sum(w)**2 / (np.sum(w)**2 - np.sum(w**2))

    # Calculate exponentially moving variance with bias correction
    varcalc[i] = bias * np.sum(w * (x - ewma)**2) / np.sum(w)

df['var_calc'] = varcalc

This gives us a DataFrame like the one below: