Understand why the exponentially weighted variance cannot accurately estimate variance.
Author: Adrian Letchford Translator: stmoonar
You are likely familiar with the method of calculating averages using exponential weighting. The principle of the exponential weighted average method is to assign higher weights to more recent information. The weights for the exponential weighted average method are as follows:
For , the exponentially weighted average of the sequence is:
In Pandas, you can easily calculate the exponentially weighted moving average using the following method:
df.ewm(alpha=0.1).mean()
If you calculate the exponentially weighted average yourself, you will find that it matches the results given by Pandas. However, we will soon see that if we try to calculate the variance, we will get a very poor estimate. This is known as estimation bias (estimation bias).
What is bias?#
The bias of an estimator refers to the difference between the expected value of the estimator and the true value of the parameter being estimated (in this case, the variance). A biased estimate has a non-zero difference from the true value, while an unbiased estimate has a zero difference from the true value.
Let's try to measure the variance and see what happens. The variance of a random variable is:
If we have a sample of values of , we can try to estimate the variance using the sample mean to replace the expected value :
Then we replace with the sample mean:
We can write a simple simulation in Python to see how good our estimator is.
from scipy.stats import norm
num_simulations = 10_000_000
n = 5 # Sample size
# This creates an array where each row is a simulation and the columns are sample values.
simulations = norm.rvs(loc=0, scale=1, size=(num_simulations, n))
# From this, we get the estimated mean for each simulation.
avg = simulations.mean(axis=1).reshape(-1, 1)
# Use our estimator to estimate the variance for each simulation.
estimates = ((simulations - avg)**2).sum(1) / n
Translator's Note:
The code uses the
norm.rvs
method to randomly generate sample values from a normal distribution, whereloc=0
indicates a mean of 0,scale=1
indicates a standard deviation of 1, andsize=(num_simulations, n)
indicates that the generated array will havenum_simulations
rows, each withn
sample values.
Now we have 10 million variance estimates, and the true variance is 1. So what is our average estimate? Calculate the mean of the estimates:
estimates.mean()
It yields about 0.8! Our estimate is biased by 0.2. This is bias!
Where does bias come from?#
Let's return to the definition of variance and how we convert it into a sample estimate. To calculate our estimate, we replaced with the sample mean:
This is where the bias arises. The sample mean () will be closer to these samples than the population mean (), which can be more intuitively understood through the following example.
Figure 1 shows an example with 100 random points centered at 0, where 5 points are randomly selected, and their mean is represented by a black cross. The mean of these 5 samples is the point that is closest to these 5 samples. By definition, the sample mean will be closer to the sample than the population mean. Therefore:
Our estimate will underestimate the variance of !
In fact, if we repeat the above Python simulation but use 0 (the population mean) instead of the sample mean:
avg = 0
Then the average of the sample variance will be 1:
estimates.mean()
If we know the population mean, we can obtain an unbiased estimate of the variance from a set of samples. But in reality, we do not know the population mean. Fortunately, we can quantify the bias and correct it.
Quantifying bias#
So far, we have seen that is a biased estimate of the population variance. We discovered this by simulating multiple samples to obtain and taking the average. The simulation results indicate:
Now we want to get rid of the simulation and calculate the exact value of . We can achieve this by expanding it out. We start from:
We can let , which means we can expand it to:
Through some algebra, we can expand the square:
Now, note that:
This means:
The key point here is:
This means:
The term is the variance of the sample mean. According to Bienaymé’s identity, we know it equals:
This gives us:
Recall our Python simulation; with a sample size of and a true variance of , the estimated variance is . If we plug and into the above formula, we get the biased answer:
Unbiased estimator#
Now that we know the exact value of , we can find a way to correct to make it an unbiased estimate of .
The correction factor is:
Through calculation, we can see:
Thus, the unbiased estimate of obtained from a set of samples is:
Unbiased weighted estimator#
Now, we will extend the above content to the case of weighted samples. The weighted sample mean is:
The weighted variance is:
Following the same expansion process as before, we find:
The variance of the mean is:
is used to calculate variance.
Thus, we obtain:
The bias correction term is:
This means the unbiased weighted estimate of variance is:
Reproducing Pandas Exponentially Weighted Variance#
Now, we have all the tools needed to reproduce the Pandas exponentially weighted variance.
import numpy as np
import pandas as pd
N = 1000
# Create some fake data
df = pd.DataFrame()
df['data'] = np.random.randn(N)
# Set the halflife for EWM and convert it to alpha for calculation.
halflife = 10
a = 1 - np.exp(-np.log(2)/halflife) # alpha
# This is Pandas' ewm
df['var_pandas'] = df.ewm(alpha=a).var()
# Initialize variable
varcalc = np.zeros(len(df))
# Calculate exponentially moving variance
for i in range(0, N):
x = df['data'].iloc[0:i+1].values
# Weights
n = len(x)
w = (1-a)**np.arange(n-1, -1, -1) # Reverse order to match the sequence order
# Calculate exponentially moving average
ewma = np.sum(w * x) / np.sum(w)
# Calculate bias correction term
bias = np.sum(w)**2 / (np.sum(w)**2 - np.sum(w**2))
# Calculate exponentially moving variance with bias correction
varcalc[i] = bias * np.sum(w * (x - ewma)**2) / np.sum(w)
df['var_calc'] = varcalc
This gives us a DataFrame like the one below: