The core of statistical inference can thought of situation like this. You have some observed data and you want to understand the actual population that generated the sample data you have.
Parameter estimation by MLE
One typically models that the observed data is generated by some probability distribution. For the sake of simplicity, let us assume the probability distribution is parameterized. A parameter is a number that can describe the population. The population is easily described/understood if we know the value of parameter. Statistical inference is about inferring/estimating the unknown parameters of distribution from the observed data. Statistical inference also aims to learn about uncertainty about these parameter estimates.
There are a number of ways to one can estimate the parameters from the observed data. And Maximum Likelihood estimates are one of the most common ways to estimate the unknown parameter from the data. Maximum Likelihood Estimation method gets the estimate of parameter by finding the parameter value that maximizes the probability of observing the data given parameter. It is typically abbreviated as MLE.
We will see a simple example of the principle behind maximum likelihood estimation using Poisson distribution.
Maximum Likelihood Estimation for data from Poisson Distribution
Poisson distribution is commonly used to model number of time an event happens in a defined time/space period. For example, we can model the number of emails/tweets received per day as Poisson distribution.
Poisson distribution is a simple distribution with a single parameter and it is great to use it to illustrate the principles behind Maximum Likelihood estimation.
We will start with generating some data from Poisson distribution. Then we will model the data to have generated from Poisson distribution with unknown parameter. Then we estimate the parameter by Maximum Likelihood estimation, by kind of brute force and compare them with the truth.
Let us generate some data from poisson distribution. In R, we can generate random numbers from a specific probability distribution easily. To generate numbers from poisson distribution, we can use rpois function.
We will generate 100 data points from Poisson distribution with parameter lambda = 5 using rpois function R.
# generate data from Poisson distribution # with the parameter lambda=5 data <- rpois(n=100, lambda=5)
Let us save the data into a data frame
# convert to data frame df_Poisson <- data.frame(data=data)
Let us quickly visualize the distribution of the data using histogram.
df_Poisson %>% ggplot(aes(x=data))+ geom_histogram(bins=100) + ylab("Count")+ xlab("data")+ theme_bw(base_size = 16) + ggtitle("Poisson Distribution: rpois(100,5)")
Likelihood
R.A. Fisher introduced the notion of “likelihood” while presenting the Maximum Likelihood Estimation. Since then, the use of likelihood expanded beyond realm of Maximum Likelihood Estimation.
When you have data x:{x1,x2,..,xn} from a probability distribution with parameter lambda, we can write the probability density function of x as f(x;lambda).
If each observed data is independent of each other, we can simplify the joint probability density function as
P(X1=x1,X2=x2,…,Xn=xn) = f(x1;lambda) * f(x2;lambda) … f(xn;lambda)
We know x as it is observed and we don’t know the parameter lambda and we can call the probability density function as Likelihood function. And likelihood function is a function of the unknown parameter lambda
L(lambda|x) = f(x1;lambda) * f(x2;lambda) *…* f(xn;lambda)
And Maximum Likelihood Estimation method gets the estimate of parameter by finding the parameter value for which the likelihood is the highest.
In the case of a model with a single parameter, we can actually compute the likelihood for range parameter values and pick manually the parameter value that has the highest likelihood.
Computing Likelihood for Poisson Distribution
In our example here, the f(x;lambda) is Poisson density function. And we can compute Poisson density, thus in turn likelihood using R with dpois() function.
Let us compute the likelihood of a single data point generated above. Let us compute the likelihood of the first data point, using the parameter lambda=1.
>dpois(data[1], lambda=1) 0.1839397
Remember, the parameter lambda is unknown and it is a parameter of the likelihood function. R being a functional programming language, we can easily compute the likelihoods for a bunch of parameters at the same time, without using a for loop.
Let us compute the likelihood of the single data point, but with multiple parameter values.
likelihood <- dpois(data[1], lambda=seq(20))
Let us create a data frame and plot the likelihood
# likelihood single data point likelihood <- dpois(data[1], lambda=seq(20))
Let us create a data frame with the likelihood and parameters in it.
# likelihood single data point as data frame lh_single <- data.frame(x=seq(20), likelihood=likelihood )
Let us see how the likelihood changes with the paramter we used ofr the single data point. Let us plot it
lh_single %>% ggplot(aes(x=x,y=likelihood))+ geom_point(size=4,color="dodgerblue")+ xlab("Lambda") + ylab("Likelihood") + ggtitle("Likelihood of a single data point over multiple lambda") + theme_bw(base_size = 16)
Why computing Log-Likelihood is better?
The above example shows that for our single data point the likelihood is the highest when lambda=2. Typically we are interested in computing the likelihood for our all of our observed data i.e. multiple data points. Computing the complete likelihood function for all the data involves multiplying likelihood of each single data point.
If we look at the above likelihood plot for the single data point, we can see that a lot of the likelihoods are pretty close to zero. What this means is that, while computing the likelihood of observed data, we might often end up multiplying really small (close to zero) likelihood. It will lead to hitting floating point underflow type errors.
Don’t panic, there is a brilliant solution. You might have guessed it. We can work with log-space. By working with log-likelihood, we can convert the multiplication to addition and we would not have worry about reaching computing precision.
It is pretty straightforward to compute log-likelihood for a single data point in R. The dpois function we used earlier has an argument log and we can set it log=TRUE to get log-likelihood.
# compute log-likelihood of single data point log_likelihood <- dpois(data[1], lambda=seq(20), log=TRUE) # log likelihood in data frame llh_single <- data.frame(x=seq(20), log_like=log_likelihood )
Let us plot the log-likelihood of the single data point over multiple values of lambda.
llh_single %>% ggplot(aes(x=x,y=likelihood))+ geom_point(size=4,color="dodgerblue")+ xlab("Lambda") + ylab("Log Likelihood") + ggtitle("Log-Likelihood of a single data point ") + theme_bw(base_size = 16)
Now you can see that we don’t have that many really small likelihood values.
Computing Likelihood for Observed Data
Let us write our likelihood function dealing with multiple data points and compute log-likelihood. We use dpois() function to get probability density or likelihood for each data point. dpois() has 3 arguments; the data point, and the parameter values (remember R is vectorized ), and log=TRUE argument to compute log-likelihood.
Since we have more than one data point, we sum the log-likelihood using the sum function. That is it, we now have log-likelihood for any given data.
llh_poisson <- function(lambda, y){ # log(likelihood) by summing llh <- sum(dpois(y, lambda, log=TRUE)) return(llh) }
Let us define the parameter space we would like to use to compute likelihood that the data was generated from Poisson distribution with a specific lambda.
lambdas <- seq(1,15, by=0.5)
We use sapply function to compute log-likelihoods for each parameter value of lambda.
# compute log-likelihood for all lambda values ll <- sapply(lambdas,function(x){llh_poisson(x,data)}) # save the lambdas and log-likelihoods in a data frame df <- data.frame(ll=ll, lambda=lambdas)
Maximum Likelihood Estimate from Observed Data
Now that we have log-likelihood for the observed data under a range parameter values, we are pretty close to finding the Maximum Likelihood Estimate of the parameter. Remember, Maximum Likelihood Estimate of a parameter is the parameter value for which the likelihood of the data is the highest.
We can simply plot the log-likelihood versus lambda and pick the lambda for which the Likelihood is the highest.
df %>% ggplot(aes(x=lambda,y=ll))+ geom_point(size=4,color="dodgerblue")+ xlab("Lambda") + ylab("Log Likelihood")+ theme_bw(base_size = 16) + geom_vline(xintercept = lambdas[which.max(ll)], color="red",size=2)
The plot below shows how the likelihood varies over different values of lambda and the red vertical line shows the MLE of lambda for which the likelihood is the highest.
From our plot, we can see that lambda=5 is our estimate by MLE. If you recall, we generated the 100 data points using Poisson distribution with parameter lambda=5.
Voila.. our brute force MLE using Poisson model worked like magic and we estimated the parameter to the same lambda=5. Note that the accuracy of this brute force approach depends on our parameter search space. For example, if our parameter search space was missing 5, we might have ended up with an estimate that is closer to 5 but included in our parameter space.
We can do better than this brute force search to do MLE, tune in for a future post to how to do this.