HaiBiostat: Beta Distribution: an Intuitive Explanation

Hai Nguyen

Motivation

Even though I had learned the beta distribution from UIC’s Bayesian methods course and tutored it, such as setting up it as the prior distribution in conjugate distribution context. But it was easy to forget because of its dried content and too abstract. Here I try to combine the rigid theory (UC coursework’s content) and intuitive thought. By that way, I was able to ‘permenently stamp’ the concept to my brain.

The Beta distribution is a probability distribution on/of probabilities
The beta distribution describes a family of continuous probability distributions that are nonzero only on the interval (0 1).
For example, we can use it to model the probabilities: the Click-Through Rate of the advertisement, the batting averages, the 5-year survival chance for women with breast cancer, and so on.

Definition

A continuous random variable \(X_B \sim Beta(\alpha, \beta)\) has Beta distribution if its probability density function (PDF) is

\[ f_{X_B} (x; \alpha, \beta) = \frac{1}{B(α,β)} x^{\alpha − 1} (1−x)^{\beta − 1}, \ \ \text{for} \ 0 < x < 1. \]

where \(B(\cdot)\) is the Beta function and shape parameters \(\alpha, \beta > 0\).

Intuitive interpretation

	PDF	Probability as a …
Binomial	\(f(x) = {n \choose x} p^x (1-p)^{n-x}\)	`parameter`
	\(\rightarrow\) the function of \(x\)
Beta	\(f(p) = \frac{1}{B(α,β)} p^{\alpha − 1} (1−p)^{\beta − 1}\)	`random variable`
	\(\rightarrow\) the function of \(p\)

The beta distribution intuitively comes into play when we look at it in terms of numerator—\(x/p\) to the power of something multiplied by \(1-x/1-p\) to the power of something—from the lens of the binomial distribution.
The difference between the binomial and the beta is that the above models the number of successes (\(x\)), while the below models the probability (\(p\)) of success. In other words, the probability is a parameter in binomial; In the Beta, the probability is a random variable.
In this context, the shape parameters \(\alpha\) and \(\beta\) or \(\alpha-1\) as the number of successes and \(\beta-1\) as the number of failures
We can explore the beauty of beta distribution via the the calculator for Beta distribution—Dr. Bognar at the University of Iowa built it.
Beta distribution is very flexible: bell-curve (The PDF of a beta distribution is approximately normal if \(\alpha + \beta\) is large enough and \(\alpha\) & \(\beta\) are approximately equal), U-shaped (when \(\alpha\) < 1, \(\beta\) < 1) and even straight line. Here’s an graph excerpt from wikipedia.

Beta function

The beta function is

\[ B(x,y) = \int_0^1 t^{x−1} (1−t)^{y−1} dt = \frac{\Gamma(x) \Gamma(y)}{\Gamma(x+y)}, \]

where \(\Gamma(\cdot)\) is the Gamma function.

Gamma function

The Gamma function \(\Gamma\) is an extension of the factorial function, with its argument shifted down by 1, to real and complex numbers.

For positive integer \(n\):

\[ \Gamma (n) = (n−1)! = 1 \times 2 \times 3 \times ... \times (n−1) \]

The gamma function is defined for all complex numbers except the non-positive integers by the integral:

\[ \Gamma (t) = \int_0^{\infty} x^{t-1} e^{-x} dx \]

Simplify the Beta function with the Gamma Function \(\Rightarrow\) we saw the PDF of Beta written in terms of the Gamma function. The Beta function is the ratio of the product of the Gamma function of each parameter divided by the Gamma function of the sum of the parameters (proof refered the further reading topic).

Main facts

\[ E[X_B] = \mu = \frac{\alpha}{\alpha + \beta}; \ \ V[X_B] = \sigma^2 = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)} \]

The standard uniform distribution \(\text{Unif} \ (0,1)\) is a special case of the beta distribution \(Beta \ (1,1)\), when \(\alpha = \beta = 1\).

The mode is \(\omega = \frac{\alpha − 1}{\alpha + \beta − 2}\) for \(\alpha, \beta > 1\).
The concentration is \(\kappa = \alpha + \beta\).
Definitions of \(\mu, \omega\) and \(\kappa\) can be inverted:

\[ \alpha = \mu\kappa, \beta = (1 − \mu)\kappa \]

\[ \alpha = \omega(\kappa−2)+1, \beta = (1 − \omega)(\kappa−2)+1, \ \kappa > 2. \]

Parameter \(\kappa\) is a measure of number of observations needed to change our previous belief about \(\mu\).
If \(\kappa\) is small we need only a few new observations.

Example. Concentration \(\kappa = 8\) around \(\mu = 0.5\) corresponds to \(\alpha = \mu \kappa = 4\) and \(\beta = (1 − \mu) \kappa = 4\).

Parameterization in terms of mean value and standard deviation is:

\[ \alpha = \mu [\frac{\mu (1 - \mu)}{\sigma^2} - 1]; \ \ \beta = (1 - \mu)[\frac{\mu (1 - \mu)}{\sigma^2} - 1] \]

Standard deviation is typically smaller than standard deviation of uniform distribution on \([0,1]\), i.e. \(0.28867\).

Examples.

For \(\mu = 0.5\), \(\sigma = 0.28867\) the shape parameters are \(\alpha = 1\), \(\beta = 1\).
Find shape parameters of beta distribution with \(\mu = 0.5\), \(\sigma = 0.1\).

The standard uniform distribution \(Unif \ (0,1)\) is a special case of the beta distribution \(Beta \ (1,1)\), when \(\alpha = \beta = 1\).

In actions

Keep parameter \(\beta\) fixed. Move \(\alpha\) up or down. Observe how the mass of the distribution moves

p <- seq(0,1,by=0.2)

df <- data.frame(p)
ggplot(data=df, aes(x=p))+
  stat_function(fun=dbeta, args=list(shape1=1, shape2=2), aes(colour = "alpha=1,beta=2")) + 
  stat_function(fun=dbeta, args=list(shape1=2, shape2=2), aes(colour = "alpha=2,beta=2")) +
  stat_function(fun=dbeta, args=list(shape1=4, shape2=2), aes(colour = "alpha=4,beta=2")) +
  stat_function(fun=dbeta, args=list(shape1=6, shape2=2), aes(colour = "alpha=6,beta=2")) +
  stat_function(fun=dbeta, args=list(shape1=8, shape2=2), aes(colour = "alpha=8,beta=2")) +
  scale_y_continuous(limits=c(0,3.6)) +
  scale_colour_manual("", values = c("palegreen", "orange", "olivedrab", "blue", "black")) + 
  ylab("Density") +
  ggtitle("PDF of Beta Distribution") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Do the same as above, but keep \(\alpha\) constant and move \(\beta\) up or down

ggplot(data=df, aes(x=p))+
  stat_function(fun=dbeta, args=list(shape1=2, shape2=1), aes(colour = "alpha=2,beta=1")) + 
  stat_function(fun=dbeta, args=list(shape1=2, shape2=2), aes(colour = "alpha=2,beta=2")) +
  stat_function(fun=dbeta, args=list(shape1=2, shape2=5), aes(colour = "alpha=2,beta=5")) +
  stat_function(fun=dbeta, args=list(shape1=2, shape2=6), aes(colour = "alpha=2,beta=6")) +
  stat_function(fun=dbeta, args=list(shape1=2, shape2=8), aes(colour = "alpha=2,beta=8")) +
  scale_y_continuous(limits=c(0,3.6)) +
  scale_colour_manual("", values = c("palegreen", "orange", "olivedrab", "blue", "black")) + 
  ylab("Density") +
  ggtitle("PDF of Beta Distribution") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Make \(\alpha = \beta = 1\). What does the shape of the distribution tell you about your knowledge about random variable \(\theta\)?
\(\Rightarrow\) The standard uniform distribution \(Unif(0,1)\) is a special case of the beta distribution \(Beta (1,1)\), when \(\alpha\)=\(\beta\)=1.

ggplot(data=df, aes(x=p))+
  stat_function(fun=dbeta, args=list(shape1=1, shape2=1), aes(colour = "alpha=1,beta=1")) +
  scale_y_continuous(limits=c(0,3.6)) +
  scale_colour_manual("", values = c("green")) + 
  ylab("Density") +
  ggtitle("PDF of Beta Distribution") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Keep \(\alpha = \beta\) , but move both of them up or down. Interpret the shape of the distribution

ggplot(data=df, aes(x=p))+
  stat_function(fun=dbeta, args=list(shape1=0.5, shape2=0.5), aes(colour = "alpha=0.5,beta=0.5")) + 
  stat_function(fun=dbeta, args=list(shape1=1, shape2=1), aes(colour = "alpha=1,beta=1")) +
  stat_function(fun=dbeta, args=list(shape1=2, shape2=2), aes(colour = "alpha=2,beta=2")) +
  stat_function(fun=dbeta, args=list(shape1=4, shape2=4), aes(colour = "alpha=4,beta=4")) +
  stat_function(fun=dbeta, args=list(shape1=6, shape2=6), aes(colour = "alpha=6,beta=6")) +
  scale_y_continuous(limits=c(0,3.6)) +
  scale_colour_manual("", values = c("palegreen", "orange", "olivedrab", "blue", "black")) + 
  ylab("Density") +
  ggtitle("PDF of Beta Distribution") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Variance changes based on 2 shape parameters.

ggplot(data=df, aes(x=p))+
  stat_function(fun=dbeta, args=list(shape1=400, shape2=80), aes(colour = "alpha=400,beta=80")) + 
  stat_function(fun=dbeta, args=list(shape1=40, shape2=8), aes(colour = "alpha=40,beta=8")) +
  stat_function(fun=dbeta, args=list(shape1=30, shape2=70), aes(colour = "alpha=30,beta=70")) +
  stat_function(fun=dbeta, args=list(shape1=3, shape2=7), aes(colour = "alpha=3,beta=7")) +
  scale_y_continuous(limits=c(0,25)) +
  scale_colour_manual("", values = c("blue", "green", "orange", "black")) + 
  ylab("Density") +
  ggtitle("PDF of Beta Distribution") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

When beta distribution is used as a prior for parameter of binomial distribution, parameters \(\alpha\) and \(\beta\) can be interpreted as previously observed numbers of successes (\(\alpha\)) or failures (\(\beta\)). For example, if in 2 Bernoulli experiments there was 1 success and 1 failure you can express opinion about probability of success as \(Beta(1,1)\). What would you assume as prior if in 6 previously observed outcomes there were 3 successes and 3 failures? What is the likely value of the parameter? Do we have more or less information than in case of 1 success and 1 failure? \(\Rightarrow\) Think of more

ggplot(data=df, aes(x=p))+
  stat_function(fun=dbeta, args=list(shape1=1, shape2=1), aes(colour = "alpha=1,beta=1")) +
  stat_function(fun=dbeta, args=list(shape1=3, shape2=3), aes(colour = "alpha=3,beta=3")) +
  stat_function(fun=dbinom, args=list(size=1, prob=0.5), aes(colour = "Bernoulli w/ prob=0.5")) + # bernoulli
  scale_y_continuous(limits=c(0,3.6)) +
  scale_colour_manual("", values = c("red","green","black")) + 
  ylab("Density") +
  ggtitle("PDF of Beta Distribution") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Define distribution with mode \(\omega\)=.8 and concentration \(\kappa = 12\). To do that find shape parameters \(\alpha = \omega (\kappa − 2) + 1 = 9\) and \(\beta = (1 − \omega)(\kappa − 2) + 1 = 3\).

ggplot(data=df, aes(x=p))+
  stat_function(fun=dbeta, args=list(shape1=9, shape2=3), aes(colour = "alpha=9,beta=3")) +
  scale_y_continuous(limits=c(0,3.4)) +
  scale_colour_manual("", values = c("blue")) + 
  ylab("Density") +
  ggtitle("PDF of Beta Distribution") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

From the actions we notify that:

The special case \(a=b=1\) is the uniform distribution.
The distribution is roughly centered on \(a/(a+b)\). Actually, it turns out that the mean is exactly \(a/(a+b)\). Thus the mean of the distribution is determined by the relative values of \(a\) and \(b\).
The larger the values of \(a\) and \(b\), the smaller the variance of the distribution about the mean.
For moderately large values of \(a\) and \(b\) the distribution looks visually “kind of normal”, although unlike the normal distribution the Beta distribution is restricted to [0,1].

Plots in `shiny`

Planning to build an shiny app to plot beta distribution on the specification of shape parameter (“still being in the process”).

Beta Distribution: an Intuitive Explanation

Motivation

Definition

Intuitive interpretation

Beta function

Gamma function

Main facts

In actions

Plots in `shiny`

Further reading

Corrections

Reuse

Citation

Beta Distribution: an Intuitive Explanation

Motivation

Definition

Intuitive interpretation

Beta function

Gamma function

Main facts

In actions

Plots in shiny

Further reading

Corrections

Reuse

Citation

Plots in `shiny`