~/Optimality Criteria

Brandon Rozek

Photo of Brandon Rozek

PhD Student @ RPI studying Automated Reasoning in AI and Linux Enthusiast.

Falling under wrapper methods, optimality criterion are often used to aid in model selection. These criteria provide a measure of fit for the data to a given hypothesis.

Akaike Information Criterion (AIC)

AIC is an estimator of relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model relative to each other.

This way, AIC provides a means for model selection. AIC offers an estimate of the relative information lost when a given model is used.

This metric does not say anything about the absolute quality of a model but only serves for comparison between models. Therefore, if all the candidate models fit poorly to the data, AIC will not provide any warnings.

It is desired to pick the model with the lowest AIC.

AIC is formally defined as $$ AIC = 2k - 2\ln{(\hat{L})} $$

Bayesian Information Criterion (BIC)

This metric is based on the likelihood function and is closely related to the Akaike information criterion. It is desired to pick the model with the lowest BIC.

BIC is formally defined as $$ BIC = \ln{(n)}k - 2\ln{(\hat{L})} $$

Where $\hat{L}$ is the maximized value of the likelihood function for the model $M$. $$ \hat{L} = p(x | \hat{\theta}, M) $$ $x$ is the observed data, $n$ is the number of observations, and $k$ is the number of parameters estimated.

Properties of BIC

Limitations of BIC

Differences from AIC

AIC is mostly used when comparing models. BIC asks the question of whether or not the model resembles reality. Even though they have similar functions, they are separate goals.

Mallow’s $C_p$

$C_p$ is used to assess the fit of a regression model that has been estimated using ordinary least squares. A small value of $C_p$ indicates that the model is relatively precise.

The $C_p$ of a model is defined as $$ C_p = \frac{\sum_{i =1}^N{(Y_i - Y_{pi})^2}}{S^2}- N + 2P $$

An alternative definition is

$$ C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2) $$

Deviance Information Criterion

The DIC is a hierarchical modeling generalization of the AIC and BIC. it is useful in Bayesian model selection problems where posterior distributions of the model was obtained by a Markov Chain Monte Carlo simulation.

This method is only valid if the posterior distribution is approximately multivariate normal.

Let us define the deviance as $$ D(\theta) = -2\log{(p(y|\theta))} + C $$ Where $y$ is the data and $\theta$ are the unknown parameters of the model.

Let us define a helper variable $p_D$ as the following $$ p_D = \frac{1}{2}\hat{Var}(D(\theta)) $$ Finally the deviance information criterion can be calculated as $$ DIC = D(\bar{\theta}) + 2p_D $$ Where $\bar{theta}$ is the expectation of $\theta$.