Negative Log{Likelihood And Statistical Hypothesis Testing
As The Basis Of Model Selection In IDE As
(Full (Technical Report) Version)
Peter A.N. Bosman Dirk Thierens
peterb@cs.uu.nl Dirk.Thierens@cs.uu.nl
Department of Computer Science, Utrecht University
P.O. Box 80.089, 3508 TB Utrecht, The Netherlands
August 2000
Abstract
In this paper, we analyze the most prominent features in model selection criteria that
have been used so far in iterated density estimation evolutionary algorithms (IDEAs, EDAs,
PMBGAs). These algorithms build probabilistic models and estimate probability densities
based upon a selection of available points. We show that the negative log{likelihood is a
basis of the inference features when the Kullback{Leibler divergence is used. We show how
previously found to be problematic issues in the case of continuous random variables can be
resolved by starting from the derived basics. By doing so, we have a probabilistic model
search metric that can be justied through the use of statistical hypothesis tests. This in turn
reduces the need for additional complexity penalties.
1 Introduction
The past few years, new probabilistic optimization algorithms inspired by evolutionary algorithms
have appeared [17]. The algorithms that have been introduced in this eld, apply search criteria
to the space of probability distributions to nd a suitable probabilistic model, given a vector of
selected samples. The resulting probability distribution is subsequently used to draw more samples
from, after which selection takes place again. Even though eÆcient algorithms have already been
proposed in this eld, the motivation of the search criteria for nding a suitable probabilistic
model is often not thoroughly investigated. Our goal in this paper is to take a closer look at
one such search criterion which is based on the Kullback{Leibler (KL) divergence. Dierent
algorithms [2, 3, 4, 6, 7, 14] have been proposed that use this divergence. We want to emphasize
the background of the KL divergence through its correspondence with likelihood maximization in
probability theory. By doing so, we give a unifying background regarding these topics. Using the
basic notion of likelihood, we aim to derive general probabilistic model selection criteria that in
addition to previous approaches can be justied through the use of statistical hypothesis testing.
The remainder of this paper is organized as follows. First, we introduce some notation in
section 2 and give a general framework for the algorithms themselves in section 3. Next, in
section 4, we formalize the statistical tools that have been used in these algorithms and show how
they together constitute some of the methods of induction used so far. In this same section, we
unify these statistical tools by looking at their relation with a basic notion of t in probability
theory, being the likelihood. By taking a closer look at the negative log{likelihood, we come across
a few subtle details. We go into these details and show how some of the previous algorithms can be
simplied by statistical hypothesis testing on the basis of the negative log{likelihood. As a result,
some problems that have been encountered in the expansion of discrete to continuous optimization
algorithms [7] disappear. In section 5, we give some practical examples of the theoretic work
described in this paper. Subsequently, in section 6, we summarize our results and discuss some
topics of interest. Finally, we conclude this paper in section 7.
1
2 Some notation and statistics
We write a = (a 0 ; a 1 ; : : : ; a ja the elements in
a vector is relevant. We assume to have l random variables available, meaning that each sample
point is an l dimensional vector. We introduce the notation ahci = (a c 0 ; a c 1 ; : : : ; a c jc
ector of l numbers and let a v , meaning that a contains only
elements of . We assume the alphabet for each discrete random variable to be the same, just as
we do for the range of each continuous random variable. We denote the alphabet for the discrete
random variables by A. We can now dene the multivariate joint probability mass function (pmf):
pa (xhai) = P (8 i2a hX i = x i i) such that
X
c2A jaj
pa (c) = 1; and pa () 0 (1)
We write P (Xhai)(xhai) for pa (xhai), making P (Xhai) a pmf. In the continuous case, we
cannot use a pmf. Let dyhai =
Q ja e.
Using the notation shorthand
R for
R 1
= P (Y hai 2 A) (2)
such that
ZZ
: : :
Z
fa (yhai)dyhai = 1; fa () 0; and A =
0
@ ja
P (Y hai)(yhai) for fa (yhai), making P (Y hai) a pdf. We use Y to denote continuous
random variables and X to denote discrete random variables. If we do not distinguish between
these cases, we write Z.
We let b v and dene a t b to be the splicing of the two vectors so that the elements of b
are placed behind the elements of a, giving ja t bj = jaj + jbj. Using the denition of multivariate
probability, we can dene conditional probability by:
P (ZhaijZhbi) = P (Zha t bi)
P (Zhbi) (3)
Shannon [20] has dened the multivariate entropy measure. In the case of discrete random
variables, this measure is dened as:
H(P (Xhai)) =
: : :
Z
P (Y hai)(yhai)ln(P (Y hai)(yhai))dyhai (5)
Let = Xh i, = Y h i and = Zh i. A well known distance metric from one proba-
bility distribution P 0 ( ) to another probability distribution P 1 ( ), is the Kullback{Leibler (KL)
divergence. The KL divergence is also called relative entropy [13]. In the case of discrete random
variables, this distance metric equals:
D(P 0 ( )jjP 1 ( )) =
X
c2A
P 0 ( )(c)ln
P 0 ( )(c)
P 1 ( )(c)
(6)
In the continuous case, the KL metric is equal to:
d(P 0 ( )jjP 1 ( )) =
ZZ
: : :
Z
P 0 ( )(yh i)ln
P 0 ( )(yh i)
P 1 ( )(yh i)
dyhai (7)
2
As a nal statistical tool, we make use of the sample vector S. We let S = (z 0 ; z 1 ; : : : ; z jS
z i
0 ; z i
1 ; : : : ; z i
l
to be a vector of independently drawn samples from a distribution P (Z). We denote
this by S s
P (Z). The sample vector is used to nd some probability distribution ^
P (Z) as an
approximation to the true distribution P (Z). The likelihood that the samples were drawn from
some given distribution ^
P (Z), is dened as:
L(Sj ^
P (Z)) =
jS
practise to use the negative logarithm of the likelihood measure as it is often
computationally more convenient. The resulting expression is called the negative log{likelihood :
3 Evolutionary optimization by iterated density estimation
Evolutionary optimization algorithms that make use of iterated density estimation, attempt to
catch the probability distribution of a selected vector of samples. Using the estimated probability
distribution, more samples are generated and mixed with the existing samples to get a new sample
vector. Assume that we have an l dimensional cost function C(zhLi), which without loss of
generality we seek to minimize. We let P (Z) be a probability distribution that is uniform over
all vectors zhLi with C(zhLi) and 0 otherwise. Sampling from P (Z) gives more samples
that evaluate to a value below . Moreover, if we know = min zhLi fC(zhLi)g, sampling from
P
(Z) gives an optimal solution. This rationale has led to the denition of the IDEA (Iterated
Density Estimation Evolutionary Algorithm) framework [4]. The greatest dierence between other
similar approaches and the IDEA, is that the IDEA has mostly been used to focus on continuous
optimization problems [4, 6, 7].
We cannot expect to eÆciently solve every structured optimization problem with just any
probabilistic model. This has been demonstrated on problems in which a multiple of variables
interact [5, 18]. Therefore, a higher order probability distribution such as ^
P (Z 0 ; Z 1 ) instead of
^
P (Z 0 ) ^
P (Z 1 ) is sometimes required. However, the higher the order of complexity, the larger the
amount of required computational resources are in order to be able to compute the probabilistic
model. Therefore, the interactions between the problem variables are attempted to be inferred
from the given sample vector to nd a probabilistic model M. The notion of a probabilistic model
is used as a computational implementation of a probability distribution. A probabilistic model M
uniquely corresponds to a probability distribution. It can be seen to consist of some structure &
and a vector of parameters that denes the pdfs to be t over each joint multivariate structure
implied by & . An example of a structure & is the notion of a factorization, which we denote by f.
A factorization factors the probability distribution over Y to get a product of pdfs. An example
in the case of l = 3 is P (Y) = P (Y 0 ; Y 1 )P (Y 2 ). Once a structure & is given, the parameters that
have to be estimated can be derived from the multivariate pdfs that have to be t. Since the
way in which the parameters are t, is predened on beforehand together with the pdfs to t,
we denote the parameter vector that is obtained in this manner by fit
using only the
structure & , we denote it by P & (Y). The denition of the IDEA framework can now be given as
follows:
3
IDEA(n, , m, sel(), rep(), ter(), sea(), est(), sam())
Initialize an empty vector of samples P ()
Add and evaluate n random samples for i 0 to n
i ] C(P i )
Initialize the iteration counter t 0
Iterate until termination while :ter() do
Select bnc samples (z 0 hLi; z 1 hLi; : : : ; z bn
N hc[z i hLi] c[z k hLi]i
Search for a structure & & sea()
Estimate the parameters fit
Evaluate the new samples in P for each unevaluated P i do
c[P i ] C(P i )
Update the generation counter t t + 1
Denote the required iterations by t end t end t
In the IDEA framework, we have that N = (0; 1; : : : ; bnc n
; 1], sel() is the selection
operator, rep() replaces a subset of P with a subset of O, ter() is the termination condition, sea()
is a model structure search algorithm, est() estimates the parameters that are implied by both
the structure & as well as predened pdfs and sam() generates a single sample using the estimated
densities. The evolutionary algorithm characteristic of the IDEA lies in the fact that a population
of individuals is used from which individuals are selected to generate new ospring with. Using
these ospring along with the parent individuals and the current population, a new population is
constructed.
Once & has been selected, the actual pdfs are computed. During the course of deriving &
however, the required pdfs will often have already been computed [4]. In whatever way, using
equation 3, computing the pdfs can functionally be restricted to computing multivariate joint
pdfs. In the discrete case, there has been only one way in which these functions are computed,
which is the most straightforward way. The probabilities equal the frequency in which some
combination of values appears in the sample vector , divided by the total amount of samples:
^
P (Xhai)(chai)= 1
jSj
jS
x j
i
i
0 otherwise (10)
In the case of continuous random variables, such a most straightforward way to compute the
probabilities does not exist. In general, some model is t to the data as good as possible by
specifying its parameters. For instance, a multivariate normal pdf can be used, which is fully
specied by its covariance matrix and its mean vector. This discrepancy between discrete and
continuous random variables is important in certain model selection techniques as we shall see.
4 Negative log{likelihood as the basis of statistical model
selection
The algorithms in the IDEA eld that have been proposed so far, can roughly be divided into three
categories. One category consists of the methods that use a univariate distribution in which none of
the variables interact with other variables. We make no statement of any kind on these algorithms
as they perform no induction on the model since & is xed. The methods in another category use
1 Note that in the IDEA framework we have that the samples we are referring to are the result of the selection
operation S sel().
4
a higher order structure and for instance use the KL divergence with the full joint probability
distribution to guide the search. For factorizations, it has been noted before [6] that using the
KL divergence should be accompanied by strong restrictions on f because otherwise minimizing
the divergence will give just the full joint probability distribution. In case of a factorization, the
heavy restrictions are usually embodied by a parameter which denotes the amount of variables
any one variable is allowed to interact with. To do away with , the methods in a third category
have been invented. These methods add a penalty term to the search metric in order to penalize
more complex models. The amount of penelization is however usually parameterized again.
4.1 Negative log{likelihood and sample entropy
To start out, we rst note that there is a correspondence between equations 4 and 9 as well as
between equations 5 and 9. We rst focus on the discrete case, which is the most simple.
We split up S into jAj l mutually disjoint subvectors so that each subvector contains only one
type of truncated sample point. If we dene aub to be the order dependent intersection of vectors
a and b such that each element of a is preserved if it occurs in b, we can rewrite equation 9 by
summing over all possible subvectors of S as they together are exactly equal to S:
c2A l
jSuc
S u cjln( ^
P (X )(c)) (11)
Note that jS u cj is just the amount of times that vector c occurs in the sample vector. Using
equation 10, this means that we have jS u cj = jSj ^
P (X )(c), so we may conclude:
e that the negative log{likelihood equals jSj times the multi-
variate entropy of the estimated model. The reason for this is that the estimated model in the
discrete case ts over the sample points with a maximum likelihood. The entropy measures are
exact in the sense that they go over the probabilities of every domain element. Alternatively, we
can dene sample entropies that compute the measure given a vector of samples jS 0 j s
^
P (Z).
Note that it is essential that the samples were sampled from the probability distribution for which
we are computing the entropy. As the sample entropy is the same in the discrete as well as the
continuous case, we have only a single denition for it:
H(S 0 ; ^
P (Z)) = Z)(S 0
i )) such that S 0 s
^
P (Z) (13)
Note that the denition of the sample entropy is closely related to that of the negative log{
likelihood. To be exact,
not require in the case of the negative
log{likelihood.
In the derivation of equation 12 from 11, we have used that jS u cj = jSj ^
P (X )(c). But,
S s
P (X ) and not S s
^
P (X ). Therefore it should be noted that only in the limit of jS 0 j ! 1
we have that H(S 0 ; ^
P (X )) ! H( ^
P (X )), jS 0 j s
^
P (X ). Furthermore, in the limit of jSj !1, we
have that jS u cj = jSjP (X )(c), since then ^
P (X )(c) ! P (X )(c) and thus in the additional limit
of jS 0 j !1; jS 0 j s
^
P (X ), we have that H(S 0 ; ^
P (X )) ! H(P (X )). We therefore have:
lim
jSj !1 P (X )(c)ln( ^
P (X )(c)) = (14)
In the continuous case, things are more complex. We discretize each variable so that it consists
of equidistant intervals of length m and let M = h: : : ; [
S u A) i is the i{th sample from the sample vector that falls into area A. Recalling
that S s
P (Y), equation 9 becomes:
X
A2M l
jSuA
Now in the limit of jSj !1, we have that jS u Aj ! jSjP (Y 2 A). We may thus write:
lim
jSj ! 1
innitely small and is therefore shrunk to a single
point. The discrete sum over all areas then becomes an integral over all points:
lim
jSj !1
: : :
Z
P (Y)(yhLi)ln( ^
P (Y)(yhLi))dyhLi
From equation 17 it follows that in the limit of jS 0 j !1, H(S 0 ; ^
P (Y)) ! h( ^
P (Y)) if and only
if S 0 s
^
P (Y). Note that we cannot enforce the same further relation as we derived for the discrete
case. The relation that we have in the discrete case that in the limits of jSj !1, jS 0 j !1, the
sample entropy over ^
P (X ) becomes equal to the entropy over the actual underlying probability
distribution P (X ), does not hold in general in the continuous case. Actually, it does not hold
in general in the discrete case either, but we have assumed that in the discrete case we always
approximate probabilities by using denition 10. As the denition in equation 10 becomes equal
to P (X ) in the limit of jSj !1, this is clearly true. However, in the continuous case, we hardly
ever have that in the limit of jSj ! 1 we get ^
P (Y) ! P (Y). So we can neither conclude that
the entropy of the true distribution becomes equal to the entropy of the estimated distribution.
In order for that to hold in general, we require a special pdf:
Z b0
a0
Z b1
a1
: : :
Z b l
vector and is therefore
said to overt the data. This is a very important aspect of density estimation and is termed
generalization. As in a practical application we shall use a more generalizing model for ^
P (Y), we
cannot enforce that in the limits of jSj !1, jS 0 j !1, H(S 0 ; ^
P (Y)) ! h(P (Y)), S 0 s
^
P (Y).
There is however one nal important issue. In the discrete case, we found that the entropy
of the estimated distribution equals jSj times the negative log{likelihood over the sample vector
(equation 12). We have however not established this in the continuous case. In our derivation,
we have directly derived that given an innite amount of samples, the sample entropy becomes
equal to the actual dierential entropy over the estimated model. As a matter of fact, we do
not in general have the same relation between the entropy and the negative log{likelihood in the
continuous case as we do in the discrete case. The reason why it does hold in the discrete case,
is because equation 10 is a maximum likelihood estimate of the sample vector. As we have not
assumed that our t in the continuous case is a maximum likelihood t unless we use equation 18,
we have not enforced this relation in the continuous case. However, Kullback [13] has shown that
if and only if ^
P (Y) is a maximum likelihood estimate over S, the equality holds both in the
discrete case as well as the continuous case. Let # be the vector of parameters for ^
P (Z) and
be the vector space of all possible values for the parameters #, we can then formalize this by:
6
^
P (Z) = arg maxfL(SjP) j P 2 f ^
P (Z j#) j # 2 gg ()
4.2 KL divergence and negative log{likelihood model selection
In the methods in the second mentioned category, model selection is performed subject to certain
constraints. Let C & be the set of all model structures that satisfy the constraints, & 2 C & . The
constraints on the model have varied over the past from none [1, 19] to those of chain dependen-
cies [3], tree dependencies [2], conditional dependencies with a maximum of conditionals per
variable [4, 6, 7, 14, 15, 16] and marginal product models [11]. Prior to the IDEA framework, all
of the higher order models were regarded in the case of discrete (binary) random variables.
The basic idea is then to minimize the distance between the true probability distribution and
the estimated probability distribution. As a distance measure, the KL divergence between two
probability distributions can be used in which one probability distribution is the most complex
probability distribution P (Z) and the other is the estimate ^
P & (Z).
In the case of factorizations, an algorithm to approximately nd the best factorization can
start from the complete univariate factorization in which each variable is taken independently and
incrementally build a more complex factorization. Each step in the incremental algorithm is based
upon the change that brings about the largest decrease in the KL divergence to ^
P (Z). In the
case of a model structure in general, this means that if we let ^
P & 0 (Z) be the current model, we
test it against a (more complex) model ^
P & 1 (Z) by observing the change in the KL divergence to
the most complex probability distribution. The model that brings about the greatest decrease is
selected as the new current model. The metric that has been used in the discrete case, has only
been applied to factorizations and is computed with respect to the full joint factorization:
D( ^
P (X )jj ^
P f 0 (X ))
X
c2A l
^
P (X )(c)ln( ^
P f 1 (X )(c))
elemental building blocks. With equation 3,
we therefore have that both sums in equation 20 are actually sums over sums over certain vectors
a of ^
P (X )(c)ln( ^
P (Xhai)(chai)) with c 2 A l . As given some vector k w a, the theorem of total
probability gives us that P
c2A jkj P (Xhki)(c)= P
c2A jaj P (Xhai)(c), equation 20 transforms into:
D( ^
P (X )jj ^
P f 0 (X ))
ve:
d( ^
P (Y)jj ^
P f 0 (Y))
[4, 6, 7] as a
logical expansion over the discrete case. This requires however the computation of a multivariate
integral over a given pdf. For the normal pdf [4], this expression can be derived analytically.
However, in the case of the normal kernels pdf [7], this is no longer possible. The same is the case
for the use of a normal mixture pdf. The latter two pdfs are however very useful in the IDEA
approach. In order to use them anyhow, the integrals may be computed numerically. There are
many ways to do this and mostly this requires an exponential amount of time in the dimensionality
of the integral. However, we have seen in this paper that we can use an approximation. If
we can eÆciently sample points from the estimation ^
P f (Y), equation 17 tells us that we may
use H(S 0 ; ^
P f (Y)) with S 0 s
^
P f (Y), as an approximation to the true entropy over the estimated
model. This resolves the problems regarding the computation of the entropy in the search metric
as encountered so far in continuous IDEAs [6, 7]. Summarizing, for nding f, the methods using
the KL divergence have used:
7