Negative Log{Likelihood And Statistical Hypothesis Testing 
As The Basis Of Model Selection In IDE As 
(Full (Technical Report) Version) 
Peter A.N. Bosman Dirk Thierens 
peterb@cs.uu.nl Dirk.Thierens@cs.uu.nl 
Department of Computer Science, Utrecht University 
P.O. Box 80.089, 3508 TB Utrecht, The Netherlands 
August 2000 
Abstract 
In this paper, we analyze the most prominent features in model selection criteria that 
have been used so far in iterated density estimation evolutionary algorithms (IDEAs, EDAs, 
PMBGAs). These algorithms build probabilistic models and estimate probability densities 
based upon a selection of available points. We show that the negative log{likelihood is a 
basis of the inference features when the Kullback{Leibler divergence is used. We show how 
previously found to be problematic issues in the case of continuous random variables can be 
resolved by starting from the derived basics. By doing so, we have a probabilistic model 
search metric that can be justied through the use of statistical hypothesis tests. This in turn 
reduces the need for additional complexity penalties. 
1 Introduction 
The past few years, new probabilistic optimization algorithms inspired by evolutionary algorithms 
have appeared [17]. The algorithms that have been introduced in this eld, apply search criteria 
to the space of probability distributions to nd a suitable probabilistic model, given a vector of 
selected samples. The resulting probability distribution is subsequently used to draw more samples 
from, after which selection takes place again. Even though eÆcient algorithms have already been 
proposed in this eld, the motivation of the search criteria for nding a suitable probabilistic 
model is often not thoroughly investigated. Our goal in this paper is to take a closer look at 
one such search criterion which is based on the Kullback{Leibler (KL) divergence. Dierent 
algorithms [2, 3, 4, 6, 7, 14] have been proposed that use this divergence. We want to emphasize 
the background of the KL divergence through its correspondence with likelihood maximization in 
probability theory. By doing so, we give a unifying background regarding these topics. Using the 
basic notion of likelihood, we aim to derive general probabilistic model selection criteria that in 
addition to previous approaches can be justied through the use of statistical hypothesis testing. 
The remainder of this paper is organized as follows. First, we introduce some notation in 
section 2 and give a general framework for the algorithms themselves in section 3. Next, in 
section 4, we formalize the statistical tools that have been used in these algorithms and show how 
they together constitute some of the methods of induction used so far. In this same section, we 
unify these statistical tools by looking at their relation with a basic notion of t in probability 
theory, being the likelihood. By taking a closer look at the negative log{likelihood, we come across 
a few subtle details. We go into these details and show how some of the previous algorithms can be 
simplied by statistical hypothesis testing on the basis of the negative log{likelihood. As a result, 
some problems that have been encountered in the expansion of discrete to continuous optimization 
algorithms [7] disappear. In section 5, we give some practical examples of the theoretic work 
described in this paper. Subsequently, in section 6, we summarize our results and discuss some 
topics of interest. Finally, we conclude this paper in section 7. 
1 

2 Some notation and statistics 
We write a = (a 0 ; a 1 ; : : : ; a ja the elements in 
a vector is relevant. We assume to have l random variables available, meaning that each sample 
point is an l dimensional vector. We introduce the notation ahci = (a c 0 ; a c 1 ; : : : ; a c jc 
ector of l numbers and let a v , meaning that a contains only 
elements of . We assume the alphabet for each discrete random variable to be the same, just as 
we do for the range of each continuous random variable. We denote the alphabet for the discrete 
random variables by A. We can now dene the multivariate joint probability mass function (pmf): 
pa (xhai) = P (8 i2a hX i = x i i) such that 
X 
c2A jaj 
pa (c) = 1; and pa ()  0 (1) 
We write P (Xhai)(xhai) for pa (xhai), making P (Xhai) a pmf. In the continuous case, we 
cannot use a pmf. Let dyhai = 
Q ja e. 
Using the notation shorthand 
R for 
R 1 
= P (Y hai 2 A) (2) 
such that 
ZZ 
: : : 
Z 
fa (yhai)dyhai = 1; fa ()  0; and A = 
0 
@ ja 
P (Y hai)(yhai) for fa (yhai), making P (Y hai) a pdf. We use Y to denote continuous 
random variables and X to denote discrete random variables. If we do not distinguish between 
these cases, we write Z. 
We let b v and dene a t b to be the splicing of the two vectors so that the elements of b 
are placed behind the elements of a, giving ja t bj = jaj + jbj. Using the denition of multivariate 
probability, we can dene conditional probability by: 
P (ZhaijZhbi) = P (Zha t bi) 
P (Zhbi) (3) 
Shannon [20] has dened the multivariate entropy measure. In the case of discrete random 
variables, this measure is dened as: 
H(P (Xhai)) = 
: : : 
Z 
P (Y hai)(yhai)ln(P (Y hai)(yhai))dyhai (5) 
Let = Xh i, = Y h i and = Zh i. A well known distance metric from one proba- 
bility distribution P 0 ( ) to another probability distribution P 1 ( ), is the Kullback{Leibler (KL) 
divergence. The KL divergence is also called relative entropy [13]. In the case of discrete random 
variables, this distance metric equals: 
D(P 0 ( )jjP 1 ( )) = 
X 
c2A 
P 0 ( )(c)ln 
 P 0 ( )(c) 
P 1 ( )(c) 
 
(6) 
In the continuous case, the KL metric is equal to: 
d(P 0 ( )jjP 1 ( )) = 
ZZ 
: : : 
Z 
P 0 ( )(yh i)ln 
 P 0 ( )(yh i) 
P 1 ( )(yh i) 
 
dyhai (7) 
2 

As a nal statistical tool, we make use of the sample vector S. We let S = (z 0 ; z 1 ; : : : ; z jS 
z i 
0 ; z i 
1 ; : : : ; z i 
l 
to be a vector of independently drawn samples from a distribution P (Z). We denote 
this by S s 
 P (Z). The sample vector is used to nd some probability distribution ^ 
P (Z) as an 
approximation to the true distribution P (Z). The likelihood that the samples were drawn from 
some given distribution ^ 
P (Z), is dened as: 
L(Sj ^ 
P (Z)) = 
jS 
practise to use the negative logarithm of the likelihood measure as it is often 
computationally more convenient. The resulting expression is called the negative log{likelihood : 
3 Evolutionary optimization by iterated density estimation 
Evolutionary optimization algorithms that make use of iterated density estimation, attempt to 
catch the probability distribution of a selected vector of samples. Using the estimated probability 
distribution, more samples are generated and mixed with the existing samples to get a new sample 
vector. Assume that we have an l dimensional cost function C(zhLi), which without loss of 
generality we seek to minimize. We let P  (Z) be a probability distribution that is uniform over 
all vectors zhLi with C(zhLi)   and 0 otherwise. Sampling from P  (Z) gives more samples 
that evaluate to a value below . Moreover, if we know   = min zhLi fC(zhLi)g, sampling from 
P   
(Z) gives an optimal solution. This rationale has led to the denition of the IDEA (Iterated 
Density Estimation Evolutionary Algorithm) framework [4]. The greatest dierence between other 
similar approaches and the IDEA, is that the IDEA has mostly been used to focus on continuous 
optimization problems [4, 6, 7]. 
We cannot expect to eÆciently solve every structured optimization problem with just any 
probabilistic model. This has been demonstrated on problems in which a multiple of variables 
interact [5, 18]. Therefore, a higher order probability distribution such as ^ 
P (Z 0 ; Z 1 ) instead of 
^ 
P (Z 0 ) ^ 
P (Z 1 ) is sometimes required. However, the higher the order of complexity, the larger the 
amount of required computational resources are in order to be able to compute the probabilistic 
model. Therefore, the interactions between the problem variables are attempted to be inferred 
from the given sample vector to nd a probabilistic model M. The notion of a probabilistic model 
is used as a computational implementation of a probability distribution. A probabilistic model M 
uniquely corresponds to a probability distribution. It can be seen to consist of some structure & 
and a vector of parameters  that denes the pdfs to be t over each joint multivariate structure 
implied by & . An example of a structure & is the notion of a factorization, which we denote by f. 
A factorization factors the probability distribution over Y to get a product of pdfs. An example 
in the case of l = 3 is P (Y) = P (Y 0 ; Y 1 )P (Y 2 ). Once a structure & is given, the parameters that 
have to be estimated can be derived from the multivariate pdfs that have to be t. Since the 
way in which the parameters  are t, is predened on beforehand together with the pdfs to t, 
we denote the parameter vector that is obtained in this manner by  fit 
  
using only the 
structure & , we denote it by P & (Y). The denition of the IDEA framework can now be given as 
follows: 
3 

IDEA(n,  , m, sel(), rep(), ter(), sea(), est(), sam()) 
Initialize an empty vector of samples P   () 
Add and evaluate n random samples for i   0 to n 
i ]   C(P i ) 
Initialize the iteration counter t   0 
Iterate until termination while :ter() do 
Select bnc samples (z 0 hLi; z 1 hLi; : : : ; z bn 
N hc[z i hLi]  c[z k hLi]i 
Search for a structure & &   sea() 
Estimate the parameters  fit 
  
Evaluate the new samples in P for each unevaluated P i do 
c[P i ]   C(P i ) 
Update the generation counter t   t + 1 
Denote the required iterations by t end t end   t 
In the IDEA framework, we have that N  = (0; 1; : : : ; bnc n 
; 1], sel() is the selection 
operator, rep() replaces a subset of P with a subset of O, ter() is the termination condition, sea() 
is a model structure search algorithm, est() estimates the parameters that are implied by both 
the structure & as well as predened pdfs and sam() generates a single sample using the estimated 
densities. The evolutionary algorithm characteristic of the IDEA lies in the fact that a population 
of individuals is used from which individuals are selected to generate new ospring with. Using 
these ospring along with the parent individuals and the current population, a new population is 
constructed. 
Once & has been selected, the actual pdfs are computed. During the course of deriving & 
however, the required pdfs will often have already been computed [4]. In whatever way, using 
equation 3, computing the pdfs can functionally be restricted to computing multivariate joint 
pdfs. In the discrete case, there has been only one way in which these functions are computed, 
which is the most straightforward way. The probabilities equal the frequency in which some 
combination of values appears in the sample vector , divided by the total amount of samples: 
^ 
P (Xhai)(chai)= 1 
jSj 
jS 
x j 
i 
i 
0 otherwise (10) 
In the case of continuous random variables, such a most straightforward way to compute the 
probabilities does not exist. In general, some model is t to the data as good as possible by 
specifying its parameters. For instance, a multivariate normal pdf can be used, which is fully 
specied by its covariance matrix and its mean vector. This discrepancy between discrete and 
continuous random variables is important in certain model selection techniques as we shall see. 
4 Negative log{likelihood as the basis of statistical model 
selection 
The algorithms in the IDEA eld that have been proposed so far, can roughly be divided into three 
categories. One category consists of the methods that use a univariate distribution in which none of 
the variables interact with other variables. We make no statement of any kind on these algorithms 
as they perform no induction on the model since & is xed. The methods in another category use 
1 Note that in the IDEA framework we have that the samples we are referring to are the result of the selection 
operation S   sel(). 
4 

a higher order structure and for instance use the KL divergence with the full joint probability 
distribution to guide the search. For factorizations, it has been noted before [6] that using the 
KL divergence should be accompanied by strong restrictions on f because otherwise minimizing 
the divergence will give just the full joint probability distribution. In case of a factorization, the 
heavy restrictions are usually embodied by a parameter  which denotes the amount of variables 
any one variable is allowed to interact with. To do away with , the methods in a third category 
have been invented. These methods add a penalty term to the search metric in order to penalize 
more complex models. The amount of penelization is however usually parameterized again. 
4.1 Negative log{likelihood and sample entropy 
To start out, we rst note that there is a correspondence between equations 4 and 9 as well as 
between equations 5 and 9. We rst focus on the discrete case, which is the most simple. 
We split up S into jAj l mutually disjoint subvectors so that each subvector contains only one 
type of truncated sample point. If we dene aub to be the order dependent intersection of vectors 
a and b such that each element of a is preserved if it occurs in b, we can rewrite equation 9 by 
summing over all possible subvectors of S as they together are exactly equal to S: 
c2A l 
jSuc 
S u cjln( ^ 
P (X )(c)) (11) 
Note that jS u cj is just the amount of times that vector c occurs in the sample vector. Using 
equation 10, this means that we have jS u cj = jSj ^ 
P (X )(c), so we may conclude: 
e that the negative log{likelihood equals jSj times the multi- 
variate entropy of the estimated model. The reason for this is that the estimated model in the 
discrete case ts over the sample points with a maximum likelihood. The entropy measures are 
exact in the sense that they go over the probabilities of every domain element. Alternatively, we 
can dene sample entropies that compute the measure given a vector of samples jS 0 j s 
  ^ 
P (Z). 
Note that it is essential that the samples were sampled from the probability distribution for which 
we are computing the entropy. As the sample entropy is the same in the discrete as well as the 
continuous case, we have only a single denition for it: 
H(S 0 ; ^ 
P (Z)) = Z)(S 0 
i )) such that S 0 s 
  ^ 
P (Z) (13) 
Note that the denition of the sample entropy is closely related to that of the negative log{ 
likelihood. To be exact, 
not require in the case of the negative 
log{likelihood. 
In the derivation of equation 12 from 11, we have used that jS u cj = jSj ^ 
P (X )(c). But, 
S s 
  P (X ) and not S s 
  ^ 
P (X ). Therefore it should be noted that only in the limit of jS 0 j ! 1 
we have that H(S 0 ; ^ 
P (X )) ! H( ^ 
P (X )), jS 0 j s 
  ^ 
P (X ). Furthermore, in the limit of jSj !1, we 
have that jS u cj = jSjP (X )(c), since then ^ 
P (X )(c) ! P (X )(c) and thus in the additional limit 
of jS 0 j !1; jS 0 j s 
  ^ 
P (X ), we have that H(S 0 ; ^ 
P (X )) ! H(P (X )). We therefore have: 
lim 
jSj !1 P (X )(c)ln( ^ 
P (X )(c)) = (14) 
In the continuous case, things are more complex. We discretize each variable so that it consists 
of equidistant intervals of length m and let M = h: : : ; [ 
S u A) i is the i{th sample from the sample vector that falls into area A. Recalling 
that S s 
 P (Y), equation 9 becomes: 
X 
A2M l 
jSuA 
Now in the limit of jSj !1, we have that jS u Aj ! jSjP (Y 2 A). We may thus write: 
lim 
jSj ! 1 
innitely small and is therefore shrunk to a single 
point. The discrete sum over all areas then becomes an integral over all points: 
lim 
jSj !1 
: : : 
Z 
P (Y)(yhLi)ln( ^ 
P (Y)(yhLi))dyhLi 
From equation 17 it follows that in the limit of jS 0 j !1, H(S 0 ; ^ 
P (Y)) ! h( ^ 
P (Y)) if and only 
if S 0 s 
  ^ 
P (Y). Note that we cannot enforce the same further relation as we derived for the discrete 
case. The relation that we have in the discrete case that in the limits of jSj !1, jS 0 j !1, the 
sample entropy over ^ 
P (X ) becomes equal to the entropy over the actual underlying probability 
distribution P (X ), does not hold in general in the continuous case. Actually, it does not hold 
in general in the discrete case either, but we have assumed that in the discrete case we always 
approximate probabilities by using denition 10. As the denition in equation 10 becomes equal 
to P (X ) in the limit of jSj !1, this is clearly true. However, in the continuous case, we hardly 
ever have that in the limit of jSj ! 1 we get ^ 
P (Y) ! P (Y). So we can neither conclude that 
the entropy of the true distribution becomes equal to the entropy of the estimated distribution. 
In order for that to hold in general, we require a special pdf: 
Z b0 
a0 
Z b1 
a1 
: : : 
Z b l 
vector and is therefore 
said to overt the data. This is a very important aspect of density estimation and is termed 
generalization. As in a practical application we shall use a more generalizing model for ^ 
P (Y), we 
cannot enforce that in the limits of jSj !1, jS 0 j !1, H(S 0 ; ^ 
P (Y)) ! h(P (Y)), S 0 s 
  ^ 
P (Y). 
There is however one nal important issue. In the discrete case, we found that the entropy 
of the estimated distribution equals jSj times the negative log{likelihood over the sample vector 
(equation 12). We have however not established this in the continuous case. In our derivation, 
we have directly derived that given an innite amount of samples, the sample entropy becomes 
equal to the actual dierential entropy over the estimated model. As a matter of fact, we do 
not in general have the same relation between the entropy and the negative log{likelihood in the 
continuous case as we do in the discrete case. The reason why it does hold in the discrete case, 
is because equation 10 is a maximum likelihood estimate of the sample vector. As we have not 
assumed that our t in the continuous case is a maximum likelihood t unless we use equation 18, 
we have not enforced this relation in the continuous case. However, Kullback [13] has shown that 
if and only if ^ 
P (Y) is a maximum likelihood estimate over S, the equality holds both in the 
discrete case as well as the continuous case. Let # be the vector of parameters for ^ 
P (Z) and  
be the vector space of all possible values for the parameters #, we can then formalize this by: 
6 

^ 
P (Z) = arg maxfL(SjP) j P 2 f ^ 
P (Z j#) j # 2 gg () 
4.2 KL divergence and negative log{likelihood model selection 
In the methods in the second mentioned category, model selection is performed subject to certain 
constraints. Let C & be the set of all model structures that satisfy the constraints, & 2 C & . The 
constraints on the model have varied over the past from none [1, 19] to those of chain dependen- 
cies [3], tree dependencies [2], conditional dependencies with a maximum of  conditionals per 
variable [4, 6, 7, 14, 15, 16] and marginal product models [11]. Prior to the IDEA framework, all 
of the higher order models were regarded in the case of discrete (binary) random variables. 
The basic idea is then to minimize the distance between the true probability distribution and 
the estimated probability distribution. As a distance measure, the KL divergence between two 
probability distributions can be used in which one probability distribution is the most complex 
probability distribution P (Z) and the other is the estimate ^ 
P & (Z). 
In the case of factorizations, an algorithm to approximately nd the best factorization can 
start from the complete univariate factorization in which each variable is taken independently and 
incrementally build a more complex factorization. Each step in the incremental algorithm is based 
upon the change that brings about the largest decrease in the KL divergence to ^ 
P (Z). In the 
case of a model structure in general, this means that if we let ^ 
P & 0 (Z) be the current model, we 
test it against a (more complex) model ^ 
P & 1 (Z) by observing the change in the KL divergence to 
the most complex probability distribution. The model that brings about the greatest decrease is 
selected as the new current model. The metric that has been used in the discrete case, has only 
been applied to factorizations and is computed with respect to the full joint factorization: 
D( ^ 
P (X )jj ^ 
P f 0 (X )) 
X 
c2A l 
^ 
P (X )(c)ln( ^ 
P f 1 (X )(c)) 
elemental building blocks. With equation 3, 
we therefore have that both sums in equation 20 are actually sums over sums over certain vectors 
a of ^ 
P (X )(c)ln( ^ 
P (Xhai)(chai)) with c 2 A l . As given some vector k w a, the theorem of total 
probability gives us that P 
c2A jkj P (Xhki)(c)= P 
c2A jaj P (Xhai)(c), equation 20 transforms into: 
D( ^ 
P (X )jj ^ 
P f 0 (X )) 
ve: 
d( ^ 
P (Y)jj ^ 
P f 0 (Y)) 
[4, 6, 7] as a 
logical expansion over the discrete case. This requires however the computation of a multivariate 
integral over a given pdf. For the normal pdf [4], this expression can be derived analytically. 
However, in the case of the normal kernels pdf [7], this is no longer possible. The same is the case 
for the use of a normal mixture pdf. The latter two pdfs are however very useful in the IDEA 
approach. In order to use them anyhow, the integrals may be computed numerically. There are 
many ways to do this and mostly this requires an exponential amount of time in the dimensionality 
of the integral. However, we have seen in this paper that we can use an approximation. If 
we can eÆciently sample points from the estimation ^ 
P f (Y), equation 17 tells us that we may 
use H(S 0 ; ^ 
P f (Y)) with S 0 s 
  ^ 
P f (Y), as an approximation to the true entropy over the estimated 
model. This resolves the problems regarding the computation of the entropy in the search metric 
as encountered so far in continuous IDEAs [6, 7]. Summarizing, for nding f, the methods using 
the KL divergence have used: 
7