Mixed IDE As 
Peter A.N. Bosman Dirk Thierens 
peterb@cs.uu.nl Dirk.Thierens@cs.uu.nl 
Institute of Information and Computing Sciences, Utrecht University 
P.O. Box 80.089, 3508 TB Utrecht, The Netherlands 
December 2000 
Abstract 
Building and using probabilistic models to perform stochastic optimization in the case 
of continuous random variables, has so far been limited to the use of factorizations as the 
structure of probabilistic models. Furthermore, the only probability density function (pdf) 
that has been successfully tested on a multiple of problems, is the normal pdf. The normal pdf 
however strongly generalizes the data and cannot cope with non{linear interactions among 
the samples. In this paper, we show how clustering algorithms can be used to overcome this 
problem. We also show how the normal mixture pdf can be used in the case of a general 
factorization instead of the normal pdf. We formalize the notion of a probabilistic model and 
propose to use two practical instances for the model structure, which are the factorization 
and the mixture of factorizations. We propose to use metrics to nd good factorizations 
and thereby eliminate a complexity parameter  that was required in previous continuous 
approaches in the case of a general factorization. We also show the background of the metrics 
through general model selection on the basis of likelihood maximization, which demonstrates 
their connection with previously used factorization selection algorithms. We use the IDEA 
framework for iterated density estimation evolutionary algorithms to construct new continuous 
evolutionary optimization algorithms based on the described techniques. Their performance 
is evaluated on a set of well known epistatic continuous optimization problems. 
1 Introduction 
The Iterated Density Estimation Evolutionary Algorithm (IDEA) framework has been used to 
apply probabilistic models to continuous stochastic optimization [8, 10, 11, 12, 13]. Algorithms 
within the IDEA framework, build a probabilistic model from a selection of samples and subse- 
quently draw more samples from the probability distribution imposed by the probabilistic model. 
These samples are then incorporated among the currently available samples, after which selection 
takes place again. As this process is iterated, the algorithm is intented to converge to a good 
solution. Estimating and sampling from a probability distribution each iteration can be seen as a 
combination of the recombination and mutation steps in evolutionary algorithms. Since a selection 
of the available samples is used to create more solutions, the IDEA framework can be seen as an 
evolutionary algorithm with an explicit use of statistics. 
In previous work [8, 12], the normal pdf and and the normal kernels pdf have been used 
as elementary probability density functions (pdfs). Using these pdfs, the estimated probability 
distribution over the complete domain of problem variables is constructed. The normal pdf has 
proven to be computationally eective and to result in acceptable optimization performance as 
well [10, 11]. The normal kernels pdf on the other hand has proven to be promising in optimization 
performance, but also to be very hard to handle [10]. Furthermore, its computational requirements 
are quite large. It seems that the use of a normal mixture pdf is a good trade{o between the 
normal pdf and the normal kernels pdf. The computational requirements of the normal mixture 
pdf are smaller than those of the normal kernels pdf. Furthermore, its parameters can be estimated 
using the EM algorithm, which eliminates the need for netuning certain parameters by hand. In 
this paper, we show how the normal mixture pdf can be used within the IDEA framework. 
1 

The normal pdf itself is unable to describe non{linear interactions in a sample set. The normal 
mixture pdf can be used to overcome this problem. Another approach is to cluster the samples in a 
preprocessing phase and to use the normal pdf in each cluster. Assuming that the clustering phase 
eectively breaks up the non{linear interactions between the variables, this should result in an 
eective estimation of the selected samples. In a pilot study by Pelikan and Goldberg, clustering 
has been applied to optimization algorithms that use probabilistic models [25]. Here, we formalize 
the notion of clustering as part of the model selection process and investigate what clustering 
algorithms are eective in both computational running time as well as clustering performance. 
In previous optimization algorithms that use continuous probabilistic models, only little atten- 
tion has been paid to search metrics that guide the search for a good probabilistic model. In this 
paper, we propose to use metrics that eectively prefer simpler models if they can describe the 
samples in a similar manner as can more complex models. We are also show its correspondence 
with signicance testing using statistical hypotheses and the notion of likelihood as a description 
of the goodness of an estimation. 
We show how to use a higher level of modeling in the case of continuous probabilistic mod- 
els using mixtures of distributions. Furthermore, we also provide useful search metrics to nd 
continuous probabilistic models within iterated density estimation evolutionary algorithms. The 
combination of the proposed techniques leads to new algorithms. In this paper, we evaluate their 
performance using a set of test functions. 
The remainder of this paper is organized as follows. In section 2 we formalize the notion of a 
probabilistic model. We also show how previous approaches have used a special instance for the 
probabilistic model structure, which is called the factorization. In section 3, we go into model 
selection. First, we discuss two approaches to model selection in general. Subsequently, we show 
how these approaches can be used to nd a good factorization. Finally, we investigate the use of 
clustering algorithms to use a new higher order instance of probabilistic models in IDEAs, which 
is a mixture of factorizations. In section 4, we show how the normal pdf and the normal mixture 
pdf can be t to any factorization and how we can draw samples from them. Subsequently, in 
section 5, we go over the IDEA framework and show the dierence between previous approaches 
and the new approaches described in this paper. We also elaborate on the dierence of the IDEA 
approach and Evolution Strategies. In section 6, we test the new algorithms on a set of well known 
test functions. In section 7, we discuss some topics of importance and possible future research. 
Finally, this paper is concluded in section 8. 
2 Probabilistic models 
Given a vector of samples from a space dened by a cost function that without loss of generality we 
seek to minimize, our aim is to build and use a probabilistic model that describes these samples. 
The construction of this model should be eective in computation time requirements. Furthermore, 
the model itself should be eective in descriptive power. Once a model has been built, more samples 
are acquired using the probabilistic model. Subsequently, a new vector of samples is selected from 
all of the available samples, after which again a probabilistic model is built. As this process is 
iterated, the requirement that the computation time for model building should be reasonable, is 
clear as this is the main ingredient of our non{deterministic inductive search. On the other hand, 
the probabilistic model should eÆciently describe the vector of selected samples in order for the 
algorithm to sample from the promising regions of the search space. If the model is not eective 
in its description, the regions may not be separated properly and the search does not propagate 
eectively over most of the regions. Instead, only a single region could for instance be explored 
and the algorithm is more likely to be mislead. 
Before we can discuss how to eectively nd a good probabilistic model, we have to dene 
exactly what such a model consists of. In section 2.1 we therefore formally dene our concept of 
probabilistic models that we shall use in our algorithms. One of the most fundamental concepts 
is the factorization. We elaborate on the use of this basic concept in section 2.2. Finally, in 
section 2.3 we give an overview of the use of probabilistic models in previous work. Also, we 
2 

indicate how we shall expand this use to more complex probabilistic models and present a general 
overview that classies previous work and the current work on the basis of the probabilistic model. 
2.1 Inside a probabilistic model 
We write a = (a 0 ; a 1 ; : : : ; a ja introduce the 
notation ahci = (a c 0 ; a c 1 ; : : : ; a c jc i 
; i 2 (0; 1; : : : ; jcj 
= (0; 1; : : : ; l 
; : : : ; y jS 
probabilistic model that we dene in this 
section, describes a probability distribution over = (Y 0 , Y 1 , . . . , Y 
y distribution. 
To this end, we note that any probability distribution over continuous random variables consists 
of probability density functions. The probability density function (pdf) is the most fundamental 
part of a probability distribution and therefore of a probabilistic model. Let a v , meaning that 
a contains only elements of . As a basis, we rst state the denition of the multivariate joint 
pdf. Let dyhai = 
Q ja 
0; and A = 
0 
@ ja 
e write P (Y hai)(yhai) for fa (yhai), making P (Y hai) a pdf. In this way, we can use the 
common probability notation P (Y hai) over random variables Y hai instead of writing fa (yhai) 
for a density function. Furthermore, the use of this notation allows us to catch the properties of 
probability distributions in both the elementary case of a single pdf as well as higher order cases in 
which the distribution is factorized. One such a denition is that of conditional probability, which 
will prove to be one of the most important. We let b v and dene atb to be the splicing of the 
two vectors so that the elements of b are placed behind the elements of a, giving jatbj = jaj + jbj. 
Using the denition of multivariate probability, we can write conditional probability as: 
P (Y haijY hbi) = P (Y ha t bi) 
P (Y hbi) 
(2) 
Combining the multivariate joint pdf from equation 1 with the multivariate joint probability 
from equation 2, we can construct a probability distribution over . This can be done by specifying 
a single multivariate joint pdf, but this can also be done by constructing a factorization. Since the 
product of two pdfs is again a pdf, the probability distribution over can be factored such that 
it becomes a product of multivariate pdfs (either joint or conditional). Actually, we can restrict 
any factorization to use only multivariate conditional pdfs from equation 2, since we have that: 
P (Y hai) = 
ja 
with a 
product of two joint pdfs P (Y hai)P (Y hbi), there is a dependency between variables Y hai and 
Y hbi. In this case, we speak of an unconditional dependency. For any pair of variables Y and Y , 
we have that it is either benecial to use P (Y jY ) instead of P (Y ) or it is not. If it is more benecial 
to use the conditional probability, we have a conditional dependency between variables Y and Y . 
3 

The most notable dierence between the two dependencies is that conditional dependencies are 
directed, whereas unconditional dependencies are not. 
Even though we can express any multivariate joint pdf with multivariate conditional pdfs, it 
can still be benecial to use only multivariate joint pdfs in a factorization. The reason for this is 
that conditional pdfs can be analytically and computationally more diÆcult to handle. 
To nd a good probabilistic model, we can thus for instance attempt to nd a good factorization 
of the probability distribution and then t a pdf over each multivariate joint or multivariate 
conditional pdf in the factorization. We have however refrained so far from giving the actual 
denition of a factorization. In section 2.2 we go into the details of factorizations and formalize 
them. Here, we shall restrict ourselves to an example of a factorization and subsequently use it in 
the denition of a probabilistic model. 
We denote a factorization by f. An example of a factorization is P f (Y 0 ; Y 1 ; Y 2 ) = P (Y 0 )P (Y 1 ; Y 2 ). 
Again, this is a pdf, but as we use a product of elementary pdfs, we refer to it as a probability 
distribution. On the other hand, the probability distribution that is factorized into a single factor 
P f (Y 0 ; Y 1 ; Y 2 ) = P (Y 0 ; Y 1 ; Y 2 ) is equal to the unfactorized probability distribution and is equal to 
an elementary pdf. Still, we assume for each element in the factorization that we t a pdf over it. 
We call the resulting probability distribution a factorized probability distribution. 
With the exception of the work by Pelikan [25], the approaches in the eld of building and 
using probabilistic models have so far used a factorization as the structure of the probabilistic 
model. The usefulness of a factorization becomes apparent when we increase the dimensionality of 
the problem. For instance in the case of discrete variables, we estimate the full joint probabilities 
by counting the amount of times some combination occurs in the sample vector and divide it by 
the total amount of samples. One of these combinations does not have to be computed as it is 
one minus the sum over the probabilities of the other combinations. For l dimensions and binary 
random variables, this means that we have to estimate 2 l 
t. This exponential scaling behavior 
is the most important reason why a factorization can be useful. If the interactions between the 
problem variables can be eectively caught in a bounded complexity order probabilistic model, the 
optimization process is eÆcient. However, if the amount of parameters that has to be computed to 
t the pdf is computationally expensive, such as exponential, optimization by using probabilistic 
models is inherently a non{scalable approach. Note that whether or not this is possible, depends 
both on the algorithm that searches for a factorization as well as the optimization problem. 
A factorization of a probability distribution along with the parameters for the pdfs to t over 
the factors, could thus serve as the denition of a probabilistic model. Even though this denition 
has been used in most approaches so far, it does not allow for modeling at a higher level, being a 
mixture of factorizations. Such a mixture allows to apply a clustering strategy rst, after which 
a factorization can be found in each cluster separately. This way, non{linearity and symmetry in 
the sample vector can eectively be tackled and processed. 
Even though a mixture of factorizations is a very general structure that allows for many 
probabilistic models, we use a more general notation that allows for any type of structure. In this 
paper, the instances of the structure in the probabilistic model so dened, will be the factorization 
and the mixture of factorizations. In general, we dene a probabilistic model to have some sort of 
structure & and a vector of parameters  for the pdfs as implied by & . We call & the probabilistic 
model structure and  the probabilistic model parameters. As noted before, & is likely to be 
constrained as using a maximum complexity structure does not scale up. Therefore, if & is a 
factorization, it is usually constrained to bounded order interactions between variables. We denote 
the space of all & that are allowed within such constraints by C & . Usually, such constraints are 
not placed on the parameters , but we denote the constrained space for the parameters by C  . 
Formally, a probabilistic model M can thus be dened as a tuple just as can be done for its 
complete constrained space C of allowed models: 
M = (& ; ) 2 C = (C & ; C  ) (4) 
4 

Y Y 
Y 
1 
Y Y 
2 
3 
4 
0 
Figure 1: An empty factorization containing only the variables. 
In the case that we use a single factorization for & , we have & = f. We shall also make use of 
a mixture of factorizations. We denote the amount of components in the mixture by k and let 
K = (0; 1; : : : ; k 
f = (f 0 ; f 1 ; : : : ; f k 
goal is to use the probabilistic model in optimization, we have to 
approximate the probability distribution of the given sample vector S. The resulting probability 
distribution that is based on the model that we nd, is an approximation to the true probability 
distribution. Therefore, we write the probability distribution implied by model M as ^ 
PM (Y), 
which is an approximation to the true underlying probability distribution P (Y) over S. 
The amount of parameters jj depends on the choice of the pdfs to use as well as on the 
structure & . However, the pdf to t over every factor implied by & is chosen on beforehand. The 
way in which the parameters  are t, is also predened on beforehand. We denote the parameter 
vector that is obtained in this manner by  fit 
  
the parameters  fit 
  
sample from the resulting probability 
distribution to achieve an iterated search procedure. However, we have already remarked that we 
shall restrict our attention to the use of factorizations and combinations thereof. To be able to 
build factorizations, we rst elaborate on them in the next section. 
2.2 Factorizations and their graphs 
A factorization implies a product of pdfs. If such a product is a function of variables yhai, it is 
a valid probabilistic model structure over random variables Y hai. The product itself is closely 
related to the dependencies between the variables. We can start by noting that two variables 
are either unconditionally dependent or they are not. Furthermore, we may state that variable 
Y i is either conditionally dependent on variables Y hai with i 62 a or it is not. These relations 
constitute a factorization. This structure can be modeled using graphs. If we identify a vertex for 
every random variable Y i and introduce an arc (Y i ; Y j ) if and only if Y j is conditionally dependent 
on Y i , we get the conditional factorization graph. In the case of unconditional dependence, a 
connected component in the graph implies that all variables in the connected component are 
jointly dependent. As such, a connected component models a multivariate joint pdf. 
A conditional dependence arc imposes a conditional probability in the probability distribution 
over Y . For example, gure 1 shows an empty factorization over 5 variables. The underlying 
factorization can be specied as P (Y) = P (Y 0 )P (Y 1 )P (Y 2 )P (Y 3 )P (Y 4 ). Using only uncondi- 
tional dependencies, gure 2 is an example of an unconditional factorization graph that models 
P (Y) = P (Y 0 ; Y 2 )P (Y 1 ; Y 3 ; Y 4 ). As a nal example, gure 3 shows a factorization graph using 
conditional dependencies that models P (Y) = P (Y 0 jY 2 )P (Y 1 jY 2 Y 3 Y 4 )P (Y 2 jY 3 Y 4 )P (Y 3 jY 4 )P (Y 4 ) = 
P (Y 0 jY 2 )P (Y 1 ; Y 2 ; Y 3 ; Y 4 ). The latter inequality holds because of equation 3. 
5 

Y Y 
Y 
1 
Y Y 
2 
3 
4 
0 
Figure 2: A non{empty unconditional factorization representing only joint dependencies. 
Y Y 
Y 
1 
Y Y 
2 
3 
4 
0 
Figure 3: A non{empty conditional factorization representing conditional dependencies. 
It follows from equation 3 that using unconditional factorizations that imply joint probabilities, 
is a special case of using conditional factorizations. Using these conditional factorizations, we can 
specify any factorization. Note that a factorization is valid if and only if its factorization graph is 
acyclic. This can be formalized [8] by introducing a function () that returns a vector of parent 
variable indices (i) = ((i) 0 ; (i) 1 ; : : : ; (i) j(i) 
! = !hLi. Using (; !), constraints can be specied to enforce that the factor- 
ization graph is acyclic. Furthermore, scanning the variables in the order Y! l 
t variables that some variable is conditioned on, will already have been 
regarded. A conditional factorization can be uniquely specied by a pair (; !) as follows: 
P (;!) (Y) = 
l 
i) (5) 
such that 8 i2L h! i 2 L ^ 8 k2L 
ecied as ! 0 = 0; ! 1 = 1; ! 2 = 
2; ! 3 = 3; ! 4 = 4 and (0) = (2), (1) = (2; 3; 4), (2) = (3; 4), (3) = (4) and (4) = (). 
If we use a product of multivariate joint pdfs, the factorization becomes a marginal product 
model. We dene the node vector  to be a vector of vectors of connected component indices 
 i = (( i ) 0 ; ( i ) 1 ; : : : ; ( i ) j i 
y unconditional factorization can 
be uniquely specied using : 
P  (Y) = 
j 
6 

The above constraints enforce that the vector of available random variables Y is partitioned 
into jj disjoint subvectors. Over each of these subvectors, a single multivariate joint pdf is t to 
obtain the factorized probability distribution. Using this denition, the factorization in gure 2 
can be specied as  0 = (0; 2);  1 = (1; 3; 4); jj = 2. 
We conclude this section by noting that we have formalized two general types of factorizations. 
One is dened using multivariate conditional probabilities whereas the other is dened using 
multivariate joint probabilities. These are the two factorizations that we use: 
f 2 f(; !); g (7) 
2.3 An overview of things that were and of things to come 
In previous work on continuous models [8, 10, 11, 12, 13, 22, 18, 30, 31], a factorization f was used 
as the structure & of the probabilistic model M. In some cases, constraints in addition to the 
general acyclic graph constraints have been applied. Furthermore, with the exception of the work 
by Bosman and Thierens [12] and Gallagher, Fream and Downs [18], a single multivariate normal 
pdf was used. In the work by Bosman and Thierens [12], the normal kernels pdf is proposed. 
However, the normal kernels pdf is very hard to handle and has therefore not been put to practice 
as much as has been done for the normal pdf. Gallagher, Fream and Downs [18] propose the use of 
the exible multivariate normal mixture pdf, but only with the univariate factorization in which 
every variable is regarded independent of the others. In this paper, we show how the multivariate 
normal mixture pdf can be used in the case of a general factorization. 
Roughly, we can say that the research on the use of probabilistic models in continuous domains 
has been limited to models where the structure is a factorization and the parameters are those of 
a normal pdf (at least in the general factorization case). If we denote the normal pdf by fN , we 
can indicate this by writing & = f and  ( fN . In this paper, we show how the normal mixture 
pdf can be used in the general factorization case, giving  ( fNM . 
By using the normal mixture pdf, we can apply a powerful density estimator in our optimization 
procedure. However, this does not necessarily allow for the breaking of symmetry or the tting 
of non{linearity in the optimization problem. To achieve this, we propose to use a mixture 
of factorizations. To this end, we cluster the sample vector into subvectors and apply density 
estimation to each subvector. We thereby change the structure of the model to & = fhKi. 
Figure 4 graphically depicts the previous work and the work that we describe in this paper. 
The approaches in the top row make use of a single factorization f. This is what has been done so 
far for continuous optimization problems. In the case of a general factorization, the approaches 
have been limited to the use of a single normal pdf, which is the top left entry in the table in 
gure 4. The approaches in the right column make use of the normal mixture pdf. The approaches 
in the bottom row use a mixture of factorizations. In this paper, we use clustering algorithms 
to partition the sample vector into subvectors. A factorization is found for each so constructed 
subvector, thereby achieving the mixture of factorizations. Concluding, in the general case, the 
approaches so far for continuous optimization problems have used the top left entry in the table 
in gure 4. In this paper, we show how the remaining three entries in the table of gure 2.3 can 
be established in search of an improvement over the approaches for continuous domains so far. 
3 Model selection 
Given a vector S of data points, it very valuable to be able to infer from what probability distri- 
bution these data points were sampled. Given such information, we can draw conclusions on the 
source of these data points and make predictions for future data coming from similar sources. In 
some applications we are not concerned with nding the full probability distribution, but certain 
aspects of it under certain conditions. The use of linear regression to nd out whether variables or 
events are correlated, is an example of such an approach. Analysis of this sort is applied whenever 
7 

 ( fN  ( fNM 
& 
= 
f Y Y 
Y 
1 
Y 2 
3 
0 
Y Y 
Y 
1 
Y 2 
3 
0 
& 
= 
fhKi Y Y 
Y 
1 
Y 2 
3 
0 
Y Y 
Y 
1 
Y 2 
3 
0 
Y Y 
Y 
1 
Y 2 
3 
0 
Y Y 
Y 
1 
Y 2 
3 
0 
Figure 4: An overview of the algorithms that build and use probabilistic models. The top left 
entry represents the current algorithms in the case of a general factorization. The other entries 
represent algorithms that are proposed in this paper. 
8 

induction is to be performed, given certain observations. Real life examples are for instance ob- 
servations of hospitalized patients and the question of whether one symptom is caused by another 
or whether a certain disease is always accompanied by high fever, so that this can be taken into 
account whenever a new patient with this disease arrives. 
The least restrictive of these inductional questions, which is the desire to know the underlying 
probability distribution, is called model selection. The data points in vector S are the result of 
sampling from some model. It is this model that we set out to nd or approximate. To be more 
precise, we utilize this inferred model information in order to perform global optimization. 
We start out in section 3.1 by going over two general model selection techniques. In section 3.2, 
we formulate an algorithm that systematically and incrementally nds a factorization given a vec- 
tor of samples. Subsequently, in section 3.3, we go into the selection of a mixture of factorizations 
using clustering techniques. 
3.1 General techniques 
In order to decide what model we wish to use to represent the data with, many automated 
approaches can be used. Depending on the choice of the pdf, specialized statistical tests can 
be derived to nd a factorization for example. In this section however, we wish to refrain from 
selecting any instances or additional constraints for the model. Instead, we give two general 
approaches to selecting an appropriate model based upon the data. 
In this paper, we see model selection as an iterative process. In any iteration, we have some 
candidate models that may possibly replace the current model. In section 3.2 we show how such 
an iterative algorithm can be formalized in order to select a model with & = f. Here, we concern 
ourselves with a single iteration of such an algorithm. 
From a set of candidate models, we wish to select the most promising one to replace the current 
model. If no candidate model seems to be promising, the iterative algorithm halts and the current 
model is taken to be the result of model selection. We assume that a set of candidate models for 
some given model is available. In section 3.2 we shall show how such a set can be obtained in the 
case where we want to nd a suitable factorization. In the iterative process, we denote the current 
model by M 0 . In order to select the best model from the set of candidates, we go over each of 
the candidate models and compare them to M 0 . In the remainder of this section, we focus on one 
such comparison and denote the candidate model by M 1 . 
In order to compare M 1 to M 0 , we distinguish two approaches. One approach is to test 
whether the use of M 1 is a signicant improvement over M 0 in representing S. To this end, 
statistical hypothesis tests can be used. A gentle introduction to statistical hypothesis testing is 
given in appendix B. We dene the general statistical hypothesis test for the comparison of the 
models as follows: 
Model Selection Test Hypothesis (MSTH) 
Probabilistic model M 1 cannot better describe the given samples than can probabilistic 
model M 0 . 
We can use statistical techniques to test the MSTH. Such a test returns whether or not the hy- 
pothesis should be accepted. A certain statistic is shown to follow some distribution. A hypothesis 
is tested by using so called critical values for the statistic, given a certain signicance threshold. 
The MSTH is accepted if the statistic is for instance below a critical value associated with the 
threshold or rejected if it is above the critical value. In such a case we speak of a right{sided test. 
Statistical hypothesis tests can be used to select a replacement model from the candidate set. 
As the test returns whether M 1 is a better descriptive model than is M 0 for the given samples, 
we can go over the candidates and take the rst model from the candidate set to replace M 0 that 
results in rejecting the MSTH. However, the set of candidates will be generated in an automated 
fashion. To prevent going over the candidates in the same manner every time so that possibly a 
certain type of candidate model is always tested before some other type of candidate model, the 
candidates should be accessed in a random order. In section 3.1.1 we show how a general instance 
for the MSTH can be formalized based on the likelihood of a probabilistic model. 
9 

The second approach to comparing M 1 with M 0 is to use some metric. If we have a metric that 
informs us of how well some model represents S, we can search amongst the candidate models 
to nd the model that results in the largest increase or decrease of the metric, depending on 
whether we have to respectively maximize or minimize it. If the metric can no longer be improved 
upon since every candidate leads to a degradation of the metric, the iterative model selection 
procedure can be halted. This approach has become a very popular one, especially within the 
eld of Bayesian statistics. In section 3.1.2 we discuss a general, transparent and useful metric 
that has a clear correspondence with the MSTH instance in section 3.1.1. 
Before moving to describe both a general MSTH as well as a metric, we note that an MSTH can 
actually also serve as a metric. Assuming such a statistic , a critical value   and a right{sided 
test, a statistical hypothesis test returns true if and only if     . However, the amount that 
 is larger (or smaller, depending on the type of test) than   can be seen as a measure of how 
strongly the MSTH is rejected. This can be used in some sort of a metric for which we may nd 
the maximum over all candidate models. The metric to maximize is then equal to  
In this section, we focus on nding a general statistical hypothesis test so as to perform model 
selection by testing the MSTH. In appendix B more details are presented on statistical hypothesis 
testing in general. Also an example is given of how unconditional dependencies can be tested 
using well known statistical measures. Here however, we focus on a hypothesis test that is directly 
based on the goodness of two ts. 
Assuming that the samples were drawn independently from the underlying distribution, the 
likelihood of the samples, given the estimated probability distribution ^ 
PM (Y), is dened as: 
L(Sj ^ 
PM (Y)) = 
jS 
i ) (8) 
It is common practise to use the negative logarithm of the likelihood measure as it is often 
computationally more convenient. The resulting expression is called the negative log{likehood : 
e to maximize L(Sj ^ 
PM (Y)) or, equivalently, to minimize 
0 
a0 
Z b1 
a1 
: : : 
Z b l 
the sample vector and is therefore 
said to overt the data. This is a very important aspect of density estimation and is termed 
generalization. Overtting tends to occur not only in such specic cases as in equation 10 but also 
often in cases where a combination of many pdfs is used, such as a kernel method. If overtting 
takes place, the optimization algorithm that samples more points from the pdf will tend to converge 
rapidly into a xed form and give poor results [10]. Therefore, a parametric model with little 
parameters such as the normal pdf, is usually preferred as it cannot easily overt a set of samples. 
Given a parametric model and thus a set of parameters for the involved pdfs, a general iterative 
method can be derived to estimate the set of pdf parameters so as to minimize the negative log{ 
likelihood and get a maximum likelihood t of the estimated probability distribution. Such a 
method is given by the EM{algorithm, which we discuss in more detail in section 4.3.2. In the 
EM{algorithm however, the model structure is xed on beforehand. In order to nd the model 
structure, we can test whether some candidate model M 1 has a lower negative log{likelihood value 
10 

than the current model M 0 . If this is the case, M 1 is a better description than is M 0 . Therefore, 
M 1 should replace M 0 . In order to perform the negative log{likelihood test, the involved pdfs 
rst have to be computed. To test whether the likelihood of the probability distribution implied 
by model M 1 is larger than that of the probability distribution implied by model M 0 , we observe: 
jS 
i=0 
ln( ^ 
PM 1 (Y)(y i )) 
e that the negative log{likelihood dierence as stated in equation 11 is 
always larger than or equal to 0. This implies that if we incrementally build a model by increasing 
its complexity, we will always prefer the more complex model if we are able to t the data with a 
maximum likelihood. This is however not desirable. If the negative log{likelihood dierence is only 
very small, the more complex model does indeed t the data slightly better, but this comes at the 
expense of a larger running time since the model is more complex. By selecting the more complex 
model, we require a lot more time to perhaps gain a hardly noticeable dierence in goodness of 
t. This poses a well known trade o between running time and expressional power. 
What we require, is to be able to test whether some negative log{likelihood dierence is signif- 
icant. If it is, this implies that we should select the more complex model as it would signicantly 
improve the resulting t. We have then justied the selection of the more complex model. To this 
end, we use a statistical hypothesis test. If we have some random variable R along with N values 
r 0 ; r 1 ; : : : ; r N 
of the central limit theorem, the sample mean is approximately normally distributed. 
If we now assume to have two of such random variables R 0 and R 1 along with an hypothesized 
value h for the dierence of the means, we observe: 
T = 
p 
N(R 0 
0 j 
under probability distribution 
^ 
PM j , j 2 f0; 1g. We then thus have that: 
r j 
i = 
preferred over model 
M 0 . To do this, we set h = 0 and perform a right{sided test. This means that we are testing 
whether the expected value of R 0 
which each 
step towards a more complex model is justifyable. However, in all previous continuous approaches 
that use factorizations of a higher order than the univariate factorization as the model structure, 
such justication has not been used directly. Instead, the Kullback{Leibler (KL) divergence from 
the approximated factorization to the full joint approximated factorization is used. By doing 
so, it can be shown [13] that the the dierential entropy over the approximated factorization 
is minimized. Given maximum likelihood estimates, it can furthermore be shown [13] that the 
dierential entropy dierence is the same as the negative log{likelihood dierence from equation 11. 
11 

The equality of the KL divergence and the negative log{likelihood dierence under the assump- 
tion of a maximum likelihood estimate, implies that the previous approaches can be seen to have 
used the negative log{likelihood without justifying any model selection decisions. However, unless 
strong restrictions are imposed on the probabilistic model, using this metric will unnecessarily 
increase the complexity since any more complex model will always result in a t with a lower 
negative log{likelihood. We can therefore conclude that justication is truly required in order for 
the resulting optimization algorithm to be scalable on decomposable problems. 
If we do not use any justication, we can refrain from using the average negative log{likelihood 
but use the dierential entropy instead. Computing the average negative log{likelihood is a time 
consuming task as it requires jSj evaluations. However, for the normal pdf, the dierential entropy 
over variables Y hai can be shown (see for instance [15]) to be 1 
2 (jaj+ln((2) jaj (det S(a)))), where 
S(a) is the sample covariance matrix. However, in this section we do require justication. In such 
a case, we cannot suÆce by only computing the average negative log{likelihood, since we also 
require the variance over the negative log{likelihood to compute the test statistic T. Still, we shall 
see in section 3.2.3 that this remark can yet be of practical concern. 
For more involved pdfs such as the normal mixture model, we cannot eÆciently compute the 
dierential entropy. Furthermore, neither can we guarantee a maximum likelihood estimate using 
the EM algorithm. To compute the negative log{likelihood dierence, we thus have to evaluate 
the implied probability distribution for each sample. 
3.1.2 Complexity penalization 
In the previous section, we have seen that we can minimize the error of the t and justify each min- 
imization based upon a large enough increase in the negative log{likelihood dierence. However, 
even though one t may indeed be better than the other, this does not mean that we are interested 
in using it from either a generalization or a computational eectiveness point of view. To enforce 
this, a penalty term that increasingly penalizes more complex models can be introduced. The 
metric M that should be minimized can then be formalized as follows: 
M( ^ 
PM (Y)jS) = Error( ^ 
PM (Y)jS) + Complexity( ^ 
PM (Y)jS) (15) 
The use of a complexity term is often termed regularization in data modeling elds. Regu- 
larization is used to prevent the model from overtting the data. Classicly, the regularization 
equation has a regularization parameter  as a factor times the regularization term. Setting  = 0 
then reduces the metric to an unconstrained one with respect to error minimization. Here, we 
have used a general complexity term that could be of such a form. However, we leave the actual 
contents completely unspecied, which gives no restrictions on the complexity term. Furthermore, 
the metric from equation 15 is thereby completely transparent from the start. 
To derive a metric of the form as we have in equation 15, we note that we are interested in 
the probability of a probability distribution implied by a model M, given a sample vector S. The 
model that implies a probability distribution with the largest probability, is the most preferable 
one if we disregard complexity and generalization at the moment. The probability that we are 
interested in, can be written as follows using Bayes' Rule: 
P ( ^ 
PM (Y)jS) = P ( ^ 
PM (Y))P (Sj ^ 
PM (Y)) 
P (S) (16) 
As the sample vector S is xed and thus P (S) = 1, we can discard P (S) from equation 16. 
The probability of S given some probability distribution ^ 
PM (Y) implied by a model M, can be 
seen as the likelihood as dened in equation 8, meaning P (Sj ^ 
PM (Y)) = L(S j ^ 
PM (Y)). In order 
to add a preference for simpler models, we want to set the probability at a probability distribution 
P ( ^ 
PM (Y)) implied by a model M in such a way that it gets smaller if its complexity increases. 
The complexity of ^ 
PM (Y) is given by the amount of parameters  that is required to dene every 
pdf in the model M. We can dene the probability of some probability distribution P ( ^ 
PM (Y)) 
implied by M = (& ; ) by using a function #(), #()  0, of the amount of parameters jj: 
12 

P ( ^ 
PM (Y)) = #() 
P 
& 0 2C & 
#( 0 fit 
  
C & j 
jS 
i ) (18) 
In order to compare one model to another, the normalization factor 
P 
& 0 2C & #( 0 fit 
  
#() = 1, equation 18 becomes just the likelihood from equation 8. 
Alternatively, we can let the probability at some probability distribution P ( ^ 
PM (Y)) implied 
by a model M decrease exponentially with its complexity by setting #() = e 
The objective is to maximize the metric. Alternatively, we can take the negative logarithm of 
the metric and minimize it. Just as is the case for the likelihood, using the negative logarithm 
is computationally more convenient. This can directly be seen from the denition of the metric, 
since the exponentials disappear and the product of a large amount of values in [0; 1] becomes a 
sum. By doing so, we get an instance of the general metric expression in equation 15: 
S) Complexity( ^ 
PM (Y)jS) 
(20) 
In the metric that we have arrived at, the negative log{likelihood is the error to minimize and 
the complexity term is the amount of parameters in the probabilistic model. The resulting metric 
is similar to the Akaike Information Criterion (AIC) [1]. The AIC equals 2 times the expression in 
equation 20. This constant factor does not inuence the result when computing the dierence and 
checking whether it is larger than 0. So basically, we have arrived at the AIC metric. The AIC 
metric elegantly favors simpler models. If two probability distributions have the same likelihood, 
the one with the least amount of parameters is favored. 
At this point, we turn back to our note on regularization theory. Even though the AIC metric 
favors simpler models by increasingly penalizing more complex ones, the amount of penalization 
is xed. By using a regularization parameter  times the AIC metric, the regularization could be 
increased. However, this would only be by a constant factor, assuming that  is a free parameter 
 2 R. Other than by increasing some regularization parameter, we can also use a dierent 
metric that penalizes the complexity dierently. The above derivation is a very general one. To 
demonstrate its transparent use, we propose to set #() = e introduced 
a parameter  that has the same semantics as the regularization parameter. The resulting metric 
that we arrive at, can be written as follows: 
( ^ 
PM (Y)jS) Complexity( ^ 
PM (Y)jS) 
(21) 
The metric in equation 21 is commonly known as the Bayesian Information Criterion (BIC) [29]. 
This metric has been proposed before in iterated density estimation approaches [24, 19, 27]. The 
BIC metric has empirically been observed to give good results. It penalizes more complex models 
more heavily than does the AIC metric, but in such a way that the most important structural 
building blocks are found. The AIC metric is less eÆcient because it tends to let such building 
blocks be merged as well, which gives an unnecessarily complex model structure. 
13 

Before we conclude this section, we note that there is a correspondence between the metrics 
in equations 20 and 21 and the general hypothesis test in the previous section. In both cases, 
the negative log{likelihood has to be computed. In the case of the hypothesis test, the average 
negative log{likelihood dierence is compared to a critical value. In the case of the penalization 
metrics, the negative log{likelihood dierence is compared to a value as well. In this case, the 
value is based on the dierence in amount of parameters. The two approaches thus result in a 
similar test for candidacy that can be written as: 
ln(L(S j ^ 
PM 1 (Y))) 
the general statistical 
hypothesis test,  is equal to the critical value times the amount of samples jSj. This is an 
important dierence between the two approaches. Beyond a certain size of the sample vector, 
the hypothesis test does no longer depend on jSj since the negative log{likelihood is divided by 
it before testing it against the critical value and the critical value doesn't change signicantly 
anymore above a certain size of jSj. However, the size of the sample vector is not factored out of 
the equation for the penalization metric. This means that if the average negative log{likelihood 
dierence is some value "  0, then the dierence itself becomes jSj". This expression can grow to 
any value, given a large enough sample vector. For the BIC metric,  also grows as the size of the 
sample vector increases, but logarithmically. Empirically, this has been observed to be eective in 
practice. The setting of  in the case of the metrics is done in a desirable way that is not the case 
for the hypothesis test. Given enough samples, more complex models are allowed. The amount of 
parameters that a more complex model conveys is thereby thus in some way justied as there are 
enough samples to estimate the parameters properly. 
3.2 Factorization selection 
In section 2.2, we observed that in order to obtain a valid factorization, we have to make sure 
that we do not introduce any cycles into the factorization graph. This is the most general and 
only necessary condition on the graph, both in the unconditional and the conditional case. In 
addition, other constraints can be imposed such as that the graph must be a chain, a tree or that 
it may have no interactions of an order above . Such additional constrains are especially useful if 
the decisions that regard introducing dependencies are not primarily based on the intrinsic data. 
In such a case, overly complex models are easily introduced if these additional constraints are 
absent. To be able to use a general graph search algorithm eectively, we either have to justify 
the introduction of more complex models or we have to penalize them. In this way, we avoid 
introducing unrequired complex models that take up a lot of computation time. 
The idea of such a general graph search algorithm is to maintain a set of edges or arcs that may 
still be added to the factorization graph without introducing cycles. From this set, the next edge or 
arc is selected according to certain criteria and the set is ltered again until either the set is empty 
or no edge or arc meets the criteria of being added anymore. If a scoring metric can be assigned to 
the factorization graph, we can add the arc or edge that increases the metric the most. When the 
set of addable edges or arcs is empty or no edge or arc can be added anymore so that the scoring 
metric is further increased, the algorithm terminates. This becomes more attractive if the metric 
is decomposable so that the metric increase can be computed based upon the involved variables 
only instead of upon the entire graph. This approach was rst used in evolutionary optimization 
by Pelikan, Goldberg and Cantu{Paz [26] and subsequently by Muhlenbein and Mahnig [24] as 
well as Bosman and Thierens [8, 10, 11], all using dierent scoring metrics. 
In the general case, we leave the presence of such a metric to specic implementation details. 
This brings us to dene the general incremental factorization search algorithm. As the notion 
of conditional dependence and unconditional dependence are intrinsically dierent, as are the 
corresponding graph types, we will specify the general algorithm for both cases separately. Even 
though the unconditional case can be written as a special version of the conditional case, the 
14 

importance of specifying dierent algorithms can also be seen from the fact that computing the 
conditional pdf can be very diÆcult in contrast to the multivariate joint pdf. 
3.2.1 An incremental general factorization search algorithm 
We start with the unconditional version of the general graph search. In this case, there is at most 
one edge between every pair of nodes. Initially this means that a triangle (eg. lower or upper) 
of the edge adjacency matrix without the diagonal is set to true. If an edge is set to true, it 
means that it may still be added to the graph as it does not introduce any cycles. Subsequently, 
the edge selection process is iterated. Each iteration, a single edge is selected to be added to the 
graph. This edge is to be chosen from the set of allowed edges f(i; j) j a[i; j] ^ (i; j) 2 L 2 g. If no 
selection can be made at some point, the iterated edge selection process is halted. If an edge is 
selected however, it is added to the graph and the set of edges that are still allowed to be added 
is updated. This is done by traversing the connected components belonging to the nodes that are 
incident to the edge that is to be added. Each edge between vertices of these opposite connected 
components is marked as false. If there are no more edges addable, the iterated selection stops. 
The actual factorization is determined afterwards. In the case of conditional dependencies, we 
can directly make use of the denition in equation 5. In the case of unconditional dependencies, 
we can do this indirectly by using equation 3. However, we will end up with the requirement of 
specifying a conditional pdf for each element of the so dened factorization. Therefore, we note 
that in theory we may very well write unconditional dependencies using (; !), but in practice we 
shall use a dierent structure. 
The result of the general factorization search algorithm in the unconditional case is the node 
vector . This is also the case in the ECGA by Harik [19]. The resulting factorization in this 
case corresponds to the vertices in the connected components of the factorization graph. If we 
now leave the edge selection procedure to be dened exterior to the general factorization search 
algorithm and write separate functions for adding an edge to the graph and nding the vertices 
in the connected components, we have a general factorization search algorithm for the case of 
unconditional dependencies. However, the algorithm in this case can be simplied. To see this, 
let  0 and  1 be two connected components in the graph. If we only observe the variables within 
these components, the probability distribution is given by the factorized distribution P (fY v jv 2 
 0 g)P (fY v jv 2  1 g). Any edge (u; v) with u 2  0 ^v 2  1 will now result in testing the factorized 
distribution P (fY v jv 2 ( 0 t 1 )g) against the former factorized distribution. Hence, if the decision 
is made that the edge should be added to the graph, all other edges between  0 and  1 become 
unallowed. Moreover, if the decision is made that the edge should not be added to the graph, 
all other edges between  0 and  1 do not have to be tested anymore. In other words, the edges 
that connect a connected component are not important, as the vertices inside a single connected 
component make up a joint pdf in the resulting factorization. A graphical impression of this 
is depicted in gure 5. This implies that we can simplify the general algorithm in the case of 
unconditional dependencies by rst placing every variable in a singleton vector. Subsequently, we 
nd 2 vectors that are to be spliced. If no such combination of vectors can be found, the algorithm 
terminates. Otherwise, the splice operation is executed and the process is iterated. As each splice 
operation reduces the amount of available vectors by 1, the amount of splice operations in this 
algorithm is clearly bounded by l 
Y Y 
Y 
1 
Y Y 
2 
3 
4 
0 
Y Y 
Y 
1 
Y Y 
2 
3 
4 
0 
Y Y 
Y 
1 
Y Y 
2 
3 
4 
0 
Y Y 
Y 
1 
Y Y 
2 
3 
4 
0 
Figure 5: Simplifying the general case of unconditional dependencies. 
UnconditionalFactorizationGraphSearch() 
1    new vector of (vector of integer) 
2 jj   l 
3 for i   0 to l 
> 1 do 
4.1 (c 0 ; c 1 )   FindNodeVectors() 
4.2 if c 0 < 0 then 
4.2.1 breakwhile 
4.3 for i   0 to j c1 j do 
4.3.1 ( c0 ) j c 0 j   ( c1 ) i 
4.4  c1    jj 
4.5 jj   jj 
i) in the graph now. The amount of arcs initially is therefore twice as 
large as the amount of edges in the unsimplied unconditional case. The algorithm that nds an 
arc to add from the set of allowed arcs, is named FindArc. Furthermore, we now require a cycle 
detection algorithm, which we specify as AddArc. This algorithm returns the amount of arcs 
that have become unallowed because of the addition of an arc. In order to dene this algorithm, 
we require the vector of predecessor nodes v p and the vector of successor nodes v s . Algorithm 
AddArc is given in appendix D. The resulting factorization can now be specied using (; !). 
The parent variable indices in the vector function  can directly be computed from the graph. As 
the graph contains no cycles, computing a topological sort gives the required ordering !: 
16 

ConditionalFactorizationGraphSearch() 
1 a   new array of boolean in 2 dimensions with size l  l 
2 v p ; v s   2 new arrays of (vector of integer) with size l 
3 for i   0 to l 
2 
, a, v p , v s ) 
6 return(TopologicalSort(v p )) 
We now have a detailed outline of an incremental general factorization search algorithm for 
both the conditional and the unconditional case. In order to complete the search algorithms, we 
require to specify how to nd two vectors to splice or how to nd an arc to add to the graph. 
A general algorithm for this is dependent on the type of information we wish to guide model 
selection by. If we have a statistical hypothesis test through which we can decide whether or not 
two vectors should be spliced or an arc should be added to the graph, we can randomly order the 
possible splices or arcs to still be added to the graph and test them in this order until an object is 
found for which the test evaluates to true. If every test fails, the resulting graph has been found. 
We can formalize this algorithm for the unconditional case as follows: 
FindNodeVectorsByTesting(  ) 
1 splices   new array of (integer, integer) with size 1 
2 
jj(jj 
1 
2 ke do 
4.1 r 1   RandomNumber(k) 
4.2 r 2   RandomNumber(k) 
4.3 (v 0 ; v 1 )   splices[r 1 ] 
4.4 splices[r 1 ]   splices[r 2 ] 
4.5 splices[r 2 ]   (v 0 ; v 1 ) 
5 for i   0 to k 
slight changes. 
These changes involve using the predecessor nodes vector v p and successor nodes vector v s instead 
of the node vector  as well as making the for loops regard variables 0 up to and including l 
FindArcByTesting( a, v p , v s ) 
1 arcs   new array of (integer, integer) with size l 2 
0 to d 1 
2 ke do 
4.1 r 1   RandomNumber(k) 
4.2 r 2   RandomNumber(k) 
4.3 (v 0 ; v 1 )   arcs[r 1 ] 
4.4 arcs[r 1 ]   arcs[r 2 ] 
4.5 arcs[r 2 ]   (v 0 ; v 1 ) 
5 for i   0 to k 
a metric for how well a factorization ts as 
a model, we can try to optimize this metric in a greedy fashion by performing the splice or by 
adding the arc that increases this metric the most. If there is no such possible splice or arc left, 
the resulting graph has been found. This approach is derived from the Bayesian approach we 
mentioned earlier. We can formalize the unconditional variant of the algorithm as follows: 
FindNodeVectorsByMetric(  ) 
1 Æ max   0 
2 i max   
i max   i 
4.1.2.3 j max   j 
5 return((imax ; j max )) 
The conditional variant of the above algorithm is quite the same. Instead of the the node 
vector , the predecessor nodes vector v p and successor nodes vector v s are required. The for 
loops go over every combination of two variables (i; j) and only compute the metric change if the 
arc (i; j) is still allowed to be added to the graph. This results in the following algorithm: 
18 

FindArcByMetric( a, v p , v s ) 
1 Æ max   0 
2 i max   
do 
4.1 for j   0 to l 
max )) 
The only ingredient missing from our general factorization search algorithms, is either a metric 
or a test. In most previous approaches [5, 7, 8, 12, 19, 24, 26], a metric was used in combination 
with a search algorithm. In a selection of these cases, a general factorization search algorithm was 
used [8, 12, 24, 26]. The use of a test to nd a factorization has been used in a single case using 
discrete random variables [28] and in previous work on continuous random variables [13]. 
3.2.2 Negative log{likelihood statistical hypothesis testing 
Let f 0 be the current factorization and f 1 be the factorization that results when the arc is added to 
the graph or when the two vectors are spliced. Given these two factorizations, we want to have a 
test that tells us which of these two factorizations signicantly better ts the given sample points. 
To this end, we specify the following hypothesis that is derived from the MSTH: 
Factorization Selection Test Hypothesis (FSTH) 
A probability distribution based on factorization f 1 cannot better describe the given 
samples than can a probability distribution based on factorization f 0 . 
If the FSTH is rejected, the test returns that the arc (v 0 ; v 1 ) should be added to the graph or 
that the two node vectors  c0 and  c1 should be spliced. A statistical hypothesis testing metric for 
selecting a factorization is readily available since in section 3.1.1 a general statistical hypothesis 
test on the basis of the negative log{likelihood dierence was given for any probabilistic model. 
Combining this test with the factorization search algorithms from section 3.2.1, we now have a 
factorization selection method based on statistical hypothesis tests. 
Regarding eÆciency, we note that computing the negative log{likelihood is decomposable over 
the factorization. This means that if we test to alter the model over some arc or by merging some 
sets, we only have to compute the negative log{likelihood dierence for the involved variables. In 
the case of undirected dependencies, the negative log{likelihood can be written as: 
 (Y))) = 
S P (Yh j i)(y i h j i) 
1 
A = 
0 is the node vector for the base model and that  
1 is the node vector 
for the new model. Furthermore, we assume that they dier only in vectors c 0 and c 1 such that 
 0 
jS X 
i=0 
0 
@ ln( ^ 
P (Yhc 0 t c 1 i)(y i hc 0 t c 1 i)) + 
X 
c2 1 
Yhc 1 i)(y i hc 1 i)) + 
X 
c2 0 
volved in 
a splice operation. In the case of conditional probabilities, assume we add arc (v 0 ; v 1 ) to the 
graph, meaning that Y v1 is made conditionally dependent on Y v0 . In an analogous way, we nd 
that given conditional factorizations ( 0 ; ! 0 ) and ( 1 ; ! 1 ) such that  1 (v 1 ) = (v 0 ) t  0 (v 1 ), the 
negative log{likelihood dierence becomes: 
ln(L(S j ^ 
P ( 1 ;! 1 ) (Y))) 
)i)) 
because 
the parts in the factorization that do not change, cancel out in the dierence. 
3.2.3 Complexity penalization 
We noted at the end of the previous subsection that the negative log{likelihood is decomposable 
over a factorization. This fact can be used to eÆciently compute the AIC and BIC metrics as 
introduced in section 3.1.2. We only have to compute the results over the variables for which the 
dependencies are dierent in one factorization as compared to another. In the unconditional case, 
each element in the node vector is mutually exclusive with respect to any other element in the 
node vector. This implies that for each element in the unconditional factorization, the amount of 
parameters is just the sum of the amount of parameters over all elements in the factorization: 
j fit 
  
P  0 (Y)jS)) = (27) 
jS 
y i hc 1 i)) + 
j fit 
  
; !)j 
can be computed in a decomposable manner over the factorziation. Assume for instance that we 
use a multivariate normal pdf for each element in the factorization. If we add an arc Y 3 ! Y 0 
to the factorization that underlies the probability distribution P (Y 0 jY 1 Y 2 )P (Y 1 )P (Y 2 )P (Y 3 ), the 
rst factor in the probability distribution becomes P (Y 0 jY 1 Y 2 Y 3 ). If computing the amount of 
parameters would be decomposable, we would be able to compute the increase in amount of 
parameters by computing the dierence in amount of implied parameters of P (Y 0 jY 1 Y 2 Y 3 ) and 
P (Y 0 jY 1 Y 2 ). Since each conditional pdf uses a multivariate joint pdf over all the involved variables, 
the dierence would be 5. However, the parameters in the factor P (Y 0 jY 1 Y 2 Y 3 ) are now not only 
20 

present in this factor. The factors themselves overlap, which is not the case for the unconditional 
dependencies. Because of this overlap, for instance the full covariance matrix index (3; 3) is already 
required, regardless of the fact that Y 3 is added as a conditional to the rst factor. Therefore, the 
amount of additional parameters is not equal to 5, but less. However, the amount of additional 
computational resources that are required, do increase similarly as in the unconditional case with 
respect to joint pdfs. The reason for this is that the covariance matrix will have to be inverted. 
Furthermore, if we do not take a normal pdf, but a normal mixture pdf, we estimate the parameters 
anew using the EM algorithm anyway since we cannot suÆce with a single ll covariance matrix in 
which the required entries are stored. Also, since applying the EM algorithm is a computationally 
resourcefull task and its performance degrades with increasing dimension, we do not want to 
apply it directly to the full joint distribution to nd all involved parameters, but apply it anew 
as the factorization is built during the search algorithm. Concluding, decomposing the amount of 
parameters over the conditional factorization is a rational operation as it penalizes the additional 
computational resources. The resulting metric dierence can be written as: 
ln(A( ^ 
P ( 1 ;! 1 ) (Y)jS)) 
jS 
v 1 )i)(y i h(v 1 ) t  1 (v 1 )i)) 
 fit 
  
composable metric. If we can derive the dierential entropy analytically to nd an eÆciently 
computable expression, we can use the entropy over the estimated model instead of the negative 
log{likelihood. This is for instance the case for the normal pdf. As a result, the computations for 
model selection based on minimizing the metric can be done eÆciently when using a normal pdf. 
3.3 Factorization mixture selection 
So far, we have discussed nding relations between variables based on a sample vector and using 
a factorization to build a useful probabilistic model. However, the structure of the sample vector 
may be highly non{linear. This non{linearity can force us to use probabilistic models of a high 
complexity to retain some of this non{linearity. However, especially using relatively simple pdfs 
such as the normal pdf, the non{linear interactions cannot be captured even with higher order 
models. For instance, a two dimensional sample vector in which the samples are aligned in a V 
shape cannot be t adequately by a product of one dimensional normal pdfs, nor by a single two 
dimensional normal pdf. However, if we would identify each line in the V shape as a single cluster, 
we could adequately model the sample vector using a sum of two dimensional normal pdfs. 
The key issue is the notion of clusters, which are possibly overlapping subvectors of the original 
sample vector and are optionally augmented with additional data. When we do not allow the 
subsets to overlap, we classify each point in the sample vector to be explicitly part of some 
cluster or class. This instance of clustering is called partitioning. The use of clusters allows us to 
eÆciently break up non{linear interactions so that we can use simple models to get an adequate 
representation of the sample vector. Furthermore, computationally eÆcient clustering algorithms 
exist that provide useful results. 
There are exact algorithms for partitioning [20], but the running times for these algorithms 
are of no practical use in our case. What we require, is a fast approximate assessment of clusters 
that we can t well using some pdf. Within these clusters, we can infer what factorization is best 
and get a probability distribution over the cluster. As all of this is part of the larger algorithm 
that attempts to infer the structure of the search space and use it in optimization, the clustering 
should be eective in representation as well as in computation time requirements. 
Each cluster is processed separately in order to have a probability distribution t over it. 
As such, we have a mixture of probability distributions. By assigning each of these probability 
distributions a weight in [0; 1] and letting the sum over all these weights equal 1, the result is again 
21 

a probability distribution. We write these weights as  i . These weights are commonly known as 
mixing coeÆcients. We write K = (K 0 ; K 1 ; : : : ; K jK 
i = (K i 
0 ; K i 
1 ; : : : ; K i 
jK i 
2 S, so K i 
j = ((K i 
j ) 0 ; (K i 
j ) 1 ; : : : ; (K i 
j ) l 
can now 
write the resulting probability distribution as: 
^ 
P fhKi (Y) = 
jK 
i (Y) (29) 
The reason why we require to have an i in the superscript in ^ 
P i 
f i (Y), is to be able to indicate 
that the i{th probability distribution is based upon the i{th cluster of samples. To determine 
the  i and the ^ 
P i 
f i (Y) in equation 29, two approaches can be taken. If we cluster the sample 
vector into possibly overlapping subvectors, we can for instance determine the mixing coeÆcients 
as the proportional subvector size. If we do not explicitly cluster the sample vector, but allow 
for each sample a probability of assigning it to a certain cluster, we can mathematically derive a 
way of nding the maximum expectation of this probability over all samples. By doing so, the 
clusters can be seen as copies of the original sample vector augmented with likelihood values for 
each sample that indicate the probability that the sample belongs to a certain cluster. As such, 
the resulting clusters can be seen as fuzzy clusters since there is no clear border between the 
clusters as is the case in partitioning. In practical applications, the rst approach leads to the 
use of partitioning algorithms and the second approach leads to expectation maximization (EM) 
algorithms. If we use such an EM algorithm however, we do not get any clusters to which we 
can apply our factorization selection procedures for constructing the nal probability distribution. 
Furthermore, we have noted that we only use the normal pdf. If we identify a single cluster 
equal to the entire sample vector and t a normal mixture model to the cluster, we also have 
to use the EM algorithm as we shall see in section 4.3.2. This gives an approach that is similar 
to identifying fuzzy clusters using the EM algorithm and subsequently using the result as the 
probability distribution. Therefore, we only focus on partitioning algorithms so as to be able to 
apply factorization selection to the identied partitions. 
3.3.1 Partitioning algorithms 
One of the most fundamental aspects in partioning, is the distance function. The distance of 
a point to a cluster determines to which cluster the point will actually be assigned. Often, the 
Euclidean distance is used. However, if we are to cluster two variables that lie on a dierent scale, 
the Euclidean distance might not be the best choice. Optically, if we would present the sample 
vector in a rectangular graph where the axes are rescaled to t within the rectangle, the clusters we 
would observe are not the clusters as they appear in the data based upon the Euclidean distance. 
These clusters will be found if both axes are of the same scale. One axis might become a lot 
longer in this case and clusters that might have been obvious in the rectangular graph, may not 
be so obvious at all anymore. Using the Euclidean distance is in this case still useful, but after 
the sample data has been rescaled in every dimension. Rescaling on the other hand is sensitive to 
outliers. For two jaj dimensional points y i hai and y j hai, the Euclidean distance dE (y i hai; y j hai) 
between them is given by: 
dE (y i hai; y j hai) = 
q 
(y i hai 
computing the sample variance 
in each dimension separately: 
22 

dES (y i hai; y j hai) = 
q 
(y i hai 1 (y i hai 
=0 
(y i 
ak 
0 0    0 
0 s 2 
a1    0 
. . . 
. . . . . . . . . 
0 0    s 2 
a ja 
and the point. A 
problem with this approach is that the distance to a cluster has ellipsoid shaped isolines that are 
aligned along with the axes. Therefore, even samples with linear interaction through correlation 
will likely be broken up into multiple axes{aligned ellipsoid shaped clusters. To overcome this 
problem, the Mahalanobis distance metric [23] can be used. Whereas the Euclidean distance 
metric can be seen as the result of tting a normal pdf with zero covariances, the Mahalanobis 
distance metric can be seen as the result of tting a normal pdf with a full covariance matrix. 
The clusters can therefore potentially be shaped better to the form of the data. Let K i hai be the 
centroid of cluster i and let S(K i ; a) be the sample covariance matrix of the points in cluster i 
over the features selected by a: 
K i hai = 1 
jK i j 
jK i 
een a point y i hai and cluster j can now be written as: 
dM (y i hai; j) = 
q 
(y i hai 
and can be aligned along any direction. In the special case where 
the samples are uncorrelated, these ellipsoids are aligned along the axes and the Mahalanobis 
distance becomes equivalent to the Euclidean distance. The Mahalanobis distance overcomes 
two limitations of the scaled Euclidean distance. On the one hand, it corrects for correlation 
between the dierent features. On the other hand, it can provide curved as well as linear decision 
boundaries. The disadvantage of the Mahalanobis distance is that the covariance matrices have to 
be determined, which becomes less reliable if the amount of samples in a cluster becomes smaller. 
Furthermore, the memory and computation time requirements are O(l 2 ) instead of O(l). 
The mixture coeÆcients  i can be determined in dierent ways. One of the two most common 
ways, is to use a probabilistic approach and set  i to the ratio of the size of the i{th cluster to the 
total size of all clusters: 
8 i2K 
* 
 i = 
jK i j 
P jK 
using equation 35 if the peaks are not equally high in case 
of maximization, since selection will prefer the larger peak. In the case of equal peaks, the search 
may still converge to concentrate on a single peak because of nite statistical instability. To this 
23 

end, the other common way to determine the mixing coeÆcients can be used. In this case, the 
mixing coeÆcients are determined as the average cost of the niche divided by the sum of the 
average costs over every niche: 
8 i2K 
* 
 i = 
1 
jK i j 
P jK i 
i 
q 
) 
P jK 
ts in equations 35 and 36 were later used in the 
rst approach [25] that applied clustering to iterated density estimation evolutionary algorithms. 
However, the formula in equation 36 doesn't work if the objective is to minimize and the values 
are below zero. Therefore, it is reasonable to assume that the optimum value will be far from 
the initial sample cost average. If we compute the average tness for niching with respect to this 
initial sample cost average C I and take the absolute value to account for maximization as well as 
minimization, we get: 
8 i2K 
* 
 i = 
   1 
jK i j 
P jK i 
the actual clustering algorithms, we place a nal note on the use of 
distance measures. In general, clusters should unite sample points that constitute promising 
regions in the search space. These regions do not necessarily have to correspond to the Euclidean 
space in which they are coded. This coding space is called the genotypic space. The decoded space 
in which parameters have their true meaning and directly contribute to the cost function, is called 
the phenotypic space. It should be noted that clustering in phenotypic space is in general more 
desirable as it is more likely to cluster the samples the most eective way. For most continuous 
problems, the two spaces are identical. We now focus on the partioning algorithms themselves. 
The leader algorithm 
The leader algorithm is one of the fastest clustering algorithms. The use of it can thus be benecial 
if the amount of overhead that is introduced by factorization mixture selection methods is desired 
to remain small. Furthermore, there is no need to specify in advance how many clusters there 
should be. The rst sample to make a new cluster is appointed to be its leader. The leader 
algorithm goes over the sample vector exactly once. For each sample it encounters, it nds the 
rst cluster that has a leader being closer to the sample than a given threshold T d . If no such 
cluster can be found, a new cluster is created containing only this single sample. 
Unless the order in which the clusters are inspected is randomized, the rst clusters are always 
quite a lot larger than the later ones. However, the asymptotic running time for nding the rst 
cluster with a leader closer than T d is the same as going over all clusters and nding the one with 
the smallest distance. Therefore, we prefer to nd the nearest cluster based on the leader of the 
cluster. If it is closer than T d , we assign the sample to the cluster. Another problem that arises is 
that if we have two clusters to which all samples are equally close, one cluster will be assigned all 
samples if we go over the clusters in a deterministic order. To overcome this problem, the order 
in which the clusters are scanned should be randomized as well. 
One of the major drawbacks of the algorithm is that it is not invariant given the sequence 
of the input samples. Most clustering algorithms have this property, but not as strongly as does 
the leader algorithm. Therefore, to be sure that the ordering of the sample vector is not subject 
to large repeating sequences of samples, which results in undesired xed density estimations, we 
propose to randomize the ordering of the samples as input to the leader algorithm. This can be 
done by allocating an array of indices that is randomly accessed in the algorithm. In order to use 
the scaled Euclidean distance measure, we rst have to compute the sample variance s i in each 
dimension. The resulting algorithm can be formalized as follows: 
24 

EuclideanLeaderClustering(T d ) 
1 L   new array of (vector of real) with size jSj 
2 K   () 
3 O s ; O c   2 new arrays of integer with size jSj 
4 for i   0 to l 
jS j=0 y j i 
jSj 
4.2 s i   
P jS 
0 to jSj 
d min   T d 
6.4 for j   0 to jKj 
O s [q s ] ; L[O c [q c ]]) < d min then 
6.4.2.1 d min   dES (y O s [q s ] ; L[O c [q c ]]) 
6.4.2.2 c min   O c [q c ] 
6.4.3 O c [q c ]   O c [jKj 
K 
c j   y O s [q s ] 
6.6 else then 
6.6.1 L[jKj]   y O s [q s ] 
6.6.2 K jKj   (y O s [q s ] ) 
6.7 O s [q s ]   O s [jSj 
measure in the above algorithm. The reason for this is that in order 
to evaluate the Mahalanobis distance, we require to know the sample covariance matrix of the 
samples that belong to a cluster as well as their sample mean. For the Euclidean measure, it 
suÆces to have a single point serving as a centroid. 
To use the Mahalanobis distance, we can determine the required parameters as the clustering 
algorithm progresses through the samples. As a point is added to a cluster, its sample covariance 
matrix and its sample mean are updated. However, the covariance matrix can only be computed 
and used in the Mahalanobis distance if we have at least two samples. Furthermore, based on 
only a few samples, the Mahalanobis distance may give unstable results, leading to unnatural 
clusters. Therefore, we propose that a cluster must rst grow to some minimum size T s before 
the Mahalanobis distance can be used for it. 
The leader aspect in the Euclidean variant of the algorithm lies within the fact that the rst 
sample to dene a new cluster is taken to be the leader of the cluster. It serves as a centroid for 
the Euclidean distance. In the variant that results when we use the Mahalanobis distance, the 
sample mean takes the role of the leader for some cluster as soon as the size of that cluster has 
reached T s . The sample mean of a cluster changes whenever a sample is added to that cluster. 
Therefore, the term leader is somewhat misleading in the resulting algorithm: 
25 

MahalanobisLeaderClustering(T d , T s ) 
1 L   new array of (vector of real) with size jSj 
2 K   () 
3 O s ; O c   2 new arrays of integer with size jSj 
4 for i   0 to l 
jS j=0 y j i 
jSj 
4.2 s i   
P jS 
to jSj 
min   T d 
6.4 for j   0 to jKj 
O s [q s ] ; O c [q c ]) 
6.4.3 else 
6.4.3.1 d   dES (y O s [q s ] ; L[O c [q c ]]) 
6.4.4 if d < d min then 
6.4.4.1 d min   d 
6.4.4.2 c min   O c [q c ] 
6.4.5 O c [q c ]   O c [jKj 
min ; L)(j; q)   S(K cmin ; L)(j; q) + (K cmin ) j (K cmin ) q 
6.5.1.1.2 S(K cmin ; L)(j; q)   jK c min 
jS(K c min 
;L)(j;q)+y O s [q s ] 
j y O s [q s ] 
q 
jK c min j+1 
6.5.2 K cmin   K c min jK c j+y O s 
[q 
s 
] 
jK c min 
j+1 
6.5.3 for j   0 to l 
min ; L)(q; j)   S(K cmin ; L)(j; q) 
6.5.4 K cmin 
jKc min 
j   y O s [q s ] 
6.6 else then 
6.6.1 K jKj   y O s [q s ] 
6.6.2 for j   0 to jSj 
s ] ) 
6.7 O s [q s ]   O s [jSj 
that is invariant of the permutation of the sample vector, is the joining algorithm. 
This algorithm initially generates a cluster for each sample and then iteratively joins clusters until 
no clusters are close to each other enough anymore. The algorithm is however not really suitable 
for large sample vectors, since jSj 2 distances must be computed and examined. 
26 

A faster algorithm that is less subject to the input ordering as is the leader algorithm but 
not independent as is the joining algorithm, is the k{means clustering algorithm. This algorithm 
constructs exactly k clusters. As we a priori mostly have little notion of how many clusters are 
appropriate to t to the data, the adaptive k{means algorithm is usually preferred. Based on 
distance thresholds, the number of clusters k is determined by the algorithm itself, just as is done 
by the leader algorithm. The amount of clusters inuences the running time of the subsequent 
distribution estimation, so a bound on k might be preferred. We restrict ourselves to the non{ 
adaptive version of the clustering algorithm and elaborate on the value of k in section 3.3.2. 
The k{means algorithm computes k clusters. A dierence with respect to the leader algorithm, 
is that it uses the cluster centroids. The general procedure is as follows. First, k clusters are picked 
at random. This can be done by partioning the sample vector at random into k subvectors. The 
resulting centroids are however expected to lie close to each other. Therefore, the initial clusters 
are usually taken to consist of a single sample, which is chosen at random from the sample vector. 
Subsequently, the algorithm iterates until the means do not change to within a signicance of 
" anymore. One iteration consists of assigning each point to the nearest cluster based on the 
distance to the cluster centroid. If multiple clusters are equally likely to be chosen, one of them is 
picked at random. To ensure this, the clusters are scanned in a random order for each new sample 
point. Once all of the points have been assigned, the means of the clusters are recomputed. 
An advantage over the leader algorithm, is that the result of the k{means algorithm is less 
dependent on the input ordering. A disadvantage is that the algorithm takes N I times as long as 
the leader algorithm where N I is the amount of iterations in the k{means clustering algorithm. 
To avoid spending large computation eorts, a maximum of T i iterations is allowed: 
EuclideanKMeansClustering(k, ", T i ) 
1 O   new array of integer with size k 
2 K   () 
3 for i   0 to l 
jS 
j=0 y j i 
jSj 
3.2 s i   
P jS 
0 to k 
q 
4.3 K i   () 
4.4 O[i]   i 
5 first   true 
6 ready   false 
7 t   0 
8 while :ready do 
8.1 for i   0 to k 
min   dES (y i ; K O[q] ) 
8.2.3 c min   O[q] 
8.2.4 O[q] $ O[k 
q] 
8.2.5.3 O[q] $ O[k 
K c min j   y i 
27 

EuclideanKMeansClustering(k, ", T i ) 
8.3 ready   true 
8.4 for i   0 to k 
jK i 
j=0 K i 
j 
jK i j 
8.4.2 if ja 
true 
9 return(K) 
Just as for the leader algorithm, we can alter the k{means clustering algorithm to make use of 
the Mahalanobis distance measure. If a cluster has grown to some minimum size, the Mahalanobis 
distance may be used. Otherwise, the Euclidean distance is used. In subsequent generations, the 
Mahalanobis distance is always used as the covariance values are then kept xed in any single 
loop. Again, the maximum amount of iterations that the algorithm is allowed, is given by T i . The 
resulting algorithm can be formalized as follows: 
MahalanobisKMeansClustering(k, ", T s , T i ) 
1 a   new vector of real with size l 
2 B   new matrix of real with size l  l 
3 O   new array of integer with size k 
4 K   () 
5 for i   0 to l 
Sj 
5.2 s i   
P jS 
  0 to k 
q 
6.3 for j   0 to jSj 
while :ready do 
10.1 for i   0 to k 
MahalanobisKMeansClustering(k, ", T s , T i ) 
10.2 for i   0 to jSj 
K O[q] j  T s then 
10.2.2.1 d min   dM (y i ; O[q]) 
10.2.3 else 
10.2.3.1 d min   dES (y i ; K O[q] ) 
10.2.4 c min   O[q] 
10.2.5 O[q] $ O[k 
d   dES (y i ; K O[q] ) 
10.2.6.4 if d < d min then 
10.2.6.2.1 d min   d 
10.2.6.2.2 c min   O[q] 
10.2.6.5 O[q] $ O[k 
jKc min j   y i 
10.2.8 if first ^ jK cmin j = T s then 
10.2.8.1 for j   0 to l 
min ) j (K cmin ) q 
10.2.8.1.1.2 S(K cmin ; L)(j; q)   
jK c min jS(K c min ;L)(j;q)+y i 
j y i 
q 
jK c min j+1 
10.2.8.2 K cmin   K c min jK c min j+y i 
jK c min j+1 
10.2.8.3 for j   0 to l 
min ; L)(q; j)   S(K cmin ; L)(j; q) 
10.3 ready   true 
10.4 for i   0 to k 
jK i 
j 
K i ; L)   B 
10.5 if first then 
10.5.1 first   false 
10.5.2 ready   false 
10.6 t   t + 1 
10.7 if t = T i then 
10.7.1 ready   true 
11 return(K) 
29 

3.3.2 Setting parameters 
On the one hand, we now have clustering algorithms on the basis of which we can dene a mixture 
of factorizations. However, on the other hand, we have introduced new parameters that determine 
the amount of mixtures that will be introduced. For the k{means clustering algorithms, this 
amount equals exactly k. For the leader algorithms, this amount depends on the value for the 
distance threshold T d . In order to perform factorization mixture selection, we clearly have to 
determine what value to use for these parameters. 
Setting the parameters requires some experience. To automate this, we can apply the general 
approaches for the detection of dependence between variables from section 3.2. If we for instance 
take the k parameter for the k{means clustering algorithms, we can increment the value of k and 
compute the negative log{likelihood value. If based on this value the probability distribution is 
signicantly better than for a smaller value of k, this value is accepted and the search continues. 
Alternatively, a scoring metric can be increasingly penalized as the amount of clusters increases. 
A similar scheme can be used for the distance threshold T d . 
It should be noted that using such methods requires a vast amount of computation time. 
Factorization selection for a single cluster can already take up a signicant amount of computation 
time. Over k clusters, we thus require k times as much computation time. It therefore seems by 
far the most eÆcient choice to x k or T d on beforehand. 
3.3.3 Examples 
In this section, we present some examples of the partioning algorithms that are described in the 
previous sections. We apply each of the the clustering algorithms from section 3.3 to a set of sample 
vectors and observe the results. We use sample vectors that can be displayed graphically to gain 
better insights into the workings of the algorithms. Sample vectors of a higher dimensionality than 
3 cannot be visualized in an intuitive manner. Such sample vectors will appear when we apply 
the iterated density estimation to continuous optimization in section 6. We go over the sample 
vectors and apply the four clustering algorithms that we discussed in section 3.3. 
We use 8 sample vectors. The rst 7 of these are two dimensional and consist of various 
structures. The rst sample vector is sampled from the uniform distribution. The second sample 
vector is sampled from a multiple of local uniform distributions, creating clusters. The third 
sample vector is a lled circle with a higher concentration of samples in the center and can be 
seen to have been sampled from a cone. The fourth sample vector is an outlined circle, which 
is diÆcult for most approaches. Even though the space can be described by a single parameter 
that is the angle around the center, this structure is hardly ever detected by any probability 
density estimation procedure. The fth sample vector is a line that exemplies full correlation 
between the two variables. The sixth and seventh sample vectors are sample vectors that express 
dependency, but in such a way that it usually cannot be accounted for by a single factorization. 
The last sample vector shows dierent behavior in dierent regions of the plane. In one region, 
the density is uniform, whereas in the other it is fully correlated. For any probabilistic model to 
eÆciently describe this sample vector, it should be clustered into at least two sample vectors that 
are processed individually. 
We have applied the four clustering algorithms from section 3.3 to each sample vector. For the 
k{means clustering algorithms, we used k = 10. The results are shown in gures 6 through to 13. 
Each gure shows, from left to right, the result of Euclidean leader clustering, Mahalanobis leader 
clustering, Euclidean k{means clustering and Mahalanobis k{means clustering. The distance 
threshold for the Euclidean leader clustering algorithm was T d = 1:0. For the Mahalanobis leader 
clustering algorithm, the distance threshold was T d = 2:5 and the minimum cluster size threshold 
was T s = 10. For both k{means clustering algorithms, we used k = 10; " = 0:001; T i = 10. All of 
the settings were chosen so that a visually intuitive amount of clusters was obtained. 
If we compare the leader results to the k{means results, it is clear that the cluster boundaries 
of the k{means results are strict in the sense that there is no overlap between the dierent clusters. 
Our randomized approach to the leader algorithm proves to be eective in the sense that there 
30 

­10 
­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 
­10 
­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 
­10 
­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 
­10 
­8 
­6 
­4 
­2 
10 
­10 ­8 ­6 ­4 ­2 0 2 4 6 10 
Figure 6: Cluster results for sample vector 0. 
­10 
­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 
­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 
­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 
­8 
­6 
­4 
­2 
10 
­10 ­8 ­6 ­4 ­2 0 2 4 6 10 
Figure 7: Cluster results for sample vector 1. 
­5 
­4 
­3 
­2 
­1 
­5 ­4 ­3 ­2 ­1 ­5 
­4 
­3 
­2 
­1 
­5 ­4 ­3 ­2 ­1 ­5 
­4 
­3 
­2 
­1 
0 
1 
2 
3 
4 
5 
­5 ­4 ­3 ­2 ­1 1 2 3 4 5 ­5 
­4 
­3 
­2 
­1 
0 
1 
2 
3 
4 
5 
­5 ­4 ­3 ­2 ­1 0 1 2 3 4 5 
Figure 8: Cluster results for sample vector 2. 
­5 
­4 
­3 
­2 
­1 
­5 ­4 ­3 ­2 ­1 ­5 
­4 
­3 
­2 
­1 
­5 ­4 ­3 ­2 ­1 ­5 
­4 
­3 
­2 
­1 
0 
1 
2 
3 
4 
5 
­5 ­4 ­3 ­2 ­1 1 2 3 4 5 ­5 
­4 
­3 
­2 
­1 
0 
1 
2 
3 
4 
5 
­5 ­4 ­3 ­2 ­1 0 1 2 3 4 5 
Figure 9: Cluster results for sample vector 3. 
31 

­10 
­5 
0 
5 
10 
15 
­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 
­10 
­5 
0 
5 
10 
15 
­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 
­10 
­5 
0 
5 
10 
15 
­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 
­10 
­5 
10 
15 
­10 ­8 ­6 ­4 ­2 0 2 4 6 10 
Figure 10: Cluster results for sample vector 4. 
­10 
­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
­4 ­3 ­2 ­1 0 3 4 ­10 
­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
­4 ­3 ­2 ­1 0 4 ­10 
­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
­4 ­3 ­2 ­1 0 1 ­10 
­8 
­6 
­4 
­2 
10 
­4 ­3 ­2 ­1 0 1 2 
Figure 11: Cluster results for sample vector 5. 
­8 
­6 
­4 
­2 
10 
­10 ­8 ­6 ­4 ­2 10 ­8 
­6 
­4 
­2 
10 
­10 ­8 ­6 ­4 ­2 10 ­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
­10 ­8 ­6 ­4 ­2 2 4 6 8 10 ­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 
Figure 12: Cluster results for sample vector 6. 
­10 
­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
12 
­10 ­8 ­6 ­4 ­2 0 10 12 ­10 
­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
12 
­10 ­8 ­6 ­4 ­2 0 2 10 12 ­10 
­8 
­6 
­4 
­2 
0 
2 
4 
6 
8 
10 
12 
­10 ­8 ­6 ­4 ­2 0 2 4 10 12 ­10 
­8 
­6 
­4 
­2 
10 
12 
­10 ­8 ­6 ­4 ­2 0 2 4 6 10 12 
Figure 13: Cluster results for sample vector 7. 
32 

are no extremely large clusters in any of its results and that the boundaries are not disturbingly 
overlapping. This means that the resulting densities that are estimated on both of the clustering 
results, will not dier much. Especially for our application in which a correct density estimation 
is not as important as is a good approximation, the results of the leader algorithm seem at least 
as promising as those of the k{means clustering algorithm. Since the running time of the leader 
algorithm is smaller than that of the k{means algorithm, the leader algorithm seems preferable. 
If we compare the results between the use of the Euclidean distance measure and the Maha- 
lanobis distance measure, it is not directly clear which of these approaches leads to more desirable 
results. One of the main drawbacks of using the Mahalanobis distance measure, is that small 
clusters are more likely to appear. Such clusters can for instance lie completely inside another 
cluster since the normal pdf that is based upon it, is sharply peaked. It seems that the k{means 
algorithm in this case leads to somewhat better formed clusters with less overlap. The true advan- 
tage of the Mahalanobis distance in that it can better describe linear relations, is unfortunately 
not exploited in a useful way using the described clustering algorithms as becomes clear from 
gure 12 for instance. In this gure, the horizontal bars of sample points are not separated by 
the clusters even though this is possible by using the Mahalanobis distance. This means that the 
resulting estimated density will not be eective. As the Mahalanobis distance measure takes more 
computational eort, it seems that the Euclidean distance measure is preferable. 
Given the computational speed of the leader algorithm and its acceptable two{dimensional 
results we have seen so far, this algorithm seems the most promising to use to use for factorization 
mixture selection based on clustering. The experiments in section 6 will bring the use of clustering 
within the IDEA framework into practice, which will indicate the actual computational usefulness 
of the proposed clustering algorithms. 
4 Model tting and sampling 
Once a factorization has been selected, a pdf has to be t for each element in the factorization. If 
we have used statistical hypothesis tests directly through for instance the correlation coeÆcient, 
we yet have to nd such a t. Then again, in the case of using the Kullback{Leibler divergence, 
the required parameters have been observed to already have been computed [8]. In general, we 
assume that our model structure has been selected but that the pdf information is not available 
yet. In this section, we concentrate on computing the pdf information. In general, whereas nding 
a model structure, is called model selection, estimating a pdf for each element in the structure as 
we do in this section, is called model tting. 
4.1 Observing pdf characteristics 
In this article, we have described algorithms that use conditional densities and unconditional 
densities. We have also shown that the conditional densities are a more general case than are the 
unconditional densities as the latter can be written as a special case of the former. 
From equation 3 we have that we only require unconditional densities. However, simplifying 
the actual conditional density can lead to expressions that can be computed more eÆciently. We 
shall give the conditional density for each pdf that we discuss. First however, we discuss some 
pdfs in general. From these, we select the pdfs to actually work with. In subsequent sections we 
go into more detail on the selected pdfs. 
We restrict ourselves to the use of the normal pdf and variants thereof. One of the main 
reasons for this, is that the normal pdf has relatively simple analytical properties. Furthermore, 
the central limit theorem tells us that given enough samples, the approximation to the normal pdf 
is often quite good for several components that represent some source. 
Based on the normal pdf, we distinguish three dierent pdfs. These are the single multivariate 
normal pdf, the normal kernels pdf and the normal mixture pdf. Both the multivariate normal 
pdf as well as the normal kernels pdf have been proposed previously in the IDEA framework [8, 
10, 11, 12]. The single multivariate normal pdf is dened as follows: 
33 

­10 
­5 
0 
5 
10 
Y0 ­10 
­5 
0 
5 
10 
Y1 
0 
2e­05 
4e­05 
6e­05 
8e­05 
0.0001 
0.00012 
0.00014 
0.00016 
f(Y0,Y1) 
­10 
­5 
0 
5 
10 
Y0 ­10 
­5 
0 
5 
10 
Y1 
0 
2e­05 
4e­05 
6e­05 
8e­05 
0.0001 
0.00012 
0.00014 
0.00016 
f(Y0,Y1) 
Figure 14: Fitting sample vectors 0 (left) and 1 (right) with a joint normal pdf. 
fN (yhai; hai; (a)) = (2) aj 2 
(det (a)) 1 
2 
e yha (a)) 
in gure 14. 
The normal kernels pdf is obtained by placing a multivariate normal pdf over each sample 
point. The covariance matrices of these normal kernels are xed to have non{zero entries on the 
diagonal and zero entries o the diagonal. The entries on the diagonal are the individual variances 
of the normal pdf in each dimension. We denote such a variance in dimension j by s j . The so 
constructed matrix implies that the normal kernels can be scaled in each dimension separately, 
but that each multivariate normal pdf cannot be aligned to any other axes than the standard axes. 
We denote the resulting matrix over variables Y hai by S(a): 
S(a) = 
2 
6 6 6 4 
s 2 
a0 0 : : : 0 
0 s 2 
a1 : : : 0 
. . . 
. . . . . . . . . 
0 0 : : : s 2 
a ja 
S; S(a)) = 1 
jSj 
jS 
with respect to the range 
was proposed. One interesting property of the normal kernels pdf, is that by increasing the 
standard deviation values s i , we allow for less detail in the nal density estimation. Furthermore, 
an advantage over the use of the normal pdf, is that we have a better way of modeling clusters. 
A resulting t over the two sample vectors can be seen in gure 15. 
If we take M normal pdfs instead of jSj and t them to the sample data as well as possible 
while allowing for full covariance matrices, we have a normal mixture pdf. Note that we are tting 
such a mixture over a subset of the variables instead of over all variables as we did in section 3.3. 
It may be clear that tting such a normal mixture pdf over a subset of the available variables, 
can be done in the same way as described earlier in section 3.3, namely through the use of the 
34 

­10 
­5 
0 
5 
10 
Y0 ­10 
­5 
0 
5 
10 
Y1 
0 
0.0005 
0.001 
0.0015 
0.002 
0.0025 
0.003 
0.0035 
f(Y0,Y1) 
­10 
­5 
0 
5 
10 
Y0 ­10 
­5 
0 
5 
10 
Y1 
0 
0.001 
0.002 
0.003 
0.004 
0.005 
0.006 
0.007 
0.008 
0.009 
f(Y0,Y1) 
Figure 15: Fitting sample vectors 0 (left) and 1 (right) with a joint normal kernels pdf with 
s 0 = s 1 = 1:0. 
EM algorithm or the adaptive EM algorithm. An advantage of the normal mixture pdf over the 
normal pdf and the normal kernels pdf, is that it is not as global and cluster insensitive as is 
the normal pdf and neither over{sensitive to clusters as is the normal kernels pdf. Therefore, the 
normal mixture pdf can be seen as a trade{o. The normal mixture pdf requires a vector of vectors 
of means as well as a vector of covariance matrices since we have multiple multivariate normal 
pdfs. We assume to have w such normal components in the mixture and let W = (0; 1; : : : ; w 
and the vector of covariance matrices is (a)hWi. 
We dene  i hai to be the mean vector of the i{th component  i hai = (haihWi) i and  i (a) 
to be the covariance matrix of the i{th component  i (a) = ((a)hWi) i . The normal mixture 
distribution can be formalized as in equation 41. A resulting t over the two sample vectors can 
be seen in gure 16. 
fNM (yhai; hWi;haihWi;(a)hWi) = 
w 
i;  i (a)) (41) 
In equation 41, hWi is a vector of mixing coeÆcients. The same condition on the  mixing 
coeÆcients holds as in the case of the  mixing coeÆcients for the factorization mixture as discussed 
in section 3.3, namely that P w 
becomes clear from gure 14 where clearly 
the clustered set is undert. The normal kernels pdf overcomes this aspect. However, the costs 
for this are quite high. Firstly, evaluating the function takes jSj times more time than the single 
normal pdf. This scales with the amount of available samples. However, getting more samples 
should imply that we get a better description of the source from which the samples were drawn 
and not that we necessarily get a much more costly pdf. Sampling from the distribution takes 
much more time as well. In order to solve higher dimensional problems, we require increasingly 
more samples. As the complexity of the normal kernels pdf increases with this amount and since 
the pdf is used very often throughout the optimization algorithm, the normal kernels pdf is not 
a very eÆcient way of reducing the strong generalization of the normal pdf. Secondly, it is not 
straightforward what values to use for the variances of each kernel. It has been observed [10] that 
the performance of the resulting algorithm strongly depends on the variances. This renders the 
normal kernels pdf hard to handle, even though it has been observed [10] that better results can 
be obtained with it on epistatic problems than with the normal pdf. 
We point out that in terms of modeling, the normal mixture pdf generalizes both the normal 
35 

­10 
­5 
0 
5 
10 
Y0 ­10 
­5 
0 
5 
10 
Y1 
0 
0.0005 
0.001 
0.0015 
0.002 
0.0025 
0.003 
f(Y0,Y1) 
­10 
­5 
0 
5 
10 
Y0 ­10 
­5 
0 
5 
10 
Y1 
0 
0.005 
0.01 
0.015 
0.02 
0.025 
f(Y0,Y1) 
Figure 16: Fitting sample vectors 0 (left) and 1 (right) with a joint normal mixture pdf using the 
EM algorithm with 10 components and " = 10 
ve w = 1, we have a single normal pdf. On 
the other hand, if we have w = jSj and  i (a) = S(a), we get the normal kernels pdf. 
We conclude that for simplicity and eÆciency, it is benecial to continue to use the single 
normal pdf. We refrain from using the normal kernels pdf any further since it is very sensitive 
to parameter settings and a good t is hard to nd. On the other hand, for the normal mixture 
pdf, the EM algorithm gives us an automated procedure to obtain a t. Therefore, we continue to 
regard only the normal pdf and the normal mixture pdf. In the next subsections, we present some 
important derivations of the pdfs in order for them to be used in the IDEA framework. For each 
pdf, we rst present its multivariate conditional version in which a single variable is conditioned 
on a multiple of others and show how we can sample from it as well as from the multivariate joint 
pdf. Subsequently, we show how the parameters  for the pdf can be estimated. 
4.2 The normal pdf 
The denition of the multivariate normal joint pdf, given a vector of variables yhai, a mean vector 
hai and a covariance matrix (a), is given in equation 38. Using this denition, we can directly 
account for unconditional dependencies as these result in multivariate joint pdfs. Therefore, we 
can evaluate a sample in a marginal product model that is t with normal pdfs. However, in the 
case of conditional dependencies, we must be able to evaluate the multivariate normal conditional 
pdf. Furthermore, to complete the resulting algorithms, we need to be able to sample from the 
multivariate normal joint pdf as well as from the multivariate normal conditional pdf. In this 
section, we rst show how to perform the actual sampling. Once this is known, we know how to 
sample from both the special case multivariate conditional normal pdf as well as the multivariate 
joint normal pdf. However, given a model structure, we still have to estimate the parameters of 
the normal pdfs, which is the actual model tting. How to best estimate the parameters, is what 
we show second in this section. 
4.2.1 Sampling 
We rst want to know how to sample from the special case multivariate conditional normal pdf. 
Usually, only sampling from a one dimensional normal pdf is available. However, it was pointed 
out in earlier work [8] that the multivariate normal conditional pdf in which a single variable ya 0 
is conditioned on jaj 
fN ((y a 0 jyha (a)) = fN (y a 0 ; ~ ; ~  2 ) (42) 
where 
8 < 
: 
~ 
 2 = 1 
(a) 
a 0 (a) i=1 (ya i 
In the case of 
unconditional dependencies however, we have to draw samples from a product of multivariate 
joint normal pdfs. We have assumed that only sampling from a one dimensional normal pdf is 
available. Therefore, if we have to sample from a multivariate normal pdf, we use equation 3 to 
write the multivariate pdf as a product of multivariate conditional pdfs for which we know that 
they can be written as one dimensional normal pdfs using equation 42. This procedure is known 
as sampling by simulation. 
The multivariate dierential entropy for the multivariate joint normal pdf can be derived 
analytically. It can be shown (see for instance [15]) that it equals: 
h(fN (yhai; hai; (a))) = 1 
2 (jaj + ln((2) jaj (det (a)))) (43) 
4.2.2 Parameter estimation 
From equations 38 and 42 it follows that we have to estimate the values for hai and (a). It can 
be shown that a maximum likelihood t of the normal pdf is found by using the sample average 
Y hai and sample covariance matrix S(a) for them respectively: 
Y hai = 1 
jSj 
jS 
4.3 The normal mixture pdf 
The denition of a multivariate normal mixture pdf is given in equation 41. Just as in the case 
of a single normal pdf, we can thereby directly account for unconditional dependencies. In this 
section, we rstly derive an expression for the multivariate conditional normal mixture pdf and 
subsequently note how to sample from it. Also, we show how to sample from the multivariate 
joint normal mixture pdf. Secondly, we show how the EM algorithm can be used to estimate the 
parameters of a multivariate normal mixture pdf. 
4.3.1 Sampling 
To sample from a model that incorporates conditional dependencies, we must rst derive the 
multivariate conditional normal mixture pdf in which a single variable ya 0 is conditioned on 
multiple others yha 
((y a0 jyha 
a 
The expression in the denominator of equation 46 depends only on variables yha 
given, we dene a constant c 0 = fNM (yha 
; (a 
w 
hai;  i (a)) (47) 
= 1 
c 0 
w 
i);  i hai;  i (a))f N (yha 
time however, the constant is additionally dependent on i. Therefore, 
we cannot move this factor outside the summation. The same is the case for the  i factors. We 
introduce a \constant" that is a function of i by c 1 (i) =  i fN (yha 
normal mixture pdf: 
= 
w 
;  i ha  i ; ( ~ 
 i ) 2 ) 
where 
8 
> < 
> : 
( ~ 
 i ) 2 = 1 
 i (a) 
P ja 
sample from the multivariate conditional normal mixture pdf, we rst have to compute the 
weights c 1 (i)=c 0 . We then select the i{th normal pdf with probability c 1 (i)=c 0 . Since the complete 
expression is a pdf and each element in the sum is a weight times a normal pdf, the weights have to 
sum to 1. As the nominator is a component of the mixture in the denominator, c 0 = P w 
able to sample from the required multivariate conditional normal mixture pdf, we 
also have to be able to sample from the multivariate joint normal mixture pdf. This is quite simple 
compared to the conditional case. The joint pdf in equation 41 is a sum of weighted multivariate 
joint normal pdfs. Given the constraint that the  i have to sum to 1 and may not be negative, we 
can select the i{th multivariate normal pdf to sample from with probability  i . How to sample 
from a single multivariate normal pdf has already been discussed in section 4.2. 
4.3.2 Parameter estimation 
Estimating the parameters of each multivariate joint normal pdf, given w, is not straightforward. 
To this end, the EM algorithm has been proposed [16]. This algorithm does not guarantee a 
maximum likelihood t, but at least it is a general approach to nding an approximation. 
Even though we get an approximation by using the EM algorithm, the choice of w still has to 
be made. This is somewhat of a similar choice as in the case of selecting the amount of clusters 
in factorization mixture selection. An adaptive version of the EM algorithm can be constructed 
so as to prevent the user from having to specify the amount of normal pdfs to use in a t over 
the data. Instead, the user must provide a threshold value in the same fashion as for the leader 
algorithms. Just as we did in the case of the k{means clustering algorithms however, we restrict 
ourselves to the non{adaptive algorithm. 
38 

The EM algorithm is an iterative parameter estimation procedure. Given a certain instance 
for the parameters, which is termed the old parameters, a single iteration consists of nding new 
estimates for these parameters. This recomputation of the parameters is iteratively repeated until 
no changes occur anymore within a certain precision or until a maximum of iterations has been 
reached. In appendix C, the update equations for a single iteration of the EM algorithm for a 
multivariate normal mixture model are derived given a vector of indices a. The resulting equations 
concern the mixing coeÆcients  j , the mean vectors  j hai and the covariance matrices  j (a) for 
each component j 2 W in the mixture: 
 new 
j = 1 
jSj 
jS X 
i=0 
 old 
j f ^ 
P j (Y hai)(y i hai)g old 
f ^ 
P (Y hai)(y i hai)g old 
(49) 
f j haig new = 
P jS j f ^ 
P j (Y hai)(y i hai)g old 
f ^ 
P (Y hai)(y i hai)g old y i hai 
P jS 
i=0 
 old 
j f ^ 
P j 
f j (Y hai)(y i hai)g old 
f ^ 
P f (Y hai)(y i hai)g old (y i hai 
Y hai)(y i hai)g old 
(51) 
Even though the EM algorithm is derived so as to obtain a maximum likelihood estimate, the 
resulting estimation may not be one of maximum likelihood. The minimization problem contains 
many local minima and the algorithm may get stuck in one of them. Since the EM algorithm is 
essentially a gradient algorithm, this can easily happen if the problem has many local optima. To 
tackle this issue, an evolutionary algorithm could for instance be applied. However, this yields a 
chicken and egg problem since within the evolutionary algorithms that learn structure, tools such 
as the EM algorithm are used again. 
We close this section by presenting pseudo{code for the EM algorithm for a normal mixture 
with full covariance matrices. The algorithm is based upon the update equations given above as 
derived in appendix C. The algorithm is dened with respect to an indices vector a. These indices 
dene the variables for which a multivariate joint normal mixture pdf must be estimated. Just as 
was the case for the k{means algorithm, we require an a priori specied amount of components k. 
We also use a convergence signicance level " as well as a maximum amount of iterations T i : 
NormalMixtureEMAlgorithm(k, ", T i , a) 
1 a o ; a n   2 new arrays of real with size k 
2 m o ; m n   2 new arrays of (vector of real with size jaj) with size k 
3 S o ; S n   2 new arrays of (matrix of real with size jaj  jaj) with size k 
4 p   new array of real in 2 dimensions with size k  jSj 
5 p m   new array of real with size jSj 
6 d   new array of real with size k 
7 r; min   new array of real with size jaj 
8 for i   0 to jaj 
k]   1 
k 
9.2 for i   0 to jaj 
10 ready   false 
11 t   0 
12 while :ready do 
12.1 for i   0 to jSj 
k 
; m o [k]; S o [k]) 
12.1.2.2 p m [i]   p m [i] + a o [k]p[k; i] 
12.2 for k   0 to k 
o [k]p[k;i] 
p m [i] 
12.2.3 a n [k]   d[k] 
jSj 
12.2.4 for i   0 to jaj 
ai 
p m [i] 
12.2.6 m n [k]   m n [k] 
d[k] 
12.2.7 for i   0 to jaj 
;i](y i ha 
ja n [k] 
i j  " then 
12.4.3.1.1 ready   false 
12.4.3.2 m o [k] i   m n [k] i 
12.4.4 for i   0 to jaj 
t   t + 1 
12.6 if t = T i then 
12.6.1 ready   true 
13 for k   0 to k 
 k (a)   S n [k] 
40 

5 IDEAs 
Our aim is to apply the use of probabilistic models to continuous optimization. So far, we have for- 
malized useful techniques and algorithms to nd such models. However, we have not yet elaborated 
on how these probabilistic models can be used. In this section, we present the IDEA framework 
for Iterated Density Estimation Evolutionary Algorithms. In this framework, probabilistic models 
are built and used in optimization with an evolutionary algorithm. In section 5.1 we rst go 
over the framework itself. Second, in section 5.2 we elaborate on the algorithmic instances of the 
framework that were used in earlier work. Third and nally, we summarize what techniques we 
will apply for the rst time to get new algorithmic instances of the IDEA framework. 
5.1 The IDEA framework 
We write the cost function of our continuous optimization problem as C(yhLi) and without loss 
of generality, we assume that we want to minimize C(yhLi). For every problem variable y i , we 
introduce a continuous random variable Y i and get Y = Y hLi. Without any prior information on 
C(yhLi), we might as well assume a uniform distribution over Y . Therefore, we generate an initial 
(population) vector of n samples at random. Now we let P  (Y) be a probability distribution that 
is uniform over all vectors yhLi with C(yhLi)  . Sampling from P  (Y) gives more samples that 
evaluate to a value below . Moreover, if we know   = min yhLi fC(yhLi)g, a single sample gives 
an optimal solution. To use this in an iterated algorithm, we select bnc samples in each iteration 
t and let  t be the worst selected sample cost. We then estimate the distribution of the selected 
samples and thereby nd ^ 
P  t 
& (Y) as an approximation to the true distribution P  t (Y). New 
samples can then be drawn from ^ 
P  t 
& (Y) and be used to replace some of the current samples. This 
rationale has led to the denition of the IDEA framework [8]. The largest dierence between other 
similar approaches and the IDEA, is that the IDEA has mostly been used to focus on continuous 
optimization problems [8, 10, 11, 12, 13]. The denition of the IDEA framework is the following: 
IDEA(n,  , m, sel(), rep(), ter(), sea(), est(), sam()) 
Initialize an empty vector of samples P   () 
Add and evaluate n random samples for i   0 to n 
) 
Initialize the iteration counter t   0 
Iterate until termination while :ter() do 
Select bnc samples (y 0 hLi; y 1 hLi; : : : ; y bn 
 hc[y i hLi]  c[y k hLi]i 
Search for a structure & &   sea() 
Estimate the parameters  fit 
  
Evaluate the new samples in P for each unevaluated P i do 
c[P i ]   C(P i ) 
Update the iteration counter t   t + 1 
Denote the required iterations by t end t end   t 
In the IDEA framework, we have that N  = (0; 1; : : : ; bnc 
a single sample using the estimated pdfs. The evolutionary algorithm characteristic of the IDEA 
lies in the fact that a population of individuals is used from which individuals are selected to 
41 

generate new ospring with. Using these ospring along with the parent individuals and the 
current population, a new population is constructed. 
If we set m to (n 
t+1 =  t 
eyed through a monotonically decreasing series 
 0   1  : : :   t end . We call an IDEA with m, sel() and rep() so chosen, a monotonic IDEA. 
5.2 Previous IDEAs 
All of the prior algorithms that use continuous probabilistic models, have used a single factor- 
ization for the probabilistic model structure & . The rst few approaches all used the univariate 
factorization [18, 30, 31]. Simultaneously, the IDEA framework was introduced [8] alongside the 
work of Larra~naga, Etxeberria, Lozano and Pe~na [22]. Both of the works propose to use the 
Kullback{Leibler divergence between the estimated distribution and the full joint estimated dis- 
tribution which leads to minimizing the entropy over all possible factorizations. This approach is 
inspired by the MIMIC algorithm [7], in which only a certain constrained class of factorizations 
is allowed. In addition, in the work by Larra~naga et al., edge exclusions tests and a continuous 
version of the Bayesian Dirichlet metric for Gaussian networks (BGe) are used to learn a factor- 
ization that is not constrained. On the other hand, in the work by Bosman and Thierens [8], a 
general factorization is allowed in which each variable is allowed to interact with a maximum of 
 other variables. The work by Bosman and Thierens has been tested and has proven to be an 
eective approach [10, 11]. 
Apart from the recent work by Larra~naga et al. and by Bosman and Thierens, there have 
been no attempts to use higher order probabilistic model structures for continuous optimization. 
Most of the approaches so far have solely used the normal pdf as the building blocks of the 
probability distribution over Y . The only exception to this, has been the work by Gallagher, 
Fream and Downs [18], in which a normal mixture model is used. However, the model structure 
is xed to the univariate factorization. Bosman and Thierens have described the use of the 
normal kernels pdf [12]. This approach has however proven to be hard to handle [10]. On the 
other hand, using the normal kernels pdf has given results on certain highly epistatic problems 
that could not be obtained by using only a single normal pdf. Therefore, a normal mixture pdf 
was suggested for future research. Finally, Bosman and Thierens also investigated the use of a 
histogram distribution [8]. Using this pdf however does not scale up [10] and has therefore been 
abandoned. 
5.3 New IDEAs 
The novelty of the algorithms that we propose in this paper, lies in the complexity of the proba- 
bilistic models. First of all, for the rst time, we use a mixture of factorizations instead of a single 
factorization by nding a factorization for each cluster that we construct rst. By eectively 
removing non{linear interactions between the problem variables in this way, previously shown to 
be eective techniques regarding the normal pdf and the learning of factorizations can be used to 
obtain eective probability distribution estimations. 
The learning of factorizations is dierent as well. We no longer need additional constraints on 
the factorization as we use the metrics to penalize more complex models. Therefore, we no longer 
have a variety of factorization search algorithms with dierent constraints. 
Finally, we have shown for the rst time how the normal mixture pdf can be used given any 
factorization and thus in an IDEA instance in which any factorization is allowed. As the normal 
mixture pdf can account for additional non{linearities that may still remain within the clusters, 
the new algorithms are thus more exible and eectively use more complex probabilistic models. 
The required techniques to get the new optimization algorithms, have been described in pre- 
vious sections in this paper. Using those techniques, we are now ready to apply our new IDEAs 
to continuous optimization problems, which is done in section 6. 
42 

5.4 Methodological comparison with Evolution Strategies 
Approaches such as the IDEA that build and use probabilistic models in evolutionary optimization 
were rst proposed as an improvement over GAs [4, 7]. The IDEA framework itself has been used 
to focus on continuous models for problems with real valued variables. This has resulted in a new 
line of continuous evolutionary algorithms. However, for continuous optimization, evolutionary 
algorithms have been proposed almost as long ago as the original simple binary GA [2]. Currently, 
we can identify two variants of this continuous evolutionary algorithm, namely Evolutionary Pro- 
gramming (EP) and Evolution Strategies (ES). Generally, we can state that these algorithms also 
make use of normal pdfs. One might therefore say that there already exist algorithms that operate 
in a similar fashion as the IDEA. For this reason, it is important to question the relevance of a 
new approach such as the IDEA instances presented so far. 
In this section, we describe how ES diers from IDEA and point out that there are indeed fun- 
damental dierences that markedly distinguish the approaches from one another. To this end, we 
rst give a brief algorithmical introduction to Evolution Strategies in section 5.4.1. Subsequently, 
we describe the actual dierences with our IDEA approach in section 5.4.2. 
5.4.1 Brief introduction to Evolution Strategies 
In this section, we give a birdseye overview of Evolution Strategies. Our goal is not to go give a 
detailed description of these approaches. For a detailed introduction, the interested reader is for 
instance referred to the work by Back and Schwefel [3]. 
At a top level, we can distinguish two ES variants based on the selection scheme that is used. 
These variants are commonly referred to as the (; ){ES and the ( + ){ES. In both cases,  
denotes the amount of parents and  the amount of ospring that is produced by recombination 
and mutation. In the case of the (; ){ES, the  parents are selected from the  ospring (  ). 
In the case of the ( + ){ES, the  parents are selected from both the parents of the previous 
generation as well as the  ospring (  1). 
In ES, the parents are recombined and subsequently mutated as is the case for the simple GAs. 
However, the mutation operator has always been the most important one for ES. This operator 
samples from normal pdfs and makes the ES appear similar to the IDEA with normal pdfs. 
With each variable, a normal pdf is associated. Each normal pdf can either be allowed to have 
identical standard deviations in each dimension, to be aligned with the axes of the search space 
without necessarily identical standard deviations or can allowed to be any arbitrary l{dimensional 
normal pdf. This comes down to a covariance matrix with respectively either similar entries on 
the diagonal and zero entries o the diagonal, non{similar entries on the diagonal and zero entries 
o the diagonal or an arbitrary symmetric covariance matrix. The variables that code the matrix 
are incorporated into the genome and are subject to mutation and recombination themselves. 
The mutation of the entries in the covariance matrix is performed by multiplication with an 
exponential normally distributed value. After the values in the covariances matrix have been 
updated, they are used to mutate the values for the problem variables. These values are updated 
by adding a normally distributed l{dimensional variable to the current sample. 
An ES individual contains real values for the l dimensions of the problem. These values 
represent the mean of a normal pdf that is used to sample ospring from after recombination. 
The covariance matrix is also stored in the genome (in either restricted or unrestricted form). 
These real values are mutated according to a xed strategy. For a covariance matrix with zero 
entries o the diagonal, the ES genome can be specied by: 
(y 0 ; y 1 ; : : : ; y l 
distributed 
variables are sampled anew for each dimension): 
43 

Figure 17: Operational sketch of IDEA for normal pdf (left) and normal mixture pdf (right). 
s = e N (0;) 
 0 
i =  i e N (0; 0 ) s 
y 0 
i = y i +N (0;  0 i ) 
(53) 
For  and  0 , the usual rule of thumb is to set  = 1= 
p 
2 
p 
n and  0 = 1= 
p 
2n. Recombination 
in ES is generally done in one of two possible ways, being discrete and intermediate crossover. 
By using discrete crossover, the values themselves are swapped without being altered. By using 
intermediate crossover, the values of the parents are averaged. Recombination is applied both to 
the problem values as well as the covariance matrix values. 
5.4.2 ES and IDEA: Similarities and dierences 
One top level aspect that becomes clear is that a monotonic IDEA can be seen as a ( + ) algo- 
rithm. In evolutionary computation, such algorithms are said to be elitist. Since many algorithms 
in the eld of EAs share such a selection strategy, this is not the most important thing to note 
about the two continuous evolutionary approaches. 
In gure 17, a landscape is plotted in which we desire to locate the unique minimum. We 
assume that we have a set of samples available as indicated by crosses in the image. The IDEA 
approach is then to t a probability distribution over these samples as well as possible to get a 
good representation of the samples. In gure 17 this is shown for a normal pdf and a normal 
mixture pdf. Once such a probability distribution has been estimated,  ospring are generated 
by sampling from the probability distribution. As selection favors better solutions, the probability 
distribution is meant to model the promising regions of the search space. Note that this is a global 
approach to optimization as it attempts to globally capture the structure of the landscape and 
renes the model as the process is iterated. 
We now turn our attention to the ES approach. In gure 18, a sketch is given of the mutation 
operation. If we set  = 1, we have only a single solution to recombine and mutate. By doing so, 
we are modeling a normal pdf just as is the case for the IDEA approach. However, its use is now 
dierent. If we only focus on sampling new solutions, we indeed get the same result as with the 
IDEA, since the  ospring will be sampled from the normal pdf that is modeled. However, the 
way in which the normal pdf is found in the next generation is dierent. In the IDEA approach, 
44 

Figure 18: Operational sketch of mutation{based ES for (1; ) (left) and (; ) (right). 
the best bnc samples are selected. Based on these samples, a maximum likelihood normal pdf 
is t. For the ES approach however, the single parent is selected from either the ospring or the 
ospring and the single parent of the previous generation. The result is a single point in the search 
space, since selection is based upon the function value only. This means that we can regard the 
ES approach with  = 1 as moving the normal pdf in some direction. The direction as well as 
the length of the step in which we move the normal pdf is subject to evolution itself. In other 
words, the ES in this case can be seen as evolving to nd local gradient information on how to 
best traverse the search space. 
In the case of  > 1 for ES, we get the representation as depicted on the righthandside 
of gure 18. Regarding only mutation for the moment again, generating the ospring involves 
sampling from the normal pdfs that lie centered around the  parent samples. Therefore, this 
can be seen as the equivalent of the normal mixture pdf or a clustered normal pdf in the IDEA 
approach. However, the way in which the  parents make up the mixture distribution by evolution 
of the covariance matrices, is similar to the case in which  = 1. Again, we can therefore see the 
movement of the  parent samples as moving the normal pdfs in a certain direction with a certain 
stepsize. Finding this direction and stepsize is subject to evolution. ES mutation can therefore 
also be taken as a more local based approach to nding and using the structure of the landscape. 
Recombination in ES quite drastically changes the view on its operational semantics. If we 
for instance regard intermediate crossover, we are drawing lines between the  parent samples 
in the l{dimensional space and place the new samples in the middle of these lines. It becomes 
clear that by doing so, the search is directed more to use the global structure of the search space 
to nd promising regions of the search space. This strongly aids the ES to eÆciently tackle a 
great amount of continuous optimization problems. Recombination allows for global exploration 
of the search space, while mutation attempts to use more local landscape information. Moreover, 
it should be noted that the initial span of the normal pdfs in ES is usually taken to be quite large. 
This improves the initial global exploration of the landscape further. 
The IDEA approach is a global procedure that attempts to use the structure of the problem 
landscape to explore the most promising regions. On the other hand, mutation based ES ap- 
proaches are local procedures that use evolution to explore the inside of promising regions. By 
adding recombination, additional means of globally searching the landscape are introduced, but 
it is not as evident as for the IDEA to what extent this helps to use global structure. 
45 

Concluding, the two approaches are fundamentally dierent. This becomes clear for instance 
on problems in which local information is important, such as Rosenbrock's function, which has a 
narrow valley. The bottom of this valley has a unique minimum. However, the gradient along the 
bottom of the valley is very small. This means that an IDEA based approach will quite easily nd 
the valley itself, but will by no means be able to traverse the valley to nd the global minimum 
unless points were sampled near the global minimum. The reason for this is that density estimation 
converges on a certain part of the valley since samples are only available in that part of the search 
space. On the other hand, once the ES is inside the valley, it can adapt its mutation direction and 
stepsize to follow the valley to its minimum in a gradient descent fashion. Even though this is a 
time consuming process, the ES is not as likely to prematurely converge on such a problem as is 
the IDEA approach. 
6 Experiments 
The continuous function optimization problems we used for testing are the following: 
Name Denition Domain 
Griewank 1 
4000 
P l 
+1 
 
+ 1 [ 
2 
i 
 
 
[0; ] l 
Rosenbrock 
P l 
minimized. We tested a large variety of IDEA variants 
for l = 5. In sections 6.1, 6.2 and 6.3 we go over the results for the individual functions. In all our 
testing, we used a monotonic IDEA. We used the rule of thumb by Muhlenbein and Mahnig [24] 
for FDA and set  to 0:3. Furthermore, for variants that use clustering, we let n increase from 
250 to 5000 in steps of 250. For variants that do not use clustering, we increase the population 
size in steps of 25. The reason for this is that by introducing clusters, we also need to increase 
the population size by a larger amount to allow a signicant change in cluster size. We allowed 
each run a maximum of 1  10 7 evaluations for all non{clustered approaches. Because of time 
restrictions, we only allowed 2 1 
2  10 6 evaluations for the clustered approaches, with the exception 
of the Euclidean leader algorithms on Griewank's function. If all of the solutions diered by less 
than 5  10 
we 
searched for both unconditional as well as conditional factorizations with both the AIC metric as 
well as the BIC metric. For the normal mixture pdf, it was empirically observed that inference 
with respect to nding a factorization is extremely computationally expensive. Therefore, we 
have xed the factorization structure to the univariate one for the normal mixture pdf. However, 
we did test various amounts of mixtures in the normal mixture pdf, namely w 2 f2; 5; 10g. We 
applied clustering using both the leader algorithm as well as the k{means clustering algorithm. 
In the case of the leader algorithm, we applied both the scaled Euclidean distance as well as the 
Mahalanobis distance. For the k{means clustering algorithm, we only used the scaled Euclidean 
distance. We used k 2 f2; 5; 10g for the k{means clustering algorithm and T d 2 f3 1 
2 
; 5g for 
the variants of the leader algorithm. This gives on average 7 and 2 clusters respectively for the 
scaled Euclidean distance and 11 and 3 clusters respectively for the Mahalanobis distance. The 
factorization mixture coeÆcients  i are determined by using equation 35. 
We present the best obtained average best results over the 10 independent runs and sort all 
of the IDEA instances by this index primarily. We also show the population size n for the best 
obtained result, the average amount of evaluations and the Relative Run Time RT. Let FT(x) be 
the time to perform x random function evaluations and TRT be the Total Run Time on the same 
processing system. Then, RT(x) = TRT=FT(x). We determined RT as RT(10 6 ). The RT index 
is a processing system independent fair comparison metric of the total required time. We sort the 
46 

results by this index secondarily instead of the amount of function evaluations, as it truely reects 
the required amount of time. 
6.1 Griewank's function 
In gure 19, a summary of the results on Griewank's function is given. The unique minimum is 
0, which is found by quite a few of the tested approaches within the allowed precision. 
Quite a lot of approaches without clustering are able to nd the unique minimum. This gives 
us the impression that Griewank's function is not extremely epistatic or non{linear. On the other 
hand, the amount of required function evaluations is very large when compared to clustering 
approaches or normal mixture approaches. The non{clustered normal mixture pdf with w = 10 
for instance, requires only 51832 evaluations on average to minimize the function, whereas the 
approaches that only use a single normal pdf require at least 945971 evaluations on average. 
However, the RT index points out an important fact. It can be seen from the table that the 
running time that is required for the normal mixture pdf, is quite a lot larger. Even though the 
single normal pdf requires 20 times more evaluations, it runs 2 to 3 times as fast as the normal 
mixture pdf in which only the univariate factorization is used. We should note at this point that 
if the evaluation time for a single solution would have been signicantly larger, then the normal 
mixture pdf would have been preferable. 
It we take a look at the dierent clustering approaches, we rst note that the successful entries 
in the table use the normal mixture pdf. It seems that we should mostly contribute the success of 
these approaches to the use of the normal mixture pdf instead of the use of clustering. If we take a 
look at clustering approaches in combination with only a normal pdf, it seems that the Mahalanobis 
distance results in no superior useful clustering algorithms within the IDEA framework. What is 
very interesting however, is that the k{means clustering approaches seem to perform signicantly 
worse than the leader approach, unless clustering is combined with the normal mixture pdf. The 
main reason for this is the dierence in allowed amount of function evaluations. Otherwise, there 
does not seem to be a large dierence between the two approaches. The only reason why they dier 
in their outcome, is that the amount of clusters diers. If we then regard the Euclidean leader 
algorithm with the BIC metric, T d = 3 1 
2 and unconditional dependencies as well as the k{means 
algorithm with k = 5, we see that the running time of the leader algorithm is only slightly larger 
than that of the k{means algorithm, even though the population size is equal and the amount of 
evaluations is 3 times as large. This makes the leader algorithm preferable here. 
It is hard to say whether conditional or unconditional dependencies are to be preferred on the 
basis of the presented results. The same holds for the search metric. The reason for this is that 
the amount of dimensions l is very small. The choice of the metric and the type of dependencies 
becomes truely signicant if l goes up. Still, it seems that for expressional power, the conditional 
dependencies should be preferred. 
6.2 Michalewicz's function 
Observing the results on the highly epistatic Michalewicz function in gure 20, it becomes clear 
immediately that clustering is required. The non{clustering approaches only work well if the 
normal mixture pdf is used, which is in a way comparable to using clustering approaches. There 
is no signicant dierence between conditional and unconditional dependencies, but in the case of 
conditional dependencies, the required amount of function evaluations is often lower. 
There seems to be a slight preference for the Mahalanobis distance, but the Euclidean distance 
seems to work quite well. A more signicant dierence is present between the leader clustering 
approach and the k{means clustering approach. The k{means approaches all seem to be inferior 
to the leader approaches. The only explanation for this can be the adaptiveness of the leader 
algorithm because of the threshold T d . This adaptiveness allows the leader algorithm to be exible 
in the amount of required clusters, which in turn results in a better performance. 
We note that the normal mixture pdf requires on average quite a lot less evaluations than does 
any other approach using only the normal pdf. However, the computation time that is required 
47 

Clustering Distance f Metric pdf n C evals RT 
| | (; !) AIC normal 175 0.000000 945971 2.21 
| |  AIC normal 175 0.000000 1039070 2.55 
| | (; !) BIC normal 150 0.000000 1403463 3.19 
| |  BIC normal 200 0.000000 1492826 3.48 
| | univ. | mix 10 1100 0.000000 51832 6.61 
k{means 2 Euclidean univ. | mix 10 1750 0.000000 72552 8.86 
Leader 3 1 
2 
Mahalanobis univ. | mix 5 3250 0.000000 156311 9.97 
Leader 3 1 
2 
Mahalanobis univ. | mix 10 2250 0.000000 89718 10.30 
k{means 5 Euclidean univ. | mix 10 2500 0.000000 98805 11.43 
Leader 5 Euclidean (; !) AIC normal 250 0.000000 4522165 11.65 
k{means 5 Euclidean univ. | mix 5 3250 0.000000 172812 11.81 
| | univ. | mix 5 1600 0.000000 163024 13.01 
Leader 5 Euclidean univ. | mix 10 2250 0.000000 106660 13.65 
Leader 5 Mahalanobis univ. | mix 10 2250 0.000000 110206 14.40 
Leader 5 Euclidean univ. | mix 5 2500 0.000000 226102 16.42 
k{means 2 Euclidean univ. | mix 5 3000 0.000000 228857 16.75 
k{means 10 Euclidean univ. | mix 10 3750 0.000000 156386 17.58 
Leader 3 1 
2 
Euclidean univ. | mix 10 4750 0.000000 199321 23.74 
Leader 5 Mahalanobis univ. | mix 5 3250 0.000000 353185 26.94 
Leader 3 1 
2 
Euclidean univ. | mix 5 4750 0.000047 360632 22.05 
k{means 10 Euclidean univ. | mix 5 5000 0.000072 274139 17.00 
Leader 5 Euclidean  AIC normal 500 0.000636 7477291 18.96 
Leader 5 Euclidean  BIC normal 250 0.000905 3706986 10.29 
Leader 5 Euclidean univ. | mix 2 4000 0.001232 3239155 106.93 
Leader 5 Euclidean (; !) BIC normal 250 0.001278 6631367 16.59 
Leader 3 1 
2 
Mahalanobis univ. | mix 2 4750 0.004618 698637 23.12 
| | univ. | mix 2 2750 0.005451 4818328 123.82 
Leader 3 1 
2 
Euclidean  BIC normal 1000 0.005755 6620613 18.40 
Leader 5 Mahalanobis  BIC normal 250 0.005956 1628976 7.51 
k{means 5 Euclidean univ. | mix 2 3000 0.006179 653522 24.00 
Leader 3 1 
2 
Euclidean (; !) BIC normal 750 0.006705 7268071 19.47 
Leader 5 Mahalanobis univ. | mix 2 2000 0.007339 1826277 54.79 
Leader 3 1 
2 
Euclidean  AIC normal 1000 0.007477 2775053 18.38 
Leader 3 1 
2 
Euclidean (; !) AIC normal 1250 0.007640 8992514 25.56 
k{means 2 Euclidean univ. | mix 2 1750 0.007930 979638 25.97 
Leader 5 Mahalanobis  AIC normal 250 0.009353 1338268 5.81 
Leader 5 Mahalanobis (; !) BIC normal 250 0.009428 2499626 10.28 
Leader 3 1 
2 
Mahalanobis (; !) AIC normal 750 0.010497 2127631 10.52 
Leader 3 1 
2 
Euclidean univ. | mix 2 5000 0.011837 751763 21.90 
Leader 3 1 
2 
Mahalanobis  BIC normal 750 0.011987 2007309 9.70 
k{means 2 Euclidean  AIC normal 250 0.012384 736524 2.48 
k{means 2 Euclidean (; !) BIC normal 500 0.012742 2381377 9.06 
Leader 5 Mahalanobis (; !) AIC normal 250 0.012981 1793646 7.61 
Leader 3 1 
2 
Mahalanobis (; !) BIC normal 500 0.013007 877386 4.11 
k{means 2 Euclidean (; !) AIC normal 250 0.013202 1172454 3.89 
k{means 10 Euclidean univ. | mix 2 4750 0.013400 958065 31.46 
k{means 2 Euclidean  BIC normal 250 0.013514 793724 2.79 
k{means 10 Euclidean (; !) AIC normal 2500 0.013601 1718042 19.35 
Leader 3 1 
2 
Mahalanobis  AIC normal 750 0.014672 2296346 11.27 
k{means 10 Euclidean  BIC normal 2250 0.015987 703016 17.55 
k{means 10 Euclidean  AIC normal 5000 0.015921 1662724 18.00 
k{means 10 Euclidean (; !) BIC normal 2500 0.016360 2054672 23.03 
k{means 5 Euclidean  AIC normal 1000 0.016410 1094122 6.40 
k{means 5 Euclidean  BIC normal 1000 0.019069 2361092 17.05 
k{means 5 Euclidean (; !) BIC normal 1000 0.020145 2236489 14.47 
k{means 5 Euclidean (; !) AIC normal 250 0.101171 1725182 6.63 
Figure 19: Results on Griewank's Function, sorted on C (primary) and RT (secondary). 
48 

to estimate the parameters of the normal mixture, signicantly increases the running time. 
The main thing to note is that whereas for Griewank's function we were still able to get good 
results even with only the normal pdf, clustering becomes much more important here. 
6.3 Rosenbrock's function 
Rosenbrock's function has proven to be a real challenge for any continuous optimization algorithm. 
It has a valley along which the quality of solutions is much better than that of the solutions in 
its neighborhood. Furthermore, this valley has a unique minimum of 0 itself. The gradient along 
the bottom of the valley is only very slight. Any gradient approach is therefore doomed to follow 
the long road along the bottom of the valley. For a density estimation algorithm, capturing the 
valley in a probabilistic model is diÆcult, even if all of the points within the valley are known. 
The reason for this is that the valley is non{linear in the coding space. Therefore, we expect that 
to get any reasonable results, we require to use clustering. 
In gure 21, the results on Rosenbrock's function for l = 5 are given. It is clear from this table 
that any approach that does not use clustering, does not give useful results. Using 10 clusters 
allows us to eectively process the non{linearity of the valley and optimize the problem using only 
very few evaluations. We note that using a mixture of normal pdfs in this case does not help the 
optimization process at all. It even results in more evaluations than when using a single normal 
pdf in some cases. The main important thing is the non{linearity of the valley and its slight 
gradient. Because of the fact that only the univariate factorization is used for the normal mixture 
pdf, the non{linearity cannot be captured eectively. The problem with density estimation in 
this continuous search space is that it doesn't use the gradient information that can be extremely 
useful to nd the minimum. Given the continuity of the search spaces, it is even unwise not to 
exploit this information in some way, which is demonstrated by the results in gure 21. Quite a 
large amount of clusters is namely required to eectively optimize the problem. 
Note that the amount of dimensions is only very small. If the amount of dimensions goes up, 
the amount of clusters will have to increase accordingly. However, by doing so, the amount of 
samples will have to increase as well to ensure a large enough cluster size. Therefore, the sole use 
of density estimation on search spaces such as the one dened by Rosenbrock's function, is not 
eective enough to compete with gradient approaches. However, a hybrid combination of both 
approaches will most likely result in very eective continuous optimization techniques. 
We close this section by noting that the k{means clustering algorithm now outperforms the 
leader algorithm. The main reason for this is that clustering is extremely important for this 
function. With k = 10, we have the largest amount of clusters that we tested. If we would have 
used a threshold lower than T d = 3 1 
2 , we would most likely have gotten better results with the 
leader algorithm as well. Finally, it is clear that the Euclidean distance measure signicantly 
outperforms the Mahalanobis distance measure. 
7 Discussion 
The use of the normal mixture pdf in combination with the EM algorithm as proposed in this 
paper, requires a vast amount of computation time. For highly epistatic search spaces however, it 
has shown to be eective with respect to the required amount of evaluations as well as optimization 
performance. Since the EM algorithm gives a general approach to nding the involved parameters, 
the normal mixture pdf is preferable over the normal kernels pdf that was proposed earlier [12]. 
The use of clustering in combination with the normal pdf also results in a normal mixture 
pdf. However, the results in this paper show dierent behavior of the two techniques on dierent 
landscapes. On non{linear landscapes, clustering is the most eective way to estimate the non{ 
linearity, unless higher order factorizations are allowed for the normal mixture pdf. However, 
allowing such factorizations results in vast additional amounts of computation time, which is 
undesirable. 
49 

Clustering Distance f Metric pdf n C evals RT 
Leader 5 Mahalanobis (; !) AIC normal 750 -4.687658 95364 0.32 
Leader 3 1 
2 
Euclidean univ. | mix 2 1500 -4.687658 38416 0.37 
Leader 5 Mahalanobis  BIC normal 1000 -4.687658 117979 0.38 
Leader 5 Mahalanobis (; !) BIC normal 1000 -4.687658 122711 0.41 
Leader 3 1 
2 
Euclidean (; !) BIC normal 4000 -4.687658 180813 0.42 
Leader 5 Mahalanobis  AIC normal 1000 -4.687658 131561 0.43 
Leader 5 Mahalanobis univ. | mix 5 1000 -4.687658 22118 0.45 
Leader 3 1 
2 
Euclidean  BIC normal 3500 -4.687658 193759 0.45 
Leader 3 1 
2 
Euclidean  AIC normal 3750 -4.687658 194792 0.46 
Leader 3 1 
2 
Euclidean (; !) AIC normal 4000 -4.687658 201120 0.47 
Leader 5 Euclidean  BIC normal 2250 -4.687658 221314 0.47 
k{means 2 Euclidean univ. | mix 2 2000 -4.687658 49634 0.52 
Leader 3 1 
2 
Mahalanobis (; !) BIC normal 3500 -4.687658 129114 0.54 
Leader 5 Mahalanobis univ. | mix 2 2000 -4.687658 51035 0.55 
Leader 5 Euclidean univ. | mix 2 2250 -4.687658 58789 0.58 
k{means 5 Euclidean univ. | mix 5 1250 -4.687658 28954 0.59 
Leader 3 1 
2 
Mahalanobis  BIC normal 3750 -4.687658 139646 0.59 
| | univ. | mix 5 1300 -4.687658 28903 0.61 
Leader 5 Euclidean (; !) BIC normal 1500 -4.687658 301297 0.63 
k{means 5 Euclidean univ. | mix 2 2500 -4.687658 63128 0.67 
Leader 3 1 
2 
Mahalanobis  AIC normal 4250 -4.687658 155654 0.68 
k{means 2 Euclidean univ. | mix 10 750 -4.687658 16596 0.70 
k{means 10 Euclidean univ. | mix 5 1750 -4.687658 43894 0.88 
Leader 3 1 
2 
Mahalanobis univ. | mix 2 3500 -4.687658 86528 0.92 
Leader 5 Euclidean  AIC normal 3250 -4.687658 423172 0.92 
Leader 5 Euclidean univ. | mix 10 1000 -4.687658 21680 0.96 
k{means 2 Euclidean univ. | mix 5 2000 -4.687658 45606 0.96 
Leader 3 1 
2 
Mahalanobis univ. | mix 5 2250 -4.687658 52288 1.04 
Leader 5 Euclidean (; !) AIC normal 2000 -4.687658 512489 1.08 
k{means 10 Euclidean univ. | mix 10 1250 -4.687658 30925 1.10 
Leader 3 1 
2 
Mahalanobis univ. | mix 10 1500 -4.687658 34475 1.25 
Leader 3 1 
2 
Euclidean univ. | mix 5 3000 -4.687658 70232 1.33 
k{means 10 Euclidean univ. | mix 2 4250 -4.687658 115106 1.34 
Leader 5 Euclidean univ. | mix 5 3000 -4.687658 68131 1.38 
| | univ. | mix 2 5000 -4.687658 135938 1.45 
k{means 5 Euclidean univ. | mix 10 1750 -4.687658 40522 1.51 
| | univ. | mix 10 1800 -4.687658 39756 1.61 
Leader 3 1 
2 
Euclidean univ. | mix 10 2000 -4.687658 45431 1.68 
Leader 5 Mahalanobis univ. | mix 10 2500 -4.687658 58751 2.18 
Leader 3 1 
2 
Mahalanobis (; !) AIC normal 4250 -4.682400 258326 1.14 
k{means 2 Euclidean (; !) AIC normal 2500 -4.661225 114406 0.48 
k{means 5 Euclidean (; !) BIC normal 5000 -4.660579 825985 2.63 
k{means 2 Euclidean  BIC normal 4750 -4.659790 543562 1.44 
| |  BIC normal 325 -4.659300 1032504 2.13 
| | (; !) AIC normal 1100 -4.657363 4064810 8.20 
k{means 2 Euclidean (; !) BIC normal 3500 -4.656492 615331 1.54 
k{means 2 Euclidean  AIC normal 3750 -4.656120 646464 1.66 
| | (; !) BIC normal 1700 -4.654513 4088736 8.26 
| |  AIC normal 175 -4.654248 13693 0.03 
k{means 10 Euclidean  BIC normal 2000 -4.650213 377993 1.48 
k{means 5 Euclidean (; !) AIC normal 4000 -4.643722 773225 2.35 
k{means 5 Euclidean  AIC normal 2750 -4.639595 443323 1.38 
k{means 10 Euclidean  AIC normal 3500 -4.638577 856142 3.31 
k{means 10 Euclidean (; !) BIC normal 4250 -4.637136 551834 2.35 
k{means 5 Euclidean  BIC normal 3750 -4.635958 499408 1.59 
k{means 10 Euclidean (; !) AIC normal 2250 -4.622931 1262262 4.52 
Figure 20: Results on Michalewicz's Function, sorted on C (primary) and RT (secondary). 
50 

Clustering Distance f Metric pdf n C evals RT 
k{means 10 Euclidean (; !) AIC normal 2500 0.000000 67506 3.03 
k{means 10 Euclidean (; !) BIC normal 3000 0.000000 82575 3.65 
k{means 10 Euclidean  AIC normal 4250 0.000000 157886 6.30 
k{means 10 Euclidean  BIC normal 2750 0.001550 168627 5.35 
Leader 3 1 
2 
Euclidean (; !) AIC normal 4750 0.002011 166477 1.81 
Leader 3 1 
2 
Euclidean  AIC normal 5000 0.033155 773470 8.84 
Leader 3 1 
2 
Euclidean (; !) BIC normal 5000 0.050435 646558 7.36 
k{means 10 Euclidean univ. | mix 5 3500 0.062745 309262 31.00 
Leader 5 Euclidean univ. | mix 10 4000 0.065469 341870 72.56 
Leader 3 1 
2 
Euclidean  BIC normal 5000 0.074347 742398 8.52 
k{means 5 Euclidean (; !) BIC normal 1250 0.078283 53482 1.18 
k{means 5 Euclidean  AIC normal 4750 0.079437 904017 15.03 
k{means 5 Euclidean (; !) AIC normal 4750 0.083104 615903 10.53 
Leader 5 Euclidean univ. | mix 5 2250 0.085702 191173 19.92 
Leader 5 Euclidean univ. | mix 2 3500 0.092902 278625 13.86 
k{means 10 Euclidean univ. | mix 10 4250 0.106256 610238 89.13 
k{means 10 Euclidean univ. | mix 2 2000 0.125456 272568 13.00 
Leader 5 Euclidean  AIC normal 3750 0.128021 478400 4.81 
Leader 3 1 
2 
Euclidean univ. | mix 2 3500 0.143837 621458 20.36 
k{means 5 Euclidean  BIC normal 4750 0.160205 1246595 19.71 
k{means 5 Euclidean univ. | mix 5 4750 0.160414 531505 45.24 
Leader 3 1 
2 
Euclidean univ. | mix 10 4500 0.172262 472424 79.29 
Leader 3 1 
2 
Euclidean univ. | mix 5 3000 0.189304 286372 25.40 
k{means 2 Euclidean  BIC normal 5000 0.189420 1621587 19.09 
Leader 5 Mahalanobis univ. | mix 5 2250 0.191094 261108 27.56 
Leader 5 Euclidean (; !) AIC normal 4500 0.209030 1361006 13.73 
Leader 3 1 
2 
Mahalanobis  AIC normal 2000 0.209074 808100 20.10 
Leader 3 1 
2 
Mahalanobis (; !) AIC normal 2500 0.247123 1765538 43.24 
k{means 5 Euclidean univ. | mix 10 5000 0.259675 551594 86.79 
Leader 5 Mahalanobis univ. | mix 2 4250 0.260017 457718 24.28 
Leader 5 Mahalanobis univ. | mix 10 4500 0.261190 735926 145.22 
Leader 5 Euclidean  BIC normal 2250 0.285851 1338881 13.80 
| | univ. | mix 5 450 0.288066 37074 3.52 
Leader 5 Euclidean (; !) BIC normal 5000 0.304032 1427719 14.91 
Leader 3 1 
2 
Mahalanobis  BIC normal 2500 0.325525 918273 23.56 
| | univ. | mix 10 1400 0.345850 133443 25.19 
Leader 5 Mahalanobis (; !) BIC normal 3000 0.363407 2252646 49.28 
k{means 5 Euclidean univ. | mix 2 4250 0.366888 600938 24.30 
Leader 3 1 
2 
Mahalanobis univ. | mix 10 3500 0.367077 555588 100.59 
Leader 3 1 
2 
Mahalanobis univ. | mix 2 2750 0.396008 378320 20.99 
k{means 2 Euclidean univ. | mix 2 5000 0.421393 1166456 45.95 
Leader 3 1 
2 
Mahalanobis (; !) BIC normal 1750 0.445489 1915536 46.29 
Leader 3 1 
2 
Mahalanobis univ. | mix 5 2500 0.459225 406324 40.36 
k{means 2 Euclidean univ. | mix 10 3250 0.481794 291164 52.68 
k{means 2 Euclidean univ. | mix 5 2250 0.495617 180535 17.21 
Leader 5 Mahalanobis  AIC normal 3250 0.605696 1506264 32.43 
Leader 5 Mahalanobis (; !) AIC normal 5000 0.615485 2316097 53.52 
Leader 5 Mahalanobis  BIC normal 4250 0.774399 1978082 43.80 
k{means 2 Euclidean (; !) AIC normal 4750 0.923545 1057429 12.74 
| | univ. | mix 2 975 1.083728 92292 5.01 
k{means 2 Euclidean (; !) BIC normal 1250 1.281874 79419 1.53 
k{means 2 Euclidean  AIC normal 3250 1.298608 919056 10.63 
| | (; !) BIC normal 2750 1.859228 2370574 19.30 
| | (; !) AIC normal 675 1.867434 79619 0.72 
| |  BIC normal 4000 1.906433 1592727 13.52 
| |  AIC normal 900 1.971504 135492 1.10 
Figure 21: Results on Rosenbrock's Function, sorted on C (primary) and RT (secondary). 
51 

Given the low computational requirements on the factorized normal pdf, introducing clustering 
is the most cost eective way to improve the performance of IDEAs on a variety of non{linear 
and epistatic continuous optimization functions. To further improve the eectiveness of the al- 
gorithms, local gradient information must be taken into account. By only estimating probability 
distributions and by generating more samples from them, gradient information is ignored. On true 
dierential continuous search spaces however, this information is a very important guide to nd 
local minima or maximima. Implicitly, approaches such as Evolution Strategies make use of such 
gradient information, be it completely local or on a less detailed scale of information. Furthermore, 
many classical gradient search techniques exist that have proven to be very eective. Therefore, 
to improve the optimization performance of IDEA approaches for dierential continuous search 
spaces, a local search method that uses gradient information should be incorporated. Combining 
the global approach of the IDEA and the local gradient search will likely be very eective. 
For the experiments in this paper, we have set the  i mixing coeÆcients of the factorization 
mixture to the proportional cluster size. To increase the optimization pressure, the coeÆcients 
should be set to the proportional average cluster tness. Preliminary results have shown that this 
is eective for many dierent cost functions. 
In this paper, we have introduced general mixtures of factorizations. To learn them, we have 
applied clustering and subsequent factorization learning in each of the clusters separately. The 
amount of clusters has so far been determined either directly by means of the k{means clustering 
algorithm or indirectly through a distance threshold in the leader algorithms. Even though the 
latter is a form of adaptive clustering, it is not really adaptive in the sense that new clusters 
are introduced as they are required to estimate the probability distribution better. If we would 
have such a mechanism that is eective with respect to computational requirements, clustering in 
density estimation for evolutionary optimization in IDEAs, will become even more valuable. 
From the results in this paper, we nd that the Mahalanobis distance for clustering does not 
introduce a signicant optimization performance increase of the use of the Euclidean distance 
measure. Since the Euclidean distance measure can be evaluated faster, this measure should be 
preferred. Furthermore, the use of the simple and fast leader clustering algorithm gives results 
that are comparable to those of the k{means clustering algorithm. Therefore, the leader algorithm 
is preferable. Since we have only tested low dimensional problems, the use of unconditional 
factorizations versus conditional factorizations can hardly be commented upon. However, it seems 
that using conditional factorizations often leads to superior results. 
The use of either the AIC or the BIC metric to search for a factorization should be tested 
on problems of a higher dimensionality. Moreover, nding a good factorization becomes more 
important for problems in which the problem structure is made up of true building blocks. On 
such problems, the linkage between the problem variables has to be exploited in the best way 
possible to eÆciently process the generated solutions and traverse the search space [9]. Such 
problems occur both in dierential continuous optimization as well as in spaces such as the one 
that is dened by using random keys. Random keys are continuous values that are used to encode 
permutations. Therefore, we are actually dealing with a discrete space in which problem structure 
plays a subtle dierent role than is the case in conventional continuous dierential problems. 
8 Conclusions 
We have expanded the use of continuous probabilistic models for any factorization from a single 
normal pdf to mixtures of normal pdfs. To learn normal mixture pdfs, we have used the EM 
algorithm. Next to the normal mixture pdf, we have shown how eÆcient clustering algorithms 
can be used to break the non{linearity of the search space. By treating each cluster as a separate 
sample vector, we can again achieve an ormal mixture pdf. The two approaches to reaching the 
mixture however, is fundamentally dierent. Furthermore, the EM algorithm has been found to 
be computationally more expensive. For non{linear search spaces, clustering is a cost eective way 
to break this non{linearity in order for the promising regions of the search space to be explored 
eÆciently. The resulting probabilistic model structure is then a mixture of factorizations. 
52 

We have claried the use of a set of intuitive search metrics and have shown how these t in 
with the basic notion of likelihood, which is a measure of how well a probability distribution ts a 
vector of samples. Starting from there, we have shown how well known measures such as the AIC 
and BIC metric can be derived. It becomes clear that these metrics are quite similar and only 
dier by means of the amount of regularization that is applied to ensure a good generalization 
performance of the estimated probability distribution. In subsequent research, the use of the AIC 
and BIC metric in higher dimensional search spaces and non{continuous search spaces is to be 
investigated to get a notion of their performance quality. 
Even though nice optimization results have been achieved for varying epistatic and non{linear 
dierential continuous search spaces, the use of gradient information to help guide the search 
towards promising regions of the search space will most likely result in even more eective con- 
tinuous optimization algorithms. By only using the estimation of and sampling from probability 
distributions, as is done at this point, the local gradient information is ignored. In true dierential 
continuous search spaces however, it is a waste to ignore such important information. 
Summarizing, in this paper we have extended the set of available tools for IDEAs to perform 
continuous optimization with. This extension consists of new and more complex tools that we 
have shown to allow for a better performance of IDEAs. 
References 
[1] H. Akaike. Information theory and an extension of the maximum likelihood principle. In P.N. Petrov 
and F. Csaki, editors, Proceedings of the 2nd International Symposium on Information Theory, pages 
267{281. Akademia Kiado, 1973. 
[2] T. Back and H.-P. Schwefel. An overview of evolutionary algorithms for parameter optimization. 
Journal of Evolutionary Computation, 1(1):1{23, 1993. 
[3] T. Back and H-P. Schwefel. Evolution strategies I: Variants and their computational implementation. 
In G. Winter, J. Priaux, M. Galn, and P. Cuesta, editors, Genetic Algorithms in Engineering and 
Computer Science, Proceedings of the First Short Course EUROGEN'95, pages 111{126. Wiley, 1995. 
[4] S. Baluja and R. Caruana. Removing the genetics from the standard genetic algorithm. In A. Prieditis 
and S. Russell, editors, Proceedings of the twelfth International Conference on Machine Learning, 
pages 38{46. Morgan Kauman publishers, 1995. 
[5] S. Baluja and S. Davies. Using optimal dependency{trees for combinatorial optimization: Learning 
the structure of the search space. In D.H. Fisher, editor, Proceedings of the 1997 International 
Conference on Machine Learning. Morgan Kauman publishers, 1997. 
[6] J. Bilmes. A gentle tutorial of the em algorithm and its application to parameter estimation 
for gaussian mixture and hidden markov models. ICSI Technical Report TR-97-021. http:// 
www.icsi.berkeley.edu/ bilmes/papers/em.ps.gz, 1997. 
[7] J.S. De Bonet, C. Isbell, and P. Viola. MIMIC: Finding optima by estimating probability densities. 
Advances in Neural Information Processing, 9, 1996. 
[8] P.A.N. Bosman and D. Thierens. An algorithmic framework for density estimation based evolutionary 
algorithms. Utrecht University Technical Report UU{CS{1999{46. ftp://ftp.cs.uu.nl/pub/RUU/CS/ 
techreps/CS-1999/1999-46.ps.gz, 1999. 
[9] P.A.N. Bosman and D. Thierens. Linkage information processing in distribution estimation algo- 
rithms. In W. Banzhaf, J. Daida, A.E. Eiben, M.H. Garzon, V. Honavar, M. Jakiela, and R.E. 
Smith, editors, Proceedings of the GECCO{1999 Genetic and Evolutionary Computation Conference, 
pages 60{67. Morgan Kaufmann Publishers, 1999. 
[10] P.A.N. Bosman and D. Thierens. Continuous iterated density estimation evolutionary algorithms 
within the IDEA framework. In M. Pelikan, H. Muhlenbein, and A.O. Rodriquez, editors, Proceedings 
of the Optimization by Building and Using Probabilistic Models OBUPM Workshop at the Genetic 
and Evolutionary Computation Conference GECCO{2000. Morgan Kaufmann Publishers, 2000. 
[11] P.A.N. Bosman and D. Thierens. Expanding from discrete to continuous estimation of distribution 
algorithms: The IDEA. In M. Schoenauer, K. Deb, G. Rudolph, X. Yao, E. Lutton, J.J. Merelo, and 
H.-P. Schwefel, editors, Parallel Problem Solving from Nature { PPSN VI, pages 767{776. Springer, 
2000. 
53 

[12] P.A.N. Bosman and D. Thierens. IDEAs based on the normal kernels probability density function. 
Utrecht University Technical Report UU{CS{2000{11. ftp://ftp.cs.uu.nl/pub/RUU/CS/techreps/ 
CS-2000/2000-11.ps.gz, 2000. 
[13] P.A.N. Bosman and D. Thierens. Negative log{likelihood and statistical hypothesis testing as the 
basis of model selection in IDEAs. In A. Feelders, editor, Proceedings of the Tenth Dutch{Netherlands 
Conference on Machine Learning. Tilburg University, 2000. 
[14] S. Brandt. Statistical And Computation Methods In Data Analysis. North-Holland, 1970. 
[15] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley & Sons Inc., 1991. 
[16] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em 
algorithm. Journal of the Royal Statistic Society, Series B 39:1{38, 1977. 
[17] W. Feller. An Introduction To Probability Theory And Its Applications, Volume 1. Wiley, 1968. 
[18] M. Gallagher, M. Fream, and T. Downs. Real{valued evolutionary optimization using a exible 
probability density estimator. In W. Banzhaf, J. Daida, A.E. Eiben, M.H. Garzon, V. Honavar, 
M. Jakiela, and R.E. Smith, editors, Proceedings of the GECCO{1999 Genetic and Evolutionary 
Computation Conference, pages 840{846. Morgan Kaufmann Publishers, 1999. 
[19] G. Harik. Linkage learning via probabilistic modeling in the ECGA. IlliGAL Technical Report 99010. 
ftp://ftp-illigal.ge.uiuc.edu/pub/papers/IlliGALs/ 99010.ps.Z, 1999. 
[20] J.A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., 1975. 
[21] M.G. Kendall and A. Stuart. The Advanced Theory Of Statistics, Volume 2, Inference And Relation- 
ship. Charles GriÆn & Company Limited, 1967. 
[22] P. Larra~naga, R. Etxeberria, J.A. Lozano, and J.M. Pe~na. Optimization by learning and simulation 
of bayesian and gaussian networks. University of the Basque Country Technical Report EHU-KZAA- 
IK-4/99. http://www.sc.ehu.es/ccwbayes/ postscript/kzaa-ik-04-99.ps, 1999. 
[23] P.C. Mahalanobis. On tests and measures of groups divergence I. Journal of the Asiatic Society of 
Benagal, 26:541, 1930. 
[24] H. Muhlenbein and T. Mahnig. FDA { a scalable evolutionary algorithm for the optimization of 
additively decomposed functions. Evolutionary Computation, 7:353{376, 1999. 
[25] M. Pelikan and D.E. Goldberg. Genetic algorithms, clustering, and the breaking of symmetry. In 
M. Schoenauer, K. Deb, G. Rudolph, X. Yao, E. Lutton, J.J. Merelo, and H.-P. Schwefel, editors, 
Parallel Problem Solving from Nature { PPSN VI, pages 385{394. Springer, 2000. 
[26] M. Pelikan, D.E. Goldberg, and E. Cantu-Paz. BOA: The bayesian optimization algorithm. In 
W. Banzhaf, J. Daida, A.E. Eiben, M.H. Garzon, V. Honavar, M. Jakiela, and R.E. Smith, editors, 
Proceedings of the GECCO{1999 Genetic and Evolutionary Computation Conference, pages 525{532. 
Morgan Kaufmann Publishers, 1999. 
[27] M. Pelikan, D.E. Goldberg, and K. Sastry. Bayesian optimization algorithm, decision graphs and 
occam's razor. Also available as IlliGAL Report 2000020. ftp://ftp-illigal.ge.uiuc.edu/pub/papers/ 
IlliGALs/2000020.ps.Z, 2000. 
[28] M. Pelikan and H. Muhlenbein. The bivariate marginal distribution algorithm. In R. Roy, T. Fu- 
ruhashi, K. Chawdry, and K. Pravir, editors, Advances in Soft Computing { Engineering Design and 
Manufacturing. Springer{Verlag, 1999. 
[29] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461{464, 1978. 
[30] M. Sebag and A. Ducoulombier. Extending population{based incremental learning to continuous 
search spaces. In A.E. Eiben, T. Back, M. Schoenauer, and H.-P. Schwefel, editors, Parallel Problem 
Solving from Nature { PPSN V, pages 418{427. Springer, 1998. 
[31] I. Servet, L. Trave-Massuyes, and D. Stern. Telephone network traÆc overloading diagnosis and evo- 
lutionary computation technique. In J.K. Hao, E. Lutton, E. Ronald, M. Schoenauer, and D. Snyers, 
editors, Proceedings of Articial Evolution '97, pages 137{144. Springer Verlag, LNCS 1363, 1997. 
[32] K.S. Trivedi. Probability & Statistics With Reliability, Queuing And Computer Science Applications. 
Prentice-Hall, 1982. 
[33] X. Yin and N. Germay. A fast genetic algorithm with sharing scheme using cluster analysis methods in 
multimodal function optimization. In Proceedings of the 1993 International Conference on Articial 
Neural Nets and Genetic Algorithms, pages 450{457, 1993. 
54 

A Notation glossary 
General 
A = f; ; : : : ; g A set A of size jAj 
ts 
cannot be enumerated. 
jAj The amount of elements in set A. 
; The empty set. 
a 2 A =