Mixed IDE As Peter A.N. Bosman Dirk Thierens peterb@cs.uu.nl Dirk.Thierens@cs.uu.nl Institute of Information and Computing Sciences, Utrecht University P.O. Box 80.089, 3508 TB Utrecht, The Netherlands December 2000 Abstract Building and using probabilistic models to perform stochastic optimization in the case of continuous random variables, has so far been limited to the use of factorizations as the structure of probabilistic models. Furthermore, the only probability density function (pdf) that has been successfully tested on a multiple of problems, is the normal pdf. The normal pdf however strongly generalizes the data and cannot cope with non{linear interactions among the samples. In this paper, we show how clustering algorithms can be used to overcome this problem. We also show how the normal mixture pdf can be used in the case of a general factorization instead of the normal pdf. We formalize the notion of a probabilistic model and propose to use two practical instances for the model structure, which are the factorization and the mixture of factorizations. We propose to use metrics to nd good factorizations and thereby eliminate a complexity parameter  that was required in previous continuous approaches in the case of a general factorization. We also show the background of the metrics through general model selection on the basis of likelihood maximization, which demonstrates their connection with previously used factorization selection algorithms. We use the IDEA framework for iterated density estimation evolutionary algorithms to construct new continuous evolutionary optimization algorithms based on the described techniques. Their performance is evaluated on a set of well known epistatic continuous optimization problems. 1 Introduction The Iterated Density Estimation Evolutionary Algorithm (IDEA) framework has been used to apply probabilistic models to continuous stochastic optimization [8, 10, 11, 12, 13]. Algorithms within the IDEA framework, build a probabilistic model from a selection of samples and subse- quently draw more samples from the probability distribution imposed by the probabilistic model. These samples are then incorporated among the currently available samples, after which selection takes place again. As this process is iterated, the algorithm is intented to converge to a good solution. Estimating and sampling from a probability distribution each iteration can be seen as a combination of the recombination and mutation steps in evolutionary algorithms. Since a selection of the available samples is used to create more solutions, the IDEA framework can be seen as an evolutionary algorithm with an explicit use of statistics. In previous work [8, 12], the normal pdf and and the normal kernels pdf have been used as elementary probability density functions (pdfs). Using these pdfs, the estimated probability distribution over the complete domain of problem variables is constructed. The normal pdf has proven to be computationally e ective and to result in acceptable optimization performance as well [10, 11]. The normal kernels pdf on the other hand has proven to be promising in optimization performance, but also to be very hard to handle [10]. Furthermore, its computational requirements are quite large. It seems that the use of a normal mixture pdf is a good trade{o between the normal pdf and the normal kernels pdf. The computational requirements of the normal mixture pdf are smaller than those of the normal kernels pdf. Furthermore, its parameters can be estimated using the EM algorithm, which eliminates the need for netuning certain parameters by hand. In this paper, we show how the normal mixture pdf can be used within the IDEA framework. 1 The normal pdf itself is unable to describe non{linear interactions in a sample set. The normal mixture pdf can be used to overcome this problem. Another approach is to cluster the samples in a preprocessing phase and to use the normal pdf in each cluster. Assuming that the clustering phase e ectively breaks up the non{linear interactions between the variables, this should result in an e ective estimation of the selected samples. In a pilot study by Pelikan and Goldberg, clustering has been applied to optimization algorithms that use probabilistic models [25]. Here, we formalize the notion of clustering as part of the model selection process and investigate what clustering algorithms are e ective in both computational running time as well as clustering performance. In previous optimization algorithms that use continuous probabilistic models, only little atten- tion has been paid to search metrics that guide the search for a good probabilistic model. In this paper, we propose to use metrics that e ectively prefer simpler models if they can describe the samples in a similar manner as can more complex models. We are also show its correspondence with signi cance testing using statistical hypotheses and the notion of likelihood as a description of the goodness of an estimation. We show how to use a higher level of modeling in the case of continuous probabilistic mod- els using mixtures of distributions. Furthermore, we also provide useful search metrics to nd continuous probabilistic models within iterated density estimation evolutionary algorithms. The combination of the proposed techniques leads to new algorithms. In this paper, we evaluate their performance using a set of test functions. The remainder of this paper is organized as follows. In section 2 we formalize the notion of a probabilistic model. We also show how previous approaches have used a special instance for the probabilistic model structure, which is called the factorization. In section 3, we go into model selection. First, we discuss two approaches to model selection in general. Subsequently, we show how these approaches can be used to nd a good factorization. Finally, we investigate the use of clustering algorithms to use a new higher order instance of probabilistic models in IDEAs, which is a mixture of factorizations. In section 4, we show how the normal pdf and the normal mixture pdf can be t to any factorization and how we can draw samples from them. Subsequently, in section 5, we go over the IDEA framework and show the di erence between previous approaches and the new approaches described in this paper. We also elaborate on the di erence of the IDEA approach and Evolution Strategies. In section 6, we test the new algorithms on a set of well known test functions. In section 7, we discuss some topics of importance and possible future research. Finally, this paper is concluded in section 8. 2 Probabilistic models Given a vector of samples from a space de ned by a cost function that without loss of generality we seek to minimize, our aim is to build and use a probabilistic model that describes these samples. The construction of this model should be e ective in computation time requirements. Furthermore, the model itself should be e ective in descriptive power. Once a model has been built, more samples are acquired using the probabilistic model. Subsequently, a new vector of samples is selected from all of the available samples, after which again a probabilistic model is built. As this process is iterated, the requirement that the computation time for model building should be reasonable, is clear as this is the main ingredient of our non{deterministic inductive search. On the other hand, the probabilistic model should eÆciently describe the vector of selected samples in order for the algorithm to sample from the promising regions of the search space. If the model is not e ective in its description, the regions may not be separated properly and the search does not propagate e ectively over most of the regions. Instead, only a single region could for instance be explored and the algorithm is more likely to be mislead. Before we can discuss how to e ectively nd a good probabilistic model, we have to de ne exactly what such a model consists of. In section 2.1 we therefore formally de ne our concept of probabilistic models that we shall use in our algorithms. One of the most fundamental concepts is the factorization. We elaborate on the use of this basic concept in section 2.2. Finally, in section 2.3 we give an overview of the use of probabilistic models in previous work. Also, we 2 indicate how we shall expand this use to more complex probabilistic models and present a general overview that classi es previous work and the current work on the basis of the probabilistic model. 2.1 Inside a probabilistic model We write a = (a 0 ; a 1 ; : : : ; a ja introduce the notation ahci = (a c 0 ; a c 1 ; : : : ; a c jc i ; i 2 (0; 1; : : : ; jcj = (0; 1; : : : ; l ; : : : ; y jS probabilistic model that we de ne in this section, describes a probability distribution over = (Y 0 , Y 1 , . . . , Y y distribution. To this end, we note that any probability distribution over continuous random variables consists of probability density functions. The probability density function (pdf) is the most fundamental part of a probability distribution and therefore of a probabilistic model. Let a v , meaning that a contains only elements of . As a basis, we rst state the de nition of the multivariate joint pdf. Let dyhai = Q ja 0; and A = 0 @ ja e write P (Y hai)(yhai) for fa (yhai), making P (Y hai) a pdf. In this way, we can use the common probability notation P (Y hai) over random variables Y hai instead of writing fa (yhai) for a density function. Furthermore, the use of this notation allows us to catch the properties of probability distributions in both the elementary case of a single pdf as well as higher order cases in which the distribution is factorized. One such a de nition is that of conditional probability, which will prove to be one of the most important. We let b v and de ne atb to be the splicing of the two vectors so that the elements of b are placed behind the elements of a, giving jatbj = jaj + jbj. Using the de nition of multivariate probability, we can write conditional probability as: P (Y haijY hbi) = P (Y ha t bi) P (Y hbi) (2) Combining the multivariate joint pdf from equation 1 with the multivariate joint probability from equation 2, we can construct a probability distribution over . This can be done by specifying a single multivariate joint pdf, but this can also be done by constructing a factorization. Since the product of two pdfs is again a pdf, the probability distribution over can be factored such that it becomes a product of multivariate pdfs (either joint or conditional). Actually, we can restrict any factorization to use only multivariate conditional pdfs from equation 2, since we have that: P (Y hai) = ja with a product of two joint pdfs P (Y hai)P (Y hbi), there is a dependency between variables Y hai and Y hbi. In this case, we speak of an unconditional dependency. For any pair of variables Y and Y , we have that it is either bene cial to use P (Y jY ) instead of P (Y ) or it is not. If it is more bene cial to use the conditional probability, we have a conditional dependency between variables Y and Y . 3 The most notable di erence between the two dependencies is that conditional dependencies are directed, whereas unconditional dependencies are not. Even though we can express any multivariate joint pdf with multivariate conditional pdfs, it can still be bene cial to use only multivariate joint pdfs in a factorization. The reason for this is that conditional pdfs can be analytically and computationally more diÆcult to handle. To nd a good probabilistic model, we can thus for instance attempt to nd a good factorization of the probability distribution and then t a pdf over each multivariate joint or multivariate conditional pdf in the factorization. We have however refrained so far from giving the actual de nition of a factorization. In section 2.2 we go into the details of factorizations and formalize them. Here, we shall restrict ourselves to an example of a factorization and subsequently use it in the de nition of a probabilistic model. We denote a factorization by f. An example of a factorization is P f (Y 0 ; Y 1 ; Y 2 ) = P (Y 0 )P (Y 1 ; Y 2 ). Again, this is a pdf, but as we use a product of elementary pdfs, we refer to it as a probability distribution. On the other hand, the probability distribution that is factorized into a single factor P f (Y 0 ; Y 1 ; Y 2 ) = P (Y 0 ; Y 1 ; Y 2 ) is equal to the unfactorized probability distribution and is equal to an elementary pdf. Still, we assume for each element in the factorization that we t a pdf over it. We call the resulting probability distribution a factorized probability distribution. With the exception of the work by Pelikan [25], the approaches in the eld of building and using probabilistic models have so far used a factorization as the structure of the probabilistic model. The usefulness of a factorization becomes apparent when we increase the dimensionality of the problem. For instance in the case of discrete variables, we estimate the full joint probabilities by counting the amount of times some combination occurs in the sample vector and divide it by the total amount of samples. One of these combinations does not have to be computed as it is one minus the sum over the probabilities of the other combinations. For l dimensions and binary random variables, this means that we have to estimate 2 l t. This exponential scaling behavior is the most important reason why a factorization can be useful. If the interactions between the problem variables can be e ectively caught in a bounded complexity order probabilistic model, the optimization process is eÆcient. However, if the amount of parameters that has to be computed to t the pdf is computationally expensive, such as exponential, optimization by using probabilistic models is inherently a non{scalable approach. Note that whether or not this is possible, depends both on the algorithm that searches for a factorization as well as the optimization problem. A factorization of a probability distribution along with the parameters for the pdfs to t over the factors, could thus serve as the de nition of a probabilistic model. Even though this de nition has been used in most approaches so far, it does not allow for modeling at a higher level, being a mixture of factorizations. Such a mixture allows to apply a clustering strategy rst, after which a factorization can be found in each cluster separately. This way, non{linearity and symmetry in the sample vector can e ectively be tackled and processed. Even though a mixture of factorizations is a very general structure that allows for many probabilistic models, we use a more general notation that allows for any type of structure. In this paper, the instances of the structure in the probabilistic model so de ned, will be the factorization and the mixture of factorizations. In general, we de ne a probabilistic model to have some sort of structure & and a vector of parameters  for the pdfs as implied by & . We call & the probabilistic model structure and  the probabilistic model parameters. As noted before, & is likely to be constrained as using a maximum complexity structure does not scale up. Therefore, if & is a factorization, it is usually constrained to bounded order interactions between variables. We denote the space of all & that are allowed within such constraints by C & . Usually, such constraints are not placed on the parameters , but we denote the constrained space for the parameters by C  . Formally, a probabilistic model M can thus be de ned as a tuple just as can be done for its complete constrained space C of allowed models: M = (& ; ) 2 C = (C & ; C  ) (4) 4 Y Y Y 1 Y Y 2 3 4 0 Figure 1: An empty factorization containing only the variables. In the case that we use a single factorization for & , we have & = f. We shall also make use of a mixture of factorizations. We denote the amount of components in the mixture by k and let K = (0; 1; : : : ; k f = (f 0 ; f 1 ; : : : ; f k goal is to use the probabilistic model in optimization, we have to approximate the probability distribution of the given sample vector S. The resulting probability distribution that is based on the model that we nd, is an approximation to the true probability distribution. Therefore, we write the probability distribution implied by model M as ^ PM (Y), which is an approximation to the true underlying probability distribution P (Y) over S. The amount of parameters jj depends on the choice of the pdfs to use as well as on the structure & . However, the pdf to t over every factor implied by & is chosen on beforehand. The way in which the parameters  are t, is also prede ned on beforehand. We denote the parameter vector that is obtained in this manner by  fit the parameters  fit sample from the resulting probability distribution to achieve an iterated search procedure. However, we have already remarked that we shall restrict our attention to the use of factorizations and combinations thereof. To be able to build factorizations, we rst elaborate on them in the next section. 2.2 Factorizations and their graphs A factorization implies a product of pdfs. If such a product is a function of variables yhai, it is a valid probabilistic model structure over random variables Y hai. The product itself is closely related to the dependencies between the variables. We can start by noting that two variables are either unconditionally dependent or they are not. Furthermore, we may state that variable Y i is either conditionally dependent on variables Y hai with i 62 a or it is not. These relations constitute a factorization. This structure can be modeled using graphs. If we identify a vertex for every random variable Y i and introduce an arc (Y i ; Y j ) if and only if Y j is conditionally dependent on Y i , we get the conditional factorization graph. In the case of unconditional dependence, a connected component in the graph implies that all variables in the connected component are jointly dependent. As such, a connected component models a multivariate joint pdf. A conditional dependence arc imposes a conditional probability in the probability distribution over Y . For example, gure 1 shows an empty factorization over 5 variables. The underlying factorization can be speci ed as P (Y) = P (Y 0 )P (Y 1 )P (Y 2 )P (Y 3 )P (Y 4 ). Using only uncondi- tional dependencies, gure 2 is an example of an unconditional factorization graph that models P (Y) = P (Y 0 ; Y 2 )P (Y 1 ; Y 3 ; Y 4 ). As a nal example, gure 3 shows a factorization graph using conditional dependencies that models P (Y) = P (Y 0 jY 2 )P (Y 1 jY 2 Y 3 Y 4 )P (Y 2 jY 3 Y 4 )P (Y 3 jY 4 )P (Y 4 ) = P (Y 0 jY 2 )P (Y 1 ; Y 2 ; Y 3 ; Y 4 ). The latter inequality holds because of equation 3. 5 Y Y Y 1 Y Y 2 3 4 0 Figure 2: A non{empty unconditional factorization representing only joint dependencies. Y Y Y 1 Y Y 2 3 4 0 Figure 3: A non{empty conditional factorization representing conditional dependencies. It follows from equation 3 that using unconditional factorizations that imply joint probabilities, is a special case of using conditional factorizations. Using these conditional factorizations, we can specify any factorization. Note that a factorization is valid if and only if its factorization graph is acyclic. This can be formalized [8] by introducing a function () that returns a vector of parent variable indices (i) = ((i) 0 ; (i) 1 ; : : : ; (i) j(i) ! = !hLi. Using (; !), constraints can be speci ed to enforce that the factor- ization graph is acyclic. Furthermore, scanning the variables in the order Y! l t variables that some variable is conditioned on, will already have been regarded. A conditional factorization can be uniquely speci ed by a pair (; !) as follows: P (;!) (Y) = l i) (5) such that 8 i2L h! i 2 L ^ 8 k2L eci ed as ! 0 = 0; ! 1 = 1; ! 2 = 2; ! 3 = 3; ! 4 = 4 and (0) = (2), (1) = (2; 3; 4), (2) = (3; 4), (3) = (4) and (4) = (). If we use a product of multivariate joint pdfs, the factorization becomes a marginal product model. We de ne the node vector  to be a vector of vectors of connected component indices  i = (( i ) 0 ; ( i ) 1 ; : : : ; ( i ) j i y unconditional factorization can be uniquely speci ed using : P  (Y) = j 6 The above constraints enforce that the vector of available random variables Y is partitioned into jj disjoint subvectors. Over each of these subvectors, a single multivariate joint pdf is t to obtain the factorized probability distribution. Using this de nition, the factorization in gure 2 can be speci ed as  0 = (0; 2);  1 = (1; 3; 4); jj = 2. We conclude this section by noting that we have formalized two general types of factorizations. One is de ned using multivariate conditional probabilities whereas the other is de ned using multivariate joint probabilities. These are the two factorizations that we use: f 2 f(; !); g (7) 2.3 An overview of things that were and of things to come In previous work on continuous models [8, 10, 11, 12, 13, 22, 18, 30, 31], a factorization f was used as the structure & of the probabilistic model M. In some cases, constraints in addition to the general acyclic graph constraints have been applied. Furthermore, with the exception of the work by Bosman and Thierens [12] and Gallagher, Fream and Downs [18], a single multivariate normal pdf was used. In the work by Bosman and Thierens [12], the normal kernels pdf is proposed. However, the normal kernels pdf is very hard to handle and has therefore not been put to practice as much as has been done for the normal pdf. Gallagher, Fream and Downs [18] propose the use of the exible multivariate normal mixture pdf, but only with the univariate factorization in which every variable is regarded independent of the others. In this paper, we show how the multivariate normal mixture pdf can be used in the case of a general factorization. Roughly, we can say that the research on the use of probabilistic models in continuous domains has been limited to models where the structure is a factorization and the parameters are those of a normal pdf (at least in the general factorization case). If we denote the normal pdf by fN , we can indicate this by writing & = f and  ( fN . In this paper, we show how the normal mixture pdf can be used in the general factorization case, giving  ( fNM . By using the normal mixture pdf, we can apply a powerful density estimator in our optimization procedure. However, this does not necessarily allow for the breaking of symmetry or the tting of non{linearity in the optimization problem. To achieve this, we propose to use a mixture of factorizations. To this end, we cluster the sample vector into subvectors and apply density estimation to each subvector. We thereby change the structure of the model to & = fhKi. Figure 4 graphically depicts the previous work and the work that we describe in this paper. The approaches in the top row make use of a single factorization f. This is what has been done so far for continuous optimization problems. In the case of a general factorization, the approaches have been limited to the use of a single normal pdf, which is the top left entry in the table in gure 4. The approaches in the right column make use of the normal mixture pdf. The approaches in the bottom row use a mixture of factorizations. In this paper, we use clustering algorithms to partition the sample vector into subvectors. A factorization is found for each so constructed subvector, thereby achieving the mixture of factorizations. Concluding, in the general case, the approaches so far for continuous optimization problems have used the top left entry in the table in gure 4. In this paper, we show how the remaining three entries in the table of gure 2.3 can be established in search of an improvement over the approaches for continuous domains so far. 3 Model selection Given a vector S of data points, it very valuable to be able to infer from what probability distri- bution these data points were sampled. Given such information, we can draw conclusions on the source of these data points and make predictions for future data coming from similar sources. In some applications we are not concerned with nding the full probability distribution, but certain aspects of it under certain conditions. The use of linear regression to nd out whether variables or events are correlated, is an example of such an approach. Analysis of this sort is applied whenever 7  ( fN  ( fNM & = f Y Y Y 1 Y 2 3 0 Y Y Y 1 Y 2 3 0 & = fhKi Y Y Y 1 Y 2 3 0 Y Y Y 1 Y 2 3 0 Y Y Y 1 Y 2 3 0 Y Y Y 1 Y 2 3 0 Figure 4: An overview of the algorithms that build and use probabilistic models. The top left entry represents the current algorithms in the case of a general factorization. The other entries represent algorithms that are proposed in this paper. 8 induction is to be performed, given certain observations. Real life examples are for instance ob- servations of hospitalized patients and the question of whether one symptom is caused by another or whether a certain disease is always accompanied by high fever, so that this can be taken into account whenever a new patient with this disease arrives. The least restrictive of these inductional questions, which is the desire to know the underlying probability distribution, is called model selection. The data points in vector S are the result of sampling from some model. It is this model that we set out to nd or approximate. To be more precise, we utilize this inferred model information in order to perform global optimization. We start out in section 3.1 by going over two general model selection techniques. In section 3.2, we formulate an algorithm that systematically and incrementally nds a factorization given a vec- tor of samples. Subsequently, in section 3.3, we go into the selection of a mixture of factorizations using clustering techniques. 3.1 General techniques In order to decide what model we wish to use to represent the data with, many automated approaches can be used. Depending on the choice of the pdf, specialized statistical tests can be derived to nd a factorization for example. In this section however, we wish to refrain from selecting any instances or additional constraints for the model. Instead, we give two general approaches to selecting an appropriate model based upon the data. In this paper, we see model selection as an iterative process. In any iteration, we have some candidate models that may possibly replace the current model. In section 3.2 we show how such an iterative algorithm can be formalized in order to select a model with & = f. Here, we concern ourselves with a single iteration of such an algorithm. From a set of candidate models, we wish to select the most promising one to replace the current model. If no candidate model seems to be promising, the iterative algorithm halts and the current model is taken to be the result of model selection. We assume that a set of candidate models for some given model is available. In section 3.2 we shall show how such a set can be obtained in the case where we want to nd a suitable factorization. In the iterative process, we denote the current model by M 0 . In order to select the best model from the set of candidates, we go over each of the candidate models and compare them to M 0 . In the remainder of this section, we focus on one such comparison and denote the candidate model by M 1 . In order to compare M 1 to M 0 , we distinguish two approaches. One approach is to test whether the use of M 1 is a signi cant improvement over M 0 in representing S. To this end, statistical hypothesis tests can be used. A gentle introduction to statistical hypothesis testing is given in appendix B. We de ne the general statistical hypothesis test for the comparison of the models as follows: Model Selection Test Hypothesis (MSTH) Probabilistic model M 1 cannot better describe the given samples than can probabilistic model M 0 . We can use statistical techniques to test the MSTH. Such a test returns whether or not the hy- pothesis should be accepted. A certain statistic is shown to follow some distribution. A hypothesis is tested by using so called critical values for the statistic, given a certain signi cance threshold. The MSTH is accepted if the statistic is for instance below a critical value associated with the threshold or rejected if it is above the critical value. In such a case we speak of a right{sided test. Statistical hypothesis tests can be used to select a replacement model from the candidate set. As the test returns whether M 1 is a better descriptive model than is M 0 for the given samples, we can go over the candidates and take the rst model from the candidate set to replace M 0 that results in rejecting the MSTH. However, the set of candidates will be generated in an automated fashion. To prevent going over the candidates in the same manner every time so that possibly a certain type of candidate model is always tested before some other type of candidate model, the candidates should be accessed in a random order. In section 3.1.1 we show how a general instance for the MSTH can be formalized based on the likelihood of a probabilistic model. 9 The second approach to comparing M 1 with M 0 is to use some metric. If we have a metric that informs us of how well some model represents S, we can search amongst the candidate models to nd the model that results in the largest increase or decrease of the metric, depending on whether we have to respectively maximize or minimize it. If the metric can no longer be improved upon since every candidate leads to a degradation of the metric, the iterative model selection procedure can be halted. This approach has become a very popular one, especially within the eld of Bayesian statistics. In section 3.1.2 we discuss a general, transparent and useful metric that has a clear correspondence with the MSTH instance in section 3.1.1. Before moving to describe both a general MSTH as well as a metric, we note that an MSTH can actually also serve as a metric. Assuming such a statistic , a critical value  and a right{sided test, a statistical hypothesis test returns true if and only if    . However, the amount that  is larger (or smaller, depending on the type of test) than  can be seen as a measure of how strongly the MSTH is rejected. This can be used in some sort of a metric for which we may nd the maximum over all candidate models. The metric to maximize is then equal to  In this section, we focus on nding a general statistical hypothesis test so as to perform model selection by testing the MSTH. In appendix B more details are presented on statistical hypothesis testing in general. Also an example is given of how unconditional dependencies can be tested using well known statistical measures. Here however, we focus on a hypothesis test that is directly based on the goodness of two ts. Assuming that the samples were drawn independently from the underlying distribution, the likelihood of the samples, given the estimated probability distribution ^ PM (Y), is de ned as: L(Sj ^ PM (Y)) = jS i ) (8) It is common practise to use the negative logarithm of the likelihood measure as it is often computationally more convenient. The resulting expression is called the negative log{likehood : e to maximize L(Sj ^ PM (Y)) or, equivalently, to minimize 0 a0 Z b1 a1 : : : Z b l the sample vector and is therefore said to over t the data. This is a very important aspect of density estimation and is termed generalization. Over tting tends to occur not only in such speci c cases as in equation 10 but also often in cases where a combination of many pdfs is used, such as a kernel method. If over tting takes place, the optimization algorithm that samples more points from the pdf will tend to converge rapidly into a xed form and give poor results [10]. Therefore, a parametric model with little parameters such as the normal pdf, is usually preferred as it cannot easily over t a set of samples. Given a parametric model and thus a set of parameters for the involved pdfs, a general iterative method can be derived to estimate the set of pdf parameters so as to minimize the negative log{ likelihood and get a maximum likelihood t of the estimated probability distribution. Such a method is given by the EM{algorithm, which we discuss in more detail in section 4.3.2. In the EM{algorithm however, the model structure is xed on beforehand. In order to nd the model structure, we can test whether some candidate model M 1 has a lower negative log{likelihood value 10 than the current model M 0 . If this is the case, M 1 is a better description than is M 0 . Therefore, M 1 should replace M 0 . In order to perform the negative log{likelihood test, the involved pdfs rst have to be computed. To test whether the likelihood of the probability distribution implied by model M 1 is larger than that of the probability distribution implied by model M 0 , we observe: jS i=0 ln( ^ PM 1 (Y)(y i )) e that the negative log{likelihood di erence as stated in equation 11 is always larger than or equal to 0. This implies that if we incrementally build a model by increasing its complexity, we will always prefer the more complex model if we are able to t the data with a maximum likelihood. This is however not desirable. If the negative log{likelihood di erence is only very small, the more complex model does indeed t the data slightly better, but this comes at the expense of a larger running time since the model is more complex. By selecting the more complex model, we require a lot more time to perhaps gain a hardly noticeable di erence in goodness of t. This poses a well known trade o between running time and expressional power. What we require, is to be able to test whether some negative log{likelihood di erence is signif- icant. If it is, this implies that we should select the more complex model as it would signi cantly improve the resulting t. We have then justi ed the selection of the more complex model. To this end, we use a statistical hypothesis test. If we have some random variable R along with N values r 0 ; r 1 ; : : : ; r N of the central limit theorem, the sample mean is approximately normally distributed. If we now assume to have two of such random variables R 0 and R 1 along with an hypothesized value h for the di erence of the means, we observe: T = p N(R 0 0 j under probability distribution ^ PM j , j 2 f0; 1g. We then thus have that: r j i = preferred over model M 0 . To do this, we set h = 0 and perform a right{sided test. This means that we are testing whether the expected value of R 0 which each step towards a more complex model is justifyable. However, in all previous continuous approaches that use factorizations of a higher order than the univariate factorization as the model structure, such justi cation has not been used directly. Instead, the Kullback{Leibler (KL) divergence from the approximated factorization to the full joint approximated factorization is used. By doing so, it can be shown [13] that the the di erential entropy over the approximated factorization is minimized. Given maximum likelihood estimates, it can furthermore be shown [13] that the di erential entropy di erence is the same as the negative log{likelihood di erence from equation 11. 11 The equality of the KL divergence and the negative log{likelihood di erence under the assump- tion of a maximum likelihood estimate, implies that the previous approaches can be seen to have used the negative log{likelihood without justifying any model selection decisions. However, unless strong restrictions are imposed on the probabilistic model, using this metric will unnecessarily increase the complexity since any more complex model will always result in a t with a lower negative log{likelihood. We can therefore conclude that justi cation is truly required in order for the resulting optimization algorithm to be scalable on decomposable problems. If we do not use any justi cation, we can refrain from using the average negative log{likelihood but use the di erential entropy instead. Computing the average negative log{likelihood is a time consuming task as it requires jSj evaluations. However, for the normal pdf, the di erential entropy over variables Y hai can be shown (see for instance [15]) to be 1 2 (jaj+ln((2) jaj (det S(a)))), where S(a) is the sample covariance matrix. However, in this section we do require justi cation. In such a case, we cannot suÆce by only computing the average negative log{likelihood, since we also require the variance over the negative log{likelihood to compute the test statistic T. Still, we shall see in section 3.2.3 that this remark can yet be of practical concern. For more involved pdfs such as the normal mixture model, we cannot eÆciently compute the di erential entropy. Furthermore, neither can we guarantee a maximum likelihood estimate using the EM algorithm. To compute the negative log{likelihood di erence, we thus have to evaluate the implied probability distribution for each sample. 3.1.2 Complexity penalization In the previous section, we have seen that we can minimize the error of the t and justify each min- imization based upon a large enough increase in the negative log{likelihood di erence. However, even though one t may indeed be better than the other, this does not mean that we are interested in using it from either a generalization or a computational e ectiveness point of view. To enforce this, a penalty term that increasingly penalizes more complex models can be introduced. The metric M that should be minimized can then be formalized as follows: M( ^ PM (Y)jS) = Error( ^ PM (Y)jS) + Complexity( ^ PM (Y)jS) (15) The use of a complexity term is often termed regularization in data modeling elds. Regu- larization is used to prevent the model from over tting the data. Classicly, the regularization equation has a regularization parameter  as a factor times the regularization term. Setting  = 0 then reduces the metric to an unconstrained one with respect to error minimization. Here, we have used a general complexity term that could be of such a form. However, we leave the actual contents completely unspeci ed, which gives no restrictions on the complexity term. Furthermore, the metric from equation 15 is thereby completely transparent from the start. To derive a metric of the form as we have in equation 15, we note that we are interested in the probability of a probability distribution implied by a model M, given a sample vector S. The model that implies a probability distribution with the largest probability, is the most preferable one if we disregard complexity and generalization at the moment. The probability that we are interested in, can be written as follows using Bayes' Rule: P ( ^ PM (Y)jS) = P ( ^ PM (Y))P (Sj ^ PM (Y)) P (S) (16) As the sample vector S is xed and thus P (S) = 1, we can discard P (S) from equation 16. The probability of S given some probability distribution ^ PM (Y) implied by a model M, can be seen as the likelihood as de ned in equation 8, meaning P (Sj ^ PM (Y)) = L(S j ^ PM (Y)). In order to add a preference for simpler models, we want to set the probability at a probability distribution P ( ^ PM (Y)) implied by a model M in such a way that it gets smaller if its complexity increases. The complexity of ^ PM (Y) is given by the amount of parameters  that is required to de ne every pdf in the model M. We can de ne the probability of some probability distribution P ( ^ PM (Y)) implied by M = (& ; ) by using a function #(), #()  0, of the amount of parameters jj: 12 P ( ^ PM (Y)) = #() P & 0 2C & #( 0 fit C & j jS i ) (18) In order to compare one model to another, the normalization factor P & 0 2C & #( 0 fit #() = 1, equation 18 becomes just the likelihood from equation 8. Alternatively, we can let the probability at some probability distribution P ( ^ PM (Y)) implied by a model M decrease exponentially with its complexity by setting #() = e The objective is to maximize the metric. Alternatively, we can take the negative logarithm of the metric and minimize it. Just as is the case for the likelihood, using the negative logarithm is computationally more convenient. This can directly be seen from the de nition of the metric, since the exponentials disappear and the product of a large amount of values in [0; 1] becomes a sum. By doing so, we get an instance of the general metric expression in equation 15: S) Complexity( ^ PM (Y)jS) (20) In the metric that we have arrived at, the negative log{likelihood is the error to minimize and the complexity term is the amount of parameters in the probabilistic model. The resulting metric is similar to the Akaike Information Criterion (AIC) [1]. The AIC equals 2 times the expression in equation 20. This constant factor does not inuence the result when computing the di erence and checking whether it is larger than 0. So basically, we have arrived at the AIC metric. The AIC metric elegantly favors simpler models. If two probability distributions have the same likelihood, the one with the least amount of parameters is favored. At this point, we turn back to our note on regularization theory. Even though the AIC metric favors simpler models by increasingly penalizing more complex ones, the amount of penalization is xed. By using a regularization parameter  times the AIC metric, the regularization could be increased. However, this would only be by a constant factor, assuming that  is a free parameter  2 R. Other than by increasing some regularization parameter, we can also use a di erent metric that penalizes the complexity di erently. The above derivation is a very general one. To demonstrate its transparent use, we propose to set #() = e introduced a parameter  that has the same semantics as the regularization parameter. The resulting metric that we arrive at, can be written as follows: ( ^ PM (Y)jS) Complexity( ^ PM (Y)jS) (21) The metric in equation 21 is commonly known as the Bayesian Information Criterion (BIC) [29]. This metric has been proposed before in iterated density estimation approaches [24, 19, 27]. The BIC metric has empirically been observed to give good results. It penalizes more complex models more heavily than does the AIC metric, but in such a way that the most important structural building blocks are found. The AIC metric is less eÆcient because it tends to let such building blocks be merged as well, which gives an unnecessarily complex model structure. 13 Before we conclude this section, we note that there is a correspondence between the metrics in equations 20 and 21 and the general hypothesis test in the previous section. In both cases, the negative log{likelihood has to be computed. In the case of the hypothesis test, the average negative log{likelihood di erence is compared to a critical value. In the case of the penalization metrics, the negative log{likelihood di erence is compared to a value as well. In this case, the value is based on the di erence in amount of parameters. The two approaches thus result in a similar test for candidacy that can be written as: ln(L(S j ^ PM 1 (Y))) the general statistical hypothesis test,  is equal to the critical value times the amount of samples jSj. This is an important di erence between the two approaches. Beyond a certain size of the sample vector, the hypothesis test does no longer depend on jSj since the negative log{likelihood is divided by it before testing it against the critical value and the critical value doesn't change signi cantly anymore above a certain size of jSj. However, the size of the sample vector is not factored out of the equation for the penalization metric. This means that if the average negative log{likelihood di erence is some value "  0, then the di erence itself becomes jSj". This expression can grow to any value, given a large enough sample vector. For the BIC metric,  also grows as the size of the sample vector increases, but logarithmically. Empirically, this has been observed to be e ective in practice. The setting of  in the case of the metrics is done in a desirable way that is not the case for the hypothesis test. Given enough samples, more complex models are allowed. The amount of parameters that a more complex model conveys is thereby thus in some way justi ed as there are enough samples to estimate the parameters properly. 3.2 Factorization selection In section 2.2, we observed that in order to obtain a valid factorization, we have to make sure that we do not introduce any cycles into the factorization graph. This is the most general and only necessary condition on the graph, both in the unconditional and the conditional case. In addition, other constraints can be imposed such as that the graph must be a chain, a tree or that it may have no interactions of an order above . Such additional constrains are especially useful if the decisions that regard introducing dependencies are not primarily based on the intrinsic data. In such a case, overly complex models are easily introduced if these additional constraints are absent. To be able to use a general graph search algorithm e ectively, we either have to justify the introduction of more complex models or we have to penalize them. In this way, we avoid introducing unrequired complex models that take up a lot of computation time. The idea of such a general graph search algorithm is to maintain a set of edges or arcs that may still be added to the factorization graph without introducing cycles. From this set, the next edge or arc is selected according to certain criteria and the set is ltered again until either the set is empty or no edge or arc meets the criteria of being added anymore. If a scoring metric can be assigned to the factorization graph, we can add the arc or edge that increases the metric the most. When the set of addable edges or arcs is empty or no edge or arc can be added anymore so that the scoring metric is further increased, the algorithm terminates. This becomes more attractive if the metric is decomposable so that the metric increase can be computed based upon the involved variables only instead of upon the entire graph. This approach was rst used in evolutionary optimization by Pelikan, Goldberg and Cantu{Paz [26] and subsequently by Muhlenbein and Mahnig [24] as well as Bosman and Thierens [8, 10, 11], all using di erent scoring metrics. In the general case, we leave the presence of such a metric to speci c implementation details. This brings us to de ne the general incremental factorization search algorithm. As the notion of conditional dependence and unconditional dependence are intrinsically di erent, as are the corresponding graph types, we will specify the general algorithm for both cases separately. Even though the unconditional case can be written as a special version of the conditional case, the 14 importance of specifying di erent algorithms can also be seen from the fact that computing the conditional pdf can be very diÆcult in contrast to the multivariate joint pdf. 3.2.1 An incremental general factorization search algorithm We start with the unconditional version of the general graph search. In this case, there is at most one edge between every pair of nodes. Initially this means that a triangle (eg. lower or upper) of the edge adjacency matrix without the diagonal is set to true. If an edge is set to true, it means that it may still be added to the graph as it does not introduce any cycles. Subsequently, the edge selection process is iterated. Each iteration, a single edge is selected to be added to the graph. This edge is to be chosen from the set of allowed edges f(i; j) j a[i; j] ^ (i; j) 2 L 2 g. If no selection can be made at some point, the iterated edge selection process is halted. If an edge is selected however, it is added to the graph and the set of edges that are still allowed to be added is updated. This is done by traversing the connected components belonging to the nodes that are incident to the edge that is to be added. Each edge between vertices of these opposite connected components is marked as false. If there are no more edges addable, the iterated selection stops. The actual factorization is determined afterwards. In the case of conditional dependencies, we can directly make use of the de nition in equation 5. In the case of unconditional dependencies, we can do this indirectly by using equation 3. However, we will end up with the requirement of specifying a conditional pdf for each element of the so de ned factorization. Therefore, we note that in theory we may very well write unconditional dependencies using (; !), but in practice we shall use a di erent structure. The result of the general factorization search algorithm in the unconditional case is the node vector . This is also the case in the ECGA by Harik [19]. The resulting factorization in this case corresponds to the vertices in the connected components of the factorization graph. If we now leave the edge selection procedure to be de ned exterior to the general factorization search algorithm and write separate functions for adding an edge to the graph and nding the vertices in the connected components, we have a general factorization search algorithm for the case of unconditional dependencies. However, the algorithm in this case can be simpli ed. To see this, let  0 and  1 be two connected components in the graph. If we only observe the variables within these components, the probability distribution is given by the factorized distribution P (fY v jv 2  0 g)P (fY v jv 2  1 g). Any edge (u; v) with u 2  0 ^v 2  1 will now result in testing the factorized distribution P (fY v jv 2 ( 0 t 1 )g) against the former factorized distribution. Hence, if the decision is made that the edge should be added to the graph, all other edges between  0 and  1 become unallowed. Moreover, if the decision is made that the edge should not be added to the graph, all other edges between  0 and  1 do not have to be tested anymore. In other words, the edges that connect a connected component are not important, as the vertices inside a single connected component make up a joint pdf in the resulting factorization. A graphical impression of this is depicted in gure 5. This implies that we can simplify the general algorithm in the case of unconditional dependencies by rst placing every variable in a singleton vector. Subsequently, we nd 2 vectors that are to be spliced. If no such combination of vectors can be found, the algorithm terminates. Otherwise, the splice operation is executed and the process is iterated. As each splice operation reduces the amount of available vectors by 1, the amount of splice operations in this algorithm is clearly bounded by l Y Y Y 1 Y Y 2 3 4 0 Y Y Y 1 Y Y 2 3 4 0 Y Y Y 1 Y Y 2 3 4 0 Y Y Y 1 Y Y 2 3 4 0 Figure 5: Simplifying the general case of unconditional dependencies. UnconditionalFactorizationGraphSearch() 1  new vector of (vector of integer) 2 jj l 3 for i 0 to l > 1 do 4.1 (c 0 ; c 1 ) FindNodeVectors() 4.2 if c 0 < 0 then 4.2.1 breakwhile 4.3 for i 0 to j c1 j do 4.3.1 ( c0 ) j c 0 j ( c1 ) i 4.4  c1  jj 4.5 jj jj i) in the graph now. The amount of arcs initially is therefore twice as large as the amount of edges in the unsimpli ed unconditional case. The algorithm that nds an arc to add from the set of allowed arcs, is named FindArc. Furthermore, we now require a cycle detection algorithm, which we specify as AddArc. This algorithm returns the amount of arcs that have become unallowed because of the addition of an arc. In order to de ne this algorithm, we require the vector of predecessor nodes v p and the vector of successor nodes v s . Algorithm AddArc is given in appendix D. The resulting factorization can now be speci ed using (; !). The parent variable indices in the vector function  can directly be computed from the graph. As the graph contains no cycles, computing a topological sort gives the required ordering !: 16 ConditionalFactorizationGraphSearch() 1 a new array of boolean in 2 dimensions with size l  l 2 v p ; v s 2 new arrays of (vector of integer) with size l 3 for i 0 to l 2 , a, v p , v s ) 6 return(TopologicalSort(v p )) We now have a detailed outline of an incremental general factorization search algorithm for both the conditional and the unconditional case. In order to complete the search algorithms, we require to specify how to nd two vectors to splice or how to nd an arc to add to the graph. A general algorithm for this is dependent on the type of information we wish to guide model selection by. If we have a statistical hypothesis test through which we can decide whether or not two vectors should be spliced or an arc should be added to the graph, we can randomly order the possible splices or arcs to still be added to the graph and test them in this order until an object is found for which the test evaluates to true. If every test fails, the resulting graph has been found. We can formalize this algorithm for the unconditional case as follows: FindNodeVectorsByTesting(  ) 1 splices new array of (integer, integer) with size 1 2 jj(jj 1 2 ke do 4.1 r 1 RandomNumber(k) 4.2 r 2 RandomNumber(k) 4.3 (v 0 ; v 1 ) splices[r 1 ] 4.4 splices[r 1 ] splices[r 2 ] 4.5 splices[r 2 ] (v 0 ; v 1 ) 5 for i 0 to k slight changes. These changes involve using the predecessor nodes vector v p and successor nodes vector v s instead of the node vector  as well as making the for loops regard variables 0 up to and including l FindArcByTesting( a, v p , v s ) 1 arcs new array of (integer, integer) with size l 2 0 to d 1 2 ke do 4.1 r 1 RandomNumber(k) 4.2 r 2 RandomNumber(k) 4.3 (v 0 ; v 1 ) arcs[r 1 ] 4.4 arcs[r 1 ] arcs[r 2 ] 4.5 arcs[r 2 ] (v 0 ; v 1 ) 5 for i 0 to k a metric for how well a factorization ts as a model, we can try to optimize this metric in a greedy fashion by performing the splice or by adding the arc that increases this metric the most. If there is no such possible splice or arc left, the resulting graph has been found. This approach is derived from the Bayesian approach we mentioned earlier. We can formalize the unconditional variant of the algorithm as follows: FindNodeVectorsByMetric(  ) 1 Æ max 0 2 i max i max i 4.1.2.3 j max j 5 return((imax ; j max )) The conditional variant of the above algorithm is quite the same. Instead of the the node vector , the predecessor nodes vector v p and successor nodes vector v s are required. The for loops go over every combination of two variables (i; j) and only compute the metric change if the arc (i; j) is still allowed to be added to the graph. This results in the following algorithm: 18 FindArcByMetric( a, v p , v s ) 1 Æ max 0 2 i max do 4.1 for j 0 to l max )) The only ingredient missing from our general factorization search algorithms, is either a metric or a test. In most previous approaches [5, 7, 8, 12, 19, 24, 26], a metric was used in combination with a search algorithm. In a selection of these cases, a general factorization search algorithm was used [8, 12, 24, 26]. The use of a test to nd a factorization has been used in a single case using discrete random variables [28] and in previous work on continuous random variables [13]. 3.2.2 Negative log{likelihood statistical hypothesis testing Let f 0 be the current factorization and f 1 be the factorization that results when the arc is added to the graph or when the two vectors are spliced. Given these two factorizations, we want to have a test that tells us which of these two factorizations signi cantly better ts the given sample points. To this end, we specify the following hypothesis that is derived from the MSTH: Factorization Selection Test Hypothesis (FSTH) A probability distribution based on factorization f 1 cannot better describe the given samples than can a probability distribution based on factorization f 0 . If the FSTH is rejected, the test returns that the arc (v 0 ; v 1 ) should be added to the graph or that the two node vectors  c0 and  c1 should be spliced. A statistical hypothesis testing metric for selecting a factorization is readily available since in section 3.1.1 a general statistical hypothesis test on the basis of the negative log{likelihood di erence was given for any probabilistic model. Combining this test with the factorization search algorithms from section 3.2.1, we now have a factorization selection method based on statistical hypothesis tests. Regarding eÆciency, we note that computing the negative log{likelihood is decomposable over the factorization. This means that if we test to alter the model over some arc or by merging some sets, we only have to compute the negative log{likelihood di erence for the involved variables. In the case of undirected dependencies, the negative log{likelihood can be written as:  (Y))) = S P (Yh j i)(y i h j i) 1 A = 0 is the node vector for the base model and that  1 is the node vector for the new model. Furthermore, we assume that they di er only in vectors c 0 and c 1 such that  0 jS X i=0 0 @ ln( ^ P (Yhc 0 t c 1 i)(y i hc 0 t c 1 i)) + X c2 1 Yhc 1 i)(y i hc 1 i)) + X c2 0 volved in a splice operation. In the case of conditional probabilities, assume we add arc (v 0 ; v 1 ) to the graph, meaning that Y v1 is made conditionally dependent on Y v0 . In an analogous way, we nd that given conditional factorizations ( 0 ; ! 0 ) and ( 1 ; ! 1 ) such that  1 (v 1 ) = (v 0 ) t  0 (v 1 ), the negative log{likelihood di erence becomes: ln(L(S j ^ P ( 1 ;! 1 ) (Y))) )i)) because the parts in the factorization that do not change, cancel out in the di erence. 3.2.3 Complexity penalization We noted at the end of the previous subsection that the negative log{likelihood is decomposable over a factorization. This fact can be used to eÆciently compute the AIC and BIC metrics as introduced in section 3.1.2. We only have to compute the results over the variables for which the dependencies are di erent in one factorization as compared to another. In the unconditional case, each element in the node vector is mutually exclusive with respect to any other element in the node vector. This implies that for each element in the unconditional factorization, the amount of parameters is just the sum of the amount of parameters over all elements in the factorization: j fit P  0 (Y)jS)) = (27) jS y i hc 1 i)) + j fit ; !)j can be computed in a decomposable manner over the factorziation. Assume for instance that we use a multivariate normal pdf for each element in the factorization. If we add an arc Y 3 ! Y 0 to the factorization that underlies the probability distribution P (Y 0 jY 1 Y 2 )P (Y 1 )P (Y 2 )P (Y 3 ), the rst factor in the probability distribution becomes P (Y 0 jY 1 Y 2 Y 3 ). If computing the amount of parameters would be decomposable, we would be able to compute the increase in amount of parameters by computing the di erence in amount of implied parameters of P (Y 0 jY 1 Y 2 Y 3 ) and P (Y 0 jY 1 Y 2 ). Since each conditional pdf uses a multivariate joint pdf over all the involved variables, the di erence would be 5. However, the parameters in the factor P (Y 0 jY 1 Y 2 Y 3 ) are now not only 20 present in this factor. The factors themselves overlap, which is not the case for the unconditional dependencies. Because of this overlap, for instance the full covariance matrix index (3; 3) is already required, regardless of the fact that Y 3 is added as a conditional to the rst factor. Therefore, the amount of additional parameters is not equal to 5, but less. However, the amount of additional computational resources that are required, do increase similarly as in the unconditional case with respect to joint pdfs. The reason for this is that the covariance matrix will have to be inverted. Furthermore, if we do not take a normal pdf, but a normal mixture pdf, we estimate the parameters anew using the EM algorithm anyway since we cannot suÆce with a single ll covariance matrix in which the required entries are stored. Also, since applying the EM algorithm is a computationally resourcefull task and its performance degrades with increasing dimension, we do not want to apply it directly to the full joint distribution to nd all involved parameters, but apply it anew as the factorization is built during the search algorithm. Concluding, decomposing the amount of parameters over the conditional factorization is a rational operation as it penalizes the additional computational resources. The resulting metric di erence can be written as: ln(A( ^ P ( 1 ;! 1 ) (Y)jS)) jS v 1 )i)(y i h(v 1 ) t  1 (v 1 )i))  fit composable metric. If we can derive the di erential entropy analytically to nd an eÆciently computable expression, we can use the entropy over the estimated model instead of the negative log{likelihood. This is for instance the case for the normal pdf. As a result, the computations for model selection based on minimizing the metric can be done eÆciently when using a normal pdf. 3.3 Factorization mixture selection So far, we have discussed nding relations between variables based on a sample vector and using a factorization to build a useful probabilistic model. However, the structure of the sample vector may be highly non{linear. This non{linearity can force us to use probabilistic models of a high complexity to retain some of this non{linearity. However, especially using relatively simple pdfs such as the normal pdf, the non{linear interactions cannot be captured even with higher order models. For instance, a two dimensional sample vector in which the samples are aligned in a V shape cannot be t adequately by a product of one dimensional normal pdfs, nor by a single two dimensional normal pdf. However, if we would identify each line in the V shape as a single cluster, we could adequately model the sample vector using a sum of two dimensional normal pdfs. The key issue is the notion of clusters, which are possibly overlapping subvectors of the original sample vector and are optionally augmented with additional data. When we do not allow the subsets to overlap, we classify each point in the sample vector to be explicitly part of some cluster or class. This instance of clustering is called partitioning. The use of clusters allows us to eÆciently break up non{linear interactions so that we can use simple models to get an adequate representation of the sample vector. Furthermore, computationally eÆcient clustering algorithms exist that provide useful results. There are exact algorithms for partitioning [20], but the running times for these algorithms are of no practical use in our case. What we require, is a fast approximate assessment of clusters that we can t well using some pdf. Within these clusters, we can infer what factorization is best and get a probability distribution over the cluster. As all of this is part of the larger algorithm that attempts to infer the structure of the search space and use it in optimization, the clustering should be e ective in representation as well as in computation time requirements. Each cluster is processed separately in order to have a probability distribution t over it. As such, we have a mixture of probability distributions. By assigning each of these probability distributions a weight in [0; 1] and letting the sum over all these weights equal 1, the result is again 21 a probability distribution. We write these weights as i . These weights are commonly known as mixing coeÆcients. We write K = (K 0 ; K 1 ; : : : ; K jK i = (K i 0 ; K i 1 ; : : : ; K i jK i 2 S, so K i j = ((K i j ) 0 ; (K i j ) 1 ; : : : ; (K i j ) l can now write the resulting probability distribution as: ^ P fhKi (Y) = jK i (Y) (29) The reason why we require to have an i in the superscript in ^ P i f i (Y), is to be able to indicate that the i{th probability distribution is based upon the i{th cluster of samples. To determine the i and the ^ P i f i (Y) in equation 29, two approaches can be taken. If we cluster the sample vector into possibly overlapping subvectors, we can for instance determine the mixing coeÆcients as the proportional subvector size. If we do not explicitly cluster the sample vector, but allow for each sample a probability of assigning it to a certain cluster, we can mathematically derive a way of nding the maximum expectation of this probability over all samples. By doing so, the clusters can be seen as copies of the original sample vector augmented with likelihood values for each sample that indicate the probability that the sample belongs to a certain cluster. As such, the resulting clusters can be seen as fuzzy clusters since there is no clear border between the clusters as is the case in partitioning. In practical applications, the rst approach leads to the use of partitioning algorithms and the second approach leads to expectation maximization (EM) algorithms. If we use such an EM algorithm however, we do not get any clusters to which we can apply our factorization selection procedures for constructing the nal probability distribution. Furthermore, we have noted that we only use the normal pdf. If we identify a single cluster equal to the entire sample vector and t a normal mixture model to the cluster, we also have to use the EM algorithm as we shall see in section 4.3.2. This gives an approach that is similar to identifying fuzzy clusters using the EM algorithm and subsequently using the result as the probability distribution. Therefore, we only focus on partitioning algorithms so as to be able to apply factorization selection to the identi ed partitions. 3.3.1 Partitioning algorithms One of the most fundamental aspects in partioning, is the distance function. The distance of a point to a cluster determines to which cluster the point will actually be assigned. Often, the Euclidean distance is used. However, if we are to cluster two variables that lie on a di erent scale, the Euclidean distance might not be the best choice. Optically, if we would present the sample vector in a rectangular graph where the axes are rescaled to t within the rectangle, the clusters we would observe are not the clusters as they appear in the data based upon the Euclidean distance. These clusters will be found if both axes are of the same scale. One axis might become a lot longer in this case and clusters that might have been obvious in the rectangular graph, may not be so obvious at all anymore. Using the Euclidean distance is in this case still useful, but after the sample data has been rescaled in every dimension. Rescaling on the other hand is sensitive to outliers. For two jaj dimensional points y i hai and y j hai, the Euclidean distance dE (y i hai; y j hai) between them is given by: dE (y i hai; y j hai) = q (y i hai computing the sample variance in each dimension separately: 22 dES (y i hai; y j hai) = q (y i hai 1 (y i hai =0 (y i ak 0 0    0 0 s 2 a1    0 . . . . . . . . . . . . 0 0    s 2 a ja and the point. A problem with this approach is that the distance to a cluster has ellipsoid shaped isolines that are aligned along with the axes. Therefore, even samples with linear interaction through correlation will likely be broken up into multiple axes{aligned ellipsoid shaped clusters. To overcome this problem, the Mahalanobis distance metric [23] can be used. Whereas the Euclidean distance metric can be seen as the result of tting a normal pdf with zero covariances, the Mahalanobis distance metric can be seen as the result of tting a normal pdf with a full covariance matrix. The clusters can therefore potentially be shaped better to the form of the data. Let K i hai be the centroid of cluster i and let S(K i ; a) be the sample covariance matrix of the points in cluster i over the features selected by a: K i hai = 1 jK i j jK i een a point y i hai and cluster j can now be written as: dM (y i hai; j) = q (y i hai and can be aligned along any direction. In the special case where the samples are uncorrelated, these ellipsoids are aligned along the axes and the Mahalanobis distance becomes equivalent to the Euclidean distance. The Mahalanobis distance overcomes two limitations of the scaled Euclidean distance. On the one hand, it corrects for correlation between the di erent features. On the other hand, it can provide curved as well as linear decision boundaries. The disadvantage of the Mahalanobis distance is that the covariance matrices have to be determined, which becomes less reliable if the amount of samples in a cluster becomes smaller. Furthermore, the memory and computation time requirements are O(l 2 ) instead of O(l). The mixture coeÆcients i can be determined in di erent ways. One of the two most common ways, is to use a probabilistic approach and set i to the ratio of the size of the i{th cluster to the total size of all clusters: 8 i2K * i = jK i j P jK using equation 35 if the peaks are not equally high in case of maximization, since selection will prefer the larger peak. In the case of equal peaks, the search may still converge to concentrate on a single peak because of nite statistical instability. To this 23 end, the other common way to determine the mixing coeÆcients can be used. In this case, the mixing coeÆcients are determined as the average cost of the niche divided by the sum of the average costs over every niche: 8 i2K * i = 1 jK i j P jK i i q ) P jK ts in equations 35 and 36 were later used in the rst approach [25] that applied clustering to iterated density estimation evolutionary algorithms. However, the formula in equation 36 doesn't work if the objective is to minimize and the values are below zero. Therefore, it is reasonable to assume that the optimum value will be far from the initial sample cost average. If we compute the average tness for niching with respect to this initial sample cost average C I and take the absolute value to account for maximization as well as minimization, we get: 8 i2K * i = 1 jK i j P jK i the actual clustering algorithms, we place a nal note on the use of distance measures. In general, clusters should unite sample points that constitute promising regions in the search space. These regions do not necessarily have to correspond to the Euclidean space in which they are coded. This coding space is called the genotypic space. The decoded space in which parameters have their true meaning and directly contribute to the cost function, is called the phenotypic space. It should be noted that clustering in phenotypic space is in general more desirable as it is more likely to cluster the samples the most e ective way. For most continuous problems, the two spaces are identical. We now focus on the partioning algorithms themselves. The leader algorithm The leader algorithm is one of the fastest clustering algorithms. The use of it can thus be bene cial if the amount of overhead that is introduced by factorization mixture selection methods is desired to remain small. Furthermore, there is no need to specify in advance how many clusters there should be. The rst sample to make a new cluster is appointed to be its leader. The leader algorithm goes over the sample vector exactly once. For each sample it encounters, it nds the rst cluster that has a leader being closer to the sample than a given threshold T d . If no such cluster can be found, a new cluster is created containing only this single sample. Unless the order in which the clusters are inspected is randomized, the rst clusters are always quite a lot larger than the later ones. However, the asymptotic running time for nding the rst cluster with a leader closer than T d is the same as going over all clusters and nding the one with the smallest distance. Therefore, we prefer to nd the nearest cluster based on the leader of the cluster. If it is closer than T d , we assign the sample to the cluster. Another problem that arises is that if we have two clusters to which all samples are equally close, one cluster will be assigned all samples if we go over the clusters in a deterministic order. To overcome this problem, the order in which the clusters are scanned should be randomized as well. One of the major drawbacks of the algorithm is that it is not invariant given the sequence of the input samples. Most clustering algorithms have this property, but not as strongly as does the leader algorithm. Therefore, to be sure that the ordering of the sample vector is not subject to large repeating sequences of samples, which results in undesired xed density estimations, we propose to randomize the ordering of the samples as input to the leader algorithm. This can be done by allocating an array of indices that is randomly accessed in the algorithm. In order to use the scaled Euclidean distance measure, we rst have to compute the sample variance s i in each dimension. The resulting algorithm can be formalized as follows: 24 EuclideanLeaderClustering(T d ) 1 L new array of (vector of real) with size jSj 2 K () 3 O s ; O c 2 new arrays of integer with size jSj 4 for i 0 to l jS j=0 y j i jSj 4.2 s i P jS 0 to jSj d min T d 6.4 for j 0 to jKj O s [q s ] ; L[O c [q c ]]) < d min then 6.4.2.1 d min dES (y O s [q s ] ; L[O c [q c ]]) 6.4.2.2 c min O c [q c ] 6.4.3 O c [q c ] O c [jKj K c j y O s [q s ] 6.6 else then 6.6.1 L[jKj] y O s [q s ] 6.6.2 K jKj (y O s [q s ] ) 6.7 O s [q s ] O s [jSj measure in the above algorithm. The reason for this is that in order to evaluate the Mahalanobis distance, we require to know the sample covariance matrix of the samples that belong to a cluster as well as their sample mean. For the Euclidean measure, it suÆces to have a single point serving as a centroid. To use the Mahalanobis distance, we can determine the required parameters as the clustering algorithm progresses through the samples. As a point is added to a cluster, its sample covariance matrix and its sample mean are updated. However, the covariance matrix can only be computed and used in the Mahalanobis distance if we have at least two samples. Furthermore, based on only a few samples, the Mahalanobis distance may give unstable results, leading to unnatural clusters. Therefore, we propose that a cluster must rst grow to some minimum size T s before the Mahalanobis distance can be used for it. The leader aspect in the Euclidean variant of the algorithm lies within the fact that the rst sample to de ne a new cluster is taken to be the leader of the cluster. It serves as a centroid for the Euclidean distance. In the variant that results when we use the Mahalanobis distance, the sample mean takes the role of the leader for some cluster as soon as the size of that cluster has reached T s . The sample mean of a cluster changes whenever a sample is added to that cluster. Therefore, the term leader is somewhat misleading in the resulting algorithm: 25 MahalanobisLeaderClustering(T d , T s ) 1 L new array of (vector of real) with size jSj 2 K () 3 O s ; O c 2 new arrays of integer with size jSj 4 for i 0 to l jS j=0 y j i jSj 4.2 s i P jS to jSj min T d 6.4 for j 0 to jKj O s [q s ] ; O c [q c ]) 6.4.3 else 6.4.3.1 d dES (y O s [q s ] ; L[O c [q c ]]) 6.4.4 if d < d min then 6.4.4.1 d min d 6.4.4.2 c min O c [q c ] 6.4.5 O c [q c ] O c [jKj min ; L)(j; q) S(K cmin ; L)(j; q) + (K cmin ) j (K cmin ) q 6.5.1.1.2 S(K cmin ; L)(j; q) jK c min jS(K c min ;L)(j;q)+y O s [q s ] j y O s [q s ] q jK c min j+1 6.5.2 K cmin K c min jK c j+y O s [q s ] jK c min j+1 6.5.3 for j 0 to l min ; L)(q; j) S(K cmin ; L)(j; q) 6.5.4 K cmin jKc min j y O s [q s ] 6.6 else then 6.6.1 K jKj y O s [q s ] 6.6.2 for j 0 to jSj s ] ) 6.7 O s [q s ] O s [jSj that is invariant of the permutation of the sample vector, is the joining algorithm. This algorithm initially generates a cluster for each sample and then iteratively joins clusters until no clusters are close to each other enough anymore. The algorithm is however not really suitable for large sample vectors, since jSj 2 distances must be computed and examined. 26 A faster algorithm that is less subject to the input ordering as is the leader algorithm but not independent as is the joining algorithm, is the k{means clustering algorithm. This algorithm constructs exactly k clusters. As we a priori mostly have little notion of how many clusters are appropriate to t to the data, the adaptive k{means algorithm is usually preferred. Based on distance thresholds, the number of clusters k is determined by the algorithm itself, just as is done by the leader algorithm. The amount of clusters inuences the running time of the subsequent distribution estimation, so a bound on k might be preferred. We restrict ourselves to the non{ adaptive version of the clustering algorithm and elaborate on the value of k in section 3.3.2. The k{means algorithm computes k clusters. A di erence with respect to the leader algorithm, is that it uses the cluster centroids. The general procedure is as follows. First, k clusters are picked at random. This can be done by partioning the sample vector at random into k subvectors. The resulting centroids are however expected to lie close to each other. Therefore, the initial clusters are usually taken to consist of a single sample, which is chosen at random from the sample vector. Subsequently, the algorithm iterates until the means do not change to within a signi cance of " anymore. One iteration consists of assigning each point to the nearest cluster based on the distance to the cluster centroid. If multiple clusters are equally likely to be chosen, one of them is picked at random. To ensure this, the clusters are scanned in a random order for each new sample point. Once all of the points have been assigned, the means of the clusters are recomputed. An advantage over the leader algorithm, is that the result of the k{means algorithm is less dependent on the input ordering. A disadvantage is that the algorithm takes N I times as long as the leader algorithm where N I is the amount of iterations in the k{means clustering algorithm. To avoid spending large computation e orts, a maximum of T i iterations is allowed: EuclideanKMeansClustering(k, ", T i ) 1 O new array of integer with size k 2 K () 3 for i 0 to l jS j=0 y j i jSj 3.2 s i P jS 0 to k q 4.3 K i () 4.4 O[i] i 5 first true 6 ready false 7 t 0 8 while :ready do 8.1 for i 0 to k min dES (y i ; K O[q] ) 8.2.3 c min O[q] 8.2.4 O[q] $ O[k q] 8.2.5.3 O[q] $ O[k K c min j y i 27 EuclideanKMeansClustering(k, ", T i ) 8.3 ready true 8.4 for i 0 to k jK i j=0 K i j jK i j 8.4.2 if ja true 9 return(K) Just as for the leader algorithm, we can alter the k{means clustering algorithm to make use of the Mahalanobis distance measure. If a cluster has grown to some minimum size, the Mahalanobis distance may be used. Otherwise, the Euclidean distance is used. In subsequent generations, the Mahalanobis distance is always used as the covariance values are then kept xed in any single loop. Again, the maximum amount of iterations that the algorithm is allowed, is given by T i . The resulting algorithm can be formalized as follows: MahalanobisKMeansClustering(k, ", T s , T i ) 1 a new vector of real with size l 2 B new matrix of real with size l  l 3 O new array of integer with size k 4 K () 5 for i 0 to l Sj 5.2 s i P jS 0 to k q 6.3 for j 0 to jSj while :ready do 10.1 for i 0 to k MahalanobisKMeansClustering(k, ", T s , T i ) 10.2 for i 0 to jSj K O[q] j  T s then 10.2.2.1 d min dM (y i ; O[q]) 10.2.3 else 10.2.3.1 d min dES (y i ; K O[q] ) 10.2.4 c min O[q] 10.2.5 O[q] $ O[k d dES (y i ; K O[q] ) 10.2.6.4 if d < d min then 10.2.6.2.1 d min d 10.2.6.2.2 c min O[q] 10.2.6.5 O[q] $ O[k jKc min j y i 10.2.8 if first ^ jK cmin j = T s then 10.2.8.1 for j 0 to l min ) j (K cmin ) q 10.2.8.1.1.2 S(K cmin ; L)(j; q) jK c min jS(K c min ;L)(j;q)+y i j y i q jK c min j+1 10.2.8.2 K cmin K c min jK c min j+y i jK c min j+1 10.2.8.3 for j 0 to l min ; L)(q; j) S(K cmin ; L)(j; q) 10.3 ready true 10.4 for i 0 to k jK i j K i ; L) B 10.5 if first then 10.5.1 first false 10.5.2 ready false 10.6 t t + 1 10.7 if t = T i then 10.7.1 ready true 11 return(K) 29 3.3.2 Setting parameters On the one hand, we now have clustering algorithms on the basis of which we can de ne a mixture of factorizations. However, on the other hand, we have introduced new parameters that determine the amount of mixtures that will be introduced. For the k{means clustering algorithms, this amount equals exactly k. For the leader algorithms, this amount depends on the value for the distance threshold T d . In order to perform factorization mixture selection, we clearly have to determine what value to use for these parameters. Setting the parameters requires some experience. To automate this, we can apply the general approaches for the detection of dependence between variables from section 3.2. If we for instance take the k parameter for the k{means clustering algorithms, we can increment the value of k and compute the negative log{likelihood value. If based on this value the probability distribution is signi cantly better than for a smaller value of k, this value is accepted and the search continues. Alternatively, a scoring metric can be increasingly penalized as the amount of clusters increases. A similar scheme can be used for the distance threshold T d . It should be noted that using such methods requires a vast amount of computation time. Factorization selection for a single cluster can already take up a signi cant amount of computation time. Over k clusters, we thus require k times as much computation time. It therefore seems by far the most eÆcient choice to x k or T d on beforehand. 3.3.3 Examples In this section, we present some examples of the partioning algorithms that are described in the previous sections. We apply each of the the clustering algorithms from section 3.3 to a set of sample vectors and observe the results. We use sample vectors that can be displayed graphically to gain better insights into the workings of the algorithms. Sample vectors of a higher dimensionality than 3 cannot be visualized in an intuitive manner. Such sample vectors will appear when we apply the iterated density estimation to continuous optimization in section 6. We go over the sample vectors and apply the four clustering algorithms that we discussed in section 3.3. We use 8 sample vectors. The rst 7 of these are two dimensional and consist of various structures. The rst sample vector is sampled from the uniform distribution. The second sample vector is sampled from a multiple of local uniform distributions, creating clusters. The third sample vector is a lled circle with a higher concentration of samples in the center and can be seen to have been sampled from a cone. The fourth sample vector is an outlined circle, which is diÆcult for most approaches. Even though the space can be described by a single parameter that is the angle around the center, this structure is hardly ever detected by any probability density estimation procedure. The fth sample vector is a line that exempli es full correlation between the two variables. The sixth and seventh sample vectors are sample vectors that express dependency, but in such a way that it usually cannot be accounted for by a single factorization. The last sample vector shows di erent behavior in di erent regions of the plane. In one region, the density is uniform, whereas in the other it is fully correlated. For any probabilistic model to eÆciently describe this sample vector, it should be clustered into at least two sample vectors that are processed individually. We have applied the four clustering algorithms from section 3.3 to each sample vector. For the k{means clustering algorithms, we used k = 10. The results are shown in gures 6 through to 13. Each gure shows, from left to right, the result of Euclidean leader clustering, Mahalanobis leader clustering, Euclidean k{means clustering and Mahalanobis k{means clustering. The distance threshold for the Euclidean leader clustering algorithm was T d = 1:0. For the Mahalanobis leader clustering algorithm, the distance threshold was T d = 2:5 and the minimum cluster size threshold was T s = 10. For both k{means clustering algorithms, we used k = 10; " = 0:001; T i = 10. All of the settings were chosen so that a visually intuitive amount of clusters was obtained. If we compare the leader results to the k{means results, it is clear that the cluster boundaries of the k{means results are strict in the sense that there is no overlap between the di erent clusters. Our randomized approach to the leader algorithm proves to be e ective in the sense that there 30 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 10 ­10 ­8 ­6 ­4 ­2 0 2 4 6 10 Figure 6: Cluster results for sample vector 0. ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 10 ­10 ­8 ­6 ­4 ­2 0 2 4 6 10 Figure 7: Cluster results for sample vector 1. ­5 ­4 ­3 ­2 ­1 ­5 ­4 ­3 ­2 ­1 ­5 ­4 ­3 ­2 ­1 ­5 ­4 ­3 ­2 ­1 ­5 ­4 ­3 ­2 ­1 0 1 2 3 4 5 ­5 ­4 ­3 ­2 ­1 1 2 3 4 5 ­5 ­4 ­3 ­2 ­1 0 1 2 3 4 5 ­5 ­4 ­3 ­2 ­1 0 1 2 3 4 5 Figure 8: Cluster results for sample vector 2. ­5 ­4 ­3 ­2 ­1 ­5 ­4 ­3 ­2 ­1 ­5 ­4 ­3 ­2 ­1 ­5 ­4 ­3 ­2 ­1 ­5 ­4 ­3 ­2 ­1 0 1 2 3 4 5 ­5 ­4 ­3 ­2 ­1 1 2 3 4 5 ­5 ­4 ­3 ­2 ­1 0 1 2 3 4 5 ­5 ­4 ­3 ­2 ­1 0 1 2 3 4 5 Figure 9: Cluster results for sample vector 3. 31 ­10 ­5 0 5 10 15 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­5 0 5 10 15 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­5 0 5 10 15 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­5 10 15 ­10 ­8 ­6 ­4 ­2 0 2 4 6 10 Figure 10: Cluster results for sample vector 4. ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­4 ­3 ­2 ­1 0 3 4 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­4 ­3 ­2 ­1 0 4 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­4 ­3 ­2 ­1 0 1 ­10 ­8 ­6 ­4 ­2 10 ­4 ­3 ­2 ­1 0 1 2 Figure 11: Cluster results for sample vector 5. ­8 ­6 ­4 ­2 10 ­10 ­8 ­6 ­4 ­2 10 ­8 ­6 ­4 ­2 10 ­10 ­8 ­6 ­4 ­2 10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 2 4 6 8 10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 Figure 12: Cluster results for sample vector 6. ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 12 ­10 ­8 ­6 ­4 ­2 0 10 12 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 12 ­10 ­8 ­6 ­4 ­2 0 2 10 12 ­10 ­8 ­6 ­4 ­2 0 2 4 6 8 10 12 ­10 ­8 ­6 ­4 ­2 0 2 4 10 12 ­10 ­8 ­6 ­4 ­2 10 12 ­10 ­8 ­6 ­4 ­2 0 2 4 6 10 12 Figure 13: Cluster results for sample vector 7. 32 are no extremely large clusters in any of its results and that the boundaries are not disturbingly overlapping. This means that the resulting densities that are estimated on both of the clustering results, will not di er much. Especially for our application in which a correct density estimation is not as important as is a good approximation, the results of the leader algorithm seem at least as promising as those of the k{means clustering algorithm. Since the running time of the leader algorithm is smaller than that of the k{means algorithm, the leader algorithm seems preferable. If we compare the results between the use of the Euclidean distance measure and the Maha- lanobis distance measure, it is not directly clear which of these approaches leads to more desirable results. One of the main drawbacks of using the Mahalanobis distance measure, is that small clusters are more likely to appear. Such clusters can for instance lie completely inside another cluster since the normal pdf that is based upon it, is sharply peaked. It seems that the k{means algorithm in this case leads to somewhat better formed clusters with less overlap. The true advan- tage of the Mahalanobis distance in that it can better describe linear relations, is unfortunately not exploited in a useful way using the described clustering algorithms as becomes clear from gure 12 for instance. In this gure, the horizontal bars of sample points are not separated by the clusters even though this is possible by using the Mahalanobis distance. This means that the resulting estimated density will not be e ective. As the Mahalanobis distance measure takes more computational e ort, it seems that the Euclidean distance measure is preferable. Given the computational speed of the leader algorithm and its acceptable two{dimensional results we have seen so far, this algorithm seems the most promising to use to use for factorization mixture selection based on clustering. The experiments in section 6 will bring the use of clustering within the IDEA framework into practice, which will indicate the actual computational usefulness of the proposed clustering algorithms. 4 Model tting and sampling Once a factorization has been selected, a pdf has to be t for each element in the factorization. If we have used statistical hypothesis tests directly through for instance the correlation coeÆcient, we yet have to nd such a t. Then again, in the case of using the Kullback{Leibler divergence, the required parameters have been observed to already have been computed [8]. In general, we assume that our model structure has been selected but that the pdf information is not available yet. In this section, we concentrate on computing the pdf information. In general, whereas nding a model structure, is called model selection, estimating a pdf for each element in the structure as we do in this section, is called model tting. 4.1 Observing pdf characteristics In this article, we have described algorithms that use conditional densities and unconditional densities. We have also shown that the conditional densities are a more general case than are the unconditional densities as the latter can be written as a special case of the former. From equation 3 we have that we only require unconditional densities. However, simplifying the actual conditional density can lead to expressions that can be computed more eÆciently. We shall give the conditional density for each pdf that we discuss. First however, we discuss some pdfs in general. From these, we select the pdfs to actually work with. In subsequent sections we go into more detail on the selected pdfs. We restrict ourselves to the use of the normal pdf and variants thereof. One of the main reasons for this, is that the normal pdf has relatively simple analytical properties. Furthermore, the central limit theorem tells us that given enough samples, the approximation to the normal pdf is often quite good for several components that represent some source. Based on the normal pdf, we distinguish three di erent pdfs. These are the single multivariate normal pdf, the normal kernels pdf and the normal mixture pdf. Both the multivariate normal pdf as well as the normal kernels pdf have been proposed previously in the IDEA framework [8, 10, 11, 12]. The single multivariate normal pdf is de ned as follows: 33 ­10 ­5 0 5 10 Y0 ­10 ­5 0 5 10 Y1 0 2e­05 4e­05 6e­05 8e­05 0.0001 0.00012 0.00014 0.00016 f(Y0,Y1) ­10 ­5 0 5 10 Y0 ­10 ­5 0 5 10 Y1 0 2e­05 4e­05 6e­05 8e­05 0.0001 0.00012 0.00014 0.00016 f(Y0,Y1) Figure 14: Fitting sample vectors 0 (left) and 1 (right) with a joint normal pdf. fN (yhai; hai; (a)) = (2) aj 2 (det (a)) 1 2 e yha (a)) in gure 14. The normal kernels pdf is obtained by placing a multivariate normal pdf over each sample point. The covariance matrices of these normal kernels are xed to have non{zero entries on the diagonal and zero entries o the diagonal. The entries on the diagonal are the individual variances of the normal pdf in each dimension. We denote such a variance in dimension j by s j . The so constructed matrix implies that the normal kernels can be scaled in each dimension separately, but that each multivariate normal pdf cannot be aligned to any other axes than the standard axes. We denote the resulting matrix over variables Y hai by S(a): S(a) = 2 6 6 6 4 s 2 a0 0 : : : 0 0 s 2 a1 : : : 0 . . . . . . . . . . . . 0 0 : : : s 2 a ja S; S(a)) = 1 jSj jS with respect to the range was proposed. One interesting property of the normal kernels pdf, is that by increasing the standard deviation values s i , we allow for less detail in the nal density estimation. Furthermore, an advantage over the use of the normal pdf, is that we have a better way of modeling clusters. A resulting t over the two sample vectors can be seen in gure 15. If we take M normal pdfs instead of jSj and t them to the sample data as well as possible while allowing for full covariance matrices, we have a normal mixture pdf. Note that we are tting such a mixture over a subset of the variables instead of over all variables as we did in section 3.3. It may be clear that tting such a normal mixture pdf over a subset of the available variables, can be done in the same way as described earlier in section 3.3, namely through the use of the 34 ­10 ­5 0 5 10 Y0 ­10 ­5 0 5 10 Y1 0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 f(Y0,Y1) ­10 ­5 0 5 10 Y0 ­10 ­5 0 5 10 Y1 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 f(Y0,Y1) Figure 15: Fitting sample vectors 0 (left) and 1 (right) with a joint normal kernels pdf with s 0 = s 1 = 1:0. EM algorithm or the adaptive EM algorithm. An advantage of the normal mixture pdf over the normal pdf and the normal kernels pdf, is that it is not as global and cluster insensitive as is the normal pdf and neither over{sensitive to clusters as is the normal kernels pdf. Therefore, the normal mixture pdf can be seen as a trade{o . The normal mixture pdf requires a vector of vectors of means as well as a vector of covariance matrices since we have multiple multivariate normal pdfs. We assume to have w such normal components in the mixture and let W = (0; 1; : : : ; w and the vector of covariance matrices is (a)hWi. We de ne  i hai to be the mean vector of the i{th component  i hai = (haihWi) i and  i (a) to be the covariance matrix of the i{th component  i (a) = ((a)hWi) i . The normal mixture distribution can be formalized as in equation 41. A resulting t over the two sample vectors can be seen in gure 16. fNM (yhai; hWi;haihWi;(a)hWi) = w i;  i (a)) (41) In equation 41, hWi is a vector of mixing coeÆcients. The same condition on the mixing coeÆcients holds as in the case of the mixing coeÆcients for the factorization mixture as discussed in section 3.3, namely that P w becomes clear from gure 14 where clearly the clustered set is under t. The normal kernels pdf overcomes this aspect. However, the costs for this are quite high. Firstly, evaluating the function takes jSj times more time than the single normal pdf. This scales with the amount of available samples. However, getting more samples should imply that we get a better description of the source from which the samples were drawn and not that we necessarily get a much more costly pdf. Sampling from the distribution takes much more time as well. In order to solve higher dimensional problems, we require increasingly more samples. As the complexity of the normal kernels pdf increases with this amount and since the pdf is used very often throughout the optimization algorithm, the normal kernels pdf is not a very eÆcient way of reducing the strong generalization of the normal pdf. Secondly, it is not straightforward what values to use for the variances of each kernel. It has been observed [10] that the performance of the resulting algorithm strongly depends on the variances. This renders the normal kernels pdf hard to handle, even though it has been observed [10] that better results can be obtained with it on epistatic problems than with the normal pdf. We point out that in terms of modeling, the normal mixture pdf generalizes both the normal 35 ­10 ­5 0 5 10 Y0 ­10 ­5 0 5 10 Y1 0 0.0005 0.001 0.0015 0.002 0.0025 0.003 f(Y0,Y1) ­10 ­5 0 5 10 Y0 ­10 ­5 0 5 10 Y1 0 0.005 0.01 0.015 0.02 0.025 f(Y0,Y1) Figure 16: Fitting sample vectors 0 (left) and 1 (right) with a joint normal mixture pdf using the EM algorithm with 10 components and " = 10 ve w = 1, we have a single normal pdf. On the other hand, if we have w = jSj and  i (a) = S(a), we get the normal kernels pdf. We conclude that for simplicity and eÆciency, it is bene cial to continue to use the single normal pdf. We refrain from using the normal kernels pdf any further since it is very sensitive to parameter settings and a good t is hard to nd. On the other hand, for the normal mixture pdf, the EM algorithm gives us an automated procedure to obtain a t. Therefore, we continue to regard only the normal pdf and the normal mixture pdf. In the next subsections, we present some important derivations of the pdfs in order for them to be used in the IDEA framework. For each pdf, we rst present its multivariate conditional version in which a single variable is conditioned on a multiple of others and show how we can sample from it as well as from the multivariate joint pdf. Subsequently, we show how the parameters  for the pdf can be estimated. 4.2 The normal pdf The de nition of the multivariate normal joint pdf, given a vector of variables yhai, a mean vector hai and a covariance matrix (a), is given in equation 38. Using this de nition, we can directly account for unconditional dependencies as these result in multivariate joint pdfs. Therefore, we can evaluate a sample in a marginal product model that is t with normal pdfs. However, in the case of conditional dependencies, we must be able to evaluate the multivariate normal conditional pdf. Furthermore, to complete the resulting algorithms, we need to be able to sample from the multivariate normal joint pdf as well as from the multivariate normal conditional pdf. In this section, we rst show how to perform the actual sampling. Once this is known, we know how to sample from both the special case multivariate conditional normal pdf as well as the multivariate joint normal pdf. However, given a model structure, we still have to estimate the parameters of the normal pdfs, which is the actual model tting. How to best estimate the parameters, is what we show second in this section. 4.2.1 Sampling We rst want to know how to sample from the special case multivariate conditional normal pdf. Usually, only sampling from a one dimensional normal pdf is available. However, it was pointed out in earlier work [8] that the multivariate normal conditional pdf in which a single variable ya 0 is conditioned on jaj fN ((y a 0 jyha (a)) = fN (y a 0 ; ~ ; ~  2 ) (42) where 8 < : ~  2 = 1 (a) a 0 (a) i=1 (ya i In the case of unconditional dependencies however, we have to draw samples from a product of multivariate joint normal pdfs. We have assumed that only sampling from a one dimensional normal pdf is available. Therefore, if we have to sample from a multivariate normal pdf, we use equation 3 to write the multivariate pdf as a product of multivariate conditional pdfs for which we know that they can be written as one dimensional normal pdfs using equation 42. This procedure is known as sampling by simulation. The multivariate di erential entropy for the multivariate joint normal pdf can be derived analytically. It can be shown (see for instance [15]) that it equals: h(fN (yhai; hai; (a))) = 1 2 (jaj + ln((2) jaj (det (a)))) (43) 4.2.2 Parameter estimation From equations 38 and 42 it follows that we have to estimate the values for hai and (a). It can be shown that a maximum likelihood t of the normal pdf is found by using the sample average Y hai and sample covariance matrix S(a) for them respectively: Y hai = 1 jSj jS 4.3 The normal mixture pdf The de nition of a multivariate normal mixture pdf is given in equation 41. Just as in the case of a single normal pdf, we can thereby directly account for unconditional dependencies. In this section, we rstly derive an expression for the multivariate conditional normal mixture pdf and subsequently note how to sample from it. Also, we show how to sample from the multivariate joint normal mixture pdf. Secondly, we show how the EM algorithm can be used to estimate the parameters of a multivariate normal mixture pdf. 4.3.1 Sampling To sample from a model that incorporates conditional dependencies, we must rst derive the multivariate conditional normal mixture pdf in which a single variable ya 0 is conditioned on multiple others yha ((y a0 jyha a The expression in the denominator of equation 46 depends only on variables yha given, we de ne a constant c 0 = fNM (yha ; (a w hai;  i (a)) (47) = 1 c 0 w i);  i hai;  i (a))f N (yha time however, the constant is additionally dependent on i. Therefore, we cannot move this factor outside the summation. The same is the case for the i factors. We introduce a \constant" that is a function of i by c 1 (i) = i fN (yha normal mixture pdf: = w ;  i ha  i ; ( ~  i ) 2 ) where 8 > < > : ( ~  i ) 2 = 1  i (a) P ja sample from the multivariate conditional normal mixture pdf, we rst have to compute the weights c 1 (i)=c 0 . We then select the i{th normal pdf with probability c 1 (i)=c 0 . Since the complete expression is a pdf and each element in the sum is a weight times a normal pdf, the weights have to sum to 1. As the nominator is a component of the mixture in the denominator, c 0 = P w able to sample from the required multivariate conditional normal mixture pdf, we also have to be able to sample from the multivariate joint normal mixture pdf. This is quite simple compared to the conditional case. The joint pdf in equation 41 is a sum of weighted multivariate joint normal pdfs. Given the constraint that the i have to sum to 1 and may not be negative, we can select the i{th multivariate normal pdf to sample from with probability i . How to sample from a single multivariate normal pdf has already been discussed in section 4.2. 4.3.2 Parameter estimation Estimating the parameters of each multivariate joint normal pdf, given w, is not straightforward. To this end, the EM algorithm has been proposed [16]. This algorithm does not guarantee a maximum likelihood t, but at least it is a general approach to nding an approximation. Even though we get an approximation by using the EM algorithm, the choice of w still has to be made. This is somewhat of a similar choice as in the case of selecting the amount of clusters in factorization mixture selection. An adaptive version of the EM algorithm can be constructed so as to prevent the user from having to specify the amount of normal pdfs to use in a t over the data. Instead, the user must provide a threshold value in the same fashion as for the leader algorithms. Just as we did in the case of the k{means clustering algorithms however, we restrict ourselves to the non{adaptive algorithm. 38 The EM algorithm is an iterative parameter estimation procedure. Given a certain instance for the parameters, which is termed the old parameters, a single iteration consists of nding new estimates for these parameters. This recomputation of the parameters is iteratively repeated until no changes occur anymore within a certain precision or until a maximum of iterations has been reached. In appendix C, the update equations for a single iteration of the EM algorithm for a multivariate normal mixture model are derived given a vector of indices a. The resulting equations concern the mixing coeÆcients j , the mean vectors  j hai and the covariance matrices  j (a) for each component j 2 W in the mixture: new j = 1 jSj jS X i=0 old j f ^ P j (Y hai)(y i hai)g old f ^ P (Y hai)(y i hai)g old (49) f j haig new = P jS j f ^ P j (Y hai)(y i hai)g old f ^ P (Y hai)(y i hai)g old y i hai P jS i=0 old j f ^ P j f j (Y hai)(y i hai)g old f ^ P f (Y hai)(y i hai)g old (y i hai Y hai)(y i hai)g old (51) Even though the EM algorithm is derived so as to obtain a maximum likelihood estimate, the resulting estimation may not be one of maximum likelihood. The minimization problem contains many local minima and the algorithm may get stuck in one of them. Since the EM algorithm is essentially a gradient algorithm, this can easily happen if the problem has many local optima. To tackle this issue, an evolutionary algorithm could for instance be applied. However, this yields a chicken and egg problem since within the evolutionary algorithms that learn structure, tools such as the EM algorithm are used again. We close this section by presenting pseudo{code for the EM algorithm for a normal mixture with full covariance matrices. The algorithm is based upon the update equations given above as derived in appendix C. The algorithm is de ned with respect to an indices vector a. These indices de ne the variables for which a multivariate joint normal mixture pdf must be estimated. Just as was the case for the k{means algorithm, we require an a priori speci ed amount of components k. We also use a convergence signi cance level " as well as a maximum amount of iterations T i : NormalMixtureEMAlgorithm(k, ", T i , a) 1 a o ; a n 2 new arrays of real with size k 2 m o ; m n 2 new arrays of (vector of real with size jaj) with size k 3 S o ; S n 2 new arrays of (matrix of real with size jaj  jaj) with size k 4 p new array of real in 2 dimensions with size k  jSj 5 p m new array of real with size jSj 6 d new array of real with size k 7 r; min new array of real with size jaj 8 for i 0 to jaj k] 1 k 9.2 for i 0 to jaj 10 ready false 11 t 0 12 while :ready do 12.1 for i 0 to jSj k ; m o [k]; S o [k]) 12.1.2.2 p m [i] p m [i] + a o [k]p[k; i] 12.2 for k 0 to k o [k]p[k;i] p m [i] 12.2.3 a n [k] d[k] jSj 12.2.4 for i 0 to jaj ai p m [i] 12.2.6 m n [k] m n [k] d[k] 12.2.7 for i 0 to jaj ;i](y i ha ja n [k] i j  " then 12.4.3.1.1 ready false 12.4.3.2 m o [k] i m n [k] i 12.4.4 for i 0 to jaj t t + 1 12.6 if t = T i then 12.6.1 ready true 13 for k 0 to k  k (a) S n [k] 40 5 IDEAs Our aim is to apply the use of probabilistic models to continuous optimization. So far, we have for- malized useful techniques and algorithms to nd such models. However, we have not yet elaborated on how these probabilistic models can be used. In this section, we present the IDEA framework for Iterated Density Estimation Evolutionary Algorithms. In this framework, probabilistic models are built and used in optimization with an evolutionary algorithm. In section 5.1 we rst go over the framework itself. Second, in section 5.2 we elaborate on the algorithmic instances of the framework that were used in earlier work. Third and nally, we summarize what techniques we will apply for the rst time to get new algorithmic instances of the IDEA framework. 5.1 The IDEA framework We write the cost function of our continuous optimization problem as C(yhLi) and without loss of generality, we assume that we want to minimize C(yhLi). For every problem variable y i , we introduce a continuous random variable Y i and get Y = Y hLi. Without any prior information on C(yhLi), we might as well assume a uniform distribution over Y . Therefore, we generate an initial (population) vector of n samples at random. Now we let P  (Y) be a probability distribution that is uniform over all vectors yhLi with C(yhLi)  . Sampling from P  (Y) gives more samples that evaluate to a value below . Moreover, if we know   = min yhLi fC(yhLi)g, a single sample gives an optimal solution. To use this in an iterated algorithm, we select bnc samples in each iteration t and let  t be the worst selected sample cost. We then estimate the distribution of the selected samples and thereby nd ^ P  t & (Y) as an approximation to the true distribution P  t (Y). New samples can then be drawn from ^ P  t & (Y) and be used to replace some of the current samples. This rationale has led to the de nition of the IDEA framework [8]. The largest di erence between other similar approaches and the IDEA, is that the IDEA has mostly been used to focus on continuous optimization problems [8, 10, 11, 12, 13]. The de nition of the IDEA framework is the following: IDEA(n,  , m, sel(), rep(), ter(), sea(), est(), sam()) Initialize an empty vector of samples P () Add and evaluate n random samples for i 0 to n ) Initialize the iteration counter t 0 Iterate until termination while :ter() do Select bnc samples (y 0 hLi; y 1 hLi; : : : ; y bn  hc[y i hLi]  c[y k hLi]i Search for a structure & & sea() Estimate the parameters  fit Evaluate the new samples in P for each unevaluated P i do c[P i ] C(P i ) Update the iteration counter t t + 1 Denote the required iterations by t end t end t In the IDEA framework, we have that N  = (0; 1; : : : ; bnc a single sample using the estimated pdfs. The evolutionary algorithm characteristic of the IDEA lies in the fact that a population of individuals is used from which individuals are selected to 41 generate new o spring with. Using these o spring along with the parent individuals and the current population, a new population is constructed. If we set m to (n t+1 =  t eyed through a monotonically decreasing series  0   1  : : :   t end . We call an IDEA with m, sel() and rep() so chosen, a monotonic IDEA. 5.2 Previous IDEAs All of the prior algorithms that use continuous probabilistic models, have used a single factor- ization for the probabilistic model structure & . The rst few approaches all used the univariate factorization [18, 30, 31]. Simultaneously, the IDEA framework was introduced [8] alongside the work of Larra~naga, Etxeberria, Lozano and Pe~na [22]. Both of the works propose to use the Kullback{Leibler divergence between the estimated distribution and the full joint estimated dis- tribution which leads to minimizing the entropy over all possible factorizations. This approach is inspired by the MIMIC algorithm [7], in which only a certain constrained class of factorizations is allowed. In addition, in the work by Larra~naga et al., edge exclusions tests and a continuous version of the Bayesian Dirichlet metric for Gaussian networks (BGe) are used to learn a factor- ization that is not constrained. On the other hand, in the work by Bosman and Thierens [8], a general factorization is allowed in which each variable is allowed to interact with a maximum of  other variables. The work by Bosman and Thierens has been tested and has proven to be an e ective approach [10, 11]. Apart from the recent work by Larra~naga et al. and by Bosman and Thierens, there have been no attempts to use higher order probabilistic model structures for continuous optimization. Most of the approaches so far have solely used the normal pdf as the building blocks of the probability distribution over Y . The only exception to this, has been the work by Gallagher, Fream and Downs [18], in which a normal mixture model is used. However, the model structure is xed to the univariate factorization. Bosman and Thierens have described the use of the normal kernels pdf [12]. This approach has however proven to be hard to handle [10]. On the other hand, using the normal kernels pdf has given results on certain highly epistatic problems that could not be obtained by using only a single normal pdf. Therefore, a normal mixture pdf was suggested for future research. Finally, Bosman and Thierens also investigated the use of a histogram distribution [8]. Using this pdf however does not scale up [10] and has therefore been abandoned. 5.3 New IDEAs The novelty of the algorithms that we propose in this paper, lies in the complexity of the proba- bilistic models. First of all, for the rst time, we use a mixture of factorizations instead of a single factorization by nding a factorization for each cluster that we construct rst. By e ectively removing non{linear interactions between the problem variables in this way, previously shown to be e ective techniques regarding the normal pdf and the learning of factorizations can be used to obtain e ective probability distribution estimations. The learning of factorizations is di erent as well. We no longer need additional constraints on the factorization as we use the metrics to penalize more complex models. Therefore, we no longer have a variety of factorization search algorithms with di erent constraints. Finally, we have shown for the rst time how the normal mixture pdf can be used given any factorization and thus in an IDEA instance in which any factorization is allowed. As the normal mixture pdf can account for additional non{linearities that may still remain within the clusters, the new algorithms are thus more exible and e ectively use more complex probabilistic models. The required techniques to get the new optimization algorithms, have been described in pre- vious sections in this paper. Using those techniques, we are now ready to apply our new IDEAs to continuous optimization problems, which is done in section 6. 42 5.4 Methodological comparison with Evolution Strategies Approaches such as the IDEA that build and use probabilistic models in evolutionary optimization were rst proposed as an improvement over GAs [4, 7]. The IDEA framework itself has been used to focus on continuous models for problems with real valued variables. This has resulted in a new line of continuous evolutionary algorithms. However, for continuous optimization, evolutionary algorithms have been proposed almost as long ago as the original simple binary GA [2]. Currently, we can identify two variants of this continuous evolutionary algorithm, namely Evolutionary Pro- gramming (EP) and Evolution Strategies (ES). Generally, we can state that these algorithms also make use of normal pdfs. One might therefore say that there already exist algorithms that operate in a similar fashion as the IDEA. For this reason, it is important to question the relevance of a new approach such as the IDEA instances presented so far. In this section, we describe how ES di ers from IDEA and point out that there are indeed fun- damental di erences that markedly distinguish the approaches from one another. To this end, we rst give a brief algorithmical introduction to Evolution Strategies in section 5.4.1. Subsequently, we describe the actual di erences with our IDEA approach in section 5.4.2. 5.4.1 Brief introduction to Evolution Strategies In this section, we give a birdseye overview of Evolution Strategies. Our goal is not to go give a detailed description of these approaches. For a detailed introduction, the interested reader is for instance referred to the work by Back and Schwefel [3]. At a top level, we can distinguish two ES variants based on the selection scheme that is used. These variants are commonly referred to as the (; ){ES and the ( + ){ES. In both cases,  denotes the amount of parents and  the amount of o spring that is produced by recombination and mutation. In the case of the (; ){ES, the  parents are selected from the  o spring (  ). In the case of the ( + ){ES, the  parents are selected from both the parents of the previous generation as well as the  o spring (  1). In ES, the parents are recombined and subsequently mutated as is the case for the simple GAs. However, the mutation operator has always been the most important one for ES. This operator samples from normal pdfs and makes the ES appear similar to the IDEA with normal pdfs. With each variable, a normal pdf is associated. Each normal pdf can either be allowed to have identical standard deviations in each dimension, to be aligned with the axes of the search space without necessarily identical standard deviations or can allowed to be any arbitrary l{dimensional normal pdf. This comes down to a covariance matrix with respectively either similar entries on the diagonal and zero entries o the diagonal, non{similar entries on the diagonal and zero entries o the diagonal or an arbitrary symmetric covariance matrix. The variables that code the matrix are incorporated into the genome and are subject to mutation and recombination themselves. The mutation of the entries in the covariance matrix is performed by multiplication with an exponential normally distributed value. After the values in the covariances matrix have been updated, they are used to mutate the values for the problem variables. These values are updated by adding a normally distributed l{dimensional variable to the current sample. An ES individual contains real values for the l dimensions of the problem. These values represent the mean of a normal pdf that is used to sample o spring from after recombination. The covariance matrix is also stored in the genome (in either restricted or unrestricted form). These real values are mutated according to a xed strategy. For a covariance matrix with zero entries o the diagonal, the ES genome can be speci ed by: (y 0 ; y 1 ; : : : ; y l distributed variables are sampled anew for each dimension): 43 Figure 17: Operational sketch of IDEA for normal pdf (left) and normal mixture pdf (right). s = e N (0;)  0 i =  i e N (0; 0 ) s y 0 i = y i +N (0;  0 i ) (53) For  and  0 , the usual rule of thumb is to set  = 1= p 2 p n and  0 = 1= p 2n. Recombination in ES is generally done in one of two possible ways, being discrete and intermediate crossover. By using discrete crossover, the values themselves are swapped without being altered. By using intermediate crossover, the values of the parents are averaged. Recombination is applied both to the problem values as well as the covariance matrix values. 5.4.2 ES and IDEA: Similarities and di erences One top level aspect that becomes clear is that a monotonic IDEA can be seen as a ( + ) algo- rithm. In evolutionary computation, such algorithms are said to be elitist. Since many algorithms in the eld of EAs share such a selection strategy, this is not the most important thing to note about the two continuous evolutionary approaches. In gure 17, a landscape is plotted in which we desire to locate the unique minimum. We assume that we have a set of samples available as indicated by crosses in the image. The IDEA approach is then to t a probability distribution over these samples as well as possible to get a good representation of the samples. In gure 17 this is shown for a normal pdf and a normal mixture pdf. Once such a probability distribution has been estimated,  o spring are generated by sampling from the probability distribution. As selection favors better solutions, the probability distribution is meant to model the promising regions of the search space. Note that this is a global approach to optimization as it attempts to globally capture the structure of the landscape and re nes the model as the process is iterated. We now turn our attention to the ES approach. In gure 18, a sketch is given of the mutation operation. If we set  = 1, we have only a single solution to recombine and mutate. By doing so, we are modeling a normal pdf just as is the case for the IDEA approach. However, its use is now di erent. If we only focus on sampling new solutions, we indeed get the same result as with the IDEA, since the  o spring will be sampled from the normal pdf that is modeled. However, the way in which the normal pdf is found in the next generation is di erent. In the IDEA approach, 44 Figure 18: Operational sketch of mutation{based ES for (1; ) (left) and (; ) (right). the best bnc samples are selected. Based on these samples, a maximum likelihood normal pdf is t. For the ES approach however, the single parent is selected from either the o spring or the o spring and the single parent of the previous generation. The result is a single point in the search space, since selection is based upon the function value only. This means that we can regard the ES approach with  = 1 as moving the normal pdf in some direction. The direction as well as the length of the step in which we move the normal pdf is subject to evolution itself. In other words, the ES in this case can be seen as evolving to nd local gradient information on how to best traverse the search space. In the case of  > 1 for ES, we get the representation as depicted on the righthandside of gure 18. Regarding only mutation for the moment again, generating the o spring involves sampling from the normal pdfs that lie centered around the  parent samples. Therefore, this can be seen as the equivalent of the normal mixture pdf or a clustered normal pdf in the IDEA approach. However, the way in which the  parents make up the mixture distribution by evolution of the covariance matrices, is similar to the case in which  = 1. Again, we can therefore see the movement of the  parent samples as moving the normal pdfs in a certain direction with a certain stepsize. Finding this direction and stepsize is subject to evolution. ES mutation can therefore also be taken as a more local based approach to nding and using the structure of the landscape. Recombination in ES quite drastically changes the view on its operational semantics. If we for instance regard intermediate crossover, we are drawing lines between the  parent samples in the l{dimensional space and place the new samples in the middle of these lines. It becomes clear that by doing so, the search is directed more to use the global structure of the search space to nd promising regions of the search space. This strongly aids the ES to eÆciently tackle a great amount of continuous optimization problems. Recombination allows for global exploration of the search space, while mutation attempts to use more local landscape information. Moreover, it should be noted that the initial span of the normal pdfs in ES is usually taken to be quite large. This improves the initial global exploration of the landscape further. The IDEA approach is a global procedure that attempts to use the structure of the problem landscape to explore the most promising regions. On the other hand, mutation based ES ap- proaches are local procedures that use evolution to explore the inside of promising regions. By adding recombination, additional means of globally searching the landscape are introduced, but it is not as evident as for the IDEA to what extent this helps to use global structure. 45 Concluding, the two approaches are fundamentally di erent. This becomes clear for instance on problems in which local information is important, such as Rosenbrock's function, which has a narrow valley. The bottom of this valley has a unique minimum. However, the gradient along the bottom of the valley is very small. This means that an IDEA based approach will quite easily nd the valley itself, but will by no means be able to traverse the valley to nd the global minimum unless points were sampled near the global minimum. The reason for this is that density estimation converges on a certain part of the valley since samples are only available in that part of the search space. On the other hand, once the ES is inside the valley, it can adapt its mutation direction and stepsize to follow the valley to its minimum in a gradient descent fashion. Even though this is a time consuming process, the ES is not as likely to prematurely converge on such a problem as is the IDEA approach. 6 Experiments The continuous function optimization problems we used for testing are the following: Name De nition Domain Griewank 1 4000 P l +1  + 1 [ 2 i   [0; ] l Rosenbrock P l minimized. We tested a large variety of IDEA variants for l = 5. In sections 6.1, 6.2 and 6.3 we go over the results for the individual functions. In all our testing, we used a monotonic IDEA. We used the rule of thumb by Muhlenbein and Mahnig [24] for FDA and set  to 0:3. Furthermore, for variants that use clustering, we let n increase from 250 to 5000 in steps of 250. For variants that do not use clustering, we increase the population size in steps of 25. The reason for this is that by introducing clusters, we also need to increase the population size by a larger amount to allow a signi cant change in cluster size. We allowed each run a maximum of 1  10 7 evaluations for all non{clustered approaches. Because of time restrictions, we only allowed 2 1 2  10 6 evaluations for the clustered approaches, with the exception of the Euclidean leader algorithms on Griewank's function. If all of the solutions di ered by less than 5  10 we searched for both unconditional as well as conditional factorizations with both the AIC metric as well as the BIC metric. For the normal mixture pdf, it was empirically observed that inference with respect to nding a factorization is extremely computationally expensive. Therefore, we have xed the factorization structure to the univariate one for the normal mixture pdf. However, we did test various amounts of mixtures in the normal mixture pdf, namely w 2 f2; 5; 10g. We applied clustering using both the leader algorithm as well as the k{means clustering algorithm. In the case of the leader algorithm, we applied both the scaled Euclidean distance as well as the Mahalanobis distance. For the k{means clustering algorithm, we only used the scaled Euclidean distance. We used k 2 f2; 5; 10g for the k{means clustering algorithm and T d 2 f3 1 2 ; 5g for the variants of the leader algorithm. This gives on average 7 and 2 clusters respectively for the scaled Euclidean distance and 11 and 3 clusters respectively for the Mahalanobis distance. The factorization mixture coeÆcients i are determined by using equation 35. We present the best obtained average best results over the 10 independent runs and sort all of the IDEA instances by this index primarily. We also show the population size n for the best obtained result, the average amount of evaluations and the Relative Run Time RT. Let FT(x) be the time to perform x random function evaluations and TRT be the Total Run Time on the same processing system. Then, RT(x) = TRT=FT(x). We determined RT as RT(10 6 ). The RT index is a processing system independent fair comparison metric of the total required time. We sort the 46 results by this index secondarily instead of the amount of function evaluations, as it truely reects the required amount of time. 6.1 Griewank's function In gure 19, a summary of the results on Griewank's function is given. The unique minimum is 0, which is found by quite a few of the tested approaches within the allowed precision. Quite a lot of approaches without clustering are able to nd the unique minimum. This gives us the impression that Griewank's function is not extremely epistatic or non{linear. On the other hand, the amount of required function evaluations is very large when compared to clustering approaches or normal mixture approaches. The non{clustered normal mixture pdf with w = 10 for instance, requires only 51832 evaluations on average to minimize the function, whereas the approaches that only use a single normal pdf require at least 945971 evaluations on average. However, the RT index points out an important fact. It can be seen from the table that the running time that is required for the normal mixture pdf, is quite a lot larger. Even though the single normal pdf requires 20 times more evaluations, it runs 2 to 3 times as fast as the normal mixture pdf in which only the univariate factorization is used. We should note at this point that if the evaluation time for a single solution would have been signi cantly larger, then the normal mixture pdf would have been preferable. It we take a look at the di erent clustering approaches, we rst note that the successful entries in the table use the normal mixture pdf. It seems that we should mostly contribute the success of these approaches to the use of the normal mixture pdf instead of the use of clustering. If we take a look at clustering approaches in combination with only a normal pdf, it seems that the Mahalanobis distance results in no superior useful clustering algorithms within the IDEA framework. What is very interesting however, is that the k{means clustering approaches seem to perform signi cantly worse than the leader approach, unless clustering is combined with the normal mixture pdf. The main reason for this is the di erence in allowed amount of function evaluations. Otherwise, there does not seem to be a large di erence between the two approaches. The only reason why they di er in their outcome, is that the amount of clusters di ers. If we then regard the Euclidean leader algorithm with the BIC metric, T d = 3 1 2 and unconditional dependencies as well as the k{means algorithm with k = 5, we see that the running time of the leader algorithm is only slightly larger than that of the k{means algorithm, even though the population size is equal and the amount of evaluations is 3 times as large. This makes the leader algorithm preferable here. It is hard to say whether conditional or unconditional dependencies are to be preferred on the basis of the presented results. The same holds for the search metric. The reason for this is that the amount of dimensions l is very small. The choice of the metric and the type of dependencies becomes truely signi cant if l goes up. Still, it seems that for expressional power, the conditional dependencies should be preferred. 6.2 Michalewicz's function Observing the results on the highly epistatic Michalewicz function in gure 20, it becomes clear immediately that clustering is required. The non{clustering approaches only work well if the normal mixture pdf is used, which is in a way comparable to using clustering approaches. There is no signi cant di erence between conditional and unconditional dependencies, but in the case of conditional dependencies, the required amount of function evaluations is often lower. There seems to be a slight preference for the Mahalanobis distance, but the Euclidean distance seems to work quite well. A more signi cant di erence is present between the leader clustering approach and the k{means clustering approach. The k{means approaches all seem to be inferior to the leader approaches. The only explanation for this can be the adaptiveness of the leader algorithm because of the threshold T d . This adaptiveness allows the leader algorithm to be exible in the amount of required clusters, which in turn results in a better performance. We note that the normal mixture pdf requires on average quite a lot less evaluations than does any other approach using only the normal pdf. However, the computation time that is required 47 Clustering Distance f Metric pdf n C evals RT | | (; !) AIC normal 175 0.000000 945971 2.21 | |  AIC normal 175 0.000000 1039070 2.55 | | (; !) BIC normal 150 0.000000 1403463 3.19 | |  BIC normal 200 0.000000 1492826 3.48 | | univ. | mix 10 1100 0.000000 51832 6.61 k{means 2 Euclidean univ. | mix 10 1750 0.000000 72552 8.86 Leader 3 1 2 Mahalanobis univ. | mix 5 3250 0.000000 156311 9.97 Leader 3 1 2 Mahalanobis univ. | mix 10 2250 0.000000 89718 10.30 k{means 5 Euclidean univ. | mix 10 2500 0.000000 98805 11.43 Leader 5 Euclidean (; !) AIC normal 250 0.000000 4522165 11.65 k{means 5 Euclidean univ. | mix 5 3250 0.000000 172812 11.81 | | univ. | mix 5 1600 0.000000 163024 13.01 Leader 5 Euclidean univ. | mix 10 2250 0.000000 106660 13.65 Leader 5 Mahalanobis univ. | mix 10 2250 0.000000 110206 14.40 Leader 5 Euclidean univ. | mix 5 2500 0.000000 226102 16.42 k{means 2 Euclidean univ. | mix 5 3000 0.000000 228857 16.75 k{means 10 Euclidean univ. | mix 10 3750 0.000000 156386 17.58 Leader 3 1 2 Euclidean univ. | mix 10 4750 0.000000 199321 23.74 Leader 5 Mahalanobis univ. | mix 5 3250 0.000000 353185 26.94 Leader 3 1 2 Euclidean univ. | mix 5 4750 0.000047 360632 22.05 k{means 10 Euclidean univ. | mix 5 5000 0.000072 274139 17.00 Leader 5 Euclidean  AIC normal 500 0.000636 7477291 18.96 Leader 5 Euclidean  BIC normal 250 0.000905 3706986 10.29 Leader 5 Euclidean univ. | mix 2 4000 0.001232 3239155 106.93 Leader 5 Euclidean (; !) BIC normal 250 0.001278 6631367 16.59 Leader 3 1 2 Mahalanobis univ. | mix 2 4750 0.004618 698637 23.12 | | univ. | mix 2 2750 0.005451 4818328 123.82 Leader 3 1 2 Euclidean  BIC normal 1000 0.005755 6620613 18.40 Leader 5 Mahalanobis  BIC normal 250 0.005956 1628976 7.51 k{means 5 Euclidean univ. | mix 2 3000 0.006179 653522 24.00 Leader 3 1 2 Euclidean (; !) BIC normal 750 0.006705 7268071 19.47 Leader 5 Mahalanobis univ. | mix 2 2000 0.007339 1826277 54.79 Leader 3 1 2 Euclidean  AIC normal 1000 0.007477 2775053 18.38 Leader 3 1 2 Euclidean (; !) AIC normal 1250 0.007640 8992514 25.56 k{means 2 Euclidean univ. | mix 2 1750 0.007930 979638 25.97 Leader 5 Mahalanobis  AIC normal 250 0.009353 1338268 5.81 Leader 5 Mahalanobis (; !) BIC normal 250 0.009428 2499626 10.28 Leader 3 1 2 Mahalanobis (; !) AIC normal 750 0.010497 2127631 10.52 Leader 3 1 2 Euclidean univ. | mix 2 5000 0.011837 751763 21.90 Leader 3 1 2 Mahalanobis  BIC normal 750 0.011987 2007309 9.70 k{means 2 Euclidean  AIC normal 250 0.012384 736524 2.48 k{means 2 Euclidean (; !) BIC normal 500 0.012742 2381377 9.06 Leader 5 Mahalanobis (; !) AIC normal 250 0.012981 1793646 7.61 Leader 3 1 2 Mahalanobis (; !) BIC normal 500 0.013007 877386 4.11 k{means 2 Euclidean (; !) AIC normal 250 0.013202 1172454 3.89 k{means 10 Euclidean univ. | mix 2 4750 0.013400 958065 31.46 k{means 2 Euclidean  BIC normal 250 0.013514 793724 2.79 k{means 10 Euclidean (; !) AIC normal 2500 0.013601 1718042 19.35 Leader 3 1 2 Mahalanobis  AIC normal 750 0.014672 2296346 11.27 k{means 10 Euclidean  BIC normal 2250 0.015987 703016 17.55 k{means 10 Euclidean  AIC normal 5000 0.015921 1662724 18.00 k{means 10 Euclidean (; !) BIC normal 2500 0.016360 2054672 23.03 k{means 5 Euclidean  AIC normal 1000 0.016410 1094122 6.40 k{means 5 Euclidean  BIC normal 1000 0.019069 2361092 17.05 k{means 5 Euclidean (; !) BIC normal 1000 0.020145 2236489 14.47 k{means 5 Euclidean (; !) AIC normal 250 0.101171 1725182 6.63 Figure 19: Results on Griewank's Function, sorted on C (primary) and RT (secondary). 48 to estimate the parameters of the normal mixture, signi cantly increases the running time. The main thing to note is that whereas for Griewank's function we were still able to get good results even with only the normal pdf, clustering becomes much more important here. 6.3 Rosenbrock's function Rosenbrock's function has proven to be a real challenge for any continuous optimization algorithm. It has a valley along which the quality of solutions is much better than that of the solutions in its neighborhood. Furthermore, this valley has a unique minimum of 0 itself. The gradient along the bottom of the valley is only very slight. Any gradient approach is therefore doomed to follow the long road along the bottom of the valley. For a density estimation algorithm, capturing the valley in a probabilistic model is diÆcult, even if all of the points within the valley are known. The reason for this is that the valley is non{linear in the coding space. Therefore, we expect that to get any reasonable results, we require to use clustering. In gure 21, the results on Rosenbrock's function for l = 5 are given. It is clear from this table that any approach that does not use clustering, does not give useful results. Using 10 clusters allows us to e ectively process the non{linearity of the valley and optimize the problem using only very few evaluations. We note that using a mixture of normal pdfs in this case does not help the optimization process at all. It even results in more evaluations than when using a single normal pdf in some cases. The main important thing is the non{linearity of the valley and its slight gradient. Because of the fact that only the univariate factorization is used for the normal mixture pdf, the non{linearity cannot be captured e ectively. The problem with density estimation in this continuous search space is that it doesn't use the gradient information that can be extremely useful to nd the minimum. Given the continuity of the search spaces, it is even unwise not to exploit this information in some way, which is demonstrated by the results in gure 21. Quite a large amount of clusters is namely required to e ectively optimize the problem. Note that the amount of dimensions is only very small. If the amount of dimensions goes up, the amount of clusters will have to increase accordingly. However, by doing so, the amount of samples will have to increase as well to ensure a large enough cluster size. Therefore, the sole use of density estimation on search spaces such as the one de ned by Rosenbrock's function, is not e ective enough to compete with gradient approaches. However, a hybrid combination of both approaches will most likely result in very e ective continuous optimization techniques. We close this section by noting that the k{means clustering algorithm now outperforms the leader algorithm. The main reason for this is that clustering is extremely important for this function. With k = 10, we have the largest amount of clusters that we tested. If we would have used a threshold lower than T d = 3 1 2 , we would most likely have gotten better results with the leader algorithm as well. Finally, it is clear that the Euclidean distance measure signi cantly outperforms the Mahalanobis distance measure. 7 Discussion The use of the normal mixture pdf in combination with the EM algorithm as proposed in this paper, requires a vast amount of computation time. For highly epistatic search spaces however, it has shown to be e ective with respect to the required amount of evaluations as well as optimization performance. Since the EM algorithm gives a general approach to nding the involved parameters, the normal mixture pdf is preferable over the normal kernels pdf that was proposed earlier [12]. The use of clustering in combination with the normal pdf also results in a normal mixture pdf. However, the results in this paper show di erent behavior of the two techniques on di erent landscapes. On non{linear landscapes, clustering is the most e ective way to estimate the non{ linearity, unless higher order factorizations are allowed for the normal mixture pdf. However, allowing such factorizations results in vast additional amounts of computation time, which is undesirable. 49 Clustering Distance f Metric pdf n C evals RT Leader 5 Mahalanobis (; !) AIC normal 750 -4.687658 95364 0.32 Leader 3 1 2 Euclidean univ. | mix 2 1500 -4.687658 38416 0.37 Leader 5 Mahalanobis  BIC normal 1000 -4.687658 117979 0.38 Leader 5 Mahalanobis (; !) BIC normal 1000 -4.687658 122711 0.41 Leader 3 1 2 Euclidean (; !) BIC normal 4000 -4.687658 180813 0.42 Leader 5 Mahalanobis  AIC normal 1000 -4.687658 131561 0.43 Leader 5 Mahalanobis univ. | mix 5 1000 -4.687658 22118 0.45 Leader 3 1 2 Euclidean  BIC normal 3500 -4.687658 193759 0.45 Leader 3 1 2 Euclidean  AIC normal 3750 -4.687658 194792 0.46 Leader 3 1 2 Euclidean (; !) AIC normal 4000 -4.687658 201120 0.47 Leader 5 Euclidean  BIC normal 2250 -4.687658 221314 0.47 k{means 2 Euclidean univ. | mix 2 2000 -4.687658 49634 0.52 Leader 3 1 2 Mahalanobis (; !) BIC normal 3500 -4.687658 129114 0.54 Leader 5 Mahalanobis univ. | mix 2 2000 -4.687658 51035 0.55 Leader 5 Euclidean univ. | mix 2 2250 -4.687658 58789 0.58 k{means 5 Euclidean univ. | mix 5 1250 -4.687658 28954 0.59 Leader 3 1 2 Mahalanobis  BIC normal 3750 -4.687658 139646 0.59 | | univ. | mix 5 1300 -4.687658 28903 0.61 Leader 5 Euclidean (; !) BIC normal 1500 -4.687658 301297 0.63 k{means 5 Euclidean univ. | mix 2 2500 -4.687658 63128 0.67 Leader 3 1 2 Mahalanobis  AIC normal 4250 -4.687658 155654 0.68 k{means 2 Euclidean univ. | mix 10 750 -4.687658 16596 0.70 k{means 10 Euclidean univ. | mix 5 1750 -4.687658 43894 0.88 Leader 3 1 2 Mahalanobis univ. | mix 2 3500 -4.687658 86528 0.92 Leader 5 Euclidean  AIC normal 3250 -4.687658 423172 0.92 Leader 5 Euclidean univ. | mix 10 1000 -4.687658 21680 0.96 k{means 2 Euclidean univ. | mix 5 2000 -4.687658 45606 0.96 Leader 3 1 2 Mahalanobis univ. | mix 5 2250 -4.687658 52288 1.04 Leader 5 Euclidean (; !) AIC normal 2000 -4.687658 512489 1.08 k{means 10 Euclidean univ. | mix 10 1250 -4.687658 30925 1.10 Leader 3 1 2 Mahalanobis univ. | mix 10 1500 -4.687658 34475 1.25 Leader 3 1 2 Euclidean univ. | mix 5 3000 -4.687658 70232 1.33 k{means 10 Euclidean univ. | mix 2 4250 -4.687658 115106 1.34 Leader 5 Euclidean univ. | mix 5 3000 -4.687658 68131 1.38 | | univ. | mix 2 5000 -4.687658 135938 1.45 k{means 5 Euclidean univ. | mix 10 1750 -4.687658 40522 1.51 | | univ. | mix 10 1800 -4.687658 39756 1.61 Leader 3 1 2 Euclidean univ. | mix 10 2000 -4.687658 45431 1.68 Leader 5 Mahalanobis univ. | mix 10 2500 -4.687658 58751 2.18 Leader 3 1 2 Mahalanobis (; !) AIC normal 4250 -4.682400 258326 1.14 k{means 2 Euclidean (; !) AIC normal 2500 -4.661225 114406 0.48 k{means 5 Euclidean (; !) BIC normal 5000 -4.660579 825985 2.63 k{means 2 Euclidean  BIC normal 4750 -4.659790 543562 1.44 | |  BIC normal 325 -4.659300 1032504 2.13 | | (; !) AIC normal 1100 -4.657363 4064810 8.20 k{means 2 Euclidean (; !) BIC normal 3500 -4.656492 615331 1.54 k{means 2 Euclidean  AIC normal 3750 -4.656120 646464 1.66 | | (; !) BIC normal 1700 -4.654513 4088736 8.26 | |  AIC normal 175 -4.654248 13693 0.03 k{means 10 Euclidean  BIC normal 2000 -4.650213 377993 1.48 k{means 5 Euclidean (; !) AIC normal 4000 -4.643722 773225 2.35 k{means 5 Euclidean  AIC normal 2750 -4.639595 443323 1.38 k{means 10 Euclidean  AIC normal 3500 -4.638577 856142 3.31 k{means 10 Euclidean (; !) BIC normal 4250 -4.637136 551834 2.35 k{means 5 Euclidean  BIC normal 3750 -4.635958 499408 1.59 k{means 10 Euclidean (; !) AIC normal 2250 -4.622931 1262262 4.52 Figure 20: Results on Michalewicz's Function, sorted on C (primary) and RT (secondary). 50 Clustering Distance f Metric pdf n C evals RT k{means 10 Euclidean (; !) AIC normal 2500 0.000000 67506 3.03 k{means 10 Euclidean (; !) BIC normal 3000 0.000000 82575 3.65 k{means 10 Euclidean  AIC normal 4250 0.000000 157886 6.30 k{means 10 Euclidean  BIC normal 2750 0.001550 168627 5.35 Leader 3 1 2 Euclidean (; !) AIC normal 4750 0.002011 166477 1.81 Leader 3 1 2 Euclidean  AIC normal 5000 0.033155 773470 8.84 Leader 3 1 2 Euclidean (; !) BIC normal 5000 0.050435 646558 7.36 k{means 10 Euclidean univ. | mix 5 3500 0.062745 309262 31.00 Leader 5 Euclidean univ. | mix 10 4000 0.065469 341870 72.56 Leader 3 1 2 Euclidean  BIC normal 5000 0.074347 742398 8.52 k{means 5 Euclidean (; !) BIC normal 1250 0.078283 53482 1.18 k{means 5 Euclidean  AIC normal 4750 0.079437 904017 15.03 k{means 5 Euclidean (; !) AIC normal 4750 0.083104 615903 10.53 Leader 5 Euclidean univ. | mix 5 2250 0.085702 191173 19.92 Leader 5 Euclidean univ. | mix 2 3500 0.092902 278625 13.86 k{means 10 Euclidean univ. | mix 10 4250 0.106256 610238 89.13 k{means 10 Euclidean univ. | mix 2 2000 0.125456 272568 13.00 Leader 5 Euclidean  AIC normal 3750 0.128021 478400 4.81 Leader 3 1 2 Euclidean univ. | mix 2 3500 0.143837 621458 20.36 k{means 5 Euclidean  BIC normal 4750 0.160205 1246595 19.71 k{means 5 Euclidean univ. | mix 5 4750 0.160414 531505 45.24 Leader 3 1 2 Euclidean univ. | mix 10 4500 0.172262 472424 79.29 Leader 3 1 2 Euclidean univ. | mix 5 3000 0.189304 286372 25.40 k{means 2 Euclidean  BIC normal 5000 0.189420 1621587 19.09 Leader 5 Mahalanobis univ. | mix 5 2250 0.191094 261108 27.56 Leader 5 Euclidean (; !) AIC normal 4500 0.209030 1361006 13.73 Leader 3 1 2 Mahalanobis  AIC normal 2000 0.209074 808100 20.10 Leader 3 1 2 Mahalanobis (; !) AIC normal 2500 0.247123 1765538 43.24 k{means 5 Euclidean univ. | mix 10 5000 0.259675 551594 86.79 Leader 5 Mahalanobis univ. | mix 2 4250 0.260017 457718 24.28 Leader 5 Mahalanobis univ. | mix 10 4500 0.261190 735926 145.22 Leader 5 Euclidean  BIC normal 2250 0.285851 1338881 13.80 | | univ. | mix 5 450 0.288066 37074 3.52 Leader 5 Euclidean (; !) BIC normal 5000 0.304032 1427719 14.91 Leader 3 1 2 Mahalanobis  BIC normal 2500 0.325525 918273 23.56 | | univ. | mix 10 1400 0.345850 133443 25.19 Leader 5 Mahalanobis (; !) BIC normal 3000 0.363407 2252646 49.28 k{means 5 Euclidean univ. | mix 2 4250 0.366888 600938 24.30 Leader 3 1 2 Mahalanobis univ. | mix 10 3500 0.367077 555588 100.59 Leader 3 1 2 Mahalanobis univ. | mix 2 2750 0.396008 378320 20.99 k{means 2 Euclidean univ. | mix 2 5000 0.421393 1166456 45.95 Leader 3 1 2 Mahalanobis (; !) BIC normal 1750 0.445489 1915536 46.29 Leader 3 1 2 Mahalanobis univ. | mix 5 2500 0.459225 406324 40.36 k{means 2 Euclidean univ. | mix 10 3250 0.481794 291164 52.68 k{means 2 Euclidean univ. | mix 5 2250 0.495617 180535 17.21 Leader 5 Mahalanobis  AIC normal 3250 0.605696 1506264 32.43 Leader 5 Mahalanobis (; !) AIC normal 5000 0.615485 2316097 53.52 Leader 5 Mahalanobis  BIC normal 4250 0.774399 1978082 43.80 k{means 2 Euclidean (; !) AIC normal 4750 0.923545 1057429 12.74 | | univ. | mix 2 975 1.083728 92292 5.01 k{means 2 Euclidean (; !) BIC normal 1250 1.281874 79419 1.53 k{means 2 Euclidean  AIC normal 3250 1.298608 919056 10.63 | | (; !) BIC normal 2750 1.859228 2370574 19.30 | | (; !) AIC normal 675 1.867434 79619 0.72 | |  BIC normal 4000 1.906433 1592727 13.52 | |  AIC normal 900 1.971504 135492 1.10 Figure 21: Results on Rosenbrock's Function, sorted on C (primary) and RT (secondary). 51 Given the low computational requirements on the factorized normal pdf, introducing clustering is the most cost e ective way to improve the performance of IDEAs on a variety of non{linear and epistatic continuous optimization functions. To further improve the e ectiveness of the al- gorithms, local gradient information must be taken into account. By only estimating probability distributions and by generating more samples from them, gradient information is ignored. On true di erential continuous search spaces however, this information is a very important guide to nd local minima or maximima. Implicitly, approaches such as Evolution Strategies make use of such gradient information, be it completely local or on a less detailed scale of information. Furthermore, many classical gradient search techniques exist that have proven to be very e ective. Therefore, to improve the optimization performance of IDEA approaches for di erential continuous search spaces, a local search method that uses gradient information should be incorporated. Combining the global approach of the IDEA and the local gradient search will likely be very e ective. For the experiments in this paper, we have set the i mixing coeÆcients of the factorization mixture to the proportional cluster size. To increase the optimization pressure, the coeÆcients should be set to the proportional average cluster tness. Preliminary results have shown that this is e ective for many di erent cost functions. In this paper, we have introduced general mixtures of factorizations. To learn them, we have applied clustering and subsequent factorization learning in each of the clusters separately. The amount of clusters has so far been determined either directly by means of the k{means clustering algorithm or indirectly through a distance threshold in the leader algorithms. Even though the latter is a form of adaptive clustering, it is not really adaptive in the sense that new clusters are introduced as they are required to estimate the probability distribution better. If we would have such a mechanism that is e ective with respect to computational requirements, clustering in density estimation for evolutionary optimization in IDEAs, will become even more valuable. From the results in this paper, we nd that the Mahalanobis distance for clustering does not introduce a signi cant optimization performance increase of the use of the Euclidean distance measure. Since the Euclidean distance measure can be evaluated faster, this measure should be preferred. Furthermore, the use of the simple and fast leader clustering algorithm gives results that are comparable to those of the k{means clustering algorithm. Therefore, the leader algorithm is preferable. Since we have only tested low dimensional problems, the use of unconditional factorizations versus conditional factorizations can hardly be commented upon. However, it seems that using conditional factorizations often leads to superior results. The use of either the AIC or the BIC metric to search for a factorization should be tested on problems of a higher dimensionality. Moreover, nding a good factorization becomes more important for problems in which the problem structure is made up of true building blocks. On such problems, the linkage between the problem variables has to be exploited in the best way possible to eÆciently process the generated solutions and traverse the search space [9]. Such problems occur both in di erential continuous optimization as well as in spaces such as the one that is de ned by using random keys. Random keys are continuous values that are used to encode permutations. Therefore, we are actually dealing with a discrete space in which problem structure plays a subtle di erent role than is the case in conventional continuous di erential problems. 8 Conclusions We have expanded the use of continuous probabilistic models for any factorization from a single normal pdf to mixtures of normal pdfs. To learn normal mixture pdfs, we have used the EM algorithm. Next to the normal mixture pdf, we have shown how eÆcient clustering algorithms can be used to break the non{linearity of the search space. By treating each cluster as a separate sample vector, we can again achieve an ormal mixture pdf. The two approaches to reaching the mixture however, is fundamentally di erent. Furthermore, the EM algorithm has been found to be computationally more expensive. For non{linear search spaces, clustering is a cost e ective way to break this non{linearity in order for the promising regions of the search space to be explored eÆciently. The resulting probabilistic model structure is then a mixture of factorizations. 52 We have clari ed the use of a set of intuitive search metrics and have shown how these t in with the basic notion of likelihood, which is a measure of how well a probability distribution ts a vector of samples. Starting from there, we have shown how well known measures such as the AIC and BIC metric can be derived. It becomes clear that these metrics are quite similar and only di er by means of the amount of regularization that is applied to ensure a good generalization performance of the estimated probability distribution. In subsequent research, the use of the AIC and BIC metric in higher dimensional search spaces and non{continuous search spaces is to be investigated to get a notion of their performance quality. Even though nice optimization results have been achieved for varying epistatic and non{linear di erential continuous search spaces, the use of gradient information to help guide the search towards promising regions of the search space will most likely result in even more e ective con- tinuous optimization algorithms. By only using the estimation of and sampling from probability distributions, as is done at this point, the local gradient information is ignored. In true di erential continuous search spaces however, it is a waste to ignore such important information. Summarizing, in this paper we have extended the set of available tools for IDEAs to perform continuous optimization with. This extension consists of new and more complex tools that we have shown to allow for a better performance of IDEAs. References [1] H. Akaike. Information theory and an extension of the maximum likelihood principle. In P.N. Petrov and F. Csaki, editors, Proceedings of the 2nd International Symposium on Information Theory, pages 267{281. Akademia Kiado, 1973. [2] T. Back and H.-P. Schwefel. An overview of evolutionary algorithms for parameter optimization. Journal of Evolutionary Computation, 1(1):1{23, 1993. [3] T. Back and H-P. Schwefel. Evolution strategies I: Variants and their computational implementation. In G. Winter, J. Priaux, M. Galn, and P. Cuesta, editors, Genetic Algorithms in Engineering and Computer Science, Proceedings of the First Short Course EUROGEN'95, pages 111{126. Wiley, 1995. [4] S. Baluja and R. Caruana. Removing the genetics from the standard genetic algorithm. In A. Prieditis and S. Russell, editors, Proceedings of the twelfth International Conference on Machine Learning, pages 38{46. Morgan Kau man publishers, 1995. [5] S. Baluja and S. Davies. Using optimal dependency{trees for combinatorial optimization: Learning the structure of the search space. In D.H. Fisher, editor, Proceedings of the 1997 International Conference on Machine Learning. Morgan Kau man publishers, 1997. [6] J. Bilmes. A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. ICSI Technical Report TR-97-021. http:// www.icsi.berkeley.edu/ bilmes/papers/em.ps.gz, 1997. [7] J.S. De Bonet, C. Isbell, and P. Viola. MIMIC: Finding optima by estimating probability densities. Advances in Neural Information Processing, 9, 1996. [8] P.A.N. Bosman and D. Thierens. An algorithmic framework for density estimation based evolutionary algorithms. Utrecht University Technical Report UU{CS{1999{46. ftp://ftp.cs.uu.nl/pub/RUU/CS/ techreps/CS-1999/1999-46.ps.gz, 1999. [9] P.A.N. Bosman and D. Thierens. Linkage information processing in distribution estimation algo- rithms. In W. Banzhaf, J. Daida, A.E. Eiben, M.H. Garzon, V. Honavar, M. Jakiela, and R.E. Smith, editors, Proceedings of the GECCO{1999 Genetic and Evolutionary Computation Conference, pages 60{67. Morgan Kaufmann Publishers, 1999. [10] P.A.N. Bosman and D. Thierens. Continuous iterated density estimation evolutionary algorithms within the IDEA framework. In M. Pelikan, H. Muhlenbein, and A.O. Rodriquez, editors, Proceedings of the Optimization by Building and Using Probabilistic Models OBUPM Workshop at the Genetic and Evolutionary Computation Conference GECCO{2000. Morgan Kaufmann Publishers, 2000. [11] P.A.N. Bosman and D. Thierens. Expanding from discrete to continuous estimation of distribution algorithms: The IDEA. In M. Schoenauer, K. Deb, G. Rudolph, X. Yao, E. Lutton, J.J. Merelo, and H.-P. Schwefel, editors, Parallel Problem Solving from Nature { PPSN VI, pages 767{776. Springer, 2000. 53 [12] P.A.N. Bosman and D. Thierens. IDEAs based on the normal kernels probability density function. Utrecht University Technical Report UU{CS{2000{11. ftp://ftp.cs.uu.nl/pub/RUU/CS/techreps/ CS-2000/2000-11.ps.gz, 2000. [13] P.A.N. Bosman and D. Thierens. Negative log{likelihood and statistical hypothesis testing as the basis of model selection in IDEAs. In A. Feelders, editor, Proceedings of the Tenth Dutch{Netherlands Conference on Machine Learning. Tilburg University, 2000. [14] S. Brandt. Statistical And Computation Methods In Data Analysis. North-Holland, 1970. [15] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley & Sons Inc., 1991. [16] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistic Society, Series B 39:1{38, 1977. [17] W. Feller. An Introduction To Probability Theory And Its Applications, Volume 1. Wiley, 1968. [18] M. Gallagher, M. Fream, and T. Downs. Real{valued evolutionary optimization using a exible probability density estimator. In W. Banzhaf, J. Daida, A.E. Eiben, M.H. Garzon, V. Honavar, M. Jakiela, and R.E. Smith, editors, Proceedings of the GECCO{1999 Genetic and Evolutionary Computation Conference, pages 840{846. Morgan Kaufmann Publishers, 1999. [19] G. Harik. Linkage learning via probabilistic modeling in the ECGA. IlliGAL Technical Report 99010. ftp://ftp-illigal.ge.uiuc.edu/pub/papers/IlliGALs/ 99010.ps.Z, 1999. [20] J.A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., 1975. [21] M.G. Kendall and A. Stuart. The Advanced Theory Of Statistics, Volume 2, Inference And Relation- ship. Charles GriÆn & Company Limited, 1967. [22] P. Larra~naga, R. Etxeberria, J.A. Lozano, and J.M. Pe~na. Optimization by learning and simulation of bayesian and gaussian networks. University of the Basque Country Technical Report EHU-KZAA- IK-4/99. http://www.sc.ehu.es/ccwbayes/ postscript/kzaa-ik-04-99.ps, 1999. [23] P.C. Mahalanobis. On tests and measures of groups divergence I. Journal of the Asiatic Society of Benagal, 26:541, 1930. [24] H. Muhlenbein and T. Mahnig. FDA { a scalable evolutionary algorithm for the optimization of additively decomposed functions. Evolutionary Computation, 7:353{376, 1999. [25] M. Pelikan and D.E. Goldberg. Genetic algorithms, clustering, and the breaking of symmetry. In M. Schoenauer, K. Deb, G. Rudolph, X. Yao, E. Lutton, J.J. Merelo, and H.-P. Schwefel, editors, Parallel Problem Solving from Nature { PPSN VI, pages 385{394. Springer, 2000. [26] M. Pelikan, D.E. Goldberg, and E. Cantu-Paz. BOA: The bayesian optimization algorithm. In W. Banzhaf, J. Daida, A.E. Eiben, M.H. Garzon, V. Honavar, M. Jakiela, and R.E. Smith, editors, Proceedings of the GECCO{1999 Genetic and Evolutionary Computation Conference, pages 525{532. Morgan Kaufmann Publishers, 1999. [27] M. Pelikan, D.E. Goldberg, and K. Sastry. Bayesian optimization algorithm, decision graphs and occam's razor. Also available as IlliGAL Report 2000020. ftp://ftp-illigal.ge.uiuc.edu/pub/papers/ IlliGALs/2000020.ps.Z, 2000. [28] M. Pelikan and H. Muhlenbein. The bivariate marginal distribution algorithm. In R. Roy, T. Fu- ruhashi, K. Chawdry, and K. Pravir, editors, Advances in Soft Computing { Engineering Design and Manufacturing. Springer{Verlag, 1999. [29] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461{464, 1978. [30] M. Sebag and A. Ducoulombier. Extending population{based incremental learning to continuous search spaces. In A.E. Eiben, T. Back, M. Schoenauer, and H.-P. Schwefel, editors, Parallel Problem Solving from Nature { PPSN V, pages 418{427. Springer, 1998. [31] I. Servet, L. Trave-Massuyes, and D. Stern. Telephone network traÆc overloading diagnosis and evo- lutionary computation technique. In J.K. Hao, E. Lutton, E. Ronald, M. Schoenauer, and D. Snyers, editors, Proceedings of Arti cial Evolution '97, pages 137{144. Springer Verlag, LNCS 1363, 1997. [32] K.S. Trivedi. Probability & Statistics With Reliability, Queuing And Computer Science Applications. Prentice-Hall, 1982. [33] X. Yin and N. Germay. A fast genetic algorithm with sharing scheme using cluster analysis methods in multimodal function optimization. In Proceedings of the 1993 International Conference on Arti cial Neural Nets and Genetic Algorithms, pages 450{457, 1993. 54 A Notation glossary General A = f; ; : : : ; g A set A of size jAj ts cannot be enumerated. jAj The amount of elements in set A. ; The empty set. a 2 A =