Negative Log{Likelihood And Statistical Hypothesis Testing As The Basis Of Model Selection In IDE As (Full (Technical Report) Version) Peter A.N. Bosman Dirk Thierens peterb@cs.uu.nl Dirk.Thierens@cs.uu.nl Department of Computer Science, Utrecht University P.O. Box 80.089, 3508 TB Utrecht, The Netherlands August 2000 Abstract In this paper, we analyze the most prominent features in model selection criteria that have been used so far in iterated density estimation evolutionary algorithms (IDEAs, EDAs, PMBGAs). These algorithms build probabilistic models and estimate probability densities based upon a selection of available points. We show that the negative log{likelihood is a basis of the inference features when the Kullback{Leibler divergence is used. We show how previously found to be problematic issues in the case of continuous random variables can be resolved by starting from the derived basics. By doing so, we have a probabilistic model search metric that can be justi ed through the use of statistical hypothesis tests. This in turn reduces the need for additional complexity penalties. 1 Introduction The past few years, new probabilistic optimization algorithms inspired by evolutionary algorithms have appeared [17]. The algorithms that have been introduced in this eld, apply search criteria to the space of probability distributions to nd a suitable probabilistic model, given a vector of selected samples. The resulting probability distribution is subsequently used to draw more samples from, after which selection takes place again. Even though eÆcient algorithms have already been proposed in this eld, the motivation of the search criteria for nding a suitable probabilistic model is often not thoroughly investigated. Our goal in this paper is to take a closer look at one such search criterion which is based on the Kullback{Leibler (KL) divergence. Di erent algorithms [2, 3, 4, 6, 7, 14] have been proposed that use this divergence. We want to emphasize the background of the KL divergence through its correspondence with likelihood maximization in probability theory. By doing so, we give a unifying background regarding these topics. Using the basic notion of likelihood, we aim to derive general probabilistic model selection criteria that in addition to previous approaches can be justi ed through the use of statistical hypothesis testing. The remainder of this paper is organized as follows. First, we introduce some notation in section 2 and give a general framework for the algorithms themselves in section 3. Next, in section 4, we formalize the statistical tools that have been used in these algorithms and show how they together constitute some of the methods of induction used so far. In this same section, we unify these statistical tools by looking at their relation with a basic notion of t in probability theory, being the likelihood. By taking a closer look at the negative log{likelihood, we come across a few subtle details. We go into these details and show how some of the previous algorithms can be simpli ed by statistical hypothesis testing on the basis of the negative log{likelihood. As a result, some problems that have been encountered in the expansion of discrete to continuous optimization algorithms [7] disappear. In section 5, we give some practical examples of the theoretic work described in this paper. Subsequently, in section 6, we summarize our results and discuss some topics of interest. Finally, we conclude this paper in section 7. 1 2 Some notation and statistics We write a = (a 0 ; a 1 ; : : : ; a ja the elements in a vector is relevant. We assume to have l random variables available, meaning that each sample point is an l dimensional vector. We introduce the notation ahci = (a c 0 ; a c 1 ; : : : ; a c jc ector of l numbers and let a v , meaning that a contains only elements of . We assume the alphabet for each discrete random variable to be the same, just as we do for the range of each continuous random variable. We denote the alphabet for the discrete random variables by A. We can now de ne the multivariate joint probability mass function (pmf): pa (xhai) = P (8 i2a hX i = x i i) such that X c2A jaj pa (c) = 1; and pa ()  0 (1) We write P (Xhai)(xhai) for pa (xhai), making P (Xhai) a pmf. In the continuous case, we cannot use a pmf. Let dyhai = Q ja e. Using the notation shorthand R for R 1 = P (Y hai 2 A) (2) such that ZZ : : : Z fa (yhai)dyhai = 1; fa ()  0; and A = 0 @ ja P (Y hai)(yhai) for fa (yhai), making P (Y hai) a pdf. We use Y to denote continuous random variables and X to denote discrete random variables. If we do not distinguish between these cases, we write Z. We let b v and de ne a t b to be the splicing of the two vectors so that the elements of b are placed behind the elements of a, giving ja t bj = jaj + jbj. Using the de nition of multivariate probability, we can de ne conditional probability by: P (ZhaijZhbi) = P (Zha t bi) P (Zhbi) (3) Shannon [20] has de ned the multivariate entropy measure. In the case of discrete random variables, this measure is de ned as: H(P (Xhai)) = : : : Z P (Y hai)(yhai)ln(P (Y hai)(yhai))dyhai (5) Let = Xh i, = Y h i and = Zh i. A well known distance metric from one proba- bility distribution P 0 ( ) to another probability distribution P 1 ( ), is the Kullback{Leibler (KL) divergence. The KL divergence is also called relative entropy [13]. In the case of discrete random variables, this distance metric equals: D(P 0 ( )jjP 1 ( )) = X c2A P 0 ( )(c)ln  P 0 ( )(c) P 1 ( )(c)  (6) In the continuous case, the KL metric is equal to: d(P 0 ( )jjP 1 ( )) = ZZ : : : Z P 0 ( )(yh i)ln  P 0 ( )(yh i) P 1 ( )(yh i)  dyhai (7) 2 As a nal statistical tool, we make use of the sample vector S. We let S = (z 0 ; z 1 ; : : : ; z jS z i 0 ; z i 1 ; : : : ; z i l to be a vector of independently drawn samples from a distribution P (Z). We denote this by S s P (Z). The sample vector is used to nd some probability distribution ^ P (Z) as an approximation to the true distribution P (Z). The likelihood that the samples were drawn from some given distribution ^ P (Z), is de ned as: L(Sj ^ P (Z)) = jS practise to use the negative logarithm of the likelihood measure as it is often computationally more convenient. The resulting expression is called the negative log{likelihood : 3 Evolutionary optimization by iterated density estimation Evolutionary optimization algorithms that make use of iterated density estimation, attempt to catch the probability distribution of a selected vector of samples. Using the estimated probability distribution, more samples are generated and mixed with the existing samples to get a new sample vector. Assume that we have an l dimensional cost function C(zhLi), which without loss of generality we seek to minimize. We let P  (Z) be a probability distribution that is uniform over all vectors zhLi with C(zhLi)   and 0 otherwise. Sampling from P  (Z) gives more samples that evaluate to a value below . Moreover, if we know   = min zhLi fC(zhLi)g, sampling from P   (Z) gives an optimal solution. This rationale has led to the de nition of the IDEA (Iterated Density Estimation Evolutionary Algorithm) framework [4]. The greatest di erence between other similar approaches and the IDEA, is that the IDEA has mostly been used to focus on continuous optimization problems [4, 6, 7]. We cannot expect to eÆciently solve every structured optimization problem with just any probabilistic model. This has been demonstrated on problems in which a multiple of variables interact [5, 18]. Therefore, a higher order probability distribution such as ^ P (Z 0 ; Z 1 ) instead of ^ P (Z 0 ) ^ P (Z 1 ) is sometimes required. However, the higher the order of complexity, the larger the amount of required computational resources are in order to be able to compute the probabilistic model. Therefore, the interactions between the problem variables are attempted to be inferred from the given sample vector to nd a probabilistic model M. The notion of a probabilistic model is used as a computational implementation of a probability distribution. A probabilistic model M uniquely corresponds to a probability distribution. It can be seen to consist of some structure & and a vector of parameters  that de nes the pdfs to be t over each joint multivariate structure implied by & . An example of a structure & is the notion of a factorization, which we denote by f. A factorization factors the probability distribution over Y to get a product of pdfs. An example in the case of l = 3 is P (Y) = P (Y 0 ; Y 1 )P (Y 2 ). Once a structure & is given, the parameters that have to be estimated can be derived from the multivariate pdfs that have to be t. Since the way in which the parameters  are t, is prede ned on beforehand together with the pdfs to t, we denote the parameter vector that is obtained in this manner by  fit using only the structure & , we denote it by P & (Y). The de nition of the IDEA framework can now be given as follows: 3 IDEA(n,  , m, sel(), rep(), ter(), sea(), est(), sam()) Initialize an empty vector of samples P () Add and evaluate n random samples for i 0 to n i ] C(P i ) Initialize the iteration counter t 0 Iterate until termination while :ter() do Select bnc samples (z 0 hLi; z 1 hLi; : : : ; z bn N hc[z i hLi]  c[z k hLi]i Search for a structure & & sea() Estimate the parameters  fit Evaluate the new samples in P for each unevaluated P i do c[P i ] C(P i ) Update the generation counter t t + 1 Denote the required iterations by t end t end t In the IDEA framework, we have that N  = (0; 1; : : : ; bnc n ; 1], sel() is the selection operator, rep() replaces a subset of P with a subset of O, ter() is the termination condition, sea() is a model structure search algorithm, est() estimates the parameters that are implied by both the structure & as well as prede ned pdfs and sam() generates a single sample using the estimated densities. The evolutionary algorithm characteristic of the IDEA lies in the fact that a population of individuals is used from which individuals are selected to generate new o spring with. Using these o spring along with the parent individuals and the current population, a new population is constructed. Once & has been selected, the actual pdfs are computed. During the course of deriving & however, the required pdfs will often have already been computed [4]. In whatever way, using equation 3, computing the pdfs can functionally be restricted to computing multivariate joint pdfs. In the discrete case, there has been only one way in which these functions are computed, which is the most straightforward way. The probabilities equal the frequency in which some combination of values appears in the sample vector , divided by the total amount of samples: ^ P (Xhai)(chai)= 1 jSj jS x j i i 0 otherwise (10) In the case of continuous random variables, such a most straightforward way to compute the probabilities does not exist. In general, some model is t to the data as good as possible by specifying its parameters. For instance, a multivariate normal pdf can be used, which is fully speci ed by its covariance matrix and its mean vector. This discrepancy between discrete and continuous random variables is important in certain model selection techniques as we shall see. 4 Negative log{likelihood as the basis of statistical model selection The algorithms in the IDEA eld that have been proposed so far, can roughly be divided into three categories. One category consists of the methods that use a univariate distribution in which none of the variables interact with other variables. We make no statement of any kind on these algorithms as they perform no induction on the model since & is xed. The methods in another category use 1 Note that in the IDEA framework we have that the samples we are referring to are the result of the selection operation S sel(). 4 a higher order structure and for instance use the KL divergence with the full joint probability distribution to guide the search. For factorizations, it has been noted before [6] that using the KL divergence should be accompanied by strong restrictions on f because otherwise minimizing the divergence will give just the full joint probability distribution. In case of a factorization, the heavy restrictions are usually embodied by a parameter  which denotes the amount of variables any one variable is allowed to interact with. To do away with , the methods in a third category have been invented. These methods add a penalty term to the search metric in order to penalize more complex models. The amount of penelization is however usually parameterized again. 4.1 Negative log{likelihood and sample entropy To start out, we rst note that there is a correspondence between equations 4 and 9 as well as between equations 5 and 9. We rst focus on the discrete case, which is the most simple. We split up S into jAj l mutually disjoint subvectors so that each subvector contains only one type of truncated sample point. If we de ne aub to be the order dependent intersection of vectors a and b such that each element of a is preserved if it occurs in b, we can rewrite equation 9 by summing over all possible subvectors of S as they together are exactly equal to S: c2A l jSuc S u cjln( ^ P (X )(c)) (11) Note that jS u cj is just the amount of times that vector c occurs in the sample vector. Using equation 10, this means that we have jS u cj = jSj ^ P (X )(c), so we may conclude: e that the negative log{likelihood equals jSj times the multi- variate entropy of the estimated model. The reason for this is that the estimated model in the discrete case ts over the sample points with a maximum likelihood. The entropy measures are exact in the sense that they go over the probabilities of every domain element. Alternatively, we can de ne sample entropies that compute the measure given a vector of samples jS 0 j s ^ P (Z). Note that it is essential that the samples were sampled from the probability distribution for which we are computing the entropy. As the sample entropy is the same in the discrete as well as the continuous case, we have only a single de nition for it: H(S 0 ; ^ P (Z)) = Z)(S 0 i )) such that S 0 s ^ P (Z) (13) Note that the de nition of the sample entropy is closely related to that of the negative log{ likelihood. To be exact, not require in the case of the negative log{likelihood. In the derivation of equation 12 from 11, we have used that jS u cj = jSj ^ P (X )(c). But, S s P (X ) and not S s ^ P (X ). Therefore it should be noted that only in the limit of jS 0 j ! 1 we have that H(S 0 ; ^ P (X )) ! H( ^ P (X )), jS 0 j s ^ P (X ). Furthermore, in the limit of jSj !1, we have that jS u cj = jSjP (X )(c), since then ^ P (X )(c) ! P (X )(c) and thus in the additional limit of jS 0 j !1; jS 0 j s ^ P (X ), we have that H(S 0 ; ^ P (X )) ! H(P (X )). We therefore have: lim jSj !1 P (X )(c)ln( ^ P (X )(c)) = (14) In the continuous case, things are more complex. We discretize each variable so that it consists of equidistant intervals of length m and let M = h: : : ; [ S u A) i is the i{th sample from the sample vector that falls into area A. Recalling that S s P (Y), equation 9 becomes: X A2M l jSuA Now in the limit of jSj !1, we have that jS u Aj ! jSjP (Y 2 A). We may thus write: lim jSj ! 1 in nitely small and is therefore shrunk to a single point. The discrete sum over all areas then becomes an integral over all points: lim jSj !1 : : : Z P (Y)(yhLi)ln( ^ P (Y)(yhLi))dyhLi From equation 17 it follows that in the limit of jS 0 j !1, H(S 0 ; ^ P (Y)) ! h( ^ P (Y)) if and only if S 0 s ^ P (Y). Note that we cannot enforce the same further relation as we derived for the discrete case. The relation that we have in the discrete case that in the limits of jSj !1, jS 0 j !1, the sample entropy over ^ P (X ) becomes equal to the entropy over the actual underlying probability distribution P (X ), does not hold in general in the continuous case. Actually, it does not hold in general in the discrete case either, but we have assumed that in the discrete case we always approximate probabilities by using de nition 10. As the de nition in equation 10 becomes equal to P (X ) in the limit of jSj !1, this is clearly true. However, in the continuous case, we hardly ever have that in the limit of jSj ! 1 we get ^ P (Y) ! P (Y). So we can neither conclude that the entropy of the true distribution becomes equal to the entropy of the estimated distribution. In order for that to hold in general, we require a special pdf: Z b0 a0 Z b1 a1 : : : Z b l vector and is therefore said to over t the data. This is a very important aspect of density estimation and is termed generalization. As in a practical application we shall use a more generalizing model for ^ P (Y), we cannot enforce that in the limits of jSj !1, jS 0 j !1, H(S 0 ; ^ P (Y)) ! h(P (Y)), S 0 s ^ P (Y). There is however one nal important issue. In the discrete case, we found that the entropy of the estimated distribution equals jSj times the negative log{likelihood over the sample vector (equation 12). We have however not established this in the continuous case. In our derivation, we have directly derived that given an in nite amount of samples, the sample entropy becomes equal to the actual di erential entropy over the estimated model. As a matter of fact, we do not in general have the same relation between the entropy and the negative log{likelihood in the continuous case as we do in the discrete case. The reason why it does hold in the discrete case, is because equation 10 is a maximum likelihood estimate of the sample vector. As we have not assumed that our t in the continuous case is a maximum likelihood t unless we use equation 18, we have not enforced this relation in the continuous case. However, Kullback [13] has shown that if and only if ^ P (Y) is a maximum likelihood estimate over S, the equality holds both in the discrete case as well as the continuous case. Let # be the vector of parameters for ^ P (Z) and  be the vector space of all possible values for the parameters #, we can then formalize this by: 6 ^ P (Z) = arg maxfL(SjP) j P 2 f ^ P (Z j#) j # 2 gg () 4.2 KL divergence and negative log{likelihood model selection In the methods in the second mentioned category, model selection is performed subject to certain constraints. Let C & be the set of all model structures that satisfy the constraints, & 2 C & . The constraints on the model have varied over the past from none [1, 19] to those of chain dependen- cies [3], tree dependencies [2], conditional dependencies with a maximum of  conditionals per variable [4, 6, 7, 14, 15, 16] and marginal product models [11]. Prior to the IDEA framework, all of the higher order models were regarded in the case of discrete (binary) random variables. The basic idea is then to minimize the distance between the true probability distribution and the estimated probability distribution. As a distance measure, the KL divergence between two probability distributions can be used in which one probability distribution is the most complex probability distribution P (Z) and the other is the estimate ^ P & (Z). In the case of factorizations, an algorithm to approximately nd the best factorization can start from the complete univariate factorization in which each variable is taken independently and incrementally build a more complex factorization. Each step in the incremental algorithm is based upon the change that brings about the largest decrease in the KL divergence to ^ P (Z). In the case of a model structure in general, this means that if we let ^ P & 0 (Z) be the current model, we test it against a (more complex) model ^ P & 1 (Z) by observing the change in the KL divergence to the most complex probability distribution. The model that brings about the greatest decrease is selected as the new current model. The metric that has been used in the discrete case, has only been applied to factorizations and is computed with respect to the full joint factorization: D( ^ P (X )jj ^ P f 0 (X )) X c2A l ^ P (X )(c)ln( ^ P f 1 (X )(c)) elemental building blocks. With equation 3, we therefore have that both sums in equation 20 are actually sums over sums over certain vectors a of ^ P (X )(c)ln( ^ P (Xhai)(chai)) with c 2 A l . As given some vector k w a, the theorem of total probability gives us that P c2A jkj P (Xhki)(c)= P c2A jaj P (Xhai)(c), equation 20 transforms into: D( ^ P (X )jj ^ P f 0 (X )) ve: d( ^ P (Y)jj ^ P f 0 (Y)) [4, 6, 7] as a logical expansion over the discrete case. This requires however the computation of a multivariate integral over a given pdf. For the normal pdf [4], this expression can be derived analytically. However, in the case of the normal kernels pdf [7], this is no longer possible. The same is the case for the use of a normal mixture pdf. The latter two pdfs are however very useful in the IDEA approach. In order to use them anyhow, the integrals may be computed numerically. There are many ways to do this and mostly this requires an exponential amount of time in the dimensionality of the integral. However, we have seen in this paper that we can use an approximation. If we can eÆciently sample points from the estimation ^ P f (Y), equation 17 tells us that we may use H(S 0 ; ^ P f (Y)) with S 0 s ^ P f (Y), as an approximation to the true entropy over the estimated model. This resolves the problems regarding the computation of the entropy in the search metric as encountered so far in continuous IDEAs [6, 7]. Summarizing, for nding f, the methods using the KL divergence have used: 7