Numerical analysis of EM estimation of mixture model parameters

Optimisation of distribution parameters is a very common problem. There are many sorts of distributions which can be used to model environment processes, biological functions or graphical data. However, it is common that parameters of those distribution may be, partially or completely unknown. Mixture models composed of a few distributions are easier to solve. In such a case simple estimation methods may be used to obtain results. Usually models are composed of several distributions. Those distributions may be of the same or diﬀerent type. Such models are called mixture models. Finding their parameters may be complicated. Usually in such cases iterative methods need to be used. The paper gives a brief survey of algorithms designed for solving mixtures of distributions and problems connected with their usage. One of the most common method used to obtain mixture model parameters is Expectation-Maximization (EM) algorithm. EM is the iterative algorithm performing maximum likelihood estimation. The authors present the results of adjusting the Gaussian mixture models to the data. It is done with the usage of EM algorithm. The article gives advantages and disadvantages of EM algorithm. Improvements of EM applied in the case of large data are also presented. They help increase eﬃciency and decrease operation time of the algorithm. Another considered issue is the problem of optimal input parameters selection and its inﬂuence on the adjustment results. The authors also present algorithm performance observations.


Introduction
Mixture models [1] are popular methods of data-sets presentation and analysis. The most common application of mixture models are natural phenomena, biological processes, graphical data, damaged and incomplete data [2,3], classification problems. Single distribution usually represents one process. However, if a sequence of processes occurs, several distributions may be combined. Genes reactions on tissues damage exemplify this. Activation of one set of genes makes the other react. Genes which do not take part in the process have small weight and their rate of activity does not change. The most popular probability distribution is a Gaussian one. According to the Central Limit Theorem if a number of sample distribution is huge and its variance is finite, distribution statistics may be approximated by the Gaussian function [4].
The single Gaussian distribution ( Fig. 1) has two parameters: mean and standard deviation. The distribution is given by the formula: Fig. 1. Gaussian distribution [5] The mixture model, consisting of several Gaussian components is represented by the formula: One can notice that apart from the parameters of Gaussian components, the mixture model must be also described by weights. Every Gaussian component in the model has its own weight, which determines height and importance of this single distribution. Fig. 2 illustrates the influence of parameter values on simple, composed of two Gaussians mixture model. The stronger line identifies the envelope, the thinner -single Gaussian. Small difference between the means of components can cause missing or merging Gaussians (Fig. 2c). Weights have also strong influence on the model characteristics. Gaussians with small weights are harder to model and solve (Fig. 2b). Small weights in combination with similar means may lead to more complicated model formation (Fig. 2d).

Mixture model parameters
One of the most problematic issues concerned with mixture models is estimation of their parameters. It is a hard task due to the number and properties of the estimated parameters. Fig. 3 illustrates the example of typical model.
There are several optimization methods which can be used to solve the problem of unknown parameters. The best of them are based on iterative algorithms. Such methods are Newton, quasi-Newton or gradient ones [6]. However, those algorithms need partial derivatives vector, which makes them hard to use due  The expectation-Maximization algorithm [7] is the most common method of computing mixture model parameters. It is an iterative method, composed of two main steps: Expectation (E) and Maximization (M). The standard version of EM algorithm allows for obtaining initial values from randomization. The first step, E, is responsible for calculation of probability value [8]: where p k|x n , p old -probability, that sample x n belongs to k th component of mixture p old -set of input parameters

127
The second step, M, includes calculation of new parameter values. For the Gaussian mixtures this step is given by formulas [8]: The EM algorithm uses the likelihood function for finding the best possible results. The likelihood function f (x n , p) is a common way of evaluation used in the estimation of probability distribution parameters. The function f (x n , p) describes the likelihood of x n observation. It is a good idea to use Maximum Likelihood [1] (ML) principle in the process of parameter values calculation. It states, that the best parameters estimation is the one which is most probable. The most probable set of parameters is the one, which is computed from maximizing of the likelihood function [9]. The ML principle is given by: To make calculations more efficient, the log-likelihood function [8] is used. This allows summation instead of multiplication and it does not change the results because monotonicity of l (x 1 , x 2 , . . . .x N , p) leads to the same location of maximum as L(p, x) does. The Gaussian interpretation of ML principle is as follows: Usage of ML principle [10] gives a certainty of stability. The ML value is always ascending or stable -it never descends (Fig. 4). It guarantees that there will be no deterioration during algorithm work regardless of any circumstances.  Good estimators of components are found quickly in the first several algorithm iterations. In most cases the researcher is interested only in high-weighted component estimation. Low-weighted components need more time to estimate. However, too high accuracy results in lengthened calculation time and excessive concentration on the low-weighted components. The next interesting issue is the shape of envelope subtraction lines, shown in Fig. 5. In all cases these lines have similar, sinus-shaped look in the areas corresponding to the overlapping components. This property can be used in error prediction simulations.
Estimation mistakes can be also a result of errors in Maximum Likelihood principle operation. The ML principle always tries to find global maximum of likelihood function but sometimes it sticks to a local one [11]. Good illustration of this process is presented in Fig. 6a. The charts show the dependence of means handle the local maximum problem. It offers to use many ML calculations (for example 50 or 100) instead of one and choose estimation with the highest value of ML. Another improvement is to repeat the whole above process many times and each time draw the best parameters (those corresponding to the highest likelihood) in the chart. As a result (Fig. 6b) the obtained parameters should oscillate between on the corresponding parameters of the true model. Another issue is the influence of the stop criterion choice on the performance of the algorithm. The most common criterion is that based on the likelihood function values. If the difference between two c consecutive calculated likelihood function values is smaller than the defined accuracy ε, ε > 0, the algorithm terminates. However, there are also other stopping conditions consisting in the rate of the estimators change. This rate can be obtained with different measures and combinations of those measures. Table 1 presents the results of comparison of: maximum change of single parameter values (relative and absolute changes), Euclidean and chi2 distance. The table contains the number of iterations needed to gain the results and charts illustrating the shape of parameters and comparing the obtained parameters.
The results Table 1   According to those data, the best results are given by simple relative changes of parameters.
One needs to remember that the defined accuracy has great influence on algorithm efficiency and speed. The same accuracy may be optimal for one distance type whereas for the others may not be good enough. This may lead to differences in number of iterations or quality of estimations. Every distance method needs another accuracy due to different values of distance parameters. The accuracy depends on a number of model components and points. There is a need to estimate the accuracy by empirical testing.

Methods of EM improvement
Initial convergence of EM is satisfactory because estimation gets to the vicinity of the limit values very fast. But after that, the progress decreases and the algorithm approaches the solution quite arduously (Fig. 7). To decrease the computational time analytical work should be done, which, unfortunately, leads to increasing the complexity. There are many variants of the EM algorithm. One of them is the incremental version of EM algorithm (IEM). This implementation is based on dividing the observed data into equal B blocks. The procedure of IEM takes the E-step for only one block of the observed data at a time and next the B-step is taken. A simple "scan" of algorithm consist of B partial E-steps and B M-steps. As a result new information is collected faster. In simulation of McLachlan and Ng performance of IEM Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 20/05/2022 01:38:50 U M C S [12] the algorithm was tested against the simple EM algorithm. A sample set n = 256 × 256 was generated (Table 2). Only in the case where the number of blocks was established to a size of the data set the IEM was slow. All other simulations show that EM has slow convergence and incremental implementation, IEM, is faster.
Another variant of algorithm is Lazy EM [12]. The main idea is to specify a threshold for selecting subsets of the data upon which E-step and M-step will be performed. In other words, the method assumes that for each iteration not all data is of equal significance.
The third method used for accelerating the EM algorithm is sparse EM. In E-step some posterior probabilities are often close to zero. The sparse method selects only relative probabilities of a given data point. This algorithm can be combined with the incremental version by performing partial E-step and sparse E-step.

Conclusions
EM is one of the best algorithms of mixture parameters estimation. It can be also used in grouping and clustering tasks. It is a stable method, giving good results in the case of huge amount of data processing. Many improvements of EM have been found, which makes EM more efficient. They are helpful in acceleration or dealing with huge data-sets. However, EM is not free of disadvantages and difficulties. It is very sensitive to the initial values -improper values may lengthen the time of work or cause a local maximum problem. Another issue is slow convergence and high complexity, especially in the M step. There is also a need of multiple repetitions, which has additional influence on working time.