An Application of Expectation-Maximization for Model Veriﬁcation

– A description which summarizes entire and usually big set of data is called its model. The problem investigated in the paper consists in veriﬁcation of models of data coming from a simulation experiment of selecting candidates for operators of mobile robot (more strictly building reliable predictive model of the data). The models are validated using train-and-test method and veriﬁed with the help of the EM (expectation-maximization) algorithm which was originally designed for solving clustering problems with missing data. Actually, the selecting is a clustering problem because the candidates are assigned to ’chosen’, ’accepted’ or ’rejected’ subgroups. For such a case the missing data is the category (the subgroup) for which a candidate should be assigned on the basis of his activity measured during the simulation experiment. The paper explains the procedure of model veriﬁcation. It also shows experimental results and draws conclusions.


Introduction
Nowadays, in the age of terabyte disk drives and the Internet, large sets of data are nothing unusual. The key to make use of such data is to pull out some useful information. The process of pulling the information is called data exploration. Its results, relationships and summaries are called models and patterns.
The paper explains the procedure of model verification with the help of the EM algorithm. It presents experimental results and draws conclusions. The procedure can be expanded over derivation predictive models concerning similar problems. Section 2 contains a short introduction to modelling and data mining. In section 3 the experiment with training mobile robot operators is described. Motivation for the research is given in section 4. In section 5 the results of model verification with the help of the EM algorithm are given. Section 6 summarizes the research conclusions.

Modelling Data
There are two main types of data models: descriptive and predictive [1,2]. Predictive models help to draw conclusions about the whole population of objects described by a set of variables, or about their probable future values. One of the variables is expressed as a function of other variables.
The predictive modelling has two forms: classification and regression. In classification the predictive variable is categorical one, in regression -quantitative.
Let Y denote a resulting variable and Xn predictive variables. Predictive models are following [1]: • Linear model -represented by a linear function Y = aX + b. Linear modelling relies on approximation of a discrimination or regression function with a hyper surface (in the simplest case it has the form of a straight line). Simple optimization techniques may be used here, but the linear model often is not enough realistic. • Addictive model -represented by a sum of components Y = a i X i + a 0 • Multiplicative model -represented by a product of components Y = a i X i + a 0 • Model with locally segmented structure -that contains different local relationships in different space areas (e.g. tree structures): -Partially linear model -where Y is locally a function of X; -Function of compound curves -where segments are low level polynomials.
All of the models are parametric. Hence, a result of modelling is a function, or some kind of the line (straight or curved) reflecting data relationships. Another type of the model is non-parametric, not reflecting data relationships explicitly, but determining the object value using the nearest neighbour values. Non-parametric models are as follows: • Local neighborhood models -where Y is determined using average Y value from the nearest neighbours. Such models are of no use for summarizing data. • Kernel models.
Classification predictive models divide the whole X space of objects into separable decision areas (one area for each class of objects). Descriptive model characterizes all data or the process in which these data were generated. Descriptive models summarize or condense data. A descriptive model makes it possible to identify real structure of data burdened with errors. The following descriptive models could be distinguished: • global data probability distribution models (density estimation), • models dividing p-dimensional space into groups (cluster analysis, segmentation), • models describing relationships between data (relationship modelling).
Usually, density function models are used in this case. There are parametric models, such as those defined by the position parameter (average value) and the scale -density function or the distribution. There are non-parametric models also, where distribution or the density are evaluated from the data.
Usually, density function models are used in this case. There are parametric models, such as those defined by the position parameter (average value) and the scale -density function or the distribution. There are non-parametric models also, where distribution or the density are evaluated from the data.
(1) Choosing the model form; (2) Determining values of parameters using estimation (maximization or minimization of the ranking function, reflecting fitness of model and the data).

Data mining
In literature, different methods for data collection and analysis are proposed, including those of publishing results and acquiring benefits from the data mining project.
One of the data mining models is CRISP (Cross-Industry Standard Process for data mining), proposed in the middle 90-s by the European company consortium, as a public data mining standard [2]. In the CRISP model, the following project steps are proposed ( Fig. 1). In Six Sigma, a different strategy is proposed. The Six Sigma is a well organized, data-based strategy for avoidance of defects and quality problems in all kinds of production, services, management and other kinds of business activity. Recently, Six Sigma has become more and more popular (due to many successful implementations) in the USA and around the world. Six Sigma recommends the following stages of data mining (so called DMAIC): Different methodology, similar to a certain degree to Six Sigma, is SEMMA, proposed by the SAS Institute. SEMMA is focused on technical side of data mining projects. This is as follows: The above methodologies are focused on using the data mining in an organization.
They try to answer the questions: how to convert data into knowledge, how to involve proper persons (company owners, managers) in data mining, how to use and publish knowledge in a form that it could be easily used in the decision process.
Ranking functions are used to evaluate if the model fits the data set. As a ranking function there may be used one of these: likelihood, total square error (the sum of square differences between the real and predicted values), classification error factor, etc.
Maximum likelihood method [1, 2] is a general method for population parameters estimation with the help of values which maximizes the likelihood of a sample (L). Likelihood L consists of n observations x 1 , x 2 , . . . , x n . L is a function of combined probability p(x 1 , x 2 , . . . , x n ), where x 1 , x 2 , . . . , x n are random discrete variables. If all of the variables are continuous, likelihood L consisting of n observations is a density function of total probability f (x 1 , x 2 , . . . , x n ). If L is a function of θ 1 , θ 2 , . . . , θ k and L(θ) is differentiable, maximum likelihood is the maximum of L(θ)).

Mobile robots
Mobile robots (mobots) are often used for exploring areas dangerous or not accessible for human being, and also for monitoring, watching and guarding [3][4][5][6][7][8]. When such an area changes dynamically (is devastated by uncontrolled forces, for example) then a mobot must be remotely controlled by a human operator.
The operator should detect all objects in the area fast and precisely. The question arises how to select such a person from a group of candidates applying for this position. Due to the fact, that guarded properties and equipment used are valuable, guards should be possibly the best. To train and rank the operators a simulator was developed. The simulator is similar to a flight simulator. Mobot controlled by an operator moves Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 04/11/2022 15:11:32 U M C S in virtual reality (in a simulated room) and takes photos. The simulator uses computer generated graphics instead of real images that would be taken by the mobot in a real room. The task for the operator is to find all scene changes in limited time [3][4][5][6][7].
A simulation experiment, in which the candidates are trained, is used as a source of data. The candidates are trained in virtual reality. After the training, the best candidates are chosen. The selection procedure consists of the following steps [7]: • Basic operator activities are measured (the number of changes, errors, moves, photos, etc.). • After the training, unreliable results and candidates are 'rejected' (too many errors, too many moves and photos, cheating, etc.). • From the rest of the candidates ('accepted'), those with the largest number of discoveries are 'chosen'.
Predictive model, built on the basis of available data, allows for choosing the best candidates.

Problem statement
The data coming from the experiment should be worked out to get the best candidates. To this end, expert knowledge is used. The data are supplemented with a new 'decision' attribute coming from the expert. Now, the most desired result of data mining for the experiment is the predictive data model allowing us to classify candidates and estimate the decision value.
Data supplemented with expert decision was used to build a few predictive data models. Using different algorithms, we got different models, so the problem is to choose the best, the most reliable and the most universal one. of the candidates has associated attributes like: a number of detected changes (%), a number of moves, a number of photos taken and a number of errors made. There are two kinds of errors -invalid discoveries (errors 1 ) and invalid moves (errors 2 ). To build and train a model, another attribute is added, a so called 'decision'. The 'decision' is a categorical variable, with the exception for linear regression, where it becomes a numerical (quantitative) three-state variable.
The whole experimental data consist of records for 145 candidates, where 67 records form a 'train' set, and the remaining 78 records form a 'test' set. The 'train' set was used for building the model, verified later with the 'test' set. Characteristics for the 'train' set and the 'test' set are very similar to the set characteristics (Fig. 2).
There are many software applications for data exploration, as Weka, Statistica, etc.  The analysis of the models obtained from Weka leads to the following conclusions: • for each of the models a set of required attributes was properly chosen; • the models differ considerably; • neither of the models is accurate.
Therefore, all of the models should be verified.

Expectation-Maximization (EM) Algorithm
The models are verified with the help of the EM (expectation-maximization) algorithm, which was designed for solving grouping (clustering) problems with missing data [1,2]. Actually, the selection is a grouping problem because the candidates Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 04/11/2022 15:11:32 U M C S are assigned to the 'chosen', 'accepted' or 'rejected' subgroups. For such a case the missing data is the category (the subgroup) for which a candidate should be assigned, on the basis of his activity measured during the experiment. When the EM algorithm would be taken for solving this problem, we could get a model worked out from our experimental data. Table 1. Comparison of the predictive data models generated with Weka, built using the 'train' set and tested with the 'test' set (5 attributes were available).

Decision Tree
Decision

EM background
The EM algorithm is used for finding the maximum likelihood estimates for parameters in a model. The model is built on the basis of unobserved variables (missing data). EM consists of two repeated steps: • Expectation (E) step, where expectation of the likelihood is computed. The missing variables, as expected, are used in the computation. • Maximization (M) step, where the expected likelihood of the parameters, found in E step, is maximized.
Parameters found in the M step are then used to perform another E step and so on. This terminates an optimum (which may be, unfortunately, a local optimum). In practice, the EM algorithm stops if values and parameters of missing variables do not change significantly between the two consecutive EM iterations.
Let D = x(1), . . . , x(n) denote a set of n observed data vectors, H = z(1), . . . , z(z) represents a set of missing data (z(i) applies to x(i)). Logarithmic likelihood of the observed data may be expressed as (1) The observed l(θ) is described by a probabilistic model p(D, H/θ), where θ are unknown model parameters In this case, the values of both θ and H are not known. If Q(H) is a probability distribution of missing data H, the logarithmic likelihood may be stated as: F (Q, θ). (2) The function F (Q, θ) is the lower bound of likelihood function l(θ), which should be maximized. The EM algorithm maximizes F as the function of Q with the fixed θ(E step), and F as a function of θ with the fixed distribution Q = p(H)(M step).
In the case of computer implementation of the EM algorithm (as in Weka [9]), different distribution of the missing variable may be used for getting the best results, such as Gaussian, normal or Poisson distribution, or mixture of many distributions.

EM results for the mobot operator data
The experimental data evaluated with the help of Weka and its EM clusterer [9]. The 'train' and the 'test' data sets were concatenated to form a single set. The EM algorithm was run many times with different parameters, getting very similar results. The results for the set of data with all attributes, are given below: Surprisingly, the decision values seem to be of no use, as the resulting clusters are not reflecting the decision attribute. One cluster for one decision was expected in the best case. But here no one but all clusters cover the (r)ejected decision. Moreover, over 57% of the data was classified incorrectly, so the EM algorithm seems to be inappropriate for our data or reversely.

Expectation-Maximization (EM) Algorithm
To verify the predictive data models, the data set was supplemented with decisions made by each of the models. The clustering results were compared with each data model. The comparison is similar to the expert knowledge verification. For all of the models the clustering does not reflect model decisions (Table 2). As a few of the predictive models do not use all attributes for computing the result, some attributes were also excluded from the data set for the EM algorithm. Further analysis led to the results shown in Table 3. The results from the EM algorithm are much less dependable than from any predictive model. The question arises why the results are not reliable. The answer may be one of the following:

EM results for selected data
To verify if the EM algorithm may give good results for different data, verification was repeated for the selected set of data -65 from 145 total records (45%). The records were chosen to form 'clusters', and the EM clustering algorithm was repeated. The results are good as expected, data forms three distinct clusters, reflecting the decision attribute (expert knowledge based decision): The clusters are almost 100% accurate in such a case. Moreover, it would be easy to get 100% accuracy by removing these two confusing records from the selected data set.

Conclusions
The EM algorithm is a well known and reliable method for data clustering, especially for data similar to these of our experiment (such as medical, sociological data, etc.). Choosing the EM algorithm to verify our predictive data models (and the expert knowledge itself) was a new but very promising idea.
However, the EM algorithm gives good results only if records with the same decision form a cluster, which was proved for the selected set of data coming from the Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 04/11/2022 15:11:32 U M C S experiment. Unfortunately, the whole data do not meet such a requirement. This means that the results from the EM algorithm cannot be used for any verification. All predictive data models get better classification results than the EM (compare Tables  1, 2 and 3).
The experiment studied in the paper, that is a model of training mobot operators, is a new problem for data mining. For medical purposes, the data models and their parameters are well known and thoroughly verified. So for such sort of data the EM algorithm gives good results.