PARAMETRIC ESTIMATION OF WATER RETENTION USING MGMDH METHOD AND PRINCIPAL COMPONENT ANALYSIS

12345Abstract: Performing a primary analysis, such as principal component analysis (PCA) may increase accuracy and reliability of developed pedotransfer functions (PTFs). This study focuses on the usefulness of the soil penetration resistance (PR) and principal components (PCs) as new inputs along with the others to develop the PTFs for estimating the soil water retention curve (SWRC) using a multi-objective group method of data handling (mGMDH). The Brooks and Corey (1964) SWRC model was used to give a description of the water retention curves and its parameters were determined from experimental SWRC data. To select eight PCs, PCA was applied to all measured or calculated variables. Penetration resistance, organic matter (OM), aggregates mean weight diameter (MWD), saturated hydraulic conductivity (Ks), macro porosity (Mp), micro porosity (Mip) and eight selected PCs were used as predictors to estimate the Brooks and Corey model parameters by mGMDH. Using PR or OM, Ks and MWD, improved the estimation of SWRC in some cases. Using the predicted PR can be useful in the estimation of SWRC. Using either the MP and Mip or the eight


INTRODUCTION
Pedotransfer functions (PTFs) are functional relationships that transfer easily, routinely, or cheaply measured properties (e.g.soil texture, structure and organic matter content) into missing soil properties (e.g.soil water retention curve, saturated hydraulic conductivity, etc.) [6].Many attempts have been made to correlate the parameters of the soil hydraulic models of Brooks and Corey [8] [9,10] and van Genuchten [51] [55] to basic soil properties.The studies include grouping soil into classes [30], developing alternative methods to derive or fit the PTFs [22,40], finding the most influential soil attributes as predictors [19,44,47] and finally using state-of-the-art techniques to develop PTFs [29].About using state-of-the-art techniques regression trees technique has been used to estimate water retention by Pachepsky, et al. [31].Nemes, et al. [25] and Nemes, et al. [27] used k-Nearest Neighbor technique to estimate water retention.Twarakavi, et al. [48] used support vector machines to derive a new set of PTFs and found that all the support vector machines-based PTFs performed better than the Rosetta PTF program.Jana and Mohanty [17] used Bayesian neural networks to predict the soil water retention.
The soil hydraulic properties are necessary in many fields such as hydrology and agronomy [46].Also, water flow and solute transport models need hydraulic properties for accurate simulations [32].In spite of many attempts that have been made to develop parametric PTFs [9,10,36,38,55], accurate prediction of SWRC remains a challenging object in soil physics.
Predicting some of the soil physical or chemical properties and using them as inputs for the parametric PTFs, along with performing some primary analysis could be helpful in raising the accuracy and reliability of the developed PTFs.In the light of the high spatial variability of penetration resistance (PR), it is difficult to collect data measured with an adequate accuracy [53].Moreover, the soil databases are frequently devoid of PR values, which may be useful to predict SWRC [29].On the other hand, some researchers such as To and Kay [45] and Bayat, et al. [3] used regression and artificial neural networks, respectively to estimate PR from the readily available soil data.The question of how reliable it is to apply the published PTFs to predict PR and then, using it as an input for the parametric PTFs, remains unanswered.
Many researchers have encountered the problem of multi-collinearity by using parametric PTFs [40,43].Van den Berg, et al. [50] suggested that it could be caused by interdependency among the input variables.Principal component analysis (PCA) can solve the problem of multi-collinearity [14] among the predictor variables.It is expected that using principal components (PCs) as predictors can improve the accuracy and reliability of estimates.
This paper is a part of Ph.D. thesis presented by Bayat, et al. [3].Several parts of the research have been published previously.The indicated paper and other reported parts of the research are complementary of one another.Fractal parameters were used to estimate water retention through the Brooks and Corey [7] model by Bayat, et al. [2] (parametric method) using multi-objective group method of data handling and artificial neural networks (ANN).Also, fractal parameters have been used as predictors to develop point PTFs using ANNs with adding penetration resistance and principal components to the input parameters [4,5].More recently Neyshaburi, et al. [28] developed PTFs (point method) to estimate the soil water retention using fractal parameters and multi-objective group method of data handling.The new feature of this article was using multi-objective group method of data handling for the point estimation of water content and comparing its results with those of the PTFs developed by ANNs.But, the effect of the principal components and the penetration resistance in the estimation of the parameters of the Brooks and Corey [7] model has not been reported so far.Therefore, this research is designated to complete the previously reported researches by using penetration resistance and principal components as inputs to estimate the parameters of the Brooks and Corey [7] model.Then, in order to be able to compare the results, the same data set was used in this study.
The objectives of the present work aimed at investigating: (i) the utility of applying PCA on original variables and using PCs as predictors to estimate the Brooks and Corey water retention parameters, and (ii) the efficiency of the predicted PR to improve the parametric PTFs.

Sampling and Measurements
In this study a data set of 148 samples were used to develop the PTFs in order to estimate the Brooks and Corey model parameters.The study areas representing a wide range of the soil properties were two provinces of Iran: Hamedan, 124 data (east of Iran) and Guilan, 24 data (north of Iran).This study is a resumption of the research in which PTFs were developed by ANNs and a multi-objective group method of data handling (mGMDH) to predict the Brooks and Corey water retention parameters [2] and all of the data were taken from Bayat, et al. [2], Bayat, et al. [4], Bayat, et al. [5].Therefore, the details of all measurements for the soil properties are given in the cited papers and supplementary materials, and we do not repeat them here.The mineralogical composition of the clay were greatly (2:1 clay types) vermiculites and illite for Guilan [11] and Hamedan [16] provinces, respectively.
Penetration resistance (PR) was measured in -100 kPa matric potential, only on 24 core samples (5.5 cm in diameter and a height of 4.0 cm) collected from Guilan province.For the other 124 samples collected from Hamedan province, PR was estimated using the PTF that had been developed by Bayat, et al. [3] using BD, TP, θ and relative saturation, that are usually available in many databases.Bayat, et al. [3] developed several PTFs to estimate PR.We describe the PTFs used in this study.A data set of 293 data from Hamedan province, Iran, was divided into training and validation groups.The first set of relative saturation (θ v TP -1 ), BD and θ m served as inputs to ANN S in order to predict PR, and the second group used for it ' s validation.The RMSE and coefficient of correlation of PR prediction were 202.2 kPa and 0.80, respectively.
The Brooks and Corey [8] model was used to give a description of the water retention curves: (1) [ ] where is the volumetric water content at the matric suction h and S is the degree of saturation.The parameters to be determined are: the residual ( r θ ), saturated ( s θ ) water contents (m 3 m -3 ), the inverse of the air entry pres- sure α (1 cm -1 ), and the pore size distribution index λ (-).The model parame- ters were estimated with the RETC program [52] using experimental soil water retention data.
Because some of the parameters exhibited non-normal distributions, the following transformations were made: ln (α), ln (λ), log (K s ) and log (PR).
Principal Component Analysis Principal component analysis was applied to all measured and calculated variables and 8 PCs were selected to be used as inputs in the PTFs.Detailed information about the application of principal component analysis and selecting 8 PCs was given in Bayat, et al. [4] and supplementary materials.

Developing PTFs
We followed the procedure by Minasny and McBratney [22], who developed "neuropath" software to create PTFs, in developing mGMDH models.100 data were selected randomly for development and remaining 48 data were used for validation.The PTFs were developed using a multi-objective group method of data handling (mGMDH) technique [24] to estimate θ r , θ s , α and λ in the Eq. 1.
The parameters were predicted by the combination of mGMDH and bootstrap method [13] using 50 bootstrap data sets.Details are given in Bayat, et al. [4] and supplementary materials.
The mGMDH algorithm uses multi-objective optimization.The mGMDH algorithms are given in detail by Atashkari, et al. [1] and Nariman-Zadeh, et al. [24].Genetic algorithms have been used in a multi-objective GMDH neural network for each neuron searching its optimal set of connections with the preceding layer [1].For the multi-objective optimization evolutionary algorithms were applied.The training and testing mean square errors were two objective functions which determined to be minimized in order to find the most accurate and reliable PTFs.In order to perform the mGMDH algorithm, a program was written in MATLAB (The MathWorks, Natick, MA).* 6 * The models would be available upon requesting from corresponding author.Four primary PTFs (from Step 1 to Step 4) were developed before creating the PTF1.Detailed information about the aforementioned four steps is given in the earlier study [2] and supplementary materials.In this paper we focus on the effect of PR, MWD, OM, K s , Mp, Mip, and PCs in the estimation of SWRC.
For each parameter in Eq. 1 four PTFs were developed using various input variables including PCs.To our knowledge no one has used the estimated PR to predict SWRC parameters.Then, PR was used as a new predictor to develop PTF1, along with the inputs in the previous steps (from Step 1 to Step 4).Therefore, to make the inputs clear for each PTF, they were shown in the related tables.PTF2 was built by using OM, MWD, and K s as new inputs.If the addition of an input significantly improved the reliability of the PTF, it was used as an input in all of the following steps; if not, it was not used as a predictor in the following steps (it was also applied to develop PTF1-PTF4).To develop PTF3, Mp and Mip were included in the list of inputs.To develop PTF4, eight selected PCs were used as new inputs.
Since Mp and Mip were calculated from the 4 kPa water retention, a question would rise about using it towards fitting a SWRC model or calculating pore size distribution index λ, rather than estimating them.The answer is that the SWRC models and/or λ cannot be fitted or calculated by using only one point of the water retention data.Instead, it's using could improve the SWRC estimation substantially.

Evaluation Criteria
In this study the accuracy and reliability of each PTF refer to the precision of the estimations for the development and validation data sets, respectively.The improvement of each PTF performance in comparison with the previous one has been evaluated using the evaluation criteria.The coefficient of determination (R 2 ), the root mean square error (RMSE) [57], and the Morgan-Granger-Newbold (MGN) test [12] have been used to evaluate the accuracy and reliability of PTFs.Detailed information about the MGN test is given in the Bayat, et al. [4] and supplementary materials.The MGN values for significant differences for accuracy and reliability are 1.99 and 2.01, respectively.It means that if the calculated MGN values are greater than the aforesaid values, the difference between two PTFs would be significant.
To evaluate the overall performance of the developed PTFs, the estimated parameters of Eq. 1 for each step were used to compute the volumetric water contents corresponding to the experimental values of matric suctions.These water contents were compared with the experimental ones on an individual basis (i.e.curve by curve) by computing the integral mean error, IME (m 3 m -3 ) and the integral root mean square error, IRMSE (m 3 m -3 ).The differences were calculated with the numerical quadrature of the following integrals [43]: PARAMETRIC ESTIMATION OF WATER RETENTION USING MGMDH METHOD… (3) ( ) ( ) where h is the matric suction (kPa) and p θ and are the PTF predicted and the measured water contents at that potential.Limits a and b define the pF range over which the experimental curve is measured.

RESULTS AND DISCUSSION
The statistics of the soil properties for the development and validation of data sets are shown in Table 1.Since both data sets were selected randomly, their parameter ranges were the same and the t-test demonstrated their similarity.The results of PCA are shown in Table 2 and the discussion concerning them is given in the Bayat, et al. [4] and supplementary materials.The coefficients of the parameters show their effect on each PC (Table 3).

Saturated Water Content ( )
Table 4 depicts the four developed PTFs to predict θ s using the input variables including PR and PCs.
PTF1.By going over the step 1-4 of the previous study [2] only addition of detailed information of PSD (s g and M d ) to the basic soil properties could improve the estimation of θ s .Therefore, attributed to PTF1 was compared with the PTF with input variables of C, TP, BD, s g and M d .In line with the MGN value no significant improvement occurred when the PR was included in the list of inputs (Table 4).It can be estimated that PR did not improve the θ s because high water PR content was relatively insensitive to soil structural parameters [23], and therefore PR could not differentiate the structural effects on θ s .The indicated finding is similar to the one reported by Pachepsky, et al. [29], who found that PR did not lead to a significant improvement in θ s estimates.a) Refer to the notation list for the variables description.b) MGN values for significant differences (P < 0.05) for accuracy and reliability are 1.99 and 2.01 respectively.
PTF2.Using OM, K s and MWD did not lead to a significant improvement either (Table 4).It could result from the inter dependencies between these parameters and the other input variables (e.g.correlation between BD and OM) [35] or the pronounced effect of OM, K s and MWD on water retention at matric suctions other than 0 kPa (e.g.θ FC or θ r ) [18].Mayr and Jarvis [21], also found that OM was excluded as a predictor for θ s by regression technique.Starks, et al. [42] also reported that using K s improved the water retention estimates at some sites, but deteriorated the estimates at some other sites.
PTF3.Using the Mp and Mip caused an improvement, even though it was not significant in the case of PTF reliability (Table 4).This is in agreement with many authors who successfully used water contents at one or two matric suctions to predict SWRC [34,39,56].
PTF4.A significant improvement of the PTF reliability occurred when 8 selected PCs were included in the list of inputs (Table 4).Using original variables themselves did not lead to a significant improvement of PTF reliability, whereas using PCs values did.This may explain the complexity of the soil system that is difficult to capture even by the mGMDH technique.It could be the reason why Schaap and Bouten [37] could not estimate θ s from basic soil properties by ANNs.However, Ungaro, et al. [49] found the simplest structure of GMDH network for θ s .

Residual Water Content ( )
Table 5 depicts the five developed PTFs to predict θ r using input variables including PR and PCs.a) Refer to the notation list for the variables description.
b) MGN values for significant differences (P < 0.05) for accuracy and reliability are 1.99 and 2.01 respectively.
PTF1 and PTF2.Since in the step 1-4 of the previous study (Bayat et al., 2011) only addition of fractal parameters of Millan's model (C 2 M, d c M, D 1 M, D 2 M and Se 2 M) to the basic soil properties could improve the estimation of θ r therefore, attributed to PTF1 was compared with the PTF with input variables of S, Si, C, TP, C 2 M, d c M, D 1 M, D 2 M and Se 2 M. According to the MGN value no significant improvement occurred when the PR or OM, K s and MWD were included in the list of inputs (Table 5).
PTF3.No significant improvement occurred when Mp and Mip were included in the list of inputs (Table 5).In physical terms this may suggest that the θ r dominated by hygroscopic water, depends only on the micropores size distribution, whereas the threshold suction between Mp and Mip is 4 kPa, that could improve the estimation of θ s and water content at lower suctions, but which did not improve θ r estimates.
PTF4.No significant improvement occurred when 8PCs were included in the list of inputs (Table 5).It could result from the fact that no significant correlation was found between θ r and 8PCs (data is not shown).Also, poor correlation of the θ r with other soil physical properties was reported by Luckner, et al. [20] and Tietje and Tapkenhinrichs [43].We thought it might be in the light of the complexity of the relationship between θ r and soil physical properties or PCs, thus using more related variables such as clay mineralogy may help to solve the problem.Since, the estimation of θ r was not improved, using different input variables in various PTFs then, we decided to estimate θ r using 20 PCs, explaining 99.48% of the total variation of original variables.
PTF5.Only twenty PCs, were used as predictors (without any other original variable), and caused a highly significant improvement of PTF accuracy and reliability (Table 5).Particular features of the soil properties may affect θ r in such a way that they could not be recognized by using original variables as inputs.Nevertheless, PCA could put these features in some PCs that explain a high percentage of the total variance, and using them as predictors increased development and validation R 2 from 0.543 and 0.608 to 0.82 and 0.912, respectively, which are quite high.Some of the values of R 2 for θ r estimates found in the literature are 0.722 [55], 0.41 [38], 0.791 [46], 0.68 [49] and 0.599 [41].

Reverse of Air Entry Value ( )
Table 6 depicts the three developed PTFs to predict α using input variables including PR.  a) Refer to the notation list for the variables description.b) MGN values for significant differences (P < 0.05) for accuracy and reliability are 1.99 and 2.01 respectively.
PTF1.In the step 1-4 of the previous study [2] only TP, BD, S, Si, C and Se 2 B were selected as input variables to estimate the α.Therefore, attributed to PTF1 was compared with the PTF with those input variables.According to the MGN value no significant improvement occurred when PR was included in the list of inputs (Table 6).A reason could be that PR was predicted from other soil physical properties (e.g.TP, BD) and they had been included in the list of inputs.It is conjectured that the mGMDH procedure, in effect, used TP and BD values to estimate PR, and these estimates had improved estimates of α, thus using PR as another input did not induce further improvement.
PTF2.No significant improvement of PTF accuracy and reliability occurred when OM, K s and MWD were included in the list of inputs (Table 6).
Selecting OM, K s and MWD in the equation to estimate α explains that there are correlations between α and the indicated parameters, the correlations, nevertheless, are not strong enough to induce a significant improvement.
PTF3.A highly significant improvement was achieved when Mp and Mip were included in the list of inputs (Table 6).Air entry values were in the range between 0.125 and 607 kPa for 70 % of samples and the threshold between Mp and Mip is 4 kPa.It is inferred from the literature [26,34] that when water content at a matric suction is used to estimate water content at other matric suctions, the closer the two points are the more accurate the estimates.Therefore the closeness of 4 kPa with air entry values could explain the highly significant effect of Mp and Mip on the α estimates.Using 8PCs did not induce a significant improvement.This is the reason why results have not been shown.

PORE SIZE DISTRIBUTION INDEX (Λ)
Table 7 depicts the four developed PTFs to predict λ using input variables including PR and PCs.a) Refer to the notation list for the variables description.b) MGN values for significant differences (P < 0.05) for accuracy and reliability are 1.99 and 2.01 respectively.
PTF1.In the step 1-4 of the previous study [2] only basic soil properties were selected as input variables to estimate the pore size distribution index.Therefore, attributed to PTF1 was compared with the PTF with input variables of S, Si, C, TP and BD.Using PR improved the PTF accuracy and reliability, however its effect was not significant in line with the MGN value (Table 7).It could result from the correlations between PR and the basic soil physical properties [15] or the error in the PR estimates.It is suggested that using the PR measured values can improve the λ estimates.
PTF2.No significant improvement was achieved by the inclusion of OM, K s and MWD in the list of inputs (Table 7).The situation is similar to that pointed out by Vereecken, et al. [54] who found that OM and K s were not selected in the regression to estimate λ.PTF3.A significant improvement of the estimates occurred when Mp and Mip were included in the list of inputs (Table 7).This suggests that λ has been strongly influenced by the pore size distribution.Water content at one or two matric suctions have been successfully used to estimate the WRC [26,29,33,38,39].
PTF4.A significant improvement of PTF accuracy was achieved when 8PCs were included in the list of inputs; however the improvement of PTF reliability was not significant (Table 7).This suggests that applying PCA on the original variables and using PCs as predictors to estimate λ was very useful.Nevertheless using most of original variables did not induce a significant improvement, whereas using PCs did.
The effect of each input variable on the PTF estimated water content has been evaluated by IRMSE and IME (Table 8).Using the predicted PR led to a small improvement of the reliability.It shows that using the predicted PR can be useful in the estimation of SWRC and further investigation is needed.Using OM, K s and MWD did not improve the water content estimates, whereas Mp and Mip or 8PCs did.The most reliable and accurate water contents estimates were achieved when 8PCs were included in the list of inputs, and using 8 PCs considerably improved the PTFs.
CONCLUSION 1.Using the predicted PR or OM, K s and MWD improved the estimation of the Brooks and Corey water retention parameters in some cases but they did not induce a significant improvement in the water content estimates.
2. Using either the Mp and Mip or the 8 PCs significantly improved the PTF accuracy and reliability, and they improved the water content estimates.This suggest that using direct information on pore size distribution could be useful to develop parametric PTFs.
3. Applying PCA on the original variables could be used as a primary analysis to develop parametric PTFs.
4. Overall, using PCs, Mp and Mip (if available) as inputs along with the basic soil properties, provide better water retention estimates.

TABLE 1 .
STATISTICS OF THE SOIL PHYSICAL PROPERTIES FOR THE DEVELOPMENT AND VALIDATION OF DATA SETS.
*) Refer to the notation list for the variables description.

TABLE 2 .
THE FIRST EIGHT PRINCIPAL COMPONENTS OF ALL ORIGINAL VARIABLES.
*) Refer to the notation list for the variables description.

TABLE 3 .
COEFFICIENT OF EACH STANDARDIZED VARIABLE IN THE PCS.
*) Refer to the notation list for the variables description.

TABLE 4 .
DEVELOPMENT AND VALIDATION RESULTS AND ERROR STATISTICS OF Θ S ESTIMATES MADE WITH DISTINCT INPUTS BY MGMDH PTF2.

TABLE 5 .
DEVELOPMENT AND VALIDATION RESULTS AND ERROR STATISTICS OF Θ R ESTIMATES MADE WITH DISTINCT INPUTS BY MGMDHPTF1 AND PTF2.

TABLE 6 .
DEVELOPMENT AND VALIDATION RESULTS AND ERROR STATISTICS OF LN Α ESTIMATES MADE WITH DISTINCT INPUTS BY MGMDHPTF1.

TABLE 7 .
DEVELOPMENT AND VALIDATION RESULTS AND ERROR STATISTICS OF LN Λ ESTIMATES MADE WITH DISTINCT INPUTS BY MGMDHPTF1.

TABLE 8 .
THE COMPARISON OF THE PTF ESTIMATED AND MEASURED WATER CONTENTS FOR EACH STEP.
[2]Refer to the notation list for the variables description.b)Results of the step4 in the study of Bayat, et al.[2].