PERFORMANCE EVALUATION OF REMOTE SENSING DATA WITH MACHINE LEARNING TECHNIQUE TO DETERMINE SOIL COLOR

1Abstract. The aim of the present research is the determination of soil color by spectral bands and indices obtained from MODIS images. For this purpose, soil samples were collected from East Azerbaijan Province (Iran) and their color and texture were investigated through Munsell color system and hydrometer method, respectively. Stepwise regression, principle component analysis and sensitivity function methods were employed to find the dominant indices and bands using artificial neural network (ANN) as one of the machine learning techniques. The improved indices as the model input had better performance, for example, the calculation of correlation coefficient between indices and hue showed 51.48% increase of correlation coefficient with comparison of the normalized difference vegetation index (NDVI) to modified soil adjustment vegetation index (MSAVI) and 54.54% correlation enhancement of soil adjustment vegetation index (SAVI) compared to MSAVI. Stepwise regression method along with error criteria decline may enhance the performance of soil color model. In comparison with multivariate regression, ANN model exhibited better performance (with a 12.61% mean absolute error [MAE] decline). Temporal variation of modified perpendicular drought index (MPDI) as well as band 31 could justify the Munsell soil color components variations specifically chroma and hue. MPDI and thermal bands could be employed as a precise indicator in soil color analysis. Thus, remote sensing data combined with machine learning technique can clarify the procedure potential for soil color determination.


INTRODUCTION
As one of the major attributes, soil color has been used by the USDA Natural Resources Conservation Service (NRCS) to describe soil morphology (Stiglitz et al. 2017). The soil properties and internal processes can be determined with reference to soil color. Moreover, it is a comprehensive indicator of water drainage, aeration, and organic matter content of the soils (Henry 1991, Han et al. 2016. In this regard, searching for an efficient method to determine soil color is of crucial importance. Matching the soil sample with the standard Munsell color charts is a common method to determine soil color, which has three components: Hue, Value and Chroma (HVC). The dominant wavelength can be described by hue, while chroma and value represent the saturation and brightness, respectively (Leone and Escadafal 2001). The condition of the sample and proficiency of the person who performs the matching can influence the precision of the visual Munsell color system (Singh et al. 2004). Indeed, this method is highly sensitive to user's skill (Stiglitz et al. 2016). The use of Munsell charts for soil color determination over large areas is time-consuming and expensive and it requires large number of samples. But this problem can be resolved by multi-spectral reflectance. In this context, satellite data have shown high potency in presenting, mapping, and forecasting the processes including soil classification, erosion and providing soil maps in the vast areas (Singh et al. 2004). For this purpose, the remote sensing technique can be considered as the useful tool because it can resolve the limitations of ground observation (Huesca et al. 2014).
The remote sensing has been an appropriate alternative to determine soil color (Singh et al. 2004). Satellite images with various spectral bands can be employed to obtain the desired information. Satellite indices can be derived by applying mathematical tools on spectral bands (Raghavendra and Mohammad Aslam 2017). Based on the governing condition, the indices have to be improved; for instance, soil's background conditions can influence partial canopy spectra and calculated indices. Thus, soil's background effect has to be minimized. Therefore, Huete (1988) introduced the vegetation index which employed the soil-adjusted factor depending on the vegetation level. The precision of these indices has been validated in numerous applications, for instance, the crop coefficient and evapotranspiration were calculated with the vegetation index in Northeast Asia by Park et al. (2017), Bijaber et al. (2018) used the satellite indices for drought monitoring, and other researches related to soil color determination by satellite-based indices are mentioned in the following part.
The correlation between the normalized difference vegetation index (NDVI) and components of Munsell soil color chart has been studied through the use of the National Oceanic and Atmospheric Administration (NOAA) / the Advanced Very High-Resolution Radiometer (AVHRR) data in Brazil. Based on the line-ar empirical model, good correlations could be observed among NDVI, hue and chroma. The NDVI was more associated with the hue and chroma than with the value. It has been shown that various indices can prove their validation under specific environmental conditions (Singh et al. 2004). The components of the Munsell soil color charts determination have been investigated in arid regions using Landsat (TM) data of 20 sample sites based on re-sampling into the size of 3 × 3 pixels. A strong correlation between the first and seventh bands of TM sensor with Munsell notation expressed the dependency of hue, value and chroma to the visible bands of Landsat data. Three Munsell components were modeled and value exhibited the strongest correlation compared to the hue and chroma. However, further studies have been recommended to obtain accurate results (Matinfar et al. 2010 Several studies have addressed the relationship between soil texture and the three components of Munsell soil color system. Based on these studies, value is very weakly correlated with clay content and there is also a very weak correlation between the sand and color components. However, all color components are well correlated with the silt content (Gunal et al. 2008).
Assimilation of machine learning techniques and remote sensing data has been widely employed for the prediction of soil features (i.e. soil organic matter, cation exchange capacity, magnesium and potassium content, and pH). A combination of satellite data with machine learning algorithms can result in a highly accurate estimation of soil features highlighting their applicability in mapping the soil properties (Khanal et al. 2018). As the spectral bands and satellite indices are the modeling tools in remote sensing applications, their comprehensive study is vital for determination of soil color, however, a limited number of researches have addressed this issue. Based on the literature, the determination of the soil color requires a complete assessment of satellite data performance (Singh et al. 2004, Matinfar et al. 2010.
The goal of the present study is the evaluation of the applicability of remote sensing information for determination of the soil color leading to a precise comparison of the spectral and thermal bands as well as various indices. For this purpose, the samples were collected from East Azerbaijan Province, Iran.
Stepwise regression, principal components analysis and sensitivity function methods were applied to find the most important indices or bands. The performance of methods was also assessed using an artificial neural network (ANN) and multivariate regression models.

Study area
The soil samples were collected from the lands of Azarbaijan Shahid Madani University in Azarshahr located in East Azerbaijan Province (northwest of Iran) as shown in Fig. 1. According to the De Martonne and Torrent-White climate classification index, the predominant climate of region is semiarid. The proximity of the studied area to Urmia Lake was one of the serious challenges  as drying of the lake had an adverse impact on the climate of region (Mahsafar et al. 2017). In this context, assessment of the soil parameters (i.e. soil color) in that region is crucial as it may improve the soil condition through measures taken according to these parameters. A total of 41 topsoil samples were obtained where the apparent changes of soil were the selection criteria in samples location. Hydrometer method and Munsell color charts were employed for evaluation of soil texture and color, respectively. Soil texture of the studied area was mainly sand and loamy sand with very low organic matter content.

Indices of satellite images
The data derived from satellite images, spectral and thermal bands, were expressed as indices which were calculated based on the ratios or normalized differences of 2 or 3 bands. The used indices are represented in Table 1 (note that land surface temperature (LST) shows the performance of the thermal bands).

Temperature Condition
Index TCI NDVI is a highly potent index but suffers from several drawbacks. NDVI has high values in heterogeneously covered lands where the soil, climate and ecosystem are more favorable (Rymuza et al. 2012). Improvement and development of SAVI and MSAVI were carried out by iterative and variable L function which minimized the soil background effect (Jiang et al. 2007). The relationship of red and near-infrared reflectance could be expressed by soil line through plot characterization of the spectral behavior of a pixel without variations in the vegetation and substantial moisture content. PDI is defined based on the vertical distance from partial cover in the near-infrared and red spectral space to the line dissecting the coordinate origin. Accordingly, parameters such as type of land cover, land use and soil heterogeneity are not taken into account (Ghulam et al. 2007). Shahabfar et al. (2012) introduced the modified perpendicular drought index (MPDI). LST indicates the energy balance of the processes on earth's surface at both regional and global levels (Ghulam et al. 2007). TCI was used to estimate the share of thermal band in condition evaluation (Du et al. 2013), while TVX simply showed the ratio of thermal and reflected radiations.
The indices and bands possessing substantial impact on the three components of Munsell color charts (HVC) play a more significant role in soil color determination. In this regard, stepwise regression, principle components analysis and sensitivity function methods are applied to seek for the effective indices or bands. The details of these methods will be discussed in following sections. Also, the association of Munsell color system (HVC) components with soil texture was evaluated in this research.

Stepwise regression
Stepwise regression applies an automatic model selection route and it is particularly useful in cases with a large number of potential explanatory variables (Sharma and Jin 2015). Stepwise regression is one member of linear regression family which uses step-by-step iterative construction. In this method, the independent variables are selected in an automatic manner, then the stepwise regression will be recursively constructed by addition or removal of independent variables in which some matrix tools are employed (Peng et al. 2018).

Principle component analysis
The method was known as the most significant multivariate statistical methods. Principle component analysis (PCA) is capable of transforming the large number of variables into some dominant variables (with lower number) called "principle components". The analysis is based on this fact that the total variance of variables can explain the maximum variance of dataset (which have high dependence on primary variables). The primary principle component has the highest possible variance. The second component is computed in such a way that it is orthogonal to the first component with the highest possible inertia. The other components will be calculated in the same way (Abdi and Williams 2010, Rymuza et al. 2012). The Kaiser-Meyer-Olkin test shows the suitability in terms of sampling adequacy, and possibility of PCA performance and the minimum value of that for an acceptable PCA was expressed 0.5 according to Wuttichaikitcharoen and Babel (2014).

Sensitivity function
A univariate regression analysis has to be employed to evaluate the relationship between two variables to determine the performance of indices or bands relative to Munsell color system components (HVC). This implies investigation of the indices or bands sensitivity throughout the full range of the Munsell color system components (HVC). In a bivariate regression model, the Munsell color system components (HVC) are considered as independent variables while and satellite data are taken as dependent ones. The error in estimation of dependent variables ( ŷ ) can be formulated through standard error of ŷ . The standard error of ŷ in linear regression, curvilinear models and in nonlinear regression can be estimated by Eq. 1 and 2, respectively. The sensitivity function (s) definition is also mentioned in Eq. 3.
Where: 2 σ shows MSE, X represents the independent variables matrix, X i denotes the i th row of X, F stands for the derivative matrix and

ANN model
ANN can be employed to assess the methods efficiency in accurate selection of bands or indices. ANN was developed as a representation of the biological neural network in mathematics. Using ANN, the complicated associations among input and output data can be described. Similar to the mathematical model calibration, ANN involves learning or training. An activation function is defined as a function employed for transformation of a unit (neuron) activation level into an output signal. The weights of ANN will be determined based on the error minimization in observation and simulation data (Ozgür 2005, Akhtar et al. 2009). The feed-forward back propagation network was applied in this study.

Error criteria of model performance evaluation
Equations 4-7 describe the error criteria for assessing ANN performance related to different inputs methods. The criteria with the lowest values reflect the best performance. Fig. 2 shows the research procedures in a flowchart.

Satellite images data and their relationship with Munsell system components
The satellite images on some days of April, May, June and October of 2015, 2016 and 2017 were used. Soil sampling was easier in these months and that was the criteria in months selection. Input variables were band 1, 2, 3, 4, 31, 32 as well as indices mentioned in Table 1 covering visible, near-infrared and thermal bands. Regarding the variety of near-infrared and thermal bands, the resolution was set to 1 km. The data driven from satellite images were on average 9 pixels (3 × 3) centered on the sampling point. LST was computed using split window algorithm as developed by Price (1984).

Download of MODIS images (2015-2017)
Derivation of the information of satellite images and indices calculation with equations in Table 1 Selection of the dominant indices or bands using three methods

Stepwise regression Principle component analysis Sensitivity function
Selection of the precise method using ANN and multivariate regression with some error criteria to determine dominant inputs related to HVC Moreover, the effect of soil texture on the Munsell color system (HVC) components was evaluated, therefore, the results of soil texture determination of samples are brought in Table 2.  The soil texture of studied area was dominantly composed of sand and loamy sand with lower spatial distribution of loamy sand. The correlation of satellite-driven data and soil particle size with three Munsell color charts (HVC) components is tabulated in Table 3. Based on Hurst (1977), hue data were numerically presented as 7.5YR = 17.5, 10YR = 20, 5YR = 15, and 2.5YR = 12.5. With regard to the hue, the significant parameters included LST, VCI, band 31 and 32, while the important parameters of chroma were silt, LST, VCI, PDI, MPDI, band 1, 2, 4, 31, and 32. The value component showed no correlation with the other data. Low correlation between soil texture and Munsell components could be attributed to the homogenously distributed soil texture as the soil texture of the studied area possessed poor diversity. The chroma had higher number of the significant correlation coefficients compared to the hue. Hue and chroma showed the strongest correlation with band 31 and MPDI, respectively. The chroma correlation with MPDI was, however, stronger than that of hue. In general, the reflectance and thermal bands were effective in both cases (hue and chroma), the thermal bands influences were, however, more profound for hue. The value exhibited the lowest correlation to clay contents and color. Hue was negatively correlated to clay. Moreover, visible spectral bands were weakly correlated with hue. Furthermore, improved indices increased the correlation coefficient. For instance, in case of hue determination, the correlation coefficient was enhanced by 51.48% from NDVI to MSAVI, from SAVI to MSAVI, it also exhibited 54.54% increase. Concerning chroma determination, we observed a 6.25% improvement from SAVI to MSAVI, while this enhancement was 7.2% from PDI to MPDI. Regarding the variety of studied indices and bands, three methods were applied to find the major inputs of models.

Stepwise regression
Automatic choice of independent variables by stepwise regression analysis had been conducted to find important variables on soil color components. The effective variables of hue were related to band 31, while MPDI and band 31 were for chroma. Band 31 is impressive on two components of Munsell color.

Principle component analysis
Effective factors of the Munsell soil color components (HVC) were determined by PCA. The Kaiser-Meyer-Olkin was calculated as 0.599, showing its suitability in terms of sampling adequacy and possibility of PCA performance. Eigenvalues trend is depicted in Fig. 3 which clearly suggests the descending trend of eigenvalue by increase of the component number. Moreover, the first five components are responsible for 95.25% of the entire variation (their eigenvalues were higher than 1). Table 4a also lists the PCA results. As Table 4b shows, for the first component, PDI, MPDI, band 1, 2, 4 possessed the maximum coefficient values, in case of the second component, maximum coefficients were assigned to band 31, 32; while in the third, fourth and fifth components NDVI, sand and clay exhibited maximal coefficient values. The coefficients values which related to the first component were, however, greater than the others.

Sensitivity function
The sensitivity function can be calculated by finding the relationship between three Munsell color charts (HVC) components and satellite information. Fig. 4 represents the results concerning sensitivity function and the impact of the three components of Munsell color charts (HVC) on the sensitivity trend of each index or band. For s > 2 (t-score of 2 is the critical value), the sensitivity of indices or bands to the Munsell soil color components (HVC) became significant. The chroma exhibited high sensitivity to band 1, MPDI and PDI, while the sensitivity of hue was high to MPDI and LST. In case of value, it exhibited high dependence upon LST and PDI. The sensitivity of SAVI and MPDI remained relatively consistent throughout the value range. The indices, however, showed decreased sensitivity for cases with chroma, hue and value less than 7, 19.5 and 7.5, respectively. MPDI had high sensitivity with regard to two Munsell components.

ANN sensitivity analysis
The selected variables with the three mentioned methods were employed as an ANN input. ANN sensitivity related to the type of activation function; Log Sigmoid, Tangent Sigmoid and Pure Linear; performances were examined through the error criteria. Fig. 5 illustrates RMSE variation for various types of activation functions of ANN model with regard to PCA method. RMSE calculation related to the value and chroma components revealed that minimum RMSE of activation functions can be expected for Tangent Sigmoid and Linear for hidden and output layers concerning value, and for chroma, the lowest RMSE of activation functions were observed in Log Sigmoid and Linear for hidden and output layers. The maximum RMSE was assigned to Tangent Sigmoid and Log Sigmoid of hidden and output layers for value and chroma components.

Evaluation the performance of input selection methods
The performance of stepwise regression, principle and sensitivity function methods using ANN and multivariate regression was assessed as presented in Fig. 6 (the abbreviation of figure are the following HSM -hue stepwise multivariate, HPM -hue principle multivariate, HSFM -hue sensitivity function multivariate, VPM -value principle multivariate, VSFM -value sensitivity function multivariate, CSM -chroma stepwise multivariate, CPM -chroma principle multivariate, CSFM -chroma sensitivity function multivariate, HAShue ANN stepwise, HPA -hue principle ANN, HSFA -hue sensitivity function ANN, VPA -value principle ANN, VSFA -value sensitivity function ANN, CSA -chroma stepwise ANN, CPA -chroma principle ANN, CSFA -chroma sensitivity function ANN-RRMSE). In terms of error criteria, the stepwise regression possessed the least error value compared to other methods. For stepwise regression, band 31 was selected for hue, while MPDI and band 31 were selected for chroma. MPDI and thermal bands were selected as the dominant variables for soil color determination. By comparing the statistical methods, it was revealed that ANN model could reduce the errors. The performance of satellite indices in the term of temporal variations was also investigated. June possessed the maximum LST among the addressed months. Temperature enhancement resulted in decline of soil moisture content, while soil reflectance showed an increasing trend by decrease of soil moisture. Increase of soil reflectance resulted in a lighter color. The chroma values ranged from 0 (neu-tral color) to 8 (maximal color intensity). During June, maximum chroma values were associated with the highest MPDI and LST. Wang et al. (2004) expressed that NDVI is negatively correlated to LST. The highest LST values were recorded in June, which did not coincide with the months possessing minimum NDVI. It indicates higher efficiency of thermal bands when compared to NDVI. Low TCI values reflect a serious moisture condition (Du et al. 2013), thus minimum values of TCI are expected in June, which is not in line with the mentioned month. Such an inconsistency raises some doubts about TCI efficacy.

DISCUSSION
Correlation coefficient analysis was used to investigate the effect of soil texture on the three components of Munsell chart. Positive correlation implies the growth of Munsell components upon enhancement of the soil texture particles. A study related to this matter indicated the lowest coefficients between clay contents and value. Gunal et al. (2008) also reported a poor correlation between the value and clay contents. The value was negatively and positively correlated with sand and clay, respectively. The silt and clay possess high visible light reflectance resulting in light-color soils. Furthermore, the specific surface area of sand particles is lower compared to the silt or clay. So, organic matters could easily cover the sand particles giving rise to darker colors (Ibanez-Asensio et al. 2013). The hue was positively correlated with sand, while its correlation with clay and silt was negative which is in line with the studies by Kone et al. (2009), and Leone and Escadafal's (2001) reports. A positive correlation between the variables and the hue means that they make the soil more yellowish, while a negative correlation implies an increase in red ratio of soil color. The negative correlation between the hue and clay could result from weathered soil particles. Soils with finer texture exhibited more saturated colors which might be attributed to the reddish finely-grained sedimentary rocks developed as a result of elevated iron (in oxide form). Higher chroma values were detected in sandy soils (Ibanez-Asensio et al. 2013), which was verified by a positive correlation between chroma and soil contents. Escadafal et al. (1989) have studied the Munsell soil color and soil reflectance by Landsat MSS, they reported a weak correlation between the hue and visible spectral bands. Soil color components always exhibited less than 50% correlation with spectral bands, while their correlation with MPDI, PDI and band 31 always exceeded 50%. Therefore, MPDI and thermal bands were considered as dominant inputs. In terms of sensitivity function, MPDI showed high sensitivity to the two components of the Munsell color system. According to Baret and Guyot (1991), noise can be dramatically decreased by sensitivity analysis with modified indices. Mohamed et al. (2018) have recommended visible and near-infrared spectroscopy as a quick tool for mapping the soil properties. The PDI and MPDI showed similar results. Accordingly, Ghulam et al. (2007) have declared that MPDI and PDI will lead to similar results in lands with decreased vegetation cover. Moreover, MPDI is capable of precise identification of water and energy cycle as well as the cover type of the surface (Ghulam et al. 2007).
The stepwise procedure showed improved performance as it solves the problem of multicollinearity through declining the total number of variables (Gilabert et al. 2002). ANN exhibited better performance compared to multivariate regression, as it can solve the complicated interactive non-linear and multivariate systems (Ozgür 2005). Multivariate regression methods have been compared with ANN in terms of soil parameters determination. For instance, Marashi et al.
(2017) estimated soil aggregate stability indices, Shabani and Norouzi (2015) also predicted the soil's cation exchange capacity. All these researchers agreed on the superiority of ANN compared to multivariate regression. Thus, combining the satellite images information and machine learning techniques can enhance the precision of results as confirmed in the study by Khanal et al. (2018).

CONCLUSIONS
Soil color is one of the significant soil features capable of characterizing the horizons separation and identifying different types of soil. The limited number of Munsell color chips as well as the user-dependent matching process has restricted the application of Munsell color system. Therefore, in this study, it was attempted to apply a different methodology which was a combination of satellite images and machine learning techniques to improve the determination of soil color. The modified indices (i.e. MSAVI and MPDI) improved the values of correlation coefficients. MPDI resulted in the maximum correlation coefficient reflecting the necessity of defining different soil lines for each soil type. Thermal bands could effectively describe soil moisture. The mathematical procedures and parameters of indices are the major factors in modeling the band's behavior as confirmed by MPDI performance. Specific equations should be defined depending on the purpose. The use of stepwise regression to select the dominant variables resulted in the lowest error values. Compared to the multivariate regression, ANN exhibited better performance in the estimation of the soil color as it had lower errors. Thus, it can be concluded that the MPDI and thermal band can be employed as the precise tools for soil color analysis. The satellite data and soil color are highly dependent on the information regarding the soil line and its parameters. Temporal variation of MPDI and band 31 can be explained by the Munsell soil color components. The dominant spectral band and indices are powerful tools in the satellite-based evaluation of soil color, which can be exploited in various fields of soil and water sciences.