Computer network traﬃc analysis with the use of statistical self-similarity factor

– The optimal computer network performance models require accurate traﬃc models, which can capture the statistical characteristic of actual traﬃc. If the traﬃc models do not represent traﬃc accurately, one may overestimate or underestimate the network performance. The paper presents conﬁrmation of the self-similar nature of the selected protocols in the computer network communication layer. It shows that the good measure of self-similarity is a Hurst factor.


Introduction
Statistical analysis of network traffic measurements shows a clear presence of the fractal or self-similar properties in computer network [1]. This means that similar statistical patterns may occur at different time scales which can vary by many time orders. The statistical characteristics of computer network traffic have been of greate interest to scientists for many years, not least to obtain a better understanding of the factors that affect the performance and scalability of large systems such as the Internet. Network traffic is inherently fractal or long-range dependent (LRD). That fact leads to question the extent to which the results of these studies are applicable in practice. Is it possible to diagnose network traffic and provide congestion risk? At the time being, there is mounting evidence that LRD is of fundamental importance for a number of engineering problems, such as traffic measurements [2,3,4] and queuing behaviour. The similar processes have been observed and analyzed in a number of other areas like, for instance hydrology, economics, biophysics. A self-similar phenomenon represents a process displaying structural similarities across a wide range of scales of a specific dimension. Recent measurements of network traffic have shown that traffic exhibits variability in a wide range of scales. The reference structure repeats itself over a wide range of scales of diverse dimensions (geometrical, statistical or temporal), and the statistics of the process do not change with time. In reality, simple systems do not exist. In the case of real, complex systems, in contradiction to simple systems, one can indicate the following features of processes: thermodynamic non-equilibrium, heterogeneous topologies, small-worlds phenomenon, long-range dependencies, bursty and self-similar traffic, scale-free (power law) distributions, packet switching, structure hierarchy, percolation, clustering, self-organization, parameters degradation and collapse [5,6,7].
The statistical characteristic of teleinformatic systems changed a lot when the human behaviour was replaced by the hierarchical, complex system, so-called computer. On the other hand, the voice traffic was quite static and low variable (short-range dependent) but now the data traffic is much more variable with both the extremely short and extremely long calls (self-similar and long-range dependent) [8,9]. Thus, it can be noticed that both the original "pure form" human behavior and the original "pure" traffic nature (a simple stream) are lost when higher (nested) layers of stack are successively added and the simple computer system becomes a complex system that has a fractal nature. It is well known that the nesting concerns all areas of computer engineering (networks, computer hardware, operating systems, programming language and queuing system as well) and inevitably leads to the long-term dependent processes from the short-range dependent processes. It is particularly intensified in complex large-scale systems, i.e. distributed systems, computer networks. The positive features of the system namely: heterogeneity, openness, security, scalability, failure handling, concurrency and transparency are understood by negative complex system features such as degradation and collapse [10,5,11,12].
2 Self-similarity statistical factor Self-similarity and fractals are the notions pioneered by Benoit B. Mandelbrot. Selfsimilarity can be associated with "fractals", which are objects with unchanged appearances over different scales. In the case of statistical fractals, this is the probability density that repeats on every scale. On the other hand a dynamical fractal is generated by a low-dimensional dynamical system with chaotic solutions. The research related to traffic self-similarity can be classified into four categories: measurement-based traffic modelling, physical modelling, queuing analysis and traffic control as wll as resource provisioning [1,13]. In order to review the LRD processes several definitions are introduced.
Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 30/07/2023 00:22:32 U M C S A self-similar time series has the property that when aggregated (leading to a shorter time series in which each point is the sum of multiple original points), the new series have the same autocorrelation function as the original.
That is, given a stationary time series X = (X t ; t = 0, 1, 2, . . .), we define the maggregated series x (m) = (x (m) k : k = 1, 2, 3, . . .) by summing the original series X over the nonoverlapping blocks of size m. Then if X is self-similar, it has the same autocorrelation function This means that the series is self-similar: the distribution of the aggregated series is the same (except changes in scale) as that of the original [10,3].
A process with long-range dependence has an autocorrelation function r(k) ∼ k −β as k → ∞ where 0 < β < 1. Thus the autocorrelation function of such a process decays hyperbolically (as compared to the exponential decay exhibited by the traditional traffic models). Hyperbolic decay is much slower than the exponential decay, and since β < 1, the sum of tile autocorrelation values of such a series approaches infinity. This has a number of implications. First, the variance of n samples from such a series does not decrease as a function of n (as predicted by basic statistics for uncorrelated datasets) but rather by the value n −β . Second, the power spectrum of such a series is hyperbolic, rising to infinity at frequency zero-reflecting the "infinite" influence of long-range dependence in the data.
The main advantage of using models of self-similar patterns of the time series is that the degree of self-similarity of the series is expressed by only one parameter. The parameter expresses the speed of decay series autocorrelation function. For historical reasons, the parameter used is the Hurst parameter H = 1 − β/2. For self-similar series, 1/2 < H < 1, as H → 1 the degree of self-similarity increases. Thus, the main criterion for a series of self-similarity reduces the question of whether H is significantly different from 1/2.
There are many ways to determine the variance. We can use the variance-time plot, basing on the slowly decaying variance of a self-similar series. The variance of X (m) is plotted against m on a log-log plot; a straight line with the slope (β) greater than -1 is indicative of self-similarity, and the parameter H is given by H = 1 − β/2. We can use the R/S method. The R/S plot, uses the fact that for a self-similar dataset, the rescaled range or R/S statistic grows according to a power law with the exponent H as a function of the number of points included (n). Thus the plot of R/S against n on a log-log scale has a slope which is an estimation of H. The last approach, the periodogram method, uses the slope of the power spectrum of the series as frequency approaches zero. On a log-log plot the periodogram slope is a straight line with the slope β − 1 = 1 − 2H close to the origin.
These methods are not resistant to faulty assumptions (such as non-stationarity in the dataset) and they do not provide confidence intervals. The fourth method, called the Whittle estimator does provide a confidence interval, but has the drawback that the form of the underlying stochastic process must be supplied. In our study, we use data collected in a private computer network (small company). The collected data were the result of the normal operation of programming between the hours of 10 am to 10 am the following day. The company has an eight-hour working time in two hourly intervals from 7:00 to 15:00 and from 8:00 to 16:00. It is possible that employees stay after regular working hours. At night, all computers should be turned off, but this is not strictly obeyed. The network uses 19 computers and network devices. The analyzed network topology is shown in Fig. 1. To collect the data we used one of the sniffer programs to capture packets -Wireshark. This program can save the movement from the level of the data link layer [5]. The captured traffic samples contain information such as the location of the file, its size, format, type of encapsulation and packet size limit, time of the first packet that is the start time of the test procedure and its completion, the total length of the work. In addition, it provides the information about the number and type of packages. For 24 hours the analysed network recorded 7 818 848 packets, the average speed was 90.447 packet per second. The examplary statistics is shown in Fig. 2.
The study aimed at observing network traffic and determining whether there are long-term dependencies in all network working time and above-hour intervals. In order to carry out work of all captured packets we isolated those that had the greatest impact on the network. They were divided in the terms of services and protocols into five main groups:   For each group number of packets, the total length of the packet and the average packet length in hourly intervals were calculated. There were also the largest and the smallest size packages. The next step was to calculate the Hurst factor by earlier estimation of β using the Benoit and Power Spectrum method.
We studied using the HTTP protocol. The HTTP is the most commonly used protocol that supports the Internet. The data are used only with TCP and the default configuration uses port 80. Each object (e.g. website, video, audio) downloaded from a Web server sends through a single session.
An essential element of communication and business users is the e-mail. Messages that are sent can use traffic spam. We send text messages using standard protocols (POP3 and SMTP) which do not significantly affect network traffic. However, if the message contains a large attachment, sending and retrieving it may have a significant impact on the operation of the entire network. We analyzed the total traffic generated by the company employees within 24 hours (Fig. 4 b). With the analysed samples of traffic packets we estimate the exchange of electronic mail 2.52% of the total. In total, protocols responsible for the correspondence were sent up to 878.44 MB. The largest increase occurred in the traffic business hours.
The third analyzed service is a Secure Sockets Layer (SSL). SSL technology was originally developed by Netscape Communications to ensure the safety and privacy on the Internet compiled session. It enters data stream encryption. In 1999 the Transport Layer Security (TLS) standard was published, which provides security at the transport layer and solves some of the SSL problems. It is used to encapsulate higher-level applications traffic such as HTTP, Lightweight Directory Access Protocol (LDAP), FTP, SMTP, POP3 and IMAP. It provides authentication and integrity through certificates and digital signatures.
Programs for the analysis of traffic networks are not always able to recognize all the protocols that exist in the captured files. Groups of such unrecognized packets are described as unknown (Fig. 4). Typically, the data are sent by the programs that have their copyrights protocols recognized as unknown. But they can also be sent by recognized data protocols, but the port numbers must be changed. In the case of enterprise network tests, the packets marked as unknown are the largest share of traffic which is 51.2% and 17.88 GB of data sent within 24 hours.
Sample graph power-spectrum density for the same hours is shown in Fig. 5.

Results of investigations
Obviously, the choice of method is connected with theoretical considerations presented in Section 2. Usually it is hard to calculate a real value of spectral density slope because of the usage of the last mean square method that is not necessarily good for log-log plots, but it is commonly accepted that such an approach can be taken for rough estimation of H parameter.  In order to calculate the possible existence of long-range dependencies, the data are divided into one-hour intervals. We decided to divide our collection into 24 subcollections (they represent traffic during 1 hour) and calculate for them the slope of spectral density. This was shown in Fig. 6. Generally, it is considered that this method calculates the degree of long-range dependencies, regardless of whether a process belongs to the Gaussian or power-law probability distributions domains of attraction.
The obtained results of this experiment show that in contradiction to so far existing belief the traffic can have a self-similar nature. As it can be seen (Table 1)  The analysis of selected protocols shows that the network traffic is self-similar. The degree of self-similarity Hurst exponent is specified by the range 0.5 to 1. The shorter the average length of packet including the Hurst exponent tends to 0.5 (white noise). The average value of the Hurst exponent for the e-mail traffic is 0.799, with a maximum value of 0.976 and 0.513 minimum. It can be seen that reducing the flow in the network (for example, overnight) causes the large fluctuations of the Hurst exponent, which tends to a value of 0.5. For SSL the average Hurst exponent is 0.721. In the entire traffic range of the test, it oscillates in the range of 0.54 to 0.98. The analysis by the Hurst exponent for the flow of packets marked as unknown shows that this traffic

Conclusions
In this paper we have reported the results from the analysis of the computer network traffic using the statistical self-similarity factor. The results confirmed that the analyzed traffic has a self-similar nature to the degree of self-similarity in the range of 0.5 to 1. The measurement and analysis have shown that the self-similar nature of computer network traffic expressed by the fractional Brownian motion or the fractional Gaussian noise and the holistic approach to queuing analysis made it possible to determine the power spectral density which can be an internal level measure of high variable traffic in the whole system. We can observe that burstiness is present across many time scales. The parameter H is larger when network utilization is higher. The network performance is dominated by the self-similarity property in the network traffic. Some of the most significant physical phenomena may give a significant raise to LRD of user behaviour, data generation, organization and retrieval, traffic aggregation, network controls etc. The results of analytical considerations and experiment show that the self-similarity factor can be successfully used in the computer network traffic analysis.