Specialized Genetic Algorithm Based Simulation Tool Designed For Malware Evolution Forecasting

– From the security point of view malware evolution forecasting is very important, since it provides an opportunity to predict malware epidemic outbreaks, develop eﬀective countermeasure techniques and evaluate information security level. Genetic algorithm approach for mobile malware evolution forecasting already proved its eﬀectiveness. There exists a number of simulation tools based on the Genetic algorithms, that could be used for malware forecasting, but their main disadvantages from the user’s point of view is that they are too complicated and can not fully represent the security entity parameter set. In this article we describe the specialized evolution forecasting simulation tool developed for security entities, such as diﬀerent types of malware, which is capable of providing intuitive graphical interface for users and ensure high calculation performance. Tool applicability for the evolution forecasting tasks is proved by providing mobile malware evolution forecasting results and comparing them with the results we obtained in 2010 by means of MATLAB.


Introduction
Nowadays malware, i.e. software created with malicious purposes in order to harm the computer software or to be installed on computer without allowance of the legal user [1], is considered to be one of the major threats to information security, information systems and modern communication methods. Significant shift in motivation for malicious activity has taken place over the past several years: from vandalism and recognition in the hacker community, to attacks and intrusions for financial gain. This

Mobile Malware Evolution and Technical Analysis
According to [17] the first mobile virus was the "Cabir" virus which appeared on the 15th of June 2004, infected mobile phones running the "Symbian" OS and used of Bluetooth wireless network as a propagation channel. After the successful infection, the virus appended the telephone software with its code, activated the Bluetooth and started searching for another Bluetooth device to forward the infected file. Since Bluetooth network coverage is limited to 10 meters, the propagation rate of the first mobile virus was rather limited. The first Trojan malware ("Skulls") also appeared in 2004, November [18,19]. It infected NOKIA mobile phones, running the "Symbian" operating system. "Skulls" propagated by pretending to be a software update, usually as the "Macromedia Flash" update file with .sis extension. When the phone user activated the Trojan it changed the phone configuration settings and depicted the skulls on the screen. It also blocked many functions, such as SMS, MMS, calendar, camera, etc. The phone user could only perform telephone calls. The mobile Trojan evolution continued in 2005. A new Trojan "Locknut.A" was detected [20]. Also created for the Symbian platform it was particular in size. The "patch.sis" file that contained the infection was only 2KB size, making it the smallest known Trojan for the mobile platform.
The first mobile malware that started using propagation methods, other than Bluetooth, was the "Commwarrior.A" virus, also running on the "Symbian" platform [21,22]. It was using much quicker propagation by MMS, since this method does not have limitations by distance, although Bluetooth was also supported. The MMS message included text in English, which proposed the phone user a new game, update for the antivirus software or similar. The message was sent to all contacts, found in the phone address book. In this case virus authors relied on the social engineering since when the recipient receives the message from his friend or a familiar person, the probability of opening it is higher than when it comes from the unknown number. An interesting thing is that Bluetooth was activated during the working hours and MMS were sent in the evening and at night. After each successful infection, the virus makes a one minute delay and after that starts searching for a new victim. In 2009 the Kaspersky Labs discovered a new mobile malware named "sms.python.flocker", written in the Python language and designed to manipulate the mobile phone accounts. The main malware functionality is dedicated to the financial gain. The virus sends SMS messages to the specific number, which allows transferring money from the account of the infected phone to the account of the malware author [17].
According to [23], the Android mobile operating system (OS) became the most "popular" platform for the new malware. Malware on Android could be separated into 3 different types: the first, SMS-sending Trojans (such as Android/Wapaxy, Android/LoveTrp, and Android/HippoSMS), which could subscribe victims to the subscription services; the second, Maliciously modified apps, such as Android/PJApp, which could collect sensitive information and disclose it to the unauthorized persons; the third, call recording malware, like Android/NickiSpy.A and Android/GoldenEagle.A. This malware records the user's conversations and forwards them to the attacker.
The non-GA models mainly cover malware epidemic consequences modeling, i.e. forecasting the number of infected computers, simulating malware behaviour or economic propagation aspects and are based only on current malware propagation strategies or oriented to other malware types. The first epidemiological model for computer virus propagation was proposed by Kephart [24]. Epidemiological models abstract from the individuals, and consider them units of a population. Each unit can only belong to a limited number of states. The SIR model assumes the Susceptible-Infected-Recovered state chain and the SIS model -the Susceptible-Infected-Susceptible chain. The technical report [25] described a model of e-mail worm propagation. The authors model the Internet e-mail service as an undirected graph of relationship between people. In order to build a simulation of this graph, they assume that each node degree is distributed on a power-law probability function.
Malware propagation in the Gnutella type P2P networks was described in [8]b y Ramachandran et al. An analytical model that emulates the mechanics of the decentralized Gnutella type of peer network was formulated and the study of malware spread on such networks was performed. The Random Constant Spread (RCS) model [26]was developed by Staniford et al. using the empirical data derived from the outbreak of the CodeRed worm. The model assumes that a machine cannot be compromised multiple times and operates the constant average compromise rate K, which is dependent on worm processor speed, network bandwidth and location of the infected host, etc. The model can predict the number of infected hosts at time t if K is known. As [27] states, that although more complicated models can be derived, most network worms will follow this trend. Other authors [28] propose the AAWP discrete time model, hoping to better capture the discrete time behaviour of a worm. However, according to [9] the continuous model is appropriate for large scale models. On the other hand, in [9] Zanero et al propose a sophisticated compartment based model, which treats the Internet as the interconnection of autonomous systems, i.e. sub networks. Interconnections are so-called "bottlenecks". The model assumes that inside a single autonomous system the worm propagates unhindered, following the RCS model. Zou et al in [5] propose a two-factor propagation model, which is more precise in modelling the satiation phase taking into consideration the human countermeasures and the decreased scan as well as the infection rate due to the large amount of scan-traffic. The same authors have also published an article on modelling worm propagation under dynamic quarantine defense [29] and evaluated the effectiveness of several existing and perspective worm propagation strategies [30].
Lelarge in [31] introduces an economic approach to malware epidemic modelling (including botnets). Li et al. [32] model botnet-related cybercrimes as a result of profit-maximizing decision-making from the perspectives of both botnet masters and renters/attackers. From this economic model, they derive the effective rental size and the optimal botnet size. Fultz in [33] describes the DDoS attacks organized with the help of botnets as economic security games.
Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 18/08/2023 05:01:34 The increase of mobile device popularity has called out the appearance of models dedicated to the mobile malware modelling. Ruitenbeek et al. in [34] simulate virus propagation using the parameterized stochastic models of a network of mobile phones, created with the help of Mobius tool and provide insight into the relative effectiveness of each response mechanism. Two models of the propagation of mobile phone viruses were designed to study the impact of viruses on the dependability and security of mobile phones: the first model quantifies the propagation of MMS viruses and the secondof Bluetooth viruses. Bulygin in [35] analyses two viruses using different propagation methods (MMS and Bluetooth) in the SI (Susceptible->Infected) model.
Although it is widely accepted that malware evolution forecasting is an important information security task, the first model-based research paper on this topic appeared only in 2008 [6], which discussed the Internet worm evolution trends. It was shown that GA may be used for malware characteristic evolution forecasting. The tests have proved the effectiveness of the model in evaluating propagation rates and have shown the tendencies of worm evolution. Rather a similar concept was proposed almost one year later in [36]. The authors validate the notion of evolution in viruses on a wellknown Bagle virus family. The results of the proof of-concept study showed that new viruses-previously unknown of Bagle family have successfully evolved starting from a random population.
In spite of a large number of GA simulation environments existing on the market, not all of them could be used for the security entities evolution forecasting. During the assessment of simulation tools based on the genetic algorithms four existing tools were evaluated: MATLAB, Global optimization packet "GMJ", EGALT, JGap packet.
As it is known, MATLAB is a powerful mathematical programming packet, capable of effective work with the data representing graphic and matrix views, also it could be easily integrated with the additional tools written in C, C++, Fortran or Java. The genetic algorithm and the direct search toolbox expanded MATLAB possibilities and could be used for the evolution forecasting.
The global optimization packet from the Genetic algorithms point of view "GMJ" has various possibilities and every year this packet is expanded by new tools for the method optimization and results analysis. The main optimization methods implemented in the "GMJ" packet are [37]: Genetic algorithm with parameters; genetic algorithm without parameters; neural network optimization task.
EGALT tools were created as educational application, which should help to understand the genetic algorithm internal processes. This tool was developed with the C++ programming language and worked only on the Windows platform. It could be found here: www.ewh.ieee.org/soc/es/May2001/14/Begin.htm. These tools allow students to choose gens amount, population size, iteration number, mutation and crossing possibilities, also a crossing type could be chosen.
The main idea of the Java genetic algorithm packet (JGap) is to develop a common genetic application function system (library) for the implementation of main genetic Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 18/08/2023 05:01:34 U M C S operations. Such system could be used within the application development process and integrated into a new application if such functionality is necessary [38].
Standard simulation tools have some common disadvantages. Often the genetic algorithm is implemented in applications as the additional functionality and such approach reduces the flexibility and calculation performance of the application. Another problem is that the user cannot often add easily a new parameter or modify existing fitness function or gens probabilities. As a result, some application could not be re-used to calculate evolution forecasting for other security entities or for the same entity with new parameters. Besides, simulation tools are quite complicated and require from the user additional knowledge about the tool interface or internal functions and procedures before he will be able to start working with them. Trying to solve the described problem or reduce its influence, it was decided to develop the basic simulation tool based on the genetic algorithm, which could be effectively used only by security specialists for the evolution simulations and forecasting of information security entities.

Requirements for specialized simulation tools
According to common application development methodologies, the functional and non-functional requirements were described in detail before starting the development process of simulation tools.

Functional requirements
The main attention in specifying requirements for the tool was paid to the simplification of the security entities evolution simulation process and making it more intuitive and effective. Seeking to achieve the specified goals, the following requirements should be implemented: • Possibility to insert and modify the initial data, such as parameters' names, from the data sources (csv documents); • Provide reports on all evolution steps and best evolution results, which are noticed during the simulation.
Detailed description of the functional requirements: 1. Fitness function: • the population individual amount -how many chromosome sets will be generated; • iteration amount -number of calculation loops; • mutation probability -with what probability gens could mutate. 4. Results output: 4.1 Graphical representation with the best fitness function result, generated during each iteration step. According to this graphic description we can analyze how evolution process developed and changed the population individual fitness during this process; 4.2 At the end of all calculations, the application provides information about the best population individual and its gens values. The result is saved in a separate file.

Non -functional requirements
User interface requirements: • Intuitive graphical user interface (GUI); • Interface management performs by keyboard and mouse; • Main windows do not have the input fields, which are not related to the main data or calculation process.
Usage requirements: • Data input should be intuitive and easy; • Application should inform user about errors or prevent wrong information input; • Application should be intuitive and ensure that user could use it without the additional training; • Compatibility with Internet services should be ensured.
Calculation requirements: • Application should implement genetic algorithms methods and perform calculations in optimized way; • Application should put calculation results into the csv data files. Additional requirements could be used to ensure that data transfer is performed in an optimal way.

Developed application
According to the described functional and non-functional requirements, simulation tools were developed. The user interface was designed as one main window, where the user can see all parameters, which will be used during the simulation. Input implemented by using the csv files and such approach allows to re-use the existing data in the future. All results simulated during the evolution forecasting process are also stored in the csv file and it means that the calculation results could be easily re-used in other simulations as an input data. In the case of GA modelling the main task consists of three parts: appropriate selection of chromosome structure, which represents the solution, definition of the fitness function and GA operating conditions, such as population size, mutation rates, parent selection, etc. In our case we define the propagation strategy as combination of methods and techniques, used by malware to insure malware population increase. In the current study, we have chosen to model strategies for a theoretical mobile virus, which aims at infecting the largest number of mobile devices during a fixed relatively short period of time.
GA consists of initialization, selection and evolution stages. During the initialization stage initial population of strategies is generated. Each strategy is represented as a chromosome. In the selection stage strategies are selected through a fitness-based process and in the case the termination condition is not met, evolutionary mechanisms are started. If the termination condition is satisfied, algorithm execution is finished. If not -evolutionary mechanisms are activated.
The initial population is generated on a random basis, i.e. each individual, representing a separate strategy is combined from random genes' values. The population size N is equal to 50. The population size remains constant after each new generation. The algorithm would stop producing new generations in the case the number of generations reached 100. Parent selection is random. The mutation operator is activated to each newly generated individual with a 0.05 probability.

Experiment results
During the experiment [37] GA consisted of initialization, selection and evolution stages. At the initialization stage the initial population of strategies was generated. Each strategy was represented as/by a chromosome. In the selection stage strategies were selected through a fitness-based process and in the case the termination condition was not met, evolutionary mechanisms were started. If the termination condition was satisfied, the algorithm execution process was terminated. If not -evolutionary mechanisms were activated.
The initial population was generated on a random basis, i.e. each individual, representing a separate strategy was combined from random genes' values. The population size N was equal to 50. The population size remained constant after each new generation. The algorithm would stop producing new generations in the case the number of generations reached 100. Parent selection was random. The mutation operator was activated to each newly generated individual with a 0.05 probability.
Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 18/08/2023 05:01:34 Modelling performed in 2010 showed that the best fitness result achieved during the algorithm test was equal to F (S d )=0.023 and the best result was achieved in the 42 nd generation [37]. The quality assurance process of a new simulation tool with old parameters showed the following results: the result bias was smaller than 0.01. The best fitness result was F (S d )=0.0229. The best strategy fitness change looks like this: This shows that the achieved results are rather similar to those in 2010 [37] and a small difference could be explained by "random" nature of genetic algorithms.
The second quality assurance step was to find out how the developed simulation tool could be re-used to calculate the evolution forecasting for other or changed security entities. The gens parameters matrix was modified according to the changes on the mobile market and new parameters like Android OS were added, also some gens probabilities were changed (Table 1). New simulation results generate the new best fitness strategy. The best fitness result was F (S d )=0.0059. We ought to mention that the best fitness result is smaller than it was with the previous parameters, but it is natural as during the last simulation new parameters and probabilities were used.
According to the data saved in the result.csv file we can extract the best fitness strategy chromosome and in our case it is: S d =(TRANSF1="SMS", TRANSF2='"Bluetooth", TRANSF3="e-mail", TRANSF4="Wi-Fi", TRANSF5="MMS", NR="Address book", BT="Scan BT", EMAIL="E-mail address", WIFI=" Scan", OS_PLATF="Google Android", TEL="HTC, Samsung, Sony Ericsson, Motorola" EN_EXPL_N="EXP1_ON_ALL", EXPL_1=" Expl1", EXPL_2=" Expl2", EXPL_3=" Expl3", NR_TIME="Always", BT_TIME="Always", WIFI_TIME="Always", EXEC=" Stand. Funct.", EXEC_CHAN="web-update",) Comparing the last simulation results with the previous one, we should conclude that gens such as TEL and OS_PLATFORM have changed. Currently they show domination of Google Android OS and there are even three dominating telephone brands, such as HTC, Samsung, Sony Ericsson, Motorola, that could be evaluated by malware developers as perspective for malware spreading. These changes appeared due to the changes in Mobile devices and Mobile OS markets. Large part of this market presently belongs to the devices working on the Android OS platform. In our previous simulation OS_PLATF was "Symbian" and Nokia devices were dominating. The fitness results decreased by 3.7 times in comparison with our previous simulation, but such results were predictable because the Mobile devices OS market is divided among a few big players and malware working on one OS could not work on another OS. Also lately a lot of countermeasures, like OS and application patch and fix, antivirus software for the Mobile device have appeared, which decreased the malware threat probability and its propagation rates.

Conclusions
In this article the specialized tool based on the principles of the genetic algorithms for malware and other information security entities evolution forecasting was presented. The main tool difference from the existing modelling tools is its usability and tuning for the specific problem area to be modelled, possibility to represent specific information security issues, such as interrelated gens while modelling malware evolution and others.
Practical applicability of the developed tool was tested by modelling evolution of rapidly evolving mobile malware. Tool functioning correctness was proved by comparing the earlier modelling results with those obtained by the tool supplied with the same initial parameters.
Evolution modelling with the changed initial parameters has shown changes in the results we have obtained in our previously performed modelling tests. These changes depict the one in the currently popular mobile device platforms (the Android platform is currently more prospective for malware spreading than the Symbian one, that is no longer popular) and general decrease of malware spreading efficiency compared to the 2010 results due to the increase of a number of mobile OS platforms. Still that does not show that the total population of infected mobile devices in the world has decreased, but it rather shows the infection spreading speed. The decrease of spreading efficiency could be also explained by the increased awareness about mobile malware and countermeasures applied.