On the capacity of survival analysis with the R language

– In order to make the big data mining analysis we meet the limit of computer capacity. We concentrate here on such a situation. We describe the problem, test the key fragment of the algorithm and conclude on the possibilities of similar computations


Introduction
In this paper we present the use of computer methods in order to investigate the influence of genes and other demographic and clinical characteristics on the clinical outcome of non-small cell lung cancer (NSCLC) patients treated with first line chemotherapy based on platinum compounds. The investigations were carried out on the sample of 201 persons observed in the Department of Pneumology, Oncology and Allergology of Public Hospital No 1 in Lublin. The investigations concerned the influence made by 13 genes and such characteristics as: Age, Sex, Smoking, Comorbidities, History of chemotherapy, Complications after chemotherapy and others on lung cancer treatment. The main difficulty lies in the possible interactions between the investigated attributes, although a number of persons was not large, but taking into account the possible interactions, the number of possible investigations was 2 numgen −1 where numgen is the quantity of considered attributes. Computation was done in Rprogramistic language, well fitted to statistical and data mining modelling.
Although the medical remarks seem to be very interesting, we restrict our attention (except in the first section) to the computer science side of fragment of these computations.
In the next subsection we will briefly sketch the biological significance of these investigations and the applied statistical tools.

Biological background
Lung cancer is the most frequent malicious tumour in Poland and in the world (in 2012 around 1.8 million new diseases and over 1.6 million deaths were reported due to it). Its most prevalent histological subtype is non-small cell lung cancer (NSCLC) [1,2]. In Poland for different reasons, the majority of patients have their lung cancer diagnosed in its advanced stage. As a result, the overwhelming majority of patients (80 − 85%) can be qualified only for chemotherapy or radiotherapy, which are, however, of moderate effectiveness. For those patients who are treated with standard chemotherapy of line 1 the median of progression free survival is 6 − 8 months, the median of overall survival is 8 − 10 months, and the percentage of 5-year survival is within the 10 − 15% range [3,4]. Currently the highest effectiveness of treatment of advanced inoperable lung cancer can be achieved in specially selected groups of patients who are given drugs which are molecularly targeted (e.g., Erlotinib, Crizotinib). These drugs, however, are reserved only for a very small group of patients who have specific molecular aberrations (around 10% for Caucasians). This is the reason why the overwhelming majority of patients (80 − 85%) get standard, double-drug chemotherapy. As is shown by research, also this group can be treated taking into account genetic predispositions (e.g., polymorphisms of genes coding proteins crucial for DNA repair) so as to choose potentially the most effective scheme of chemotherapy [5,6]. Many previous studies proved that marking particular SNPs (single nucleotide polymorphisms) in genes coding DNA repair proteins may be useful for qualification of the most appropriate (and the most effective) chemotherapy scheme. Due to the fact that DNA repair is a complex and multi-stage process, and the genes which are responsible for it are characterised by "low penetration", which is a significant yet limited effect of a single change on the evaluated process (in this case, metabolism of cytostatic agents), in order to find strong relationships there is a need to carry out comprehensive research encompassing many different SNPs located in the line of genes belonging to a few DNA repair mechanisms. Finding the right combination of variants of particular genes will make it possible to qualify tumour patients for proper treatment in the most effective way. Owing to the fact that the current state of knowledge about genetic disturbances is extremely wide, the implementation of advanced bioinformatics tools in the process of searching for clinically-relevant changes seems to be absolutely essential.

Mathematical background
We are interested in the survival function, conventionally denoted S, which is defined as S(t) = P [T > t], where t is some time, T is a random variable denoting the time of Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 27/08/2022 12:58:59 U M C S death. That is, the survival function is the probability that the time of death is later than some specified time t. The survival function is also called the survivor function or the survivorship function in problems of biological survival, and the reliability function in mechanical survival problems. All attributes are used to categorize the set of persons. For example the people who are 0 − 18 years old and have a gene on loci XPD2 written by GG (two guanine) have the survival function S 1 , the other group 19 − 30 years old and a gene of loci XPD2 written as AG (adenine and guanine) have the survival function S 2 . When the attributes Age and Gen XPD2 influenced the stage of illness of lung cancer, then the survival functions S 1 and S 2 should be "essentially" different. Because 1 − S i , i = 1, 2, are the distribution functions thus the Kolmorov's-Smirnov or Chi-square tests can be applied but because due to discrete observations, S i , i = 1, 2, are the discrete functions, thus only the Chi-square test can be used. This test has restrictions on the number of observations, it should be greater than 5. The groups which contain fewer than 5 observations are deleted at the moment of this analysis only.
The R constructs the Kaplan-Meier estimate of the survival function, S(t), corresponds to the non-parametric MLE estimate of S(t). The resulting estimate is a step function that has jumps at the observed event times, t i , 1 ≤ i ≤ n. In general, it is assumed that the t i are ordered: 0 < t 1 < t 2 < . . . < t n . If the number of individuals with an observed event time t i is d i , and the number of individuals at risk (i.e. those who have not experienced the event) at the time before t i is Y i , then the Kaplan-Meier estimate of the survival function and its estimated variance is given by Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 27/08/2022 12:58:59 On the capacity of survival analysis with the R language

Data
We prepare the data for the MySQL database (for further applications, cf. [12]) but for the analysed fragment of program usually a CSV format file is taken as input. The investigation comprised 201 persons with different stages of lung cancer. They are described by almost 100 attributes which can be grouped into: Survival analysis data:: Time of life, Fixed-right censoring where Time of life is given in months.
The genetic results were obtained by the amplifications PCR method and analysed by the GeneMapperSoftware Version 4.0 produced by Applied Biosystems. Figure  1 presents the analysis of 6 SNPs of genes involved in DNA repair (ERCC1, XPC, XPD, XPA, XPG, XRCC1), genotypes from the left: heterozygote GA, homozygous Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 27/08/2022 12:58:59 U M C S GG homozygous AA, homozygous GG homozygous CC homozygous AA. This is an example of a peak electropherogram, obtained by the separation of the reaction of products SNaPshot PCR (capillary electrophoresis was performed on a 3130 Genetic Analyzer, Applied Biosystems). As can be seen in Figure 1 a similar analysis was made for all persons and later written in databases (for simplicity coded as 0, 1, 2 where 0 and 2 are homozygosity and 1 means heterozygosity). Such prepared data were written in the MySQL table.

Programs in R
The parameters work and the introductory was set as follows: Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 27/08/2022 12:58:59 On the capacity of survival analysis with the R language Listing 1. Introduction c h i _ l e v e l =6 p _ l e v e l =0.05 r e s u l t =" r e s u l t . t x t " c a t a l o g ="C: / Workspace / c a n c e r " d a t a s="gen . c s v " from_column=6 numgen=16 l i b r a r y ( s u r v i v a l ) setwd ( c a t a l o g ) d a t a r=r e a d . t a b l e ( datas , s e p =";" , h e a d e r=T) f i l e . c r e a t e ( r e s u l t ) c a t ( c ( " s e t " , " c h i^2" ," d e g r e e " , " p−v a l u e " , " rho " , "Comb" , " Number " ) , "\n " , f i l e =r e s u l t , append=T) where chi_level is the minimal number of observations for which the chi 2 test is doing, p_level is the level of statistical significance, from_column is the number of column from which there are gene's (or may be other) attributes of persons and numgen is the number of genes (or may be other) attributes of investigated persons. The ASCII file for output (result) is prepared with the header defined in lines 13 − 14.

Algorithm
The essential algorithm reads as follows: where A T denotes the transposition of the matrix A. The instruction in row 6 paste together the choice in a attributes, in rows 7-9 all pasted codes, in which there is at least one NA value change the whole value on NA, in row 10 there are fixed (in the vector sl) numbers of class, in row 11 there are computed quantities of each class whereas in rows 12-14 there are excluded classes with the quantities lower than chi_level. The next instructions (15-27) are done if only there exist at least two classes. In this case two tests of difference of survival function were done: one with ρ = 0 and the other with ρ = 1. The parameter ρ belonging to the interval [0, 1] allows to weigh times such that every moment of death is multiplied by S(t) ρ . For ρ = 0 we obtain usually a log-rank test.

Optimization
Each fragment of code 1 was verified with respect to possible optimization. We show two such attempts. Consider the fragment of 12 − 14 of algorithm 1: gives the running times We see that the running times of fragment B are significantly better than those of fragment A and in consequence, fragment B replaces fragment A in algorithm 2. It is natural, due to speed operation on indices of big data frames and vector. The attempt of replacing iteration by the instruction lapply does not give satisfactory results: with the running times:

29
The profiler Rprof("profiler.out", interval=0.1, memory.profiling=TRUE) applied for numgen in 17:19 gives the following (we present only the results with time higher than 300 sec): Thus the conversion factor, operations on strings paste and other operations on factors (summary, levels) are most time-consuming, but factor type in R is not so flexible in order to change Algorithm 2.

Results
We run algorithm 1 in three cases: for N = 50 where 151 randomly chosen items were deleted, for N = 201 -the original database of investigated persons and with N = 1000 where 799 persons were drawn (every attribute was drawn according to the attribute distribution). For the numgen< 13 we take the randomly chosen genes whereas for numgen > 13 we take all gene values and additionally randomly chosen attributes. The time (given in seconds) of this running on computer (processor -Intel(R) Core(TM)2 Quad CPU Q8200 2.33 GHz 2.34 GHz, RAM 4GB, 64 bit, system Debian, R version 2.15.2) is as follows: Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 27/08/2022 12:58:59 U M C S On the basis of Table 5, using the regression methods in R, we find the following formula on time (in sec.) of execution of algorithm 2: thus if we allow the computer to work for 10min, 1h, 24h or 1 week for the given quantity of persons N we may take the maximum number of attributes as in Table 5 Time  N 10 min 1h 24h 1week  40  17 20  24  27  100  16 19  23  26  200  15 18  23  25  500  14 17  21  24  1000 13 16 21 23 Neglecting very interesting medical conclusions we remark that • Language R is the fastest among the data mining and statistical tools (compared with STATISTICA, SPSS and others), • From the mathematical viewpoint it would be better to investigate all attributes of persons but informatics restrictions described in Tables 5 and 6 make this often impossible • Everyone planning similar computations should be aware of the above mentioned restrictions.