A LOW-COST MULTICOMPUTER FOR SOLVING THE RCPSP

: In the paper it is shown that time necessary to solve the NP-hard Resource-Constrained Project Scheduling Problem (RCPSP) could be considerably reduced using a low-cost multicomputer. We consider an extension of the problem when resources are only partially available and a deadline is given but the cost of the project should be minimized. In such a case finding an acceptable solution (optimal or even semi-optimal) is computationally very hard. To reduce this complexity a distributed processing model of a metaheuristic algorithm, previously adapted by us for working with human resources and the CCPM method, was developed. Then, a new implementation of the model on a low-cost multicomputer built from PCs connected through a local network was designed and compared with regular implementation of the model on a cluster. Furthermore, to examine communication costs, an implementation of the model on a single multi-core PC was tested, too. The comparative studies proved that the implementation is as efficient as on more expensive cluster. Moreover, it has balanced load and scales well.


INTRODUCTION
Resource allocation, called the Resource-Constrained Project Scheduling Problem (RCPSP), attempts to reschedule project tasks efficiently using limited renewable resources minimising the maximal completion time of all activities [3 -5]. A single project consists of m tasks which are precedence-related by finish-start relationships with zero time lags. The relationship means that all predecessors have to be finished before a task can be started. To be processed, each task requires a human resource (HR). The resources are limited to one unit and therefore have to perform different tasks sequentially. RCPSP is an NP-hard problem. In most cases, branch-and-bound is the only exact method which allows the generation of optimal solutions for scheduling rather small projects (usually containing less than 60 tasks and not highly constrained) within acceptable computational effort [1,5]. Results of the Hartmann and Kolisch [8] investigation showed that the best performing heuristics were the GA of Hartmann [7] and the SA procedure of Bouleimen and Lecocq [2]. Their latest research revealed that the forward-backward improvement technique applied to X-pass methods, metaheuristics or other approaches produces good results and that the most popular metaheuristics were GAs and TS methods.
In our previous works, cost-efficient project management based on a critical chain (CCPM) was investigated. The CCPM is one of the newest scheduling techniques [19]. It was used to solve a variant of the RCPSP. A goal of the management was to allocate resources in order to minimise the project total cost and complete it in a given time. A sequential metaheuristic from Deniziak [6] was adapted to take into account specific features of human resources participating in a project schedule. The research showed high efficiency of this adaptation for resource allocation [12]. An extension of the problem, where HRs are only partially available since they may be involved in many projects, was also investigated [14]. The research proved that the adaptation is efficient but the minimization was still time consuming and would require accelerating to cope with bigger real-life problems Our latest research showed that the algorithm has got an inherent parallelism. Hence, a distributed processing model for solving the extension of the RCPSP was developed and tested on a regular PCs [13]. It gave a time of scheduling even 10 times smaller than the sequential processing. Therefore, in this research we present a new implementation of the model, on a low-cost multicomputer built from PCs connected through a local network Furthermore, we compare it with regular implementation of the model on a cluster and show that it may be just as efficient, but not so expensive what might limit its practical value.
The next section of the paper contains a brief overview of related work. Motivation for the research is given in section 3. An implementation of the distributed processing model for the algorithm is presented in section 4. Evaluation of the implementation in both distributed and parallel environments is given in section 5. The paper ends with conclusions.

RELATED WORK
Researchers studied the problem and suggested their own solutions which can be divided into exact procedures and heuristics. Branch and bound methods are an example of the exact procedures (see e.g. [3], [4]). In [11] another method, a tree search algorithm, was presented. It is based on a new mathematical formulation that uses lower bounds and dominance criteria. An in-depth study of the performance of the latest RCPSP heuristics can be found in [10]. Heuristics described by the authors include X-pass methods, also known as priority rule-based heuristics, classical metaheuristics, such as Genetic Algorithms (GAs), Tabu search (TS), Simulated annealing (SA), and Ant Colony Optimisation (ACO). Non-standard metaheuristics and other methods were presented as well. The former consist of local search and populationbased approaches, which have been proposed to solve the RCPSP. The authors investigated a heuristic which applies forward-backward and backward-forward improvement passes. For detailed description of the heuristic schedule generation schemes, priority rules, and representations refer to [8].
The effectiveness of scheduling methods can be further improved using parallel processing. Some implementations of parallel TS [15 17] and SA [18] algorithms for different combinatorial problems have already been proposed. The most common one is based on dividing (partitioning) the problem such that several partitions could be run in Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 22/05/2022 01:22:11 U M C S parallel and then merged. Parallelism in GAs can be achieved at the level of single individuals, the fitness functions or independent runs [21,22]. All of the parallel approaches fall into three categories: the first uses a global model, the second uses a coarse-grained (island) model and the third uses a fine-grained (grid, cellular) model [20]. In the global model, a master process manages the whole population by assigning subsets of individuals to slave processes. In the island model a population is divided into sub-populations that are evolved separately. During evolution, some individuals are exchanged periodically between them. In the grid model a population is represented as a network of interconnected individuals where only neighbors may interact. It was observed that parallel GAs (PGAs) usually provide better efficiency than sequential ones [20]. The same parallel approaches can be applied for ACO. In [23] five strategies of parallel processing are described, which are mainly based on the well-known master/slave approach [24].

MOTIVATION
The sequential algorithms are time consuming, what considerably limits their usefulness. Speeding up the calculations would be desirable for project managers because it may allow managing complex projects in acceptable time. Parallel models offer the advantage of reducing the execution time and give an opportunity to solve new problems which have been unreachable in case of sequential models. The most popular parallel strategies are based on master/slave approach [24] with centralized management of distributing tasks and gathering results. The master can efficiently coordinate the system, avoiding potential conflicts before they take place, and react on failures of the slaves. However, global gathering and re-broadcasting of large configurations can be timeconsuming. Costs of synchronization between slaves have to be considered, also. Some slaves may have to wait for completing other tasks, which is necessary to retain data integrity. More-over, the master is the weakest point of the system. The system will slow down if the master cannot handle incoming requests. If the master crashes, the whole system will also crash. Another problem is load imbalance caused by unpredictable processing time of each slave. Summarizing, the gain coming from parallelization of the algorithm may be significantly reduced.
From our research it also follows that parallel processing could reduce efficiently the amount of the time consumed by the metaheuristic algorithm [13]. Usually, such reduction requires a use of a cluster and hence is expensive what may limit its popularity. The key idea to overcome this inconvenience is to make use of multi-core architecture of low-cost PCs, instead of the cluster. Such a multi-multi computer is cheap, easily assembled and might be very useful for practical reasons. However, it should be proven that the implementation is as efficient as on the cluster, and that it has balanced load and scales well.

OPTIMIZATION ALGORITHM
The metaheuristic algorithm starts with the initial point and searches for the cheapest solution satisfying given time constraints. The initial schedule is generated by greedy procedures that try to find a resource for each task basing upon to the smallest increase of the project duration or the project total cost. It is a suboptimal solution which the algorithm tries to enhance. In each pass of the iterative process, the current project Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 22/05/2022 01:22:11 U M C S schedule is being modified in order to get closer to the optimum. In the first add stage a new HR which is not in the schedule is attached to it. Tasks of HRs which have already been engaged in the schedule are moved to the HR but only when a positive gain is achieved. Afterwards, if there are HRs without allocated tasks, they are removed from the schedule. The best schedule goes to the next stage and the proceeding is repeated until no more free HRs are available. In the second rem stage all tasks allocated to the HR are moved onto other HRs, still remaining in the schedule, but only when a positive gain is achieved. Then again, HRs without allocated tasks are removed from the schedule. Finally, the best project schedule coming from all stages is chosen. The iterative process is repeated for every resource from the resource library until no improvement can be found. At the very end, project tasks may be shifted right to the latest feasible position into their forward free slack by means of As Late As Possible (ALAP) schedule.

Distributed processing model
The distributed processing model is shown in Figure 1.

Implementation of the model
The distributed processing model ( Figure 2) was implemented in Java. One application, which is a tasks dispatcher (D), manages a pool of threads responsible for communication with other worker applications, located on remote computers. At the beginning, workers notify the dispatcher about their readiness to execute tasks. The tasks dispatcher creates a new thread for each worker and joins it to the pool. The pool contains as many threads as needed, but will reuse previously constructed threads when they are available. On the remote computers, workers run as independent processes, what makes them available for direct communication. Therefore, the tasks dispatcher may uniformly split the computational tasks, so as to workload could easily be balanced. Each remote computer runs as many processes as the number of processor cores, in order to use the whole computing power of multi-core machines. During executing an iteration of the algorithm, the tasks dispatcher sends schedule modification requests to the first free worker. To this end, it uses Remote Method Invocation (RMI) for communication. If a worker is not responding, it will be removed from the pool and the request will be sent to another free worker. Workers receive project data and the searching parameters so as to invoke a method, in order to perform the add or the rem stage. Afterwards, results of modifications are sent back to the dispatcher and then the thread can be reused. Synchronization occurs at the end of each of the iterations because all the rem stages have to be finished in order to choose the best schedule.

COMPARATIVE STUDIES
The efficiency of the algorithm described in the paper was estimated on 100 randomly generated project plans containing from 30 to 60 tasks, and from 8 to 16 HRs with random data. Each project plan was scheduled several times and results were averaged. Tasks in the project plan may have at most 4 precedence relationships with probability 0,35. They can be easily scheduled because they have few predecessors or none. If the probability of inserting the precedence relationships were lower, the project plan would contain mostly unconnected tasks. On the other hand, tasks with two or more predecessors significantly decrease the search space. In each project, resource availability was reduced by allocating 30 tasks from PSPLIB, developed by Kolisch and Sprecher [9]. The set with 30 non-dummy activities currently is the hardest standard set of RCPSP-instances for which all optimal solutions are known [4]. However, we considered an extension of RCPSP where resources have already got their own schedule and a cost of the project, but not the project duration, should be minimized. So even though we take the project instances from PSPLIB, the results cannot be compared. The initial schedule was generated by two greedy procedures mentioned at the beginning of section 4.
Implementation of the distributed model was run on two distributed systems: • multicomputer built from PCs (ClusterPCs) that comprises 10 multi-core computers with Intel Core i5-760 Processor (8M Cache, 2,80 GHz) and 2 GB of RAM memory, connected via a Gigabit Ethernet TCP/IP local network, • regular cluster that comprises 1 head node with Intel Xeon E5410@2,33GHz, 16GB of RAM memory and 10 processing nodes with Intel Xeon E5205@1,86GHz, 6GB of RAM memory, connected via a Gigabit Ethernet TCP/IP local network.
Furthermore, to examine communication costs, an implementation of the model on a single multi-core PC was tested, too.

Tests which examine implementation of the model in distributed environments
The algorithm scalability depends on the number of HRs because it is related to the number of schedule modifications. The number of independent requests, and consequently the need for workers, increases along with the increase of the number of HRs. Influence of changing the number of workers on the computation time towards the number of tasks is shown in Figure 3. In both distributed environments, the computation time significantly falls as the number of workers grows. Decline is particularly visible when only few workers are used. Finally, the computation time exceeds its minimum, no matter how many workers is used. In both environments, also the increase of the number of tasks influences the drop of the scheduling time. However, the cluster, despite slower CPUs, copes better along with the increase of the number of tasks. In the cluster, the growth of the scheduling time in more complex projects is slower, especially when only few workers are used. In general, a reduction of the computation time looks similar in both environments. It is worth noticing that, the computation time was reduced even to 6% of sequential computation time for the project with 60 tasks and 12 HRs (Figure 3b (Figure 4). The CPU usage was monitored every 50 ms and the reads were averaged at the end of calculations. More frequent reads could influence the processor load. The number of HRs was chosen so that enough simultaneous attempts were provided to make workers busy. PCs were running 4 workers each (one worker was assigned to every core).

Tests which examine the in uence of the communication cost on algorithm performance
Distributed tests were executed in order to examine how the network latency influences the algorithm performance. To that end, 4 workers were run on the ClusterPCs that comprises 2 multi-core PCs and compared with 4 workers on 2 processing nodes in the cluster and 4 workers on a single PC (so called LocalPC ). All workers were using RMI for communication. At first, the number of modification requests was counted with respect to the number of resources and the number of tasks (Table 1). The number of requests increases as the number of resources increases and varies along with the increase of the number of tasks. However, the more requests are sent, the greater will be the impact of communication cost on the performance. The average scheduling time for a project with 30 tasks is shown in Table 2.  It is clear that the scheduling time decreases when the number of workers grows. Yet, the decline is very low between 3 and 4 workers in the LocalPC because computer resources start to be overloaded when 4 workers and the tasks dispatcher run on the same machine. On average, the LocalPC is about 13% faster than the corresponding ClusterPCs (for less than 4 workers), due to low communication costs. On the other hand the ClusterPCs is better when the number of workersexceeds the number of processor cores. It is also not limited to the number of workers. But even the usage of 4 workers reduced the scheduling time by 54% in the ClusterPCs and by 48% in the cluster, in the project with 30 tasks and 10 HRs. However, the reduction ratio in the former decreases along with the increasing number of resources and does not change in the latter. It means that the cluster copes better than PCs also with the increase of the number of resources.
The average time of transferring data between the tasks dispatcher and 3 workers is shown in Table 3. It increases when the number of tasks increases because more data needs to be transferred. It also increases when the number of resources increases due to increased number of requests that the tasks dispatcher has to handle. Yet, the increase of the time is much faster in the ClusterPCs, than in the cluster. Consequently, the data transfer in the ClusterPCs gets slower in the projects with more than 35 tasks and 10 HRs. On average, the data transfer is about 2,2 times slower in the ClusterPCs than within a single multi-core PC. On a single machine, it may be further reduced to less than 0,5 ms by the use of threads instead of processes in LocalPC (so called Threads). Threads are much lighter than processes and share the process' resources. Thus, even if only one multi-core machine is available, the scheduling time with the use of 4 workers may be reduced by about 47%. The scheduling time on a single machine with the use of 4 threads is relevant to the scheduling in ClusterPCs on 2 multi-core PCs with 4 workers on each. But still, if the need for workers is greater, the ClusterPCs is better. Moreover, running more threads than 5 on a 4-core processor is not so efficient. Comparison results of time needed to transfer data between the tasks dispatcher and 3 workers, averaged from all attempts, are shown on Figure 5. Figure 5 Comparison results of time needed to transferring data between the tasks dispatcher and 3 workers averaged from all attempts [ms]

CONCLUSIONS
In the research, a distributed model was used in order to reduce the computation time for a solution of the RCPSP when resources are partially available. An implementation of the model on a multicomputer built from PCs was tested and compared with regular implementation of the model on a cluster. The tasks dispatcher and workers were connected through a local network and were using RMI for communication. The tasks dispatcher was using multithreading for spreading and gathering data while, at the same time, workers were calculating different schedule modifications and sending back the results. The workers were run on remote computers as independent processes and hence did not have to be synchronized. Workers were gathered in a pool managed by the tasks dispatcher and were available for a direct use. The best efficiency was obtained when there were as many processes running as the number of computer cores. Hence, the more cores inside the computer, the more workers can run on it and fewer PCs are needed. Consequently, the more workers the shorter the computation time, but only when there is enough work to do for the workers. Too few workers cannot handle rapidly growing calculation requests after the first stage of the algorithm. The maximum number of workers depends on the number of HRs because it is related to the number of schedule modifications Thus, the project scheduling cannot be speed up if there is a lot of resources and not enough workers and vice versa.
The research showed that the multicomputer built from multi-core PCs may be successfully used for reduction of the scheduling time. Obtained results are comparable with the cluster. In both environments the reduction of time looks similar. However, the cluster copes better along with the increase of the number of tasks and the number of resources. In the cluster the communication cost is lower than in the ClusterPCs, in the Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 22/05/2022 01:22:11 U M C S projects with more than 35 tasks and 10 HRs. On a single machine, the scheduling time is about 13%, faster than through a local network (for less than 4 workers) due to lack of the network latency. It can be further reduced by about 47% by the use of threads instead of processes. However, the computer resources start to be overloaded when the tasks dispatcher and more than 3 processes or more than 5 threads run on the same 4-core processor. Therefore, the ClusterPCs outperforms the LocalPC when more than 3 workers and the usage of threads when more than 7 workers are used.
The experimental results showed that the distributed model is well-balanced. The computational tasks are uniformly splitted among workers. If the number of workers increases, the load spreads over the available PCs. The distributed algorithm scales well, adjusting to the number of workers. Moreover, if any of the workers crashes, its task will be taken over by another worker and the proceeding will be continued. Various complexities of the projects were tested. However in each, the scheduling time was significantly reduced by the distributed calculations, even up to 6% of sequential time. In comparison to the sequential computing, the number of used cores (counted in 100%) was 10 times higher, during scheduling of a project with 30 tasks and 16 HRs by 36 workers.