Transposition Rearrangement: Linear Algorithm for Length-Cost Model ∗

The contemporary computational biology gives motivation to study dependencies between ﬁnite sequences. Primary structures of DNA or proteins are represented by such sequences (also called words or strings). In the paper a linear algorithm, computing the distance between two words, is presented. The model operates with transpositions of single letters. The cost of a single transposition is equal to the distance which transposed letter has to cover. Other papers concerning the model give, as the best known, algorithms of time complexity O ( n log n ). The complexity of our algorithm is O ( nk ), where k is the size of the alphabet, and O ( n ) when the size is ﬁxed.


Introduction
The problem of describing similarities, or differences, between two strings has been deeply studied over the years. One of the main motivation of this studies is the rapid development of computational biology. There is a need of good models to compare sequences of genes, nucleotides or aminoacids and fast algorithms computing distances in such models.
There are two well studied approaches. The first is based on erasing or changing letters. The classical measures are Levenstein distance or Hamming distance [1,2]. Another one focuses on rearranging the order of letters. This second approach is closely related to an old problem of sorting sequences. In these two problems the same types of operation can be allowed. The best examples are reversals or transpositions [3,4,5].
In this paper we deal with strings composed of the same multisets of letters. Because of the equality of Parikh vectors of such strings, we say that they are Parikh equivalent. We consider one of the models for rearranging strings by transpositions, the length-cost model. This model of measuring distance between two strings was recently introduced in [6]. It supposes that shorter transposing is cheaper. In the simplest case, transposing letter from the i-th to the j-th position in string costs |i − j|. The authors give the solution of semilinear complexity. A similar problem for the interchange rearrangement was introduced in [7]. Once more, the model with the cost based on a simple difference of positions of the exchanged letters is an important special case. The given algorithm is quadratic and gives the description of rearrangement. The authors also claim that this measure could be computed in linear time. Both algorithms base on an observation that it is enough to exchange subsequent letters. Then, they solve a permutation case and broaden it to a general case, by setting a numeration of occurrences of different letters.
We look at the problem of string rearrangement. We start with the binary alphabet and show the linear algorithm for that case. Subsequently, we extend the binary alphabet case to the general case, using partial solution for all projections on the binary subalphabets. The algorithm has time complexity linearly dependent on the size of compared strings. However, it is also linearly dependent on the size of the alphabet.

Basic Notions and Definitions
We use some basic notions of mathematical language theory. By Σ we denote an arbitrary finite set, called alphabet. Elements of the alphabet are called letters. Words or strings are arbitrary sequences over the alphabet Σ, the empty word is denoted by . By u n we mean the nth letter of the word u. The set of all finite words is denoted by Σ * . By |u| we denote the length of word u and by |u| a the number of letters a in the word u.
An useful operation on words is a projection Π : Σ * ×2 Σ → Σ * . The projection of a word u on the subalphabet S ⊆ Σ is the word obtained by erasing from u all letters from Σ \ S. More precisely, we can give an inductive definition, assuming that c is an arbitrary letter and u is an arbitrary word: Definition 1. Let u, v ∈ Σ * be two Parikh equivalent words (i.e. ∀ a∈Σ |u| a = |v| a ). Then the canonical permutation from the word u to the word v is a one to one function P uv : {1, . . . , |u|} → {1, . . . , |v|} such that v P uv (i) = u i and ∀ i<j u i = u j ⇒ P uv (i) < P uv (j). Whenever is not confusing, the index will be omitted. Moreover, whenever we tell about the canonical permutation from u to v, we suppose that the words u and v are the Parikh equivalent.
Definition 2. Let u, v ∈ Σ * and P be the canonical permutation from u to v. We say that a pair (i, j) is a reversed pair if and only if i < j ∧ P (i) > P (j). A set of all reversed pairs for the words u and v is denoted by RP (u, v). By #RP (u, v) we denote the number of elements in the set RP (u, v).
Directly from the definitions, only indices of distinct letters could form a reversal pair. Moreover, there is a strict connection between the reversed pairs of two Parikh equivalent words and the reversed pairs of their projections to binary subalphabets. The two following facts describe this condition formally. Proof Sketch. Projections, as morphisms, preserve the order of appearances of letters. It means that the n-th a stays before the m-th b in a word u iff the n-th a is before the m-th b in the projection a,b u. The thesis of lemma follows from that simple observation.

Proposition 1. Then number of reversed pairs in two words u and v is equal to the sum of reversed pairs in their projections to all binary subalphabets. More formally:
On the other hand, the reversed pairs counted from different projections are formed by different pairs of letters, so in the right side we count every pair at most once. It means that which ends the proof.
Example 1. Let us consider two strings u = abac and v = cbaa. These strings are the Parikh equivalent (both of them consist of two letters a, one letter b and one letter c). The canonical permutation looks as follows: The set of reversed pairs is RP (u, v) = { (1, 2), (1,4), (2,4), (3,4)}. The projections to binary subalphabets of Σ looks as follows: The sets of reversed pairs for these projections are appropriately According to [6], we have to label letters in both strings and compute the number of reversed pairs (in the sense of definitions 1 and 2). Every transposition of length l can be decomposed to l transpositions of length 1; cost of the long transposition is equal to the sum of costs of the short transpositions. We can consider only these short jumps over the single letter. The next observation refers to the reversed pairs. For each reversed pair, at least one jump has to be done to set this pair in a correct order. It is also possible to make whole transformation with transpositions of the length 1, where the number of such transpositions is equal to the number of the reversed pairs. In the case of binary alphabet, we can count the number of reversed pairs simply counting "how many b's are before each a" in the strings u and v. This way we get two vectors U (u) and V (v), of the length |u| a = |v| a . Then we compute the distance between two produced vectors using the Manhattan metrics

Binary Alphabet Case
The straightforward algorithm (without checking the correctness of data) is: time complexity to O(nk), where n is the length of a rearranged word and k is the size of the alphabet.

Conclusions
The linear algorithm computing the cost of rearrangement of finite sequences is presented. The only allowed operations are transpositions, the cost of a single transposition is given by its range. The algorithm is dependent on the size of the alphabet. It is a serious disadvantage in the cases when the size of the alphabet is close to the length of sequence. However, many practical situations operate on small sets and long sequences. For instance, the DNA chains are the sequences over the set of cardinality four.