Query Optimization by Indexing in the ODRA OODBMS

We present features and samples of use of the index optimizer module which has been implemented and tested in the ODRA prototype system. The ODRA index implementation is based on linear hashing and works in a scope of a standalone database. The solution is adaptable to distributed environments in order to optimally utilize data grid computational resources. The implementation consists of transparent optimization, automatic index updating and management facilities


Introduction
Indices are auxiliary (redundant) data structures stored at a server. A database administrator manages a pool of indices generating a new or removing an existing one depending on the current needs w.r.t. improving overall performance of applications. As indices at the end of a book are used for quick page finding, a database index makes quick retrieving objects (or records) matching given criteria possible. As indices have a relatively small size (comparing to the inheriting from the one with the index can inherit the index. To index subclass attributes, it is necessary to specifically set indices on each subclass. This results in the need for providing index consistency by a database administrator.
In the Objectstore DBMS [5] there are two types of improving query performance by indexing, i.e., with indices and with superindices. The first solution involves building indices on a collection of objects. A superindex is an index kind that is specially used for optimizing queries involving types that have many subtypes. By default, adding an index on a type results in recursive adding of indices to all its subtypes. Still for queries with a large and intricate hierarchy of subtypes the regular indexing can seriously deteriorate processing. Adding a superindex to a type with many subtypes differs from a default index in one essential feature: the superindex is only one. It eliminates a recursion; consequently, only one parent query operation occurs in contrast to multiple queries when using the regular index.
There is also a possibility to create a query that uses a multistep index, which is an index on a complex navigational path that accesses multiple public data members. It optimizes queries that use the same path. For example, if a query concerns all employees who works in the Sales department, an index on WorksIn.Name to Emp collection can be used. However, updating an index entry after data modification must be explicitly determined by the programmer. It is a serious drawback of an ObjectStore multistep index.
The ObjectStore ODBMS automatically optimizes a query applied to a collection. If an index is added to a collection, then the database first evaluates indexed fields and establishes a preliminary result set. Then, ObjectStore applies non-indexed fields and methods to the elements in the preliminary result set. In ObjectStore the optimization can be done manually by preparing a query or automatically otherwise. This means that a query is optimized to use exactly indices which are available on the collection being queried. The automatic optimization is convenient and effective. Moreover, it supports data independence, i.e., the database administrator is not constrained in establishing new or removing indices because application programs do not refer to them explicitly.
Let us consider the index usage in Objectivity/DB [6]. The main goal of an index is to optimize predicate scans and this is how it is implemented in Objectivity. The predicate used in the scan can be one of the following: • A single optimized condition (=, ==, >, <, >=, <=, =∼ -string match) that tests the first key field of the index • A conjunction (&&) of conditions in which the first conjunct is an optimized condition that tests the first key fields of the index (no disjunction -OR ) Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 03/08/2023 05:31:27 In the case of Objectivity/DB creation of an index for a class means indexing also objects references of all classes derived from the indexed one. The index structure maintains references to persistent objects of a particular class (so called indexed class) and its derived classes. An indexed class is specified during creation of an index. Objectivity/DB additionally supports concatenated index on several attributes (key fields). The order of key values of an index is very relevant regarding the proper activity of predicate.
While considering indexing in OODBMSs the way the GemStone database server handles the issue should also be noticed [7,8]. GemStone indices address path-expressions. A variable name appearing in the beginning of a path is called path prefix. Then, a path contains a sequence of links and a path suffix; e.g. Employee.worksIn.manager. For each link (for an instance variable of an object) in the path suffix one index is available thus forming a sequence of index components. In GemStone identity indices directly support exact match lookups; whereas, equality indices and identity indices on Boolean, characters and integers directly support =, >, >=, <, <= and range lookups. Fig. 1 shows the ODRA query optimization process in the context of a query evaluation process. The input for the optimization process is an abstract syntax tree (AST) of a query. The optimization modules are divided into optimization by rewriting and optimization by indices. The theoretical idea for these methods is presented in several documents, see e.g. [9, 10, 2, 3].

Query Optimization Engine Architecture
The rewriting optimization process modifies a query during compile-time with the use of information stored in the metabase augmented with static query evaluation results. Currently ODRA supports several rewriting methods: changing the order of execution of algebraic operators; view rewrite (replacing a view invocation by a view body); removing dead sub-queries; factoring out independent sub-queries; shifting conditions as close as possible to the proper operator; methods based on the distributivity property of some query operators.
Optimization by indices searches for parts of an input query that can be transparently replaced with an index call. If such an index exists (added previously by the administrator) the query is rewritten to the form where the target part is replaced with an index invocation.

The idea of ODRA indexing
In general, an index can be considered a two-column  Key values can be also calculated with the use of expressions that can contain build-in query language functions or user defined functions (function-based indices [11]). This approach enables the administrator to create an index matching exactly predicates within frequently occurring queries, so their evaluation is faster and uses the minimal amount of I/O operations.
In query optimization indices are used in the context of a where operator, when the left operand is indexed by key values of the right operand selection predicates. Let us make an example using the database store structure presented in Fig. 2. If the administrator will set an index named For big databases, replacing the where clause evaluation with an index function call may cause performance gain even orders of magnitude. However, to achieve this the database server should ensure index transparency and automatic index updating.

Index Transparency.
In the common approach a programmer should not involve explicit operations on indices into an application program. To make indexing transparent from the point of view of a database application programmer, the database management system should ensure two important functionalities index-based optimisation and automatic index updating.
The first functionality means that indices are used automatically during query evaluation. Therefore, the administrator of a database can freely establish new indices and remove them without changing the codes of applications. The Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 03/08/2023 05:31:27 U M C S responsibility in ensuring such transparency lies in query optimisation and particularly in the index optimiser. The second functionality, i.e. an automatic index updating, is required due to possible changes in a database. Indices, like all redundant structures, can lose cohesion if a database is updated. An automatic mechanism should improve, eliminate or generate a new index in the case of database updates. This paper focuses on the first functionality, i.e. index optimisation, which is the main topic of Section 6.

Index Classification.
The most common classification of indices distinguishes primary and secondary ones or dense and range ones. From the query optimizer point of view the distinction between primary and secondary indices is less crucial because it does not lead to significant differences in optimizer algorithms, whereas the division into dense and range indices is essential: • a dense index is applied when for each value in the object attributes a separate position in an index is created, e.g. for a person objects index, where any name occurring in the database can be a key-value, • a range index means that index items concern values within a given range, e.g. a range index for a salary attribute is a table where each index item describes a range of salaries: < 0 − 500), < 500 − 1000), < 1000 − 1500), . . . etc (Table 1). Similarly, range index items for names can take the following form: "names starting with a letter A", "names starting with a letter B",. . ., "names starting with a letter Z".

Features of ODRA indices.
Currently the implementation supports indices based on Linear Hashing [12] which can be easily extended to its distributed version SDDS [13] in order to optimally utilize data grid computational resources. Nevertheless, there is a wide range of different index structures that could be used in indexing in object-oriented databases similarly to those in the solutions occurring in relational ones [11,1,14,15]: B-Trees, bitmap indices, etc.
An extended idea of an ODRA index works with multiple key indices. Additionally to the key types mentioned earlier (dense and range) enum type was introduced to improve multiple key indexing (among other things). Moreover, thanks to properties of the SBQL language, i.e. orthogonality and compositionality, the implemented solution provides generic support for variety of index definitions including usage of complex expressions with polymorphic methods and aggregate operators.
ODRA supports local indexing which ensures index transparency by providing a mechanism (optimization framework) to automatically utilize an index before query runtime evaluation and therefore to take the advantage of indices. ODRA C.R.U.D. (Create, Read, Update and Delete) is also equipped with triggers to ensure automatic index updating so existing indices are consistent with the database state.

Index Management
All indices existing in a database are registered and managed by the ODRA index manager. The list of all indices and auxiliary information needed by the index optimizer are stored inside a special admin module. Each index is associated with a module where it was created and its name has to be unique. Therefore, the index manager checks whether a given index exists in the list of references to meta-base objects describing indices using the combination of a module name and an index name: "module n ame.index n ame".

Example Schema.
The schema in Fig

Index Types.
The syntax for creating index allows the administrator to specify general index key properties, i.e. concerning key values or the goal of optimization. These are achieved by introducing optional type indicators: dense, range and enum.
The dense indicator implies that the optimization of queries which use the given key as a condition will be applied only for selection predicates based on '=' or in operators. Therefore the distribution of indexed objects in index (e.g. in hash table) can be more random. The order of key values has no significance for indexing. The dense indicator is always used for reference values (regardless of an indicator set by the administrator). Moreover, it is the default type indicator for integer, string, double or reference key values.
The range indicator implies that optimized selection predicates will be based not only on '=' or in operators but also on range operators: '>', '≥', '<' and '≤'. Within an index a hash function groups objects according to key value ranges. In the current implementation, ranges are dynamically split because each range is associated with an individual bucket of a linear hash map.
The idxDeptSalary index returns references to departments according to a value (or a value range) of a sum of department employees salaries. Its advantage is avoiding calculation of a complex selection predicate multiple times because it is already calculated during index creation. On the other hand, the maintenance of the idxDeptSalary index is very expensive and can cause serious deterioration during database updating.
The enum indicator is introduced in order to take the advantage of keys with a countable limited set of distinct values, i.e. keys with low values cardinality. The performance of an index can be strongly deteriorated if key values have low cardinality e.g. person eye colour, marriage status (Boolean value) or the year of birth. Using the enum key type index internally stores all possible key values (or range for integer values) and uses this information to optimize the index structure.
Another important property of enum keys occurring when index is set on multiple keys is that the optimizer can omit them if necessary during optimization of queries. If enum is set on all index keys and the number of indexed objects is large then index call evaluation should prove great efficiency (each key value combination points to a separate object references array called bucket) Other examples of creating indices commands are as follows: The enum index which returns Person objects queried by a zip attribute of its subobject address. It is important to note that a zip attribute is optional and therefore this index stores only Person objects containing this attribute.
The index returns Person objects according to the value of expression 2009 -age. It is assumed that this index is capable of processing range queries.
The dense index uses the Emp class method getTotalIncomes() as a key for selecting Emp objects. This method is overridden for instances of the EmpStudent class.
The only action required from the administrator in order to take advantage of indexing is creation of proper indices since the rest of optimization is transparent for programmers. The next section describes the rules used by the Index Optimizer.

Query Optimization
In ODRA the use of indices is entirely transparent for an application code. The programmer may be aware or not of existence of indices, but the code does not depend on it. The index optimizer automatically applies all possible indices during query compilation process.
Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 03/08/2023 05:31:27 U M C S Besides this possibility, a user can also use indices explicitly. This feature is introduced for testing purposes in order to check semantic equivalence of introduced index optimizations and research into new possibilities in indexing.
In the following we briefly describe ODRA indices optimization engine module used for query optimization based on indices.

Index Usage Syntax.
From the SBQL syntax point of view an index invocation is simply a procedure invocation: The number of parameters is equal to the number of index keys. Each key parameter defines a desirable value of a key. An index function call returns references to objects matching specified criteria.
A key parameter expression can define a single value as a criterion. In that case its evaluation should return integer, double, string, reference or Boolean value or reference to such a value. Below we present an example calls for the sample index idxDeptName: A single value key parameter can be passed through a value of a binder named "$equal". Binders are used to increase readability and to make introducing new types of parameters for index calls easier.
To specify a range as a key value criterion parameter, an expression should return a structure consisting of four parameters: (< lower − limit >, < upper − limit >, < lower − closed >, < upper − closed >) where: • < lower − limit > and < upper − limit > are key values specifying range, • < lower − closed > is a Boolean value indicating whether < lower − limit > belongs to a criterion range, • < upper − closed > is a Boolean value indicating whether < upper − limit > belongs to a criterion range.
Examples of index calls: The last example returns references to persons whose year of birth is below the average of all the persons from the database. Like in the case of single value key parameters, parameters specifying a range are passed using the value of a binder named "$range".

U M C S
A key parameter can specify also collection of single key values as a criterion. This is done when a key parameter returns a bag of key values.
The binder named "$in" is used to pass a collection of key values. If a criterion parameter returns an empty bag then the index call returns an empty bag too.

Transparent Index-based Optimization.
The mechanism responsible for index transparency during query evaluation is called the index optimizer. Its function is to replace a part of a query with an index call in order to minimize amount of data processed.
This section describes general rules used in solving the problem of semantic equivalence of queries rewritten by the index optimizer and original input queries. Most of the following rules concern optimizing range queries. The index optimizer analyzing the right operand of a where non-algebraic operator takes into consideration all selection predicates joined with conjunction (and) or disjunction (or) operators.

Optimization procedure.
The basic index optimizer procedure works on selective queries where left side of the where operator is < object e xpression > indexed by one or more indices. The algorithm analyses all selection predicates joined with and operators and tries to find an index that keys matches the predicates. If more than one index is found, the optimizer selects one with the best selectivity.

Semantic Equivalence Issue in Optimization Involving Optional Keys.
Firstly, let us to consider how [0..1] key cardinality affects optimization. Using criteria with the discussed cardinalities may cause runtime errors because selection predicates based on '=', '>', '≥', '<' and '≤' operators force using single values as left and right operands. An unexpected number of operand Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 03/08/2023 05:31:27 U M C S values causes a runtime error. Using an index call in optimization with those predicates would eliminate threat of error and therefore optimized query would not be semantically equivalent to the original. In these cases the optimization is allowed only if in operator is used as a predicate because it does not constrain the cardinality of a right operand.
The example of an unsafe predicate evaluation that may cause a run-time error (left side of selection predicate has cardinality [0..1] due to zip attribute) is presented below: To avoid the possibility of a run-time error the "safe" in operator should be used: In the discussed case the index optimizer supports optimization when predicates are defined using '=', '>', '≥', '<' and '≤' operators and only if a proper exists predicate is used.
The example of safe predicate evaluation when '=' operator is used can be as follows: Only in the case of two previous examples of queries the Index Optimizer can apply the following query transformation: The minimal cardinality of a key equal to zero indicates that the index may not contain references to all objects defined by an index < object − expression >. In the case of multiple key index, if such a key is omitted in selection predicates, it is possible that evaluation of the where operator may return references to the objects that are not stored inside the index. Therefore, the index optimizer would not apply optimization using such an index. To sum up, keys with the minimal cardinality equal to zero are obligatory even if they are declared with enum type indicator. Currently the maximum cardinality of keys greater than one is not supported by the ODRA indices. However theoretically it would imply that an index call may return the same object reference more than onece. To prevent such problems in the future, the index optimizer uses the uniqueref operator to remove redundant object references.

Aspects of Range Predicates Optimization.
If optimized query selection predicates specify only one limit of a range (lower or upper) then the second limit is generated automatically i.e. a possible smallest or biggest value for a given key. For example, the following query concerns the departments located in the Warsaw city whose employees together earn less than the best paid employee of the whole company.
Original query: Optimized query: If there are more than one predicate or two opposite predicates describing the range on a given key then min, max, union and comparison expressions are used to obtain a correct key range parameter.
Original query: Optimized query: In some cases, the index optimizer can use if then expression to predict whether a given query returns no result (and calling the index is unnecessary) i.e. if selection predicates are in contradiction. This is to be checked e.g. when for a given key there exists more than one selection predicate and at least one is based on '=' or in operator. If any of these selection predicates contradicts with a predicate based on '=' or in operator then such a query returns an empty bag: Original query: Optimized query: This procedure is used also when the key cardinality is different from [1..1], i.e. in the case of two or more selection predicates based on in operator.

Omitting Individual Index Keys.
For multiple keys indices, enum keys may be usually omitted in an index call. The index optimizer, in order to omit a key when no selection predicates were specified, sets both lower and upper bounds to the smallest and largest key values: Original query: Optimized query: To omit the Boolean key in an index call, set key parameter criteria are used (false union true). Original query: Optimized query: The index optimizer is also prepared to deal with queries where selection predicates are joined with or operators. As disjunction weakens a selection, it also makes optimization more complex. Therefore if the application of an index is possible without considering predicates joined with or operator then the optimizer may skip deeper analysis. In another case, in order to check all possibilities for indexing, the optimizer removes or operator and splits nonalgebraic where operator expression on two partial selection expressions. The objects returned by both these expressions can be duplicated so it is necessary to leave only distinct object references which is achieved using a uniqueref expression. Indexing reduces the amount of data processed in a query only if it can be applied to both partial expressions. This procedure is recursive if there is more than one or operator. Let us consider the following example of optimization: The query can be split by the index optimizer into the following form: and depending on a current cost model and existing indices, the optimizer can apply the transformation: Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 03/08/2023 05:31:27 U M C S Selection predicates based on age, married and address.city expressions concern EmpClass's superclass, i.e. PersonClass, and for that reason the administrator can equip the whole Person collection with the idxPerAge&Mar&City index. It can return instances that do not belong to EmpClass thus the optimizer has to introduce a facility removing non-EmpClass instances from the index invocation result. This can be done using an SBQL coerce operator. The syntax of the coerce operator was taken from the typical syntactic convention that is known from the languages such as C, C++, Java, etc. as cast. Consequently, the result of the idxPerAge&Mar&City index call is automatically cast to Emp collection because the original query concerns only employees.
In the presented approach to reusing an index in inheritance, indices considering the class which introduces the given key are more versatile as they can be used for optimising selection queries addressing subclasses collections.

Optimization Gain
Let us discuss the following test example. If an index call is located on the right side of a non-algebraic operator, e.g. a dot, then it is likely to be evaluated more then once during the query execution. This is shown using the following example with an idxEmpTotalIncomes index: Query 1a. For 61 year old, married employees living in Lódź, working in Lódź or Wroc law retrieves a name concatenated with a surname and a number of employees with an equal amount of total incomes In Figs 4 and 5 the logarithmic scale is used also on the y-axis. The dependency between the optimization gain and the number of persons is close to linear and grows to 457 for 300000 objects.

Conclusion and Future Work
In the paper the rules concerning creating and taking advantage of indices in the ODRA prototype have been briefly described. In the presented approach the optimization is achieved through the described query transformation. The proposed implementation of indexing in ODRA enables creation and transparent automatic maintenance of indices facilitating processing of selection predicates based on arbitrary deterministic expressions consisting of path expressions, aggregate functions, class method invocations (taking into consideration inheritance and polymorphism). All functionalities necessary to provide the desired behaviour of indices are already implemented and functional. Still, the ODRA indexing is under development and requires further research. Future works include employing different index structures (e.g. B-Trees) and implementing new optimization methods taking advantage of indices (e.g. optimisation of rank queries). Additionally we consider extending indexing capabilities onto distributed environment using the SDDS method and currently developed volatile indexing technique.