XML Search Without an Index

By: Keith Swenson
Chief Architect
Fujitsu Software

Imagine you have a customer who has hired you to put together a solution for managing a huge quantity of XML information. The firm's team is using XML because it gives them flexibility in how the data is structured. They like the fact that they do not need to specify a given record structure up front, and they can change the XML structure of records whenever they need to. Still, the question remains, "How do you manage and search for records?"

One choice, of course, would be to put the data into a relational database. This approach is not always convenient because you need to decide ahead of time what the schema of the tables will be. Such a database will need to be managed, and any changes in requirements will demand changes in the schema as well as migration of the data. A second option is to put the data into the XML column of a relational database. This will work for moderate to large quantities of data, but what if you have truly huge amounts of data?

Now there is an alternative to a traditional database for storing and searching huge quantities of XML encoded data, and it is based on the use of a grid paradigm that eliminates the need for indices and provides the ability to guarantee the search response time.

The concept of searching XML data without using an index may well sound counter-intuitive. After all, we learned in "computer science 101" that an index is a way of making searches go faster. The accepted practice is that if your query performance times are taking too long, reexamine your schema and figure out how to construct the right index. In practice, judicious index construction can often make certain queries run much faster. However, several challenges arise when the number of records extends into the terabyte range.

Consider first that, no matter how carefully a database is coded, there will be an n^2 (or higher) order component to the time needed to maintain the index. Whenever a record is added to the database, the proper place in the index must be found, and an entry must be added. Whenever a record is removed, the index must be similarly modified to remove the entry. Of course, the ability to search on any part of an XML record then requires that every part must be indexed. The amount of extra processing involved to keep the indices up to date can be very significant if the index is large.

The second effect of an index is one of partial results. If the query involves several different fields, and if one of the values being searched for is commonly found in the data, the result set for that part of a query could be huge as well. The "and" operation in a search query effectively can cause the database to generate two huge partial result sets and then find the intersection of the large set - all in order to end up with what might be a relatively small final result set. The partial result sets must be managed in and out of memory, which can demand a great deal of processing, and the index itself can take up a lot of disk space requiring even more I/O operations.

It doesn't stop there. Once the result set is found from the indices, the database must go back to the disk and read those records into memory, causing yet more random I/O. It turns out that this random access nature of retrieving data contributes to slowing result times. A disk can be read sequentially about 10 times faster than reading random sectors because when reading sequentially, there is no need to seek and reposition the head across the disk.

The last effect of the index is to make the data coupled. An index by its nature must be an ordered set of references to all the records. An index over part of the records is just not as useful for searching as a complete index.

Previous Page | Next Page
1 | 2 | 3

If you found this page useful, bookmark and share it on: