XML Search Without an Index
The solution is remarkably simple; eliminate the challenges of using indices by starting with the idea that data can be searched without an index. One benefit of this is that nothing has to be set up in advance. The XML is written to files on a disk, so it can be searched immediately.
At the same time, offer the ability to add and search XML data of any record structure, thereby eliminating the need to convert or prepare the XML records. If over time the application is modified to handle additional data, the new records can be added to the database without changing any of the existing records.
Simplicity and flexibility are nice, but how do you actually get the required performance? Two techniques are being employed today.
The first is permitting data to be spread across any number of low-cost Linux blade servers by having a controlling service as well as search services deployed on any number of Linux blade servers. Since there is no index, there is no need for any of these servers to know anything about the other servers; each blade server is given its part of the data, and it searches that part alone. In this scenario, the controlling service is able to collect the search queries and submit them in parallel to each of the search servers. It then collects all of the results from the search servers, merges them together, and returns them to the requester.
One of the great advantages of this approach is that new search servers can be added at any time. If the response time gets longer, deploy more servers and the entire search operation speeds up again. There are no dependencies between the search servers, so there is almost no limit to how much this can be scaled. More importantly, this property can be used to guarantee response time. As the quantity of data grows, the amount of processing power can grow, guaranteeing that the response time remains constant.
At the same time, breaking the data out into many search servers would still leave the response time unacceptable if it were not for another technique. This is the use of the controlling service to collect hundreds or thousands of queries to be searched at the same time, providing the ability to search for multiple queries at the same time without slowing down the search.
Imagine you have 20 queries looking for home listings in New York and another 20 queries looking for homes in Texas. As soon as a record is tested and found to be about a house in New York, the search mechanism continues processing on the first 20 queries, but the queries for houses in Texas can be ignored for this record, and cause no additional overhead. The search through a large quantity of data might take five to ten seconds. However, the ability to return 1,000 result sets from a single pass through the data means you can maintain very high speed, even without an index.
Here is how it works. When a record is found that matches a query condition, the entire record is already in memory, so the entire record can immediately be added to the result set without additional I/O. This adds up to significant I/O savings. Since there is no index to bring in an out of memory, there are no huge partial result sets to manage, and there is no need to go back and retrieve the record contents separately. Instead, the disk is read sequentially, which is the fastest way to read a disk.
Additionally, XML records can be added and removed at any time. There is none of the overhead associated with updating an index, which in traditional index-based solutions can demand a significant amount of processing in certain situations. Therefore, when a record is added, it becomes immediately available for searching by the next. Similarly, records can be removed quickly and easily.
Previous Page | Next Page
1 | 2 | 3
If you found this page useful, bookmark and share it on:
If you are familiar with RSS feeds, you can also sign up for our free blog feed. Our RSS feed is updated in real-time while our newsletter is updated daily.
