Full Text Search Engines Vs DBMS


Introduction

Full text search engines and relational databases each have unique strengths as development tools but also have overlapping capabilities. Both can provide for storage and update of data and both support search of the data. Full text systems are better for quickly searching high volumes of unstructured text for the presence of any word or combination of words. They provide rich text search capabilities and sophisticated relevancy ranking tools for ordering results based on how well they match a potentially fuzzy search request. Relational databases, on the other hand, excel at storing and manipulating structured data — records of fields of specific types (text, integer, currency, etc.). They can do so with little or no redundancy. They support flexible search of multiple record types for specific values of fields, as well strong tools for quickly and securely updating individual records. Some of the fields in a table’s records may in fact be free form text, such as a product description, and most relational databases today provide support for doing full text searching on the unstructured data. However, the relevancy ranking of results for unstructured text search for most relational databases is not on par with that of the best full text search systems.

Some applications will naturally be served best by one or the other technology. Others may start in one of the widely installed relational databases but be better served by a full text search engine, a deficiency that may only become noticeable or problematic once requirements change or the volume of data grows. Many applications will rely on both technologies in tandem, either for different parts of the data or by duplicating some of the data in the two systems to get the advantages of both.

The rest of this article explains some of the strengths and weaknesses of full text search engines in more detail and provides some guidelines for choosing the right development model.

What full text search engines can do

Full text search engines excel at quickly and effectively searching large volumes of unstructured text — documents or other ‘records’ containing free form text — and returning these documents based on how well they match the user’s query. They may also have the ability to quickly facet, or categorize, data or search results based on specific values of specific fields. The text search capabilities of the best systems are rich and flexible, and include support for basic keyword searching, Internet-style +/- syntax, use of Boolean operators, limited real or pseudo-natural language processing, proximity operations, find-similar, etc. Relevancy ranking capabilities that determine the best match for a query include using the frequency of query terms in the document, their frequency in the database as a whole (the presence of rarer query terms in the document are usually more significant in indicating a good match), proximity of query terms near each other in the document, special weightings for particular terms, fields or documents and more.

These documents are typically of one type or structure. This structure often consists of a primary free form text field (e.g., the main body of a document, or the main description of a product), additional secondary free form text fields (e.g., a title or abstract) and some non-text or more constrained text fields (e.g., date of publication, size, price, product code, etc.). It’s common to refer to the main text field as the primary data or text and to the other fields (title, price, etc.) pertaining to it as metadata. However, these documents or records — the terms can be used interchangeably to refer to a single indexed ‘item’ in a full text search system — can also be viewed as simply a concatenation of fields. Any number of these fields may be free form text, and any of them may be non-text or more constrained textual data (e.g., one of some number of product codes).

Full text search systems generally adopt this view of the data they are indexing and searching: each document/record is simply a collection of fields. A given search is always run against a single field or some combination of fields, though the end-user may not be aware of this, especially if the default is to search all fields together.

Full text search systems generally depend on some type of index in order to perform queries. Most common is an inverted index, which effectively lists every term — every word, number, etc. — in every document together with an indication of which documents contain that term (and where, if searching on phrases or other proximity operations are supported, as they usually are). There may be a separate index for each field, or all fields may be contained in a single index. In a given document there may be no value for one or more fields. However, the structure is the same for all documents in that the set of possible fields is the same for all documents for a given full text index. The full text search system usually contains some capabilities for handling non-text fields, such as range searching, the ability to sort results by any field, etc. But these capabilities are not as strong as they are in a relational database.

Search results will come from that one set of indexed documents with one set of fields. That set may be an aggregation of documents from many sources and of many types. The set of fields defined for the index will need to include all the fields to be searched on from any of the document sources..This can mean that certain information needs to be repeated for certain fields, e.g., if the city field is Boston and there’s a need to also store or search state information, the state will need to be Massachusetts for every record for which the city is Boston. In a ‘normalized’ relational database, the fact that Boston is in Massachusetts would only be stored once. Some full text systems may provide some capability for effectively ‘joining’ data of different types, minimizing such redundancy. This might be done by additional special index structures, by pre-defined ‘filter’ queries that retain information about common query constraints (effectively joining queries with stored queries) or by multi-pass search techniques. These multi-record type capabilities are not as rich or as straightforward as they are in a relational database 해외문자

In addition to indexing the data, most modern full text search systems, including Lucene/Solr, let you actually store and retrieve the data in its original form. One reason they do so is to be able to easily populate a search result list with actual data — e.g. a document’s title or summary — to make it easier to see which documents are most relevant and worth opening for full review. Selected documents/records are often then opened from their original location, but one can also store the entire document/record in the search system and view it from within the system.

Modern full text search systems also support incremental indexing, including the ability to add, delete or update records. Still, full text systems are somewhat limited in their ability to rapidly and securely process transactional updates. This is partly because the speed and scale advantage of full text systems for text search are due in good measure to sophisticated index compression to represent that a given word might occur in millions of specific documents. This compression limits the ability for selective index update. Some full text search engines nevertheless support near real-time updating, often in a memory-based index partition that is rolled into a fuller disk-based index at some point in the background.


Leave a Reply

Your email address will not be published.