Search Discussion Article

FREE SEARCH HELP

On Site Resources

Search Tool Guide

BUY THE BOOK
ABOUT THE BOOK
FAQ's
Audiences
User Benefits
Overview & Contents
Book Excerpts
Awards-Reviews
Updates
OTHER
Contact Us
Authors
Discussion Topics
Sales Affiliates
 
by Darlene Canning

Part 2: Evolution of Web Searching

This section outlines the evolution of searching on the Web: from its origins in the research community in the early 1990's to present-day investigations into the creation of new and enhanced ways to search. Many events had to occur in a short time period for the Web to become the information source that it is today.

Keyword searching comes to the Web

Given the magnitude of the Web, search engines that rely exclusively on keywords provided by users have long been known to be inadequate. Despite this, using keywords to search the Web was an innovation when it first became available. Prior to this, systems such as Archie and Veronica (see above) were the search tools for locating information on the Internet. They were not user-friendly; they provided limited access to FTP or Gopher servers where files were stored. Until the appearance of search engines, which permitted users to retrieve information by keyword searching, information retrieval on the Web was limited and tended to be restricted to the research community.

Search retrieval based on single word or phrase matching within Web documents has its roots in the field of information retrieval. Information retrieval systems had their broadest application in database systems for indexes to bibliographic references; mainly reference librarians originally used them. Even in the research community, researchers frequently came to librarians with their search requests. Prior to the Web, the general public had limited access to information. Information retrieval was a vital part of research but members of the public tended to seek advice from a local expert on a topic or they would go to a library, if one were available in their community.

Librarians became efficient at searching bibliographic systems by spending countless hours learning the structure of each database and the best way to search based on the underlying indexing and database structure. These information retrieval systems were effective as search tools because they were applied to a small resource with a particular discipline or application bias. Many of these systems made use of tools such as controlled vocabularies and thesauri to achieve precise information retrieval.  They were not particularly effective when searching full text documents unless index terms were added to clarify and define the meaning of the full text article.

With the widespread use of the Web today, few people have the patience to search an index of bibliographic references. They may consider using such an index if it includes a unique searching capability such as molecular structure searching for the chemist that can be found in Scifinder (Trademark name of American Chemical Society Online Chemical Abstracts, at www.cas.org/SCIFINDER/scicover2.html). Even in this kind of searching, users expect to locate a ready link to the full text. On the Web, the closest approximation to this type of index searching is found in a directory-type search such as Yahoo where broad categories of topics are selected and then the user follows a tree or drills down to a level where truly meaningful information is found.

Now that the Web has become a pervasive part of society, users expect to turn to the Web and locate the information they need with the same skill as librarians. Many of the search engines have come a long way toward achieving this but much remains to be done. The Web itself lacks the inherent structure that was an integral part of the bibliographic systems that were previously described. There is a movement on the Web to add the kind of structure that will enable precise information retrieval previously limited to these library research tools. To create a structured information environment, many resources have to be made available and worldwide cooperation and agreement on standards must be in place. Work is progressing in this area to allow for the development of what is known as the semantic web (www.w3.org/2001/sw/).

Relevancy ranking enhances web searching

With the development of keyword searching and more user-friendly web search engines, access to the Web moved from the research lab and  into use by the general public. It was only when the number of users reached a critical mass that conclusions could be drawn about the patterns of user behavior. Prior to these developments, the Web was an information resource for only a few people and it was impossible to predict the behavior of the population at large.

With the appearance of these first search engines, it became apparent that most users were unwilling to work their way through numerous screens of Web addresses until they found an interesting site. Previously, this was unknown and it was thought that users would browse through several pages of websites and assess the value of each Web link. This has proven not to be the case and most users will browse only the first two or three screens.

When designing web search engines, the browsing behavior of the users must be taken into account. Once it was discovered that information found after the first couple of screens would be largely ignored, the value of a good relevancy ranking tool became apparent. Relevancy ranking was found to enhance user satisfaction with web searching, and there was a move to those web search engines offering relevancy ranking.

Originally, relevancy was calculated by counting the number of times the keywords were found in the document. The more times keywords were found in a document, the higher the document would appear on the list of websites and the greater its relevancy. It was this knowledge that contributed to the use of "spamming," as described in the following frame.

Spamming

Not all new developments on the Web contributed to the retrieval of pertinent information in the way that might have been envisioned. As the Web grew and matured, spamming was a new phenomenon to appear. Once it became known that web search engine retrieval was based solely on matching the keywords found in the Web documents (webpages), attempts were made to doctor Web documents to increase the likelihood they would be found.

Spamming is the process of embedding Web documents with excessive and repeated instances of the same word to ensure retrieval by keyword-based search engines. This might be used as a form of advertising.

Companies selling cars might want to embed their websites with numerous occurrences of the word car to increase the likelihood that their site appears on the list of websites retrieved when someone searches using the word car.

This abuse of the Web is given as the reason why many commercial web search engine providers refuse to fully disclose the exact nature of the search algorithms they use in programming their search engines. They argue that if they describe in detail the way in which their search engine retrieves information, then it may be possible to manipulate the content of Web documents to ensure their retrieval, even if the document itself may be considered of poor quality or of questionable value.

With the development of second-generation search engines, the calculation of relevancy ranking has been further refined, and it is now based on a composite of factors. In addition to keyword counts, links from pages are used in the calculation of relevancy weights as well; information about the search words themselves may also be included in the calculations. If the keywords appear in the title of a document or if they are a larger font size or in bold, one could assume that these words are of greater significance to the content of the document. The algorithms used to compute the relevancy weights may include any, or all, of this information in its calculations.

Hyperlink Analysis used to enhance Web Search Engines

Currently, all search engines purport to use some form of hyperlink analysis. These links are used in different ways to create the collection of documents to be searched when a user invokes a search using a web search engine. The beauty of the Web is the way in which it is organized to simulate human thinking. The mind tends to wander off in tangents as thoughts are processed; it is known that this process does not follow a strictly logical path. When webpages are written, the author of the document adds links to direct the reader to other websites that he/she considers useful and/or pertinent to the content of the original document. Currently, search engines use the links to locate other pages to add to their core collection of searchable pages. The collection is built as the search engine crawls the Web. In fact, one of the reasons for the speed of access of web search engines is that they are not searching in "real time," when a web search is executed, the search engine is working its way through an index of a static set of webpages, albeit it contains a very large number of pages. Although it is static for searching purposes, it is updated on a regular schedule.

Web search engines are programmed to crawl the Web on a regular schedule to locate new sources or to discover updated versions of existing pages. Monika Henzinger, Google, Inc., describes the process:

The crawling process usually starts from a set of source web pages. The Web crawler follows the source page hyperlinks to find more web pages. Search engine developers use the metaphor of the spider "crawling" along the Web creating hyperlinks. This process is repeated on each new set of pages and continues until no more new pages are discovered or until a predetermined number of pages have been collected. [1]

Not only does this provide an excellent description of the way in which search engines crawl the Web to build the collection of websites but also it demonstrates the way in which web search engines are dependent on humans. The source pages are chosen by means of human intervention or they are programmed to begin with websites that are considered to be important by the people programming the search engines. These source pages are initially selected based on criteria created by human guidance and are considered important sites that warrant inclusion. This crawling process creates the static entity to be searched each time a user interacts with a web search engine.

Problems not resolved by relevancy ranking

Relevancy ranking does not compensate for the natural ambiguity that exists in language itself. Since the same word may have different meanings depending on the context, keyword searching without other enhancements will continue to retrieve information that is not pertinent.

  With the word banking, it is easy to see the problems that arise in search engines. Banking may refer to the transactions made at the financial institution called a bank or it may be used to describe the action of banking an airplane to change its direction. If the word banking is searched, all sites which contain the word banking will be retrieved regardless of its meaning. Unless the user includes other keywords to restrict the search to the intended meaning of the word, many sites will be retrieved that do not pertain to the search.
Locating pertinent information in a fast and efficient way is what most people expect when they turn to the Web for information. The efficiency of the search is dependent on at least two factors:
  • First, if the information needed is very complex, it may be difficult to articulate the information needed in one or two keywords. Studies have shown that most people search using two or three words in their search string. However, this often does not produce the desired results on the first page of findings.
  • Secondly, if the information needed is a simple fact, the retrieval may depend on the currency of the information needed and knowing the right keywords to enter to bring the user to the right place on the Web.

If you are looking for the phone and address of someone living in Canada, one of the most efficient places to locate that information will be www.411.ca; nevertheless, it may take several searches to locate that site. It is highly unlikely that it will appear on the first or second screen of search results unless the keywords are carefully chosen. Furthermore, this site will only provide useful information if the person in question has a listed telephone number. This demonstrates that even a very useful site may have restrictions on the quality and quantity of information it provides.

Research to enhance search capabilities is continuing. The confusion that arises in keyword searching will persist unless web search engines are developed to address the issues of the meaning of words. To achieve a truly effective web search engine, it will have to account for the natural ambiguity that exists in language; the semantics of language must be considered.

Work is progressing on the development of the semantic web ( Part 4: The Vision of the Semantic Web). This is only one approach to eliminate the mix-up that is possible in searching even a simple term like banking. In addition to the vision of the semantic web, other solutions are being investigated to address the problems of language and the definition of words. In the future,  search engines might query the user to clarify the meaning of the keywords used in a search ( Part 3: Toward the use of semantics on the Web).


FREE SEARCH HELP

On Site Resources

Search Tool Guide

BUY THE BOOK
ABOUT THE BOOK
FAQ's
Audiences
User Benefits
Overview & Contents
Book Excerpts
Awards-Reviews
Updates
OTHER
Contact Us
Authors
Discussion Topics
Sales Affiliates

FREE SEARCH HELP

On Site Resources

Search Tool Guide

BUY THE BOOK
ABOUT THE BOOK
FAQ's
Audiences
User Benefits
Overview & Contents
Book Excerpts
Awards-Reviews
Updates
OTHER
Contact Us
Authors
Discussion Topics
Sales Affiliates

FREE SEARCH HELP

On Site Resources

Search Tool Guide

BUY THE BOOK
ABOUT THE BOOK
FAQ's
Audiences
User Benefits
Overview & Contents
Book Excerpts
Awards-Reviews
Updates
OTHER
Contact Us
Authors
Discussion Topics
Sales Affiliates

FREE SEARCH HELP

On Site Resources

Search Tool Guide

BUY THE BOOK
ABOUT THE BOOK
FAQ's
Audiences
User Benefits
Overview & Contents
Book Excerpts
Awards-Reviews
Updates
OTHER
Contact Us
Authors
Discussion Topics
Sales Affiliates


Effective Internet Search: E-Searching Made Easy!     © Baylin Systems, Inc., 2006