by Darlene Canning
This section outlines the evolution of searching on the Web:
from its origins in the research community in the early 1990's
to present-day investigations into the creation of new and
enhanced ways to search. Many events had to occur in a short
time period for the Web to become the information source that it
is today.
Given the magnitude of the Web, search engines that rely
exclusively on keywords provided by users have long been known
to be inadequate. Despite this, using keywords to search the Web
was an innovation when it first became available. Prior to this,
systems such as Archie and Veronica (see above) were the search
tools for locating information on the Internet. They were not
user-friendly; they provided limited access to FTP or Gopher
servers where files were stored. Until the appearance of search
engines, which permitted users to retrieve information by
keyword searching, information retrieval on the Web was limited
and tended to be restricted to the research community.
Search retrieval based on single word or phrase matching
within Web documents has its roots in the field of information
retrieval. Information retrieval systems had their broadest
application in database systems for indexes to bibliographic
references; mainly reference librarians originally used them.
Even in the research community, researchers frequently came to
librarians with their search requests. Prior to the Web, the
general public had limited access to information. Information
retrieval was a vital part of research but members of the public
tended to seek advice from a local expert on a topic or they
would go to a library, if one were available in their community.
Librarians became efficient at searching bibliographic
systems by spending countless hours learning the structure of
each database and the best way to search based on the underlying
indexing and database structure. These information retrieval
systems were effective as search tools because they were applied
to a small resource with a particular discipline or application
bias. Many of these systems made use of tools such as controlled
vocabularies and thesauri to achieve precise information
retrieval. They were not particularly effective when searching
full text documents unless index terms were added to clarify and
define the meaning of the full text article.
With the widespread use of the Web today, few people have the
patience to search an index of bibliographic references. They
may consider using such an index if it includes a unique
searching capability such as molecular structure searching for
the chemist that can be found in Scifinder (Trademark name of
American Chemical Society Online Chemical Abstracts, at
www.cas.org/SCIFINDER/scicover2.html).
Even in this kind of searching, users expect to locate a ready
link to the full text. On the Web, the closest approximation to
this type of index searching is found in a directory-type search
such as Yahoo where broad categories of topics are selected and
then the user follows a tree or drills down to a level where
truly meaningful information is found.
Now that the Web has become a pervasive part of society,
users expect to turn to the Web and locate the information they
need with the same skill as librarians. Many of the search
engines have come a long way toward achieving this but much
remains to be done. The Web itself lacks the inherent structure
that was an integral part of the bibliographic systems that were
previously described. There is a movement on the Web to add the
kind of structure that will enable precise information retrieval
previously limited to these library research tools. To create a
structured information environment, many resources have to be
made available and worldwide cooperation and agreement on
standards must be in place. Work is progressing in this area to
allow for the development of what is known as the semantic web (www.w3.org/2001/sw/).
With the development of keyword searching and more
user-friendly web search engines, access to the Web moved from
the research lab and into use by the general public. It was
only when the number of users reached a critical mass that
conclusions could be drawn about the patterns of user behavior.
Prior to these developments, the Web was an information resource
for only a few people and it was impossible to predict the
behavior of the population at large.
With the appearance of these first search engines, it became
apparent that most users were unwilling to work their way
through numerous screens of Web addresses until they found an
interesting site. Previously, this was unknown and it was
thought that users would browse through several pages of
websites and assess the value of each Web link. This has proven
not to be the case and most users will browse only the first two
or three screens.
When designing web search engines, the browsing behavior of
the users must be taken into account. Once it was discovered
that information found after the first couple of screens would
be largely ignored, the value of a good relevancy ranking tool
became apparent. Relevancy ranking was found to enhance user
satisfaction with web searching, and there was a move to those
web search engines offering relevancy ranking.
Originally, relevancy was calculated by counting the number
of times the keywords were found in the document. The more times
keywords were found in a document, the higher the document would
appear on the list of websites and the greater its relevancy. It
was this knowledge that contributed to the use of "spamming," as
described in the following frame.
| Spamming Not all new
developments on the Web contributed to the retrieval of
pertinent information in the way that might have been
envisioned. As the Web grew and matured, spamming was a
new phenomenon to appear. Once it became known that web
search engine retrieval was based solely on matching the
keywords found in the Web documents (webpages), attempts
were made to doctor Web documents to increase the
likelihood they would be found.
Spamming is the process of embedding Web documents
with excessive and repeated instances of the same word
to ensure retrieval by keyword-based search engines.
This might be used as a form of advertising.
Companies selling cars might want to embed their
websites with numerous occurrences of the word
car to
increase the likelihood that their site appears on
the list of websites retrieved when someone searches
using the word
car.
This abuse of the Web is given as the reason why many
commercial web search engine providers refuse to fully
disclose the exact nature of the search algorithms they
use in programming their search engines. They argue that
if they describe in detail the way in which their search
engine retrieves information, then it may be possible to
manipulate the content of Web documents to ensure their
retrieval, even if the document itself may be considered
of poor quality or of questionable value. |
With the development of second-generation search engines, the
calculation of relevancy ranking has been further refined, and
it is now based on a composite of factors. In addition to
keyword counts, links from pages are used in the calculation of
relevancy weights as well; information about the search words
themselves may also be included in the calculations. If the
keywords appear in the title of a document or if they are a
larger font size or in bold, one could assume that these words
are of greater significance to the content of the document. The
algorithms used to compute the relevancy weights may include
any, or all, of this information in its calculations.
Hyperlink Analysis used to enhance Web Search Engines
Currently, all search engines purport to use some form of
hyperlink analysis. These links are used in different ways to
create the collection of documents to be searched when a user
invokes a search using a web search engine. The beauty of the
Web is the way in which it is organized to simulate human
thinking. The mind tends to wander off in tangents as thoughts
are processed; it is known that this process does not follow a
strictly logical path. When webpages are written, the author of
the document adds links to direct the reader to other websites
that he/she considers useful and/or pertinent to the content of
the original document. Currently, search engines use the links
to locate other pages to add to their core collection of
searchable pages. The collection is built as the search engine
crawls the Web. In fact, one of the reasons for the speed of
access of web search engines is that they are not searching in
"real time," when a web search is executed, the search engine is
working its way through an index of a static set of webpages,
albeit it contains a very large number of pages. Although it is
static for searching purposes, it is updated on a regular
schedule.
Web search engines are programmed to crawl the Web on a
regular schedule to locate new sources or to discover updated
versions of existing pages. Monika Henzinger, Google, Inc.,
describes the process:
The crawling process usually starts from a set of
source web pages. The Web crawler follows the source page
hyperlinks to find more web pages. Search engine developers
use the metaphor of the spider "crawling" along the Web
creating hyperlinks. This process is repeated on each new
set of pages and continues until no more new pages are
discovered or until a predetermined number of pages have
been collected. [1]
Not only does this provide an excellent description of the
way in which search engines crawl the Web to build the
collection of websites but also it demonstrates the way in which
web search engines are dependent on humans. The source pages are
chosen by means of human intervention or they are programmed to
begin with websites that are considered to be important by the
people programming the search engines. These source pages are
initially selected based on criteria created by human guidance
and are considered important sites that warrant inclusion. This
crawling process creates the static entity to be searched each
time a user interacts with a web search engine.
Relevancy ranking does not compensate for the natural
ambiguity that exists in language itself. Since the same word
may have different meanings depending on the context, keyword
searching without other enhancements will continue to retrieve
information that is not pertinent.
With the word
banking, it is easy to see the problems that
arise in search engines. Banking
may refer to the transactions made at the financial
institution called a bank or it may be used to describe
the action of
banking an airplane to change its direction. If
the word banking
is searched, all sites which contain the word banking
will be retrieved regardless of its meaning. Unless the
user includes other keywords to restrict the search to
the intended meaning of the word, many sites will be
retrieved that do not pertain to the search.
Locating pertinent information in a fast and efficient way is
what most people expect when they turn to the Web for
information. The efficiency of the search is dependent on at
least two factors:
- First, if the information needed is very complex, it may
be difficult to articulate the information needed in one or
two keywords. Studies have shown that most people search
using two or three words in their search string. However,
this often does not produce the desired results on the first
page of findings.
- Secondly, if the information needed is a simple fact,
the retrieval may depend on the currency of the information
needed and knowing the right keywords to enter to bring the
user to the right place on the Web.
If you are looking for the phone and address of someone
living in Canada, one of the most efficient places to
locate that information will be
www.411.ca;
nevertheless, it may take several searches to locate
that site. It is highly unlikely that it will appear on
the first or second screen of search results unless the
keywords are carefully chosen. Furthermore, this site
will only provide useful information if the person in
question has a listed telephone number. This
demonstrates that even a very useful site may have
restrictions on the quality and quantity of information
it provides.
Research to enhance search capabilities is continuing. The
confusion that arises in keyword searching will persist unless
web search engines are developed to address the issues of the
meaning of words. To achieve a truly effective web search
engine, it will have to account for the natural ambiguity that
exists in language; the semantics of language must be
considered.
Work is progressing on the development of the semantic web (
Part 4: The Vision of the Semantic Web). This is
only one approach to eliminate the mix-up that is possible in
searching even a simple term like
banking. In
addition to the vision of the semantic web, other solutions are
being investigated to address the problems of language and the
definition of words. In the future, search engines might query
the user to clarify the meaning of the keywords used in a search
(
Part 3: Toward the use of semantics on
the Web).
|