Search Discussion Article

FREE SEARCH HELP

On Site Resources

Search Tool Guide

BUY THE BOOK
ABOUT THE BOOK
FAQ's
Audiences
User Benefits
Overview & Contents
Book Excerpts
Awards-Reviews
Updates
OTHER
Contact Us
Authors
Discussion Topics
Sales Affiliates
by Darlene Canning

Part 4: Vision of the Semantic Web

Introduction

It is impossible to predict the exact evolution of the Web; however, exciting research is underway to develop the semantic web. The semantic web is a vision for the future of the Web as it grows and adapts to the changing demands of the information world. There are many groups around the world working on the semantic web, particularly in the areas of e-commerce, digital libraries and knowledge management; however, it has only been implemented on a small scale in each of these areas. Examples of these experiments are described in the case studies at the end of this section ( Part 4: Case studies: application of the semantic web).

Proponents of the semantic web, including Tim Berners-Lee (www.w3.org/People/Berners-Lee/) and others at the World Wide Web Consortium (W3C) project (www.W3C.org), are convinced it will solve many of the existing problems encountered in retrieving information on the Web. W3C, comprised of a full-time staff of more than sixty experts, together with over 500 member organizations, are developing the tools needed for the semantic web. Before the semantic web can become reality, these tools must be readily available and easy to apply; this does not appear to be the case at the present time. Although the semantic web is a brilliant idea, many issues remain and only time will tell if the semantic web will lead to another revolution in the information world.

What is the semantic web?

The semantic web is a work-in-progress and therefore, it is difficult to provide a precise definition. The semantic web is not intended to replace the existing World Wide Web, but it is meant to add another layer on top of what currently exists. According to Berners-Lee:

The semantic web is not a separate web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. [7]

Confusion arises from the fact that much of the semantic web is still under development and it is largely experimental in nature. Basically, to have a semantic web, documents must be coded in languages other than HTML (the standard upon which the Web developed). An infrastructure will have to exist in which terms are more clearly defined to provide meaning and outline relationships in the use of language. This will permit information on the Web to be located based on its meaning and also its relationship to other information. These relationships will be far more structured and include more data than the simple inclusion of metadata descriptors such as the Dublin Core. These relationships are to be made available in such a way that they may be processed automatically; that is, computer applications, or agents, as they are often called, will be able to explore the Web to retrieve pertinent information. These agents are the future search engines. Once a search is made, the information will be retrieved based on the semantics embedded in the coding of the documents; this, together with other computer tools, will link this information to related Web information sources.

The word semantic pertains to the use of language, and language is of even greater importance in both the creation, and the retrieval, of information on the semantic web. There are both human and machine-created components for this use of language to be applied to the semantic web. In fact, the semantic web may require the user to define a context before a search begins and the author of the document may have to assign a context to the document when it is created for the Web. This will eliminate the confusion in using a term such as the banking example ( Part 2: Problems not resolved by relevancy ranking) found earlier. To avoid this kind of confusion, the searcher would specify a context such as financial or navigation before a web search is initiated; the author would assign the document to a context before placing it on the semantic web. From there, the computer could search the Web for banking based on the context of the search itself as well as the context assigned to the Web resource.

Problems to be addressed before the semantic web can exist

Many obstacles remain to be solved before the semantic web becomes a reality. Protocols and standards must be accepted and made readily available to enable the semantic web to move from the experimental stage to a fully developed information resource. The resources needed to create the infrastructure of the semantic web include mark-up languages, resource description frameworks (RDF, www.w3.org/RDF/) and ontologies, including classification schemes and thesauri.

Applications of the semantic web are only in their infancy and in general, it is being applied to a single discipline or in an area with a narrow focus. At this stage, the semantic web is experimental in nature and it is not known how long it will take to develop to its full potential.

One might expect the semantic web to be used in a corporate setting where there is a general understanding of the nature of the information to be shared. The semantic web lends itself to use in the areas of e-commerce and in knowledge management. Thus, one of the case studies ( Part 4: Case studies: applications of the semantic web) provides an outline of the work that is being undertaken in several European corporate settings.

Any taxonomy or classification scheme that is sufficiently broad to cover all disciplines suffers from a lack of currency when new terms appear or usage changes; this is especially true in areas of research that are considered cutting edge. At the same time, classification schemes that cover all disciplines may lack the specificity needed for very precise applications; for example, they may not have been updated frequently and they may not contain the specific terms. The creation and maintenance of ontologies must overcome similar problems to ensure that they remain current; furthermore, they must be updated automatically for the semantic web to become a reality.

Despite the problems, exciting research is being conducted on the semantic web. In China, where there is widespread acceptance of a national classification scheme, the development of a semantic web has already begun. Nevertheless, the application that is described in this chapter ( Part 4: Case studies: applications of the semantic web) is limited in scope and was used only for the discipline of computer science.

The strength of the Web arises from its grassroots development (publishing on the Web is almost as accessible to the individual as it is for large companies). Posting documents on the Web does not require a huge investment in time and money. HTML code can be learned or adapted quickly to allow for a speedy way to publish on the Web. Consequently, the Web is a vast information resource known for its multiplicity of sources and for its lack of structure. To create the semantic web, the various elements of the infrastructure must be easily accessible to all. Can an XML (Extensible Markup Language) document be created as readily as an HTML document? In XML, elements in the document once coded are used as searchable data. XML provides the common syntax for machine understandable statements. All parts of the document might be tagged to become data. HTML provided a simple page formatting language with very few tags such as <title>, <bold>, <paragraph> and similar tags to control the layout on screen. It was easy to learn since the codes had a fairly simple application. There currently exists a large body of information on the Web not in XML format; it is highly unlikely that all of that will be converted to XML.

Proponents of the semantic web expect that it will create another revolution in information retrieval should it become fully functional. It is expected to change the way in which information is created and retrieved.

Infrastructure of the semantic web

The semantic web proposes to restructure the Web to permit better information retrieval. To implement the semantic web, the supporting infrastructure must be created and built. Documents will need to be written in a standard mark-up language other than HTML; ontologies must be readily available to define meaning and provide relationships among terms and Web documents; and then computer agents will be needed to search the Web to locate this knowledge. The semantic web is comprised of data elements and rules for making decisions about the inference rules governing the relationships among the data elements. Mark-up languages will be used to create documents in which the data is coded; ontologies will be developed to provide the standards for decision-making rules about the use of terms.

Mark-up languages

When mark-up languages such as XML are employed, metadata (data about the content of the document) can be embedded and coded in the document. HTML was basically a language to provide layout information; XML on the other hand is an evolution of SGML (Standard Generalized Markup Language) that was used to query databases. A database implies a rigid structure where there are records and fields used as descriptive entities to describe objects. The process of creating a XML document for the semantic web would be analogous to using the entries in a personal address book to create a computer database. For each entry in the book, there would be an individual record in the database. Each record pertains to the address for one individual or one company. For each record, you might have a field for first name, last name, street address, city address, state/province, and so forth. A document written in XML would contain many embedded codes to tag the data in the document. If a document described an author such as Tim Berners-Lee, the person creating the document might use a tag called author or software programmer or any number of tags depending on the data that should be tagged in the published document. These tags are hidden codes used to identify or code information embedded in the webpages or text of the document. XML provides the structure; nevertheless, there is a need for consensus and agreement on the meaning of these tags if they are to be used in an automatic way by computers.

Steven Cherry provides an excellent explanation of the way in which the XML documents will be used to retrieve information on the semantic web.

Right now, HTML coding serves mostly to control appearance and arrangement of text and images on a web page, so that only a few elements are tagged such as <title> and <bold>. With XML tags <price>, for instance a software agent might be able to comparison shop across different web sites, or update an account ledger after an e-purchase. [8]

Resource Description Framework (RDF) is a term that is often found in descriptions of the semantic web. RDF repositories of metadata and ontologies must be available to be searched to assign meaning to the content of Web information. In RDF, a document makes assertions that particular things are related. Because RDF uses URIs (Uniform Resource Identifiers) to encode this information in a document, the RDF allows for Web resource identifiers (URIs) to be described in relation to other ones on the Web. These are called ontologies where thesauri-like relationships are defined between words and phrases.

Ontologies

Ontologies are also needed to facilitate this use of language on the semantic web. Ontologies define relationships between terms and they include classification schemes, thesauri and similar language tools.

The best definition of an ontology comes from the group working on the semantic web (www.w3.org/TR/2002/WD-webont-req-20020307/#onto-def).  In their document [9], they provide the following description of an ontology.

The word ontology has been used to describe artifacts with different degrees of structure. These range from simple taxonomies (such as the Yahoo hierarchy), to metadata schemes (such as the Dublin Core), to logical theories. The Semantic Web needs ontologies with a significant degree of structure. These need to specify descriptions for the following kinds of concepts:
  • Classes (general things) in the many domains of interest;
  • The relationships that can exist among things;
  • The properties (or attributes) those things may have.

Standardized ontologies are being created. www.W3C.org provides examples of  ontologies under development.  Work on and ontology for Mathematics has already started. With the tools created for the semantic web, computer agents will  be able to process the language of the user's search query.

Mapping of featured search engines to Dublin Core

Although much of the semantic web is still futuristic, one type of ontology has been implemented to a certain extent in today's Web. The Dublin Core is an example of a metadata ontology. Ed Baylin explains how it has been implemented in relation to the search engines featured in this book.

Ed's Explanation:
Featured Search Engines versus the Dublin Core
[10]

To facilitate building search engine indices, various standardized ways of setting up metadata for unstructured (and also structured) data have been proposed, but only one has been implemented to any extent so far. This is the Dublin Core (http://dublincore.org/documents/dces/) ontology of 15 fields, including, among others, fields for author, document language, creation or last change date, and period of applicability. Each field can be refined for particular uses, and, with the approval of the organization responsible for the Dublin Core, further fields can be added.

The Dublin Core metadata scheme is of particular interest to reference librarians, because it is data from a bibliographic perspective. This is because its fields are applicable to just about any Web document (called a "resource" in Web terminology).

While the Dublin Core's scheme can be implemented in varying degrees of complexity, as the need and opportunity arise, even the Dublin Core metadata structure at its most complex, is too limited to fully suit all disciplinary and application domains. The RDF (Resource Description Framework) method, based on XML (Extensible Markup Language), provides another, more flexible and complex scheme for bibliographic and other kinds of metadata. The latter requires even more expertise than even the more complex kinds of Dublin Core refinements. It allows the users in each domain (discipline or application area) to define their own schemes for indexing and classifying data. The Dublin Core can be included as a part of any particular RDF.

 

The cell entries in the leftmost column of the table refer to appropriate rows or sets of contiguous rows in the Features Control Center of our book, "Effective Internet Search."

Ed's table on the following pages maps the tenuous relationships between the Dublin Core's fifteen standard fields to the metadata-based filters of the featured search engines in this book. From it one can conclude that the relationship between the metadata parameters in the Dublin Core and those in the general-purpose search engines vary considerably by search engine.

In general, some of the 15 Dublin Core fields are better supported than others. The better mapping fields, shown first in the below table, are those for Title, Resource Identifier, Date, Relation, and Language.

Computer agents

We already use language to query the Web, but the semantic web will involve the use of computer agents to go out to the Web to seek more pertinent information. Much of what is accessible on the Web right now was designed for human interpretation with a search engine providing the access. The semantic web, on the other hand, will automate the query process by allowing data and programs to be processed automatically by computers. This requires a level of structure and standardization previously unseen on the Web.
 


DUBLIN CORE META-
DATA FIELD
SEARCH ENGINE

Disclaimer: In all cases, the cells in the following table contain nothing more than our rough guesses, and depend to some degree on subjective interpretation of what is meant versus what the search engine provides.

Fields that
map fairly
well

AlltheWeb

AltaVista

Copernic

Google

MSN Search

Title: Name given to the document or "resource" Document title section Document title section Document title section Document title section Document title section
Resource identifier: e.g., the Internet address (URL), ISBN URL and certain parts of the URL URL and certain parts of the URL URL and certain parts of the URL URL and certain parts of the URL URL and certain parts of the URL
Date: An important date in the history of the resource, usually of creation or last update File age based on last site crawl date File age based on last site crawl date File age based on last site crawl date File age based on last site crawl date Filter not supported
Relation: Reference to a related resource Various links from the findings display Various links from
the findings display
Links provided by keyword analysis Various links from the findings display; and related sites filter Various links from the findings display
Language: Document language Document language Document language Document language Document language Document language
Fields that
map a little

AlltheWeb

AltaVista

Copernic

Google

MSN
Search

Format: e.g., file format, media format, type of device needed to record the file File format? File format? File format by search tool category? File format? File format and maybe embedded content type?
Coverage: Resource scope, e.g., time period, area of space, or jurisdiction Geo-
graphical region of the server?
Geo-
graphical region of the server?
Geo-
graphical region of the server?
Geo-
graphical region of the server?
Geo-
graphical region of the server?
Subject and keywords: Subject covered and keywords No Subject directory? Search tool cate-
gories; keywords extracted?
Subject directory? Subject directory?
Resource type: A meaningful way, given the domain of interest, of classifying the resource Special-
ized search entry interfaces?
Special-
ized search entry interfaces?

Subject directory?

Search tool cate-
gories?

Keywords extracted?

Special-
ized search entry interfaces?

Subject directory?
Special-
ized search entry interfaces?

Subject directory?

Description:
e.g., a table of contents, an abstract, some introductory text
Metatag descrip-
tions extracted by crawler?
Metatag descrip-
tions extracted by crawler or submitted to subject directory?
Depends on data from other search engines? Metatag descrip-
tions extracted by crawler or submitted to subject directory?
Metatag descrip-
tions extracted by crawler or submitted to subject directory?
Creator: Entity responsible for creating the document or file No, except for news source? No No No, except for message author in discussion groups? No
Fields that
do not map

AlltheWeb

AltaVista

Copernic

Google

MSN
Search

Source: Larger work from which the document or other file is derived No No No No No
Publisher: Party responsible for making the "resource" available No No No No No
Contributor: Party other than the creator or publisher No No No No No
Rights management:
e.g., identifier of intellectual property
No No No No No


Case studies: applications of the semantic web

Project "Vision" at Peking University Library

Jun [11] describes an application of the semantic web. His article provides an example of one way to create a semantic web and to apply the tools of the semantic web to the knowledge management of digital libraries.

Jun reports on the project called Vision, which was undertaken in Peking University Library. This application is limited to the discipline of computer science. The classification and indexing terms from the computing domain found in the Chinese Classification and Thesaurus (CCT) were combined with the bibliographic information for all Chinese materials in computer science held in Peking University Library and published between 1990-1999. These were used to create a database of more than 5000 bibliographic records for the Vision system. In Vision, the CCT terms were integrated with the metadata to create a knowledge network. In the semantic web, the process of maintaining the ontologies should be automated and produced by the computer programs.

One of the drawbacks of classification and thesaurus schemes lies in the static nature of the terms found in these tools. They reflect language and terms in use at a certain point in time; a process has to exist to update and add new terms. In Vision, the new concepts and categories are extracted from the titles of the computer books. The words found in the titles of the books in a discipline such as computer science reflect the changes and new developments in terminology that are a landmark of that discipline. The words from the titles provide the essential dynamic element that will be needed in creating the tools that contain current language. This is one of the core components of a true semantic web.

Why is Vision considered an application of the semantic web? The best way to explain this is by using the example provided by Jun. He describes the way in which the words from the title of the book called "Internet Firewall Technologies" are integrated into the knowledge network of Vision. In this example, it is assumed that the term firewall is new and it has been encountered for the first time. Decisions must be made to determine if it should be integrated into the ontology as a term. From the bibliographic record, the book was found to have been indexed under the subject classification of network security which existed already in the ontology. It is now possible to assume the term firewall might be a co-concept of network security. These two terms are narrower terms located under the category of computer networks. To further add to the creation of the ontology for this project, the author name is added to the author category to become the authoritative name for this author; the publisher is treated in the same way. For each new book title, it is possible that the publisher name may already be present, whereas the author names need to be updated and revised on a regular basis, since many authors only write one or two books.

In the scheme described, the author and publisher names are linked to the document title; the terms firewall and network security are found under the category of computer networks. Thus, if you were to search any of these terms or names, you would be presented with the now familiar look of the Windows operating system where the cursor flashes at a particular term with a "plus" or a "minus" sign, indicating if there is more information to be found if you continue. In this case, the links would take you to either other concepts or subject categories or if it were the author or publisher lists, you could see a list that could be browsed to locate all of the books by a particular publisher as an example.

Semantic web application in knowledge management

Fensel gives an overview of the issues involved in the creation of knowledge management from the large quantity of data that is a part of the corporate environment. In this application called On-to-Knowledge, Fensel states:

We are building an ontology-based tool suite that efficiently processes the large numbers of heterogeneous, distributed, and semi-structured documents typically found in large company intranets and on the internet. [12]

A brief description is provided for the tools that are needed to convert data into vital knowledge. These tools include those needed to extract the data, to represent it and to allow for subsequent easy access. This is a complex system but it is considered well worth the effort involved in its creation.

This paper demonstrates an application of the semantic web that has already started. Swiss Life is using these tools. This company is using On-to-Knowledge to proved ready access to documentation, including accounting standards and accurate and functional job descriptions. BT, the large British telecommunications corporation, is investigating this for its customer service operations. They are developing best practices for client service personnel and also learning user's interests and preferences with minimal feedback from the user. A brief synopsis is also given of the work being done with EnterSearch Ab "to improve knowledge transfer between in-house researchers and outside specialists via its Web site" [12, pg. 59].

Fensel does not state that this is a simple process, but it is implied that once the systems needed to implement the semantic web in each of these scenarios is fully studied, enough knowledge will have been gained to allow for their application in many areas. One gathers from this that these computer applications should be able to do most of the work once the complexity of the systems are fully analyzed and the right tools are applied. This is one of the most fascinating, live applications of the semantic web that has been seen to date. Its potential for success appears to be good but only in those corporations where adequate resources are devoted to the project to make it work.
 

FREE SEARCH HELP

On Site Resources

Search Tool Guide

BUY THE BOOK
ABOUT THE BOOK
FAQ's
Audiences
User Benefits
Overview & Contents
Book Excerpts
Awards-Reviews
Updates
OTHER
Contact Us
Authors
Discussion Topics
Sales Affiliates

FREE SEARCH HELP

On Site Resources

Search Tool Guide

BUY THE BOOK
ABOUT THE BOOK
FAQ's
Audiences
User Benefits
Overview & Contents
Book Excerpts
Awards-Reviews
Updates
OTHER
Contact Us
Authors
Discussion Topics
Sales Affiliates

FREE SEARCH HELP

On Site Resources

Search Tool Guide

BUY THE BOOK
ABOUT THE BOOK
FAQ's
Audiences
User Benefits
Overview & Contents
Book Excerpts
Awards-Reviews
Updates
OTHER
Contact Us
Authors
Discussion Topics
Sales Affiliates

FREE SEARCH HELP

On Site Resources

Search Tool Guide

BUY THE BOOK
ABOUT THE BOOK
FAQ's
Audiences
User Benefits
Overview & Contents
Book Excerpts
Awards-Reviews
Updates
OTHER
Contact Us
Authors
Discussion Topics
Sales Affiliates


Effective Internet Search: E-Searching Made Easy!     © Baylin Systems, Inc., 2006