Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

The Languages of the Semantic Web


The Languages of the Semantic Web

June 2002

To create the Web as we know it, Tim Berners-Lee put aside much of the existing research on hypertext technologies and built a simple system that was easy to understand, use, and maintain. This simplification became an important factor in the Web's rapid growth. Despite this success, the realities of information management are illuminating some problems of simplification. While the Web continues to be useful for retrieving information from individuals or organizations of close collaborators, it is much harder to use if you want to gain a broad understanding of a particular subject.

For example, while we can visit the Burton snowboards Web site to find out what products the company offers, read about its corporate policies and philosophies, and even browse a selection of links about snowboarding in general, it's much more difficult to find a wider perspective on snowboarding as an industry and interest. It's even harder to bind together the many Web sites that discuss snowboarding.

This is where the Semantic Web comes in. The Semantic Web is a vision of a next-generation network that lets content publishers provide notations designed to express a crude "meaning" of the page, instead of merely dumping arbitrary text onto a page. Autonomous agent software can then use this information to organize and filter data to meet the user's needs.

There has been much effort to refactor the Web more along these lines since the success of the current Web. Proponents of this goal often refer to it as the Intelligent Web. For those who focus on the problem of how to express the context—or, the semantics—of content in distributed systems like the Web, this goal is called the Semantic Web.

Even though this next-generation Web has yet to become a reality, much of the current work on the Semantic Web centers on a variety of technologies that are already in widespread, practical use. In particular, the Resource Description Framework (RDF)—which lets content creators express structured metadata statements describing URIs.

Limits of Today's Web

With the current state of the Web, there are only two real methods of gaining broader information about documents. The first is to use a directory or portal site, and thus rely on human editors to scour the Web and appropriately categorize pages and their associated links.

Such portals are the heroes of today's Web. After all, the most effective information management tool on Earth is still the human librarian, and probably will be for years to come. The problem is that directories take tremendous effort to maintain. Finding new links, updating old ones, and maintaining the database technology add to a portal's administrative burden and operating costs.

Search engines are the alternative. Good search engines pay special attention to metadata in the pages that they spider and add to their index databases. In the simplest case, this metadata might take the form of content in <meta> tags. More advanced search engines, like Google, rely on more subtle information. For instance, Google's widely touted algorithm evaluates not only the occurrence of keywords on a page, but also the number of outside links to the page itself, as a measure of its importance or popularity.

Search engines take less human effort on the content management end, but they require a frightfully large resource investment. It's also very difficult to produce valuable indices efficiently. It's no secret that some of the most advanced search engines are so primitive that queries often turn up an unmanageable number of poorly differentiated hits. A user who tries to finely craft his or her search to zero in on a point risks filtering out potentially relevant search results.

The Web needs to support something in between portals and search engines. Of course, until there's a server as sophisticated as HAL 9000 (but, hopefully, not as neurotic), we probably won't be able to completely replace the human portal editor with a computer program. But if we could provide standardized means for Web publishers to catalog and classify their own content, then we could develop more effective agents that work on this substrate of better-organized information.

The result of having better standard metadata would be a Web where users and agents could directly tap the latent information in linked and related pages. This would help free us from having to scour for information site by site, and from relying on portals and search engines. It wouldn't be hard to outfit each user with personal portal generators and search agents tailored to their particular interests, needs, and constraints. These agents might even be configured to learn and respond to personal details with the help of artificial intelligence techniques.

The Semantic Web's Challenges

It's fine to talk about enabling each Web publisher to properly place content in context, but there are several problems to overcome before any such initiative will gain critical mass:

  • Complexity. Any technology that the average Web developer can't grasp in a day and apply proficiently in a week is doomed. In addition, a successful technology will have to be integrated into current Web development and maintenance tools. Semantics are quite arcane, and it won't be easy for semantic technologies to meet this criteria.
  • Abuse. Practices like meta-tag spamming, and even trademark hijacking, show that any system that lets people set their own context is subject to abuse. Knowing the value of the Burton snowboards brand, another unscrupulous manufacturer might want to tell an agent that it is the Burton company in hopes of directing some undeserved attention to its site. Semantic Web technologies will need a mostly automated system for establishing trust in the assertions that Web publishers make. This concept is often referred to as the Web of trust.
  • Proprietary Technology. Because of the diversity in developers and development tools, Semantic Web technology will have to be politically and technically open for implementation and use. If it requires royalty payments to any party, open source advocates and competing Web technology vendors will boycott it. If it requires a specific plug-in or module, most developers and users won't even bother installing it.

Semantic Web proponents are looking to XML and RDF to meet these challenges. XML would let a publisher use markup that differentiates a catalog entry of a snowboard product from an independent review of the same item. However, this method relies on custom tags, and agents need a way to grasp the "meaning" in such tags— a facility called semantic transparency. Web metadata is the key to providing it. Because of its importance, the W3C developed RDF as a standard for Web metadata.

Inside RDF

RDF is indeed quite simple at its core, though it can get hairy in short order. It is a model of statements made about resources. A resource is anything with an associated URI. In practice, it's most often a document on the Web, but it can be anything to which people have agreed to assign a URI. In this way, one could even use RDF to make statements about abstractions like peace, or even imaginary entities like Gandalf the Wizard. RDF's statements are hardly as complex as those we use in natural language. They have a uniform structure of three parts: predicate, subject, and object. For example: The author [predicate] of The Lord of the Rings [subject] is J.R.R. Tolkien [object].

This simplicity and uniformity make RDF's statements generic. They can be used to encode the above natural-language statement, as well as, say, an object-oriented model. For example, if you had written a class called Person, and that class was instantiated as an object called jrrt, your statements would be: Person's type is Class; and jrrt's type is Person. The connections between various subjects and objects can be much more complex than this, of course, especially if you think about inheritance and properties and other attributes that classes take on. As you might guess, this approach can be tedious if done by hand.

RDF lets you express such statements in a formal way that software agents can read and act on. It lets us express a collection of statements as a graph, as a series of (subject, predicate, object) triples, or even in XML form. The first form is the most convenient for communication between people, the second for efficient processing, and the third for flexible communication with agent software.

If a portal were to create a directory of snowboarding sites, it could use such an RDF/XML document to help RDF-enabled agents and tools better understand the information that the sites offer. Example 1 is loosely based on the format used by the Open Directory Project (www.dmoz.org), a community effort to build a universal Web site directory.

The first document element, rdf:RDF, tells an RDF parser that the child elements can be interpreted as RDF constructs. Note the namespace declarations. The first one defines the core namespace for RDF constructs. The second is a special namespace that is controlled by our fictitious snowboarding Webmaster community.

The first child is an rdf:Description element, which tells the RDF parser that we have a resource to describe. The rdf:about attribute notes the URI of the resource being described. In this case, it's a local resource to the community site that refers to the overall topic of snowboarding (according to some agreement the community will have made).

The dc:title child is known as a property element, and gives the predicate of a statement to be made. The object of the statement is given by the text content. In this case, we are making a statement similar to this: Snowboarding is the title of the topic identified at http://rdfinference.org/eg/snowboard/metadata/topics/references.

Next comes another property element, s:SubTopic, with another resource provided as an object rather than a text string. This one relates the topic to its sub-topics, establishing the directory hierarchy. The next description is for this sub-topic. It also has a title and a relationship to a link, using the s:Link that is specified as another resource—which is a child of the property element—and thus, becomes the object of the link statement. In this case, the link is to the site www.geocities.com/iliktoast/, which we further describe with a title and description as well, also defined according to DCMI.

Of course, it would be best for the community if each Web page could maintain its own metadata. The RDF specification provides a convention for people to place RDF within HTML pages. Example 2 illustrates how the maintainers of various snowboarding sites might use RDF to do this for their own pages.

The empty rdf:about="" attribute is a special URI convention that refers to the current document. Other than that, the code is similar to that in the RDF directory. Note that this data can be maintained in tandem with regular HTML <meta> tags to support existing search engines and RDF agents. One hopes that vendors of popular Web authoring tools will soon produce products that automatically represent metadata in both RDF and <meta> formats.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.