Ever since the Digital Big Bang -- that is, the creation of the World Wide Web -- the cyber universe has been expanding at an explosive rate. The internet has doubled in size every year for the past eight years and shows no signs of slowing down. The WWW, the most colourful part of the internet, now contains more than 10 terabytes of information. Hundreds of times more information is spinning on the hard drives of computers that can be reached via the internet.The problem is keeping up with the useful information without drawing in a sea of irrelevancies. The current tools for finding information on the internet are known as search engines. Most require one to formulate a query -- for example, ``Find all articles in which IBM's Research labs are mentioned.'' The inquirer then runs the query and sifts through the results.A typical query returns hundreds of results, most of them irrelevant. "If you do a query on `rainbow' and `weather' using Altavista,'' says Norm Pass, a manager at IBM's Almaden ResearchCenter, ``you can end up getting references to bridges!''
To make matters worse, keeping track of changing information requires that searches be repeated frequently. Most people simply do not have the time. "If you didn't do that this morning," says Almaden's Daniel Ford, "you are behind the game."
Gathering and summarising
To help users of the internet to keep up, Pass, Ford, Qi Lu, Toby Lehman and other Almaden researchers have launched a project called Grand Central Station. "The idea is to enable people to express an interest just in something that's important to them," explains Lehman. "For me, it's knowing the IBM stock price and developments in my project. I'd like to know whenever the project gets mentioned in the external world or internally at IBM. Also, I want to know if my copyright is showing up someplace." Once these individual requests are formulated, Grand Central Station builds a profile of the user and keeps him or her informed whenever something new appears on the digitalhorizon.
The GCS system consists of two basic components. One constantly gathers and summarises new information. The other matches this information against the profiles of individual users and delivers the relevant titbits.
An innovative aspect of GCS is that it does not deal only with the user's desktop machine. The technology can also push the information to devices such as Personal Digital Assistants (PDAs) that have a connection to the internet. The Almaden team calls the concept of having information available as needed "just-in-time information". The parts of GCS that go out and look for information on sales figures, airport directions, patent citations and box scores are devices -- actually computer programmes running on workstations -- known as gatherers and based on the University of Colorado's Harvest system. To handle the information explosion, GCS splits up the task of searching among several Gatherers. "The idea is that, because the digital universe is big, there's a lot of room forexpanding search," says Ford.
Vacuuming the web
The first thing a Gatherer does is vacuum up all available information. All search engines currently available on the internet work in one of two ways. "Crawlers", such as Digital's AltaVista and Wired's HotBot, try to visit every site on the web, indexing all the information they find there. Hierarchical engines like Yahoo! are more like card catalogs.
Crawlers suffer from Pass's "rainbow" syndrome, producing too many irrelevant hits. Hierarchical engines suffer from the opposite problem: They can miss information that does not fit into their carefully constructed scheme. But both types of engine share a major shortcoming -- they simply ignore most of the information in the digital universe. GCS uses a crawler designed to sniff out obscure information that other search engines miss. The crawler can communicate using most of the popular network protocols, which enables it to access information from a variety of data sources such as web servers, FTPservers, database systems, news servers and even CICS transaction servers. It can, for example, track down vast file systems on machines in dozens of formats that are not part of the graphical WWW. This data can take the form of corporate presentations, database files, Java bytecode, tape archives and much more.
The crawler passes the information that it discovers to the next part called the Recogniser. This determines what kind of information -- database files, web documents, e-mails, graphics or sounds -- the crawler has unearthed. It passes this information on to the Selector. This filters the information to remove irrelevant material before handing it off to the Summariser.
The Summariser is actually a collection of plug-in programmes that takes each of the data types the Recogniser can recognise and produces a summary represented in a metadata format known as the SOIF (Summary Object Interchange Format).
Future versions of the Summariser will produce summaries in XML/RDF (eXtended MarkupLanguage/Resource Discovery Format), an emerging standard for metadata representation. The metadata for a web page, for example, might contain its title, date of creation and an abstract if one is available, or the first paragraph of text if it's not. As new programmes are developed that are more intelligent about understanding documents, they can easily be incorporated into the open architecture of GCS.
Regardless of the data type, all SOIF summaries look the same. That makes them easy to collect, classify and search. A web server associated with each Gatherer makes the SOIFs available to a central component called the Collector. From the SOIFs, the Collector creates a database that is essentially a map of the digital universe (see below). The collector also makes sure that the Gatherers do not step on each other's toes. For example, when the Gatherer looking for information in North America comes across a link to Japan, it informs the Collector, which passes this information on to the Japan Gatherer.Gatherers are initially assigned by a GCS administrator to specific domains in the digital universe, but over time they may migrate dynamically to distribute the overall load of the system.
Making matches
The Gatherers and the Collector make up the GCS search engine. The real power of GCS, however, lies in its ability to match this information to the interests and needs of users. A programmes known as the Profile Engine carries out that task. Starting with the user's queries, it constructs information profiles that it constantly matches against the incoming wave of information. It distributes them to Administration Servers that deliver them to the client's desktop machine or PDA.
At present, Ford admits, the client side of GCS is ``a grab bag of ideas.'' The basic interface looks like a television remote control. Commercially available systems like PointCast already push channels of information to a user's desktop using the free PointCast browser. However, those channels are predefined, broad andunfiltered. PointCast has a sports channel, for example, but it doesn't have a Chicago Bulls channel. GCS users can create channels that are exactly as narrow or as broad as they like.
As the user switches from channel to channel, the information scrolls by in ``tickers,'' just like the ticker tapes on Wall Street. The group is exploring means of programming tickers to alert users to important information -- stock market crashes or traffic jams, for instance -- as it comes in. Users can also specify that some important or useful information be pushed to their PDA or other device.
The quality of the information delivered by GCS will improve with use, stemming from a concept known as a Relevance Tracker. The GCS inevitably delivers a lot of information unrelated to the initial query. Technology being developed by Almaden's Rob Barrett will someday permit GCS to analyze information that the user accepts and rejects, to refine queries and cut down on the irrelevant hits.
Early Adopters
Suchrefinements lie in the future. The GCS project is less than a year old -- though in that time, Ford notes, the internet has doubled in size. The team has filed about 10 invention disclosures and has produced about 200,000 lines of Java code. A few IBM researchers are already using prototypes of the technology to help with their work. IBM employees and consultants will probably use early versions of GCS, but the Almaden group is looking at applications in the wider world.
However rapidly GCS takes shape, it will not be finished too quickly for Ford. "The world is blossoming with information you cannot keep up with," he says. "What we're trying to do is eliminate that." Maybe one could board the information superhighway at Grand Central Station.
-- www.research.ibm.com
Copyright © 1999 Indian Express Newspapers (Bombay) Ltd.