Research in
WWW Architecture and Technology:

A Next-Generation WWW Server



John B. Smith & F. Don Smith


Research problem

In the four years since the World Wide Web (WWW) appeared, it has become the largest service on the Internet in terms of both volume of data and number of packets. Conservative estimates of current WWW users are in the tens of millions. Once security problems are solved, its use as an infrastructure for commerce may well stimulate even faster rates of growth.

To date, the Web has been most successful as a vehicle for delivering relatively short segments of information, such as enrollment information for a university, current weather images, personal homepages, or up-to-the-minute pictures of a comet striking a planet. Surprisingly, it is not being used as a vehicle for scientific and technical research as much as it could be. For example, the Web has the capability to make papers and results immediately available and to distribute them at a fraction of the cost of conventional publication. A more specialized but related use is support for remote collaboration among relatively small research communities. Why are such uses of the Web relatively rare?

There are several aspects of the current Web architecture that make it less than ideal as an infrastructure for serious scientific inquiry and communication. Those limitations include lack of support for authorship and lack of link integrity. While these limitations have particularly important consequences for scientific and technical research, they affect all Web users. Consequently, finding solutions to the problems they pose will have extensive ramifications.

At present, authorship lies outside the WWW architecture. Building and maintaining structures of information must be done using shell functions to create directories and files rather than by using Web tools (e.g., browsers) to change the structure of information stored within Web servers. Relating pages to one another is done through embedded "links" using text editors. If one is to construct extended documents, one needs to see the structure of ideas and components; to add, delete, and reorganize sections of the discourse; and to construct references through direct manipulation. Having WYSIWYG HTML editors will help with formatting Web pages, but those tools will not help with the more fundamental structural aspects of authorship.

The WWW architecture needs to be extended to provide functions and semantics both to create new objects directly and to create links among them. It recognizes this need, through its POST, LINK, and UNLINK methods included in the HTTP protocol, but the authorship provisions of these methods have not been standardized and their semantics have not been agreed upon. At present, they constitute "placeholders" for work in progress.

Links are embedded within the data of WWW pages. They consist of symbolic addresses, expressed as an IP address and a path to a particular file in some Web server's file space. If the owner moves the file -- to reorganize his or her files, to move them to a new physical site, etc. -- all of the links to that file become invalid. To understand the seriousness of this problem, consider what this would mean if, by analogy, large numbers of the references contained within published scientific articles suddenly became invalid. Indeed, the notion of cumulative knowledge is one of two cornerstones, along with the scientific method, on which all of science rests. Consequently, until links can be guaranteed, the WWW cannot serve as a basic vehicle for scientific knowledge, comparable to or replacing conventional publication, in spite of its many advantages.

To provide needed flexibility while maintaining link integrity will require that links become first class object, similar to pages. This architectural change would also address several other limitations, including providing a means to add annotations and to distinguish between private and public links. However, it would require link servers, comparable to existing Web servers. It would also require a link domain service, similar to the current IP Domain Name Service. Such changes would, thus, go to the heart of the current Web and Internet architectures. Considerable research will be needed to develop an approach that can scale in accord with the current size of the Web and its anticipated growth.

Thus, in spite of the obvious utility of the WWW and its extraordinary growth, the Web is constrained by several fundamental limitations. To remove those constraints will require significant changes in the Web's architecture, in some cases countermanding earlier, simplifying decisions that enabled the development and extraordinary growth of the Web in the first place. Obviously, such changes need to be carefully thought-through and tested before they are put into wide-spread use.

In the research project described here, we are examining several fundamental limitations in the current Web architecture, and we are exploring ways in which those limitations might be removed. In the rest of this section, we state the research questions we are examining and then describe our work plan to answer them.

Research issues

Research plan

Our approach to these issues must be viewed in the context of the current status of research in the Web and hypermedia research communities, which currently are relatively separate from one another.

One reason the Web has been as successful as it has been is that its designers made several simplifying design decisions, as suggested above. First, the Web was designed as a tool for accessing information, not as a tool for creating new information. Thus, issues of authorship tools, access control for vast numbers of users within a server's domain, physical storage requirements for users' write needs, etc., were bypassed.

Second, by embedding links within documents, the Web architecture became responsible only for providing node services, through access to portions of existing file systems. It did not have to provide alternative storage facilities nor did it have to provide link services. However, the cost of not including links service within the architecture is the current unreliable nature of links and limitations on creating new link-related functions, such as users' annotations and personalized sets of anchors/links.

Third, the architecture was based on stateless transactions between client and server processes supported by basic Internet services. This permitted wide-spread decentralization of services with the associated benefits of distributed storage and processing. But it also made update of data objects (such as updating embedded links) impractical and precluded function based on server callbacks to clients.

We are now seeing efforts to extend the Web architecture to address these and other limitations. However, these are difficult problems when considered in the context of the scaling requirements posed by the Web, and solutions are not on the immediate horizon.

Over the past ten years, various research groups, including our own, have examined many of these issues. A number of researchers have examined hypermedia authorship and first class links, and have developed systems that support them. However, those studies and systems have focused on individual users or on groups numbered in the tens of users. The situation posed by the WWW, in contrast, requires that system design issues be evaluated in terms of tens of millions of users. It is not clear that architectures and strategies that work on a small scale will work on the much larger scale posed by the WWW.

Thus, on the one hand, the Web community is now trying to solve many of the problems the hypermedia community has been working on for a number of years, while, on the other hand, the hypermedia community has solutions that work for small numbers of users but have not been tested on the scale required by the Web. What is needed is a program of research that combines the best of both worlds. It is unlikely that prior hypermedia work will meet the Web's needs directly, but the body of experience and some of the results can be adapted. But it will take a significant research effort to determine which are adaptable and to make needed modifications.

In prior work, we have developed a hypermedia server, which we call the Distributed Graph Storage System (DGS), that has been demonstrated to be capable of supporting approximately 500 simultaneous users, while providing authorship and first class links. In the project described here, we will try to apply prior research results to the problems of authorship and link integrity in the WWW.

Our project has two components. First, we will build a Web interface to our existing DGS server and user tools to work with that data. This work is nearing completion. Tellis Ellis has designed and implemented a WWW/DGS server that provides WWW access to DGS graph objects. He has also built a JavaScript user interface that allows user to create and edit graphs. We will conclude this part of our project by Summer, 1997.

Second, we are currently applying the knowledge and experience we have gained working with DGS and WWW/DGS to a new architecture and implementation based on Java. It will use the basic WWW infrastructure to access Java graph applets, but will then rely on Java Remote Method Invocation and Object Serialization features to access graph objects and edit their components, such as nodes and links.

The approach we are taking is to build islands of enhanced function within the overall WWW address space. Within these islands, users will have authorship capabilities similar to those found in the general computing environment, but included as an integral part of the WWW. Additionally, links within those islands will be guaranteed by the system, even when nodes/files are moved from one graph/directory to another.

Additional information is available for the WWW-DGS and the Java-based GraphDomain projects.


email: jbs@cs.unc.edu
url: http://www.cs.unc.edu/~jbs/