The World Wide Web

Whereas the Internet provides an underlying infrastructure for reliable communication between remote computers, the World Wide Web provides a user interface to a collection of network-oriented functions. These functions include fetching files from remote computers (the Web's more frequently used function), providing input parameters to a remote process such as a database, and invoking common network applications such as e-mail, news, ftp, and remote login.

An extremely important function of the World Wide Web is its role in delivering and providing a context for Java programs that can run on the client machine, on the server machine, or on both. In this respect, the Web is becoming a general context for network-oriented programs. Learning to program in this environment is a major goal for this course.

The discussion that follows begins with a brief history of the World Wide Web. After that, several key technologies used by the Web are described. That discussion extend the description of Internet technology begun in the preceding chapter.

History

The original concept for the World Wide Web began with Tim Berners-Lee. It was described by him in a paper entitled Information Management: A Proposal, and is still available through the World Wide Web Consortium (W3C) at http://www.w3.org/History/1989/proposal.html. It describes a hypertext system that was intended to be used by high energy physicists associated with CERN.

Berners-Lee's paper was redistributed during 1990 and work began on a proof of concept system. A first prototype server, written for a NeXT machine, was demoed during the fall and work began on a line-mode browser by Nicola Pellow. Also during that fall the name, World Wide Web, was coined by Berners-Lee.

1991 was a year of demonstration and promotion of the idea. These efforts met with success in the scientific community but less so in the hypertext community. A paper describing the Web was submitted by Tim Berners-Lee to the Hypertext'91 Conference as was rejected. However, Berners-Lee did present his work on WWW through a "poster" presentation there.

!992 saw public release of the line-mode editor and distribution of the server software. Interest in the scientific community began to build with a growing number of live demonstrations through various scientific conferences and organizations

1993 was the year the Web really took off. A large part of the growing public awareness of the Web can be traced to the release of an alpha version of the graphical browser, Mosaic, written by Marc Andreessen. This work was done in conjunction with the supercomputer center at the University of Illinois, Champagne-Urbana and sponsored by DARPA. During that year, Web traffic increased by a factor of 10, although it still accounted for only 1% of the traffic on the NSF backbone network.

During 1994, the Web went commercial. Mosaic, Inc., was founded by James Clark, who had earlier founded Silicon Graphics, and by Andreessen. Mosaic soon changed its name to Netscape. During that year, the World Wide Web Consortium was founded jointly by CERN and MIT. However, before the end of the year, CERN withdrew support from the WWW project, for financial reasons, and transferred its interest to the Institut National pour la Recherche en Informatique et Automatique (INRIA) in France. However, the center of WWW activity, along with the W3C, soon moved to MIT, along with Tim Berners-Lee.

From 1995 on, the Web has continued to grow at an exponential rate. It has also become more and more diverse in the uses to which it is being put. To gain a sense of this extraordinary breadth, scan the lists of topics included in the various International WWW Conferences (see http://w3c.org/Conferences/Overview-WWW.html). One particularly important development during 1995 was the introduction of Java, enabling the Web to deliver program, in the form of Java applets, to be run on a browser.

As we look to the future, we can expect this pattern of growth to continue, both in diversity of use as well as volume of activity. An increasing number of general purpose programs that use network resources are being written so that they can be invoked through a Web browser. In the process, the defining ideas of the Web as it was first conceived are being stretched further and further. It's almost like an old house being replaced room by room and board by board by new constructions. Consequently, the Web that will exist five years from now will not be the Web that exists today. But, whatever else it may become, it will be an interface to the network and to a world of information and function that is global in scale. It's truly an exciting time to be working in this environment!

Technology

The World Wide Web is based on three key technical components. Universal Resource Locators (URLs) provide an addressing scheme whereby a browser may request a file from a remote computer over the Internet. A Web browser and a Web server, identified through a URL, use a special "language" or protocol to communicate, called the Hypertext Transfer Protocol (HTTP). The third leg of the Web's technology triangle is HTML, the Hypertext Markup Language. It is a standard that has been implemented by a number of different browsers to display formatted pages of text and graphics in approximately the same way on different hardware and software platforms.

The discussion that follows begins with a brief description of the Web's client/server architecture and then describes the three key components introduced, above.

WWW Client/Server Architecture

The basic WWW architecture is shown in the figure, below. It is based on the client/server model of distributed systems. In the model, a client process makes a request to a server process, normally running on a different machine and using a network such as the Internet for communication. The server process receives the request, establishes a connection with the client, performs the desired function, returns the result to the client, and breaks the connection. The server is then available to receive requests from other clients and perform similar services for them.

WWW Client/Server Architecture. A Web browser runs as a client, normally on the user's workstation. It communicates with a Web server running somewhere on the Internet. Information is exchanged between the two using a protocol called HTTP.

In some implementations, the server creates a copy of itself (called forking) immediately upon receiving the request. The child process then establishes the connection with the client and performs the desired service while the parent process goes back to listening for other requests. This design enables a single server to provide services for a number of clients at the same time.

With respect to the WWW, the client is normally a Web browser, such as Netscape or Internet Explorer. The server may be any of several Web servers, such as NCSA's HTTPD, Netscape's Enterprise server, or Sun's JavaServer. The server's data, in the form of pages, is stored as files on the server's file system. Servers also provide an interface, called the Common Gateway Interface or, more frequently, CGI, whereby data submitted by users through forms can be processed by programs and the results returned to the user. A CGI program may also use the server's file system. The next chapter provides additional details about CGI.

URL

The addressing system used by the World Wide Web that enables a Web browser to specify a particular Web server and a file accessible to that server is based on Universal Resource Locators, usually referred to as URLs.

A particular URL identifies a host machine and a path to a particular file located within the file system of that host. It also designates the particular protocol being used, usually HTTP but occasionally FTP and several other possible protocols. Optionally, a URL may include the port on the host machine where Web requests are expected. The default port is 80 and is not usually specified unless a different port is being used, in which case the port must be designated.

The basic form of a URL is the following:

http://host[:port]/path

The square brackets ( [ ] ) indicate an optional element.

An Example URL is:

http://www.cs.unc.edu/~jbs/aw-wwwp/site

This example indicates that the HTTP protocol is being used, that the host machine is www.cs.unc.edu, and that the file in question can be located by following a path that begins at the author's home directory and goes down two levels from there.

The host machine is normally specified in the form supported by the Internet's Domain Name Service (DNS). These names are hierarchical structured, beginning with a specific machine or service, then one or two administrative units, and concluding with a type designation, such as edu for education or com for commercial. In the example, above, cs refers to the Computer Science Department and unc to the University of North Carolina.

When a browser encounters one of these host designations in a URL, the first thing it does is call the Internet DNS to translate it into a 32-bit IP address, as discussed in the preceding chapter. It is the IP form of the address that browsers and servers use to establish connections.

URLs may also include several other optional fields. These allow a URL to designate a particular place within a file, such as a particular heading or section. A query that my be processed by a CGI program may also be attached to the end of a URL. We will look more closely at the latter form in the chapter on CGI that follows. For the former, the reader should consult a standard HTML reference.

HTTP

The language spoken by browsers and servers to one another is called the Hypertext Transfer Protocol, or, more commonly, HTTP. Like the network protocols discussed earlier, HTTP messages normally include header and a data components. A blank line separates the two. For details of the HTTP protocol, see the World Wide Web Consortium specification at http://w3c.org/Protocols/. A short summary of its main features follows.

HTTP messages fall into two main groups, requests and responses. A request header includes a verb, such as GET or POST, that informs the server of the action requested by the browser. It may include a dozen or so other parameters, depending on the particular verb. With requests for a file, the header must also include a URL for the file to be fetched. Other optional header parameters include a designation of the browser software, a list of the types of data it can accept, and authorization information. Most requests do not use the data component. One notable exception is for one type of CGI request in which the query is placed in the data component as opposed to being tacked onto the end of the URL, as discussed above. the details of these CGI options will de discussed in the chapter that follows.

HTTP responses messages are generated by the server in response to requests, as just discussed. They, too, include header and data components. The header begins with a status line that specifies the version of HTTP being used and a coded indication of the action taken by the server. For example, a status code of 200 indicates that the request was processed successfully whereas a 403 indicates that the file requested could not be found. Additional header lines may specify the content length, type, language, title, etc. The file or generated data returned a request is placed in the data component of the HTTP response message.

A frequent interaction between a browser and a WWW server is to request a particular document or "page" of information. After parsing the request, the server fetches the requested data from its local file system, constructs a HTTP message, places the file contents in the data portion of the message, and returns the entire package to the client.

In some cases, the request sent to a server will not be to fetch an existing file but to execute some program, such as a database access program. In this case, the WWW server uses a special interface, called the Common Gateway Interface (CGI), to run the desired program. That program is then responsible for constructing the HTTP message that includes the data requested by the user, which it passes back to the server for delivery to the client.

HTTP messages are sent back and forth between a Web browser client and a Web server through the Internet using TCP/IP protocols. Thus, an HTTP message is embedded as the data portion of TCP segments which, in turn, are embedded in IP datagrams, which in term are embedded in Ethernet frames. The whole thing is like a set of nested Russian dolls.

HTML

The third leg of the technology stool on which the World Wide Web rests is the Hypertext Markup Language, or HTML as it is more commonly called. HTML is a language that enables an author to specify the format and appearance of a document. The reason it is so important is that it allows a document to be displayed by different browsers running on different hardware and software platforms and have the same basic appearance on all of them. Thus, it underscores a major premise of the Web -- that it function as a standard that supports interoperability through multiple implementations that provide similar operation. The intent here is to provide a general understanding of HTML. If you are not familiar with writing HTML, several excellent tutorials are available on-line; an especially good resource is the tutorials provided by Netscape (see http://www.pageresource.com/).

Among the textual features that can be specified are headings of various weights, several kinds of lists, a primitive form of indentation, and specification of both the type and size of fonts.

One of the most important capabilities of HTML is its support for anchors. Anchors allow one to associate a string of text or a graphic image with a URL for another page or document. By clicking on the associated text or graphic in one document, a user can instruct the browser to request the page designated by the URL and display it. It was this capability that provided the metaphor of a web of links joining information stored world wide.

Additional features have been added to HTML over the years. These include tables and frames, which provide multiple areas or multiple windows within a single browser. One recent addition that playa a large role in this course is the inclusion of Java programs, called applets, in HTML pages.

As the Web becomes a more pervasive medium, pressure is increasing to provide more extensive formatting capabilities, similar to those available for hardcopy printing. Consequently, several major extension have been proposed (see W3C's documentation on the subject at http://w3c.org/MarkUp/).

As the name states, HTML is a markup language. Consequently, its primary mechanism for specifying formatting and related information is through tags that are inserted directly into a document. The basic form of a tag is as follows:

<tag>content to be formatted</tag>

Tags are marked by opening and closing angle brackets ( < > ). Inside the angel brackets are specific labels that identify a particular formatting option, such as <ol> for an order list, or <font> for font information. Notice that tags are generally used in pairs, with the "end marker" including a slash as a prefix to the label string. Some tags may also include additional parameters in the form of attribute/value pairs. For example, a font tag can specify the particular font to be used as well as its size and color.

Additional Reading

The topics discussed above are moving targets. Below are several suggestions for further reading, including both printed and on-line sources. Check the on-lines one for current information.

URL

Virtually every book on the Web includes a section on URLs. The most complete and authoritative source is the set of documents maintained by the W3C. They include prior and proposed standards as well as discussions. The current point of entry to these is http://w3.org/Protocols/.

HTTP

Most introductions to the Web provide only cursory discussions of HTTP. The most complete and authoritative source is the set of documents maintained by the W3C. They include prior and proposed standards as well as discussions of issues. The current point of entry to these is http://w3c.org/Protocols/

HTML

There are numerous books on HTML. One that I have found particularly helpful is Raggett, D.; Lam, J.; & Alexander, I., (1996), HTML 3: Electronic publishing on the World Wide Web. Reading, MA: Addison-Wesley Longman. An excellent set of on-line tutorials is available from Netscape at http://www.pageresource.com/. For the current standard for HTML as well as earlier versions and discussions of issues, see the W3C's documents on the subject (http://w3c.org/MarkUp/).