Abstract: We report the results of a large-scale empirical study of web traffic. Our study is based on over 500 GB of TCP/IP protocol-header traces collected in 1999 and 2000 (approximately one year apart) from the high-speed link connecting a large university to its Internet service provider. We also use a set of smaller traces from the NLANR repository taken at approximately the same times for comparison. The principal results from this study are: (1) empirical data suitable for constructing traffic generating models of contemporary web traffic, (2) new characterizations of TCP connection usage showing the effects of HTTP protocol improvement, notably persistent connections (e.g., about 50% of web objects are now transferred on persistent connections), and (3) new characterizations of web usage and content structure that reflect the influences of "banner ads," server load balancing, and content distribution. A novel aspect of this study is to demonstrate that a relatively light-weight methodology based on passive tracing of only TCP/IP headers and off-line analysis tools can provide timely, high quality data about web traffic. We hope this will encourage more researchers to undertake ongoing data collection programs and provide the research community with data about the rapidly evolving characteristics of web traffic.
Get the slides for this talk.