HTTP trace processing scripts INFO
Nearly all scripts are kept in my directory at
with the exception of some source code at
and some Perl library routines at:
An all-important file in the latter directory is
This file contains names, location, and misc information associated
with each trace file in the Oct/Nov 1999 OC3mon data set. The
information is kept in the form of a Perl data structure which is used
by several high-level scripts to selectively traverse the entire trace
set based on various criterion. You should also look here to locate
any files manually.
On a high level, the central scripts are:
process_crl.pl - convert raw OC3mon dumps to tcpdump format
process_ipslice.pl - slice the first 10 seconds off each trace
process_ip.pl - compile IP stats about each trace
process_endpts.pl - get stats about http connection endpoints
process_http.pl - process http traffic data
Each of the scripts iterates over multiple files in the trace set and
performs pipelined operations to come up with some result. They often
call several scripts in /home/ott/bin-http/* to work on an individual
data set. The results in many cases are also in a table format for
easy entry into an Excel spreadsheet.
Since development proceeded in phases, these scripts are often not
really meant to be run one after the other, so you need to look at
them carefully if you plan to do so. Some coordination between start
and finish points will be needed.
Also, I just recently changed our data to text-converted tcpdump
format, with filtering done for server-to-client http traffic only,
and all files located on dirt.cs.unc.edu at
This change means that scripts expecting tcpdump binary format as a
starting point may need to by modified a bit. The format change
usually just means that several steps can be skipped in the script,
making the processing proceed at a much faster rate. Right now our
trace data is in three forms:
gzipped crl format on michigan
gzipped tcpdump format on michigan
tcpdump text format on dirt
Deleting one of the first two sets is an obvious next step, but I've
put it off temporarily until we are sure that we won't need to refer
back to them during debugging or to rerun a data set for some reason.
It has happened before.
In general, both high and low-level scripts have a header which gives
a statement of purpose and an approximate development date, followed
by a configuration section which lists associated scripts and
directory locations. (I believe that most summary statements in the
header are accurate, but perhaps with a few exceptions.)
Most scripts are meant to be run on dirt. Look for the directory
initializations in the config file to see where input and output are
The latest results for process_endpts, and process_http.pl are located
The first two are fairly complete. The http results, however,
represent a step on route toward a more complete analysis we've been
chasing. In particular, progress was made on document-level analysis
but not completed before I needed to leave for my summer commitment.
Some tentative results, however, have been included.
Two programs are of central importance to work we did during the
latter few months. First is Don's http_connect program. Source code
for the current version is at:
A post-processing script process_http.pl is in
This program is similar in nature to Don's http_sessions, but teases out
statistics using Perl.
This summary has given pointers to some of the most important scripts and
where to find them. What it doesn't include, however, is ALL the details on
ALL the scripts I've written and included in ~ott/bin-http/*. Let me know
if you need some additional details about something, and I'll send some
information your way. ;-)
Last modified: 4 June 2000