Next: Validation of Source-level Trace Up: Generating Traffic Previous: Generating Traffic Contents

Subsections

Replaying Traces at the Source-Level

Our approach to traffic generation is illustrated in Figure 5.1. Given a packet header trace $\mathcal{T}_h$ collected from some Internet link, we first use the methods described and evaluated in Chapters 3 and 4 to analyze this trace and describe its content. This description is a collection of connection vectors $\mathcal{T}_c$ . Each vector describes the source-level behavior of one of the TCP connections in $\mathcal{T}_h$ using either the sequential or the concurrent a-b-t model. In addition, each vector includes the relative start time of each connection, and its measured round-trip time, TCP receiver window sizes and loss rate. The basic approach for generating traffic according to $\mathcal{T}_c$ is to replay each connection vector. For each connection vector, the replay consists of starting a TCP connection, carefully preserving its relative start time, and reproducing ADUs and inter-ADU quiet times. We call this traffic generation method source-level trace replay, and we have implemented it in a network testbed. Source-level trace replay in our environment implies the need to first partition $\mathcal{T}_c$ into disjoint subsets and then assign each subset to a pair of traffic generators. Partitioning is important in our environment, since the high throughput and large number of simultaneously alive connections in our real traces prevents us from using a single pair of traffic generators. We provide further details on our partitioning method in 5.1.1.

**Figure 5.1:** Overview of Source-level Trace Replay.
$\includegraphics[width=4.5in]{fig/pkt-cvec-tmix2.eps}$

The goal of the direct source-level trace replay of $\mathcal{T}_c$ is to reproduce the source-level characteristics of the traffic in the original link, generating the traffic in a closed-loop fashion. Closed-loop traffic generation requires to simulate the behavior of applications, using regular network stacks to actually translate source-level behavior (the input of the generation) into network traffic (the output of the generation). In our implementation, described in Section 5.1.2, this is accomplished by relying on the standard socket interface to reproduce the communication in each connection vector. This is a closed-loop manner of generating traffic in the sense that it preserves the feedback mechanisms in the TCP layer, which adapt their behavior to changes in network conditions, such as in congestion. In contrast, packet-level trace replay, which means to directly reproduce $\mathcal{T}_h$ , is an open-loop traffic generation method where TCP and lower layers are not used, and the traffic does not adapt to network conditions.

A new packet header trace $\mathcal{T}_h^\prime$ can be obtained from the source-level trace replay of $\mathcal{T}_c$ . Our validation of the traffic generation method is then based on analyzing this trace using the same methods used to transform $\mathcal{T}_h$ into $\mathcal{T}_c$ . We then compare the resulting set of connection vectors $\mathcal{T}_c^\prime$ with the original $\mathcal{T}_c$ . In principle, they should be identical, since $\mathcal{T}_c$ represents the invariant source-level characteristics of $\mathcal{T}_h$ . Section 5.2 studies the results from the source-level trace replay of three traces, assessing how closely $\mathcal{T}_c^\prime$ approximates $\mathcal{T}_c$ . $\mathcal{T}_h^\prime$ is necessarily different from $\mathcal{T}_h$ . Besides the stochastic nature of network traffic, this is because $\mathcal{T}_h^\prime$ is generated according to $\mathcal{T}_c$ , which is a simplified description of the source-level behavior and network parameters in the original trace $\mathcal{T}_h$ . It is however important to understand the difference between $\mathcal{T}_h$ and $\mathcal{T}_h^\prime$ in order to understand to what extent $\mathcal{T}_c$ describes the original traffic. Chapter 6 is an in-depth study of this question.

Trace Partitioning

**Figure 5.2:** Diagram of the network testbed where the experiments of this dissertation were conducted.
$\includegraphics[width=5in]{fig/expnet.eps}$

The focus of our traffic generation work is the generation of wide-area traffic in a closed-loop manner. This type of generation process requires to drive a large number of connections by simulating the behavior of the applications on the endpoints. For example, the experiments presented in the latter part of this chapter involve several millions of TCP connections, behaving in the manner specified by as many connection vectors. At any given point in time during the generation, tens of thousands of connections are active. Given CPU, memory and bus speed limitations, a single pair of traffic generators cannot handle such loads, so we generate traffic in our experiment in a distributed fashion. Experiments are conducted in the environment illustrated in Figure 5.2. The goal of the experiment is to generate traffic on the link between the two routers. Traffic is generated by 42 traffic generators, 21 on each side of the network. This type of topology is usually known as the ``dumbbell'' topology.

Each pair of traffic generators (one on each side) is responsible for replaying the source-level behavior of a (disjoint) subset of the connection vectors in $\mathcal{T}_c$ . In our experience, assigning connection vectors to subsets in a round-robin fashion works well. While the resulting subsets are far from being completely balanced, this simple partitioning technique results in subsets that can be easily handled by a pair of traffic generators. We carefully collected statistics on CPU and memory utilization from our source-level trace replay experiments, and found that no pair of traffic generators was ever overloaded. For the results in this dissertation, CPU utilizations were never above 60%, and usually well below that. The use of network connections involves allocating and deallocating pieces of memory known as ``mbufs'' for buffering purposes. No request for this type of memory was ever denied for the experiments reported in this dissertation. While larger traces than the ones we use in this dissertation could certainly overload our specific environment, our approach is fully scalable, in the sense that $\mathcal{T}_c$ can be partition into an arbitrary number of subsets. This means that the number of traffic generators can increase as much as necessary to handle the replay of any trace without running into resource constraints. This is obviously true as long as no individual connection requires more resources than those provided by an entire traffic generator end host.

Conducting Experiments

We have developed a traffic generation tool, tmix , which accurately replays the source-level behavior of an input set of connection vectors using real TCP sockets in a FreeBSD environment. In addition, we make use of a modified version of dummynet [Riz97] to apply arbitrary packet delays and packet drop rates to the segments in each connection^5.1Our version of dummynet , that we will call usernet in the rest of this text, implements a user-level interface that can be used by tmix instances to assign per-connection delays and loss rates read from the input set of connection vectors. Finally, a single program, treplay, is used to control the setup of the experimental environment, configure and start tmix instances (assigning them a subset of $\mathcal{T}_c$ and a traffic generation peer), and collect the results.

**Figure 5.3:** End-host architecture of the traffic generation system.
$\includegraphics[width=4in]{fig/tmix-usernet.eps}$

Tmix is a user-level program that receives a collection of connection vectors as input, and generates traffic according to their source-level behavior. Figure 5.3 illustrates the relationship between tmix and the network layers in the traffic generation end host in which a tmix instance runs. Tmix instances rely on the standard socket interface to create a connection, send and receive ADUs, and to close the connection. The socket interface is an that enables user-level programs, such as tmix , to communicate with other end host using a programming abstraction similar to a file. Calls to the socket interface are translated by the kernel into requests to use the process-to-process communication service provided by the transport layer (TCP). The transport layer itself uses the host-to-host communication service provided by the network layer (IP), and the network layer uses the link layer (Ethernet in our case) to handle the network interface and create physical packets.

Usernet

Our experiments also require a special simulation service, usernet , which is a modified version of dummynet , that provides a highly scalable way of imposing per-connection round-trip times and loss rates. These per-connection round-trip times and loss rates are directly controlled from the user level by tmix instances. This requires a direct communication between the tmix instance and the usernet layer that is not directly supported by the network stack. In order to overcome this difficulty, we use a covert communication channel: the source port number of each replayed connection. By having tmix assigning specific source port numbers to each connection, we can then use ioctl calls to modify a table at the usernet layer that maps source port numbers to round-trip times and loss rates. When a segment is received by usernet (from the higher layer), usernet can appropriately use the source port number to decide which network parameters should be applied. Source port numbers are unique for each active connection in the same end host, and they are always present in TCP segments^5.2. The user-level program, i.e., the tmix instance, has therefore to keep track of the (dynamic) source port number that is used for each new TCP connection it opens. Using this technique, usernet can determine the delay and loss rate that should be applied to each segment simply by reading an entry in a table indexed by source port number, so the lookup time is $O(1)$ . The number of source port numbers is small ( $2^{16}$ ), so this table does not require too much kernel memory (524 KB). No special infrastructure was required to accurately replay the receiver window sizes measured for each connection. This is because these parameters can be directly modified by tmix instances using a FreeBSD system call. This approach has worked very well in our experiments.

An alternative solution using traditional dummynet would be to use the programmable API of ipfw, which makes it possible to add new dummynet rules from a user-level program. The idea would be to add a new rule for each connection, again using the source port number to map delay/loss to individual connections. However, this will mean an $O(n)$ lookup cost for each segment, where $n$ is the number of rules, since the current implementation of ipfw searches through the rules in a sequential fashion. Given the large number of connections that each end host handles during the experiments, this per-segment lookup is unacceptable.

Another way of introducing per-connection round-trip times was used by Le et al. [LAJS03]. This study used random sampling from a uniform distribution whose parameters were be set at the start of the experiment. As seen in Section 4.1.1, the uniform distribution is not a good approximation of real round-trip times. A later refinement enabling sampling from an empirical distribution was rather inflexible, since it required to modify the dummynet source code and recompile it for each experiment. The use of usernet , which is fully controllable from the user level, is far more convenient.

Replaying an a-b-t Connection Vector

Two instances of tmix can replay an arbitrary subset of $\mathcal{T}_c$ by establishing one TCP connection for each connection vector in the trace, with one instance of the program playing the role of the connection initiator and the other instance playing the role of the connection acceptor. To begin, the connection initiator opens the connection and performs one or more socket writes in order to send exactly the number of bytes specified in the first ADU $a_1$ . The other endpoint accepts the connection and reads as many bytes as specified in the ADU $a_1$ . For efficiency, the size of these read and write operations was chosen to be a multiple of the in our Ethernet testbed (1,460 bytes). We made no attempt to actually measure and reproduce the size of the I/O operations in the original connections. The impact of this simplification is likely to be small, given the results in Section 3.4.

One important issue is how to synchronize the two endpoints (i.e., instances of tmix ) of the connection to replay exactly the same connector vector. This is accomplished by having the first ADU unit in each generated connection include a 32-bit connection vector id in the ADU's first four bytes. Connection vector ids are assigned to each connection vector prior to the traffic generation, and they are unique. Since this id is part of the content of the first data unit, the acceptor can unambiguously identify the connection vector that is to be replayed in this new connection. If $a_1$ is less than 4 bytes in length, the connection initiator will open the connection using a special port number designated for connections for which the id is provided by the connection acceptor. This approach guarantees that the two tmix instances always remain properly synchronized (i.e., they agree on the $C_i$ they replay within each TCP connection) even if connection establishment segments are lost or reordered. It also makes it possible to generate traffic without introducing any control traffic into the experiment, i.e., traffic comes only from the replay of connection vectors, and from any need to manage the behavior of the tmix instances.

One important design consideration in the implementation of our traffic generation approach is the assumption of independence among flows. While this is not completely realistic, the level of aggregation at which we generate traffic makes it a reasonable approach (see Hohn et al. [HVA02] for a related discussion). This assumption makes traffic generation fully scalable, since $\mathcal{T}_c$ can be partitioned into an arbitrary number of subsets. As long as there are enough traffic generation hosts, we can replay traffic from arbitrarily large traces.

Data Collection

We obtain two types of data from each experiment. First, we collect a new packet header trace $\mathcal{T}_h^\prime$ , which can be directly compared with the original packet header trace $\mathcal{T}_h$ and analyzed with our methods to extract a new set of connection vectors $\mathcal{T}_c^\prime$ . This new set can be directly compared to $\mathcal{T}_c$ . Second, tmix instances create a number of logs. Some tmix logs can be used to verify that the traffic generation host did not run out of resources during traffic generation, and they successfully replayed their subset of $\mathcal{T}_c$ . Other tmix logs report on the performance of the TCP connections in the experiments. This includes connection and epoch response times and the list of uncompleted connections with a description of their progress by the end of the experiment.

Next: Validation of Source-level Trace Up: Generating Traffic Previous: Generating Traffic Contents

Doctoral Dissertation: Generation and Validation of Empirically-Derived TCP Application Workloads
© 2006 Félix Hernández-Campos