A brief write-up about the tcpdump to connection vector format – prepared for Long – some of this has been directly taken from Felix’s dissertation Chapter 7 on “Trace Resampling and Load Scaling”

Notes on converting a given tcpdump trace to the connection vectors that could be used by tmix and fed into treplay:

Step 1:

Convert the two binary tcpdump traces (one in each direction at the captured link) into two ascii files filtering so as to retain only TCP connections and discard all other protocols. Extract the connection vectors from these two traces, partition them into disjoint subsets. This extraction process separates out the connection vectors into sequential connections that were fully captured and concurrent connections that were fully captured. All other connections, including millions of empty TCP connections that are present in every trace, are discarded in this process. Note that the replay only uses fully captured connections.

Step 2

Convert connection vectors into tmix format

Step 3

Use the techniques of block resampling and load scaling to scale the connection vectors obtained above so they would generate a target load and replay the trace for a specified duration.

This is a high level view of tmix – much of this has been directly taken from Felix’s dissertation Chapter 7 on “Trace Resampling and Load Scaling”.

Given a captured trace T_c, the connection vectors can be generated, and then replayed in the laboratory testbed, for a specified average target load, and a specified duration. Without resampling of the connections and scaling the original load of the captured trace, traffic is generated according to a trace T_c = {T_i,C_i}. Each augmented connection vector C_i is replayed starting at time T_i. This implies that two different replays of T_c using the same hardware and the same physical network result in very similar synthetic traffic. In both cases, the synthetic traffic has the same number of TCP connections, replaying the same source-level behaviors under the same network-level parameters, and starting exactly at the same times. Only tiny variation would be introduced on the end-systems by changes in clock synchronization, operating system scheduling and interrupt handling, and at switches and routers by the stochastic nature of packet multiplexing. This reproducibility has been shown in Felix’s dissertation.

However, in the practice of experimental networking, experimenters often want to introduce more variability in their experiments. The technique used here is called block resampling.

This technique involves resampling blocks of connection vectors, while preserving arrival dependencies among the connections inside the same block. Each block is a group of connections observed in an interval of fixed duration, e.g., 1 minute in the original trace. It has been demonstrated that this technique is sufficient to reproduce the long-range dependent nature of the connection arrival process, which can never be captured by sampling independently from an exponential (or any other) distribution.

The term resampling refers to randomly selecting connection vectors, and the term sampling refers to randomly drawing values from a parametric distribution, such as the exponential distribution.

There is a common need in experimental networking research, where the performance of a network mechanism or protocol is often affected by the amount of traffic to which it is exposed. For example, active queue management mechanisms have very different performance depending on the level of utilization of the input link, so researchers generally perform experiments with offered loads that range from 50% to 110% of the link bandwidth. Rather than trying to find or collect traces with the exact range of loads needed (something generally difficult), this method produces a collection of resampled traces with the intended range of offered loads.

Average load l is defined as the total number of bytes injected in the network s divided by the total duration of the experiment d. Changing the average load in an experiment of constant duration therefore implies creating a scaled trace T_c^’ with a higher or a lower total number of bytes. The assumption is that it is possible to create a scaled trace T_c^’ “which looks like” the original T_c but has more or fewer bytes. This implies a model of the traffic that is general enough to encompass traces of different loads.

By using the block resampling method, the resampling of T_c to create a scaled T_c^’ can be driven by a target average load, making the load created by replaying scaled traces predictable. A key observation here is that the average load is a deterministic function of the total number of bytes in a trace T_c. In the case of Block resampling, constructing a new trace T_c^’ with a specific target offered load involves subsampling blocks (“thinning") to decrease load, or combining two or more blocks (“thickening”) to increase load.

Common application protocols make use of more than one connection, creating correlations among some start times. For example, web browsers often open several connections for downloading a web page. On the other hand, if the focus is on traces of highly aggregated traffic, where a large number of hosts start hundreds or even thousands of connections every second, aggregation could diminish or completely eliminate any correlations structure in the connection arrival processes.

The analysis of traces reveals non-negligible correlations in the connection arrival process. It was noted, however, that these dependencies between connection start times are far more likely to occur within relatively small periods. For example, web browsing results in new connections started according to the sequence of web page downloads and the way the browser opens new connections to the servers in which these pages are found. This results in bursts of connection starts that are correlated, but correlations are not expected to last very long. This intuition is used to develop a resampling method wherein the resampled objects are not individual connections, but groups of connections started during the same period. The key idea of the Block resampling method is to sample blocks of connections rather than individual connections, preserving the relative offsets of connection start times within blocks.

Given a trace Tc, it is divided in blocks of duration β, so that the first block B₁ groups together connections started in the interval [0; β), the second block B₂ groups together connections started in the interval [β; 2 β), and so on. The block resampled trace T_c^’ is obtained by concatenating randomly sampled blocks, and adjusting the start time of connections by the offset of the new block location. For example, if the random resampling put block B₂ as the first block of T_c^’, the start times of the connection vectors in this block are set to T_i - β. Similarly, if B₂ is placed in the fourth location of T_c^’, the start times of the connections in this block are set to T_i + 2β. More formally, when the j-th block Bj in the original trace becomes the k-th block Bk in the block resampling, the start time Ti of each connection vector in Bj is set to

T_i’ = T_i + (k - j) β

Block resampling chooses blocks for T_c^’ with replacement, making it possible to create new traces that are longer than the original T_c from which the blocks are obtained. Blocks with duration of 30 seconds or 1 minute provide the best trade off between blocks that are large enough to ensure realistic long-range dependence in the connection arrival process and blocks that are short enough to provide a large number of distinct resamplings.

Block resampling makes it possible to construct a resampled T_c^’ of arbitrary duration but it does not provide a method for adjusting its offered load. In order to perform this task, “load scaling” is used, where the blocks are “thinned", when the offered load of Tc is above the intended offered load, and “thickened", when the offered load of Tc is below the intended offered load. Thinning involves randomly removing connections from T_c^’. Theoretical work has shown that the thinning of a long-range dependent process does not change its long-range dependence structure [HV03 in Felix’s dissertation reference list].

For example, to double the load, the connection vectors of two randomly chosen blocks will be placed in the first interval, those from another pair of randomly chosen blocks will be placed in the second interval, and so on. It is also possible to achieve offered loads that are not a multiple of the original load by combining basic thickening and thinning in a two-step process. The first step is to create a preliminary version of T_c^’ by combining as many blocks as possible without exceeding the target load. The second step is to “complete" this trace by combining it with another block-resampled trace that has been thinned in such a manner that the combined load of the two traces matches the intended load. For example, in order to create a T_c^’ with 2.5 time the load of Tc, a first thickened trace T_c^tk is created by combining two blocks in each position. This trace is then combined with another trace T_c^tn that has been thinned to half of the offered load of Tc. In practice, the exact thinning of T_c^tn is tuned further to make the offered load of T_c^’ = T_c^tk + T_c^tn equal to 2.5 time that of Tc. One can therefore achieve any intended load with the block resampling method, so it is very flexible.

The results of several testbed experiments with block-resampled traces are shown in “Felix’s dissertation”. The achieved loads are very good approximations of the intended target offered loads.