Next: Validation using Synthetic Applications Up: Abstract Source-level Modeling Previous: The Concurrent a-b-t Model Contents

Subsections

Abstract Source-Level Measurement

The a-b-t model provides an intuitive way of describing source behavior in an application-neutral manner that is relevant for the performance of TCP. However, this would be of little use without a method for measuring real network traffic and casting TCP connections into the a-b-t model. We have developed an efficient algorithm that can convert an arbitrary trace of TCP/IP protocol headers into a set of connection vectors. The algorithm makes use of the wealth of information that segment headers provide to extract an accurate description of the abstract source-level behavior of the applications driving each TCP connection in the trace. It should be noted that this algorithm is a first solution to a complex inference problem in which we are trying to understand application behavior from the segment headers of a measured TCP connection without examining payloads, and hence without any knowledge of the identity of the application driving the connection. This implies ``reversing'' the effects of TCP and the network mechanisms that determine how ADUs are converted into the observed segments that carry the ADU. The presented algorithm is by no means the only one possible, or the most sophisticated one. However, we believe it is sufficiently accurate for our purpose, and we provide substantial experimental evidence in this and later chapters to support this claim.

From TCP Sequence Numbers to Application Data Units

The starting point of the algorithm is a trace of TCP segment headers, $\mathcal{T}_h$ , measured on some network link. Our technique applies to TCP connections for which both directions are measured (known as a bidirectional trace), but we will also comment on the problem of extracting a-b-t connection vectors from a trace with only one measured direction (a unidirectional trace). While most public traces are bidirectional (e.g., those in the NLANR repository [nlaa]), unidirectional traces are sometimes collected when resources (e.g., disk space) are limited. Furthermore, routing asymmetries often result in connections that only traverse the measured link in one direction.

**Figure 3.10:** A first set of TCP segments for the connection vector in Figure 3.1: lossless example.
$\includegraphics[width=3.0in]{fig/seg-diagram/infer.eps}$

We will use Figure 3.10 to describe the basic technique for measuring ADU sizes and quiet time durations. The figure shows a set of TCP segments representing the exchange of data illustrated in the a-b-t diagram of Figure 3.1. After connection establishment (first three segments), a data segment is sent from the connection initiator to the connection acceptor. This data segment carries ADU $a_1$ , and its size is given by the difference between the end sequence number and the beginning sequence number assigned to the data (bytes 1 to 341). In response to this data segment, the other endpoint first sends a pure acknowledgment segment (with cumulative acknowledgment number 342), followed by two data segments (with the same acknowledgment numbers). This change in the directionality of the data transmission makes it possible to establish a boundary between the first data unit $a_1$ , which was transported using a single segment and had a size of 341 bytes, and the second data unit $b_1$ , which was transported using two segments and had a size of 2,555 bytes.

The trace of TCP segments $\mathcal{T}_h$ must include a timestamp for each segment that reports the time at which the segment was observed at the monitoring device. Timestamps provide a way of estimating the duration of quiet times between ADUs. The duration of $ta_1$ is given by the difference between the timestamp of the 4th segment (the last and only segment of $a_1$ ), and the timestamp of the 6th segment (the first segment of $b_1$ ). The duration of $tb_1$ is given by the difference between the timestamp of the last data segment of $b_1$ (7th segment in the connection) and the timestamp of the first FIN segment (8th segment in the connection).

Note that the location of the monitoring point between the two endpoints affects the measured duration of $ta_1$ and $tb_1$ (but not the measured sizes of $a_1$ and $b_1$ ). Measuring the duration of $ta_1$ from the monitoring point 1 shown in Figure 3.10 results in an estimated time $t_1$ that is larger than the estimated time $t_2$ measured at monitoring point 2. Inferring application-layer quiet time durations is always complicated by this kind of measurement variability (among other causes), so short quiet times (with durations up to a few hundred milliseconds) should not be taken into account. Fortunately, the larger the quiet time duration, the less significant the measurement variability becomes, and the more important the effect of the quiet time is on the lifetime of the TCP connection. We can therefore choose to assign a value of zero to any measured quiet time whose duration is below some threshold, e.g., 1 second, or simply use the measurement disregarding the minor impact of its inaccuracy.

**Figure 3.11:** A second set of TCP segments for the connection vector in Figure 3.1: lossy example.
$\includegraphics[width=3.0in]{fig/seg-diagram/infer2.eps}$

If all connections were as ``well-behaved'' as the one illustrated in Figure 3.10, it would be trivial to create an algorithm to extract connection vectors from segment header traces. This could be done by simply examining the segments of each connection and counting the bytes sent between data directionality changes. In practice, segment reordering, loss, retransmission, duplication, and concurrency make the analysis much more complicated. Figure 3.11 shows a second set of segment exchanges that carry the same a-b-t connection vector of Figure 3.1. The first data segment of the ADU sent from the connection acceptor, the 6th segment, is lost somewhere in the network, forcing this endpoint to retransmit this segment some time later as the 9th segment. Depending on the location of the monitor (before or after the point of loss), the collected segment header trace may or may not include the 6th segment. If this segment is present in the trace (like in the trace collected at monitoring point 2), the analysis program must detect that the 9th segment is a retransmission and ignore it. This ensures we compute the correct size of $b_1$ , i.e., 2,555 bytes rather than 4,015 bytes. If the lost segment is not present in the trace (like in the trace collected at monitoring point 1), the analysis must detect the reordering of segments using their sequence numbers and still output a size for $b_1$ of 2,555 bytes. Measuring the duration of $ta_1$ is more difficult in this case, since the monitor never saw the 6th segment. The best estimation is the time $t_1$ between the segment with sequence number 341 and the segment with sequence number 2555. Note that if the 6th segment is seen (as for a trace collected at monitoring point 2), the best estimate is the time $t_2$ between 5th and 6th segments. A data acquisition algorithm capable of handling these two cases cannot simply rely on counts and data directionality changes, but must keep track of the start of the current ADU, the highest sequence number seen so far, and the timestamp of the last data segment. In our analysis, rather than trying to handle every possible case of loss and retransmission, we rely on a basic property of TCP to conveniently reorder segments and still obtain the same ADU sizes and inter-ADU quiet time durations. This makes our analysis simpler and more robust.

Logical Order of Data Segments

A fundamental invariant that underlies our previous ADU analyses is that every byte of application data in a TCP connection receives a sequence number, which is unique for its direction^3.7. This property also means that data segments transmitted in the same direction can always be logically ordered by sequence number, and this order is independent of both the time at which segments are observed and any reordering present in the trace. The logical order of data segments is a very intuitive notion. If segments 6 and 7 in Figure 3.10 carried an HTML page, segment 6 carried the first 1,460 characters of this page, while segment 7 carried the remaining 1,095. Segment 6 logically preceded segment 7. When the same page is transmitted in Figure 3.11, the first half of the HTML is in segment 6 (which was lost) and again in segment 9. Both segments 6 and 9 (which were identical) logically precede segment 7, which carried the second half of the HTML page.

The notion of logical order of data segments can also be applied to segments flowing in opposite directions of a sequential TCP connection. Each new data segment in a sequential connection must acknowledge the final sequence number of the last in-order ADU received in the opposite direction. If this is not the case, then the new data is not sent in response to the previous ADU, and the connection is not sequential (i.e., two ADUs are being sent simultaneously in opposite directions). In the previous examples in Figures 3.10 and 3.11, we can see that both data segments comprising $b_1$ acknowledge the final sequence number of $a_1$ . Intuitively, no data belonging to $b_1$ can be sent by the server until $a_1$ is completely received and processed. The data in $a_1$ logically precede the data in $b_1$ , and therefore the segment carrying $a_1$ logically precedes the segments carrying $b_1$ . Given the sequence and acknowledgment numbers of two data segments, flowing in the same or in opposite directions, we can always say whether the two segments carried the same data, or one of them logically preceded the other.

Connections that fit into the sequential a-b-t model are said to preserve a total order of data segments with respect to the logical flow of data:

For any pair of data segments $p$ and $q$ , such that $p$ is not a retransmission of $q$ or vice versa, either the data in $p$ logically precedes the data in $q$ , or the data in $q$ logically precedes the data in $p$ .

In the example in Figure 3.11, the data in segment 9 logically precedes the data in segment 7 (e.g., segment 9 carries the first 1460 bytes of a web page, and segment 7 carries the rest of the bytes). We know this because the sequence numbers of the bytes in segment 9 are below the sequence numbers of the bytes in segment 7. The first monitoring point observes segment 7 before segment 9, so temporal order of these two segments did not match their logical data order. A total order also exists between segments that flow in opposite directions. In the example in Figure 3.11, the data in segment 4 logically precede the data carried in the rest of the data segments in the connection. Timestamps and segment reordering play no role in the total order that exists in any sequential connection.

Logical data order is not present in data-concurrent connections, such as the one shown in Figure 3.8. For example, the segment that carried the last b-type ADU (the 438 don't send ADU) may have been sent roughly at the same time as another segment carrying some of the new data of the data unit sent in the opposite direction (such as a CHECK ADU). Each segment would use new sequence numbers for its new data, and it would acknowledge the data received so far by the endpoint. Since the endpoints have not yet seen the segment sent from the opposite endpoint, the two segments cannot acknowledge each other. Therefore, there is no causality between the segments, and no segment can be said to precede the other. This observation provides a way of detecting data concurrency purely from the analysis of TCP segment headers. The idea is that a TCP connection that violates the total order of data segments described above can be said to be concurrent with certainty. This happens whenever a pair of data segments, sent in opposite directions, do not acknowledge each other, and therefore cannot be ordered according the logical data order.

Formally, a connection is considered to be concurrent when there exists at least one pair of data segments $p$ and $q$ that either flow in opposite directions and satisfy

$\begin{displaymath}p.seqno > q.ackno \end{displaymath}$

(3.1)

and

$\begin{displaymath}q.seqno > p.ackno, \end{displaymath}$

(3.2)

or that flow in the same direction and satisfy

$\begin{displaymath}p.seqno > q.seqno \end{displaymath}$

(3.3)

and

$\begin{displaymath}q.ackno > p.ackno. \end{displaymath}$

(3.4)

, Where $p.seqno$ and $q.seqno$ are the sequence numbers of $p$ and $q$ respectively, and $p.ackno$ and $q.ackno$ are the acknowledgment numbers of $p$ and $q$ respectively. Note that, for simplicity, our $.ackno$ refers to the cumulative sequence number accepted by the endpoint (which is one unit below the actual acknowledgment number stored in the TCP header [Pos81]). The first pair of inequalities defines the bidirectional test of data concurrency, while the second pair defines the unidirectional test of data concurrency. We next discuss why a connection satisfying one of these tests must necessarily be associated with concurrent data exchanging.

We consider first the case where $p$ and $q$ flow in opposite directions, assuming without loss of generality that $p$ is sent from initiator to acceptor and $q$ from acceptor to initiator. If they are part of a sequential connection, either $p$ is sent after $q$ reaches the initiator, in which case $p$ acknowledges $q$ so $q.seqno = p.ackno$ , or $q$ is sent after $p$ reaches the acceptor in which case $p.seqno = q.ackno$ . Otherwise, a pair of data segments that do not acknowledge each other exists, and the connection exhibits data concurrency.

In the case of segments $p$ and $q$ flowing in the same direction, we assume without loss of generality that $p.seqno < q.seqno$ . The only way in which $q.ackno$ can be less than $p.ackno$ is when $p$ is a retransmission sent after $q$ , and at least one data segment $k$ with new data sent from the opposite direction arrives between the sending of $p$ and the sending of $q$ . The arrival of $k$ increases the cumulative acknowledgment number in $p$ with respect to $q$ , which means that $q.ackno < p.ackno$ . In addition, $k$ cannot acknowledge $p$ , or $p$ would not be retransmitted. This implies that the connection is not sequential, since the opposite side sent new data in $k$ without waiting for the new data in $p$ .

Thus, only data-concurrent connections have a pair of segments that can simultaneously satisfy inequalities (3.1) and (3.2) or inequalities (3.3) and (3.4). These inequalities provide a formal test of data concurrency, which we will use to distinguish sequential and concurrent connections in our data acquisition algorithm. Data-concurrent connections exhibit a partial order of data segments, since segments flowing in the same direction can always be ordered according to sequence numbers, but not all pairs of segments flowing in opposite directions can be ordered in this manner.

Situations in which all of the segments in a concurrent data exchange are actually sent sequentially are not detected by the previous test. This can happen purely by chance, when applications send very little data or send it so slowly that concurrent data sent in the reverse direction is always acknowledged by each new data segment. Note also that the test detects concurrent exchanges of data and not concurrent exchanges of segments in which a data segment and an acknowledgment segment are sent concurrently. In the latter case, the logical order of data inside the connection is never broken because there is no data concurrency. Similarly, the simultaneous connection termination mechanism in TCP in which two FIN segments are sent concurrently is usually not associated with data concurrency. In the most common case, none of the FIN segments or only one of them carries data, so the data concurrency definition is not applicable. It is however possible to observe a simultaneous connection termination where both FIN segments carry data, which is considered concurrency if these segments satisfy inequalities (3.1) and (3.2).

Data Analysis Algorithm

We have developed an efficient data analysis algorithm that can determine whether a connection is sequential or concurrent, and can measure ADU sizes and quiet time durations in the presence of arbitrary reordering, duplication, and loss. Rather than trying to analyze every possible case of reordering, duplication/retransmission, and loss, we rely on the logical data order property, which does not depend on the original order and timestamps.

Given the set of segment headers of a TCP connection sorted by timestamp, the algorithm performs two passes:

Insert each data segment as a node into the data structure ordered_segments. This is a list of nodes that orders data segments according to the logical data order (bidirectional order for sequential connections, unidirectional order for concurrent connections). The insertion process serves also to detect data exchange concurrency. This is because connections are initially considered sequential, so their segments are ordered bidirectionally, until a segment that cannot be inserted according to this order is found. No backtracking is needed after this finding, since bidirectional order implies unidirectional order for both directions.
Traverse ordered_segments and output the a-b-t connection vector (sequential or concurrent) for the connection. This is straight-forward process, since segments in the data structure are already ordered appropriately.

The first step of the algorithm creates a doubly-linked list, ordered_segments in which each list node represents a data segment using the following four fields:

$seqno_A$ : the sequence number of the segment in the initiator to acceptor direction (that we will call the A direction). This sequence number is determined from the final sequence number of the segment (if the segment was measured in the ``A'' direction), or from the cumulative acknowledgment number (if measured in the ``B'' direction).
$seqno_B$ : the sequence number of the segment in the acceptor to initiator direction.
$dir$ : the direction in which the segment was sent (A or B).
$ts$ : the monitoring timestamp of the segment.

The list always preserves the following invariant that we call unidirectional logical data order: for any pair of segments $p$ and $q$ sent in the same direction $D$ , the ordered_segments node of $p$ precedes the ordered_segments node of $q$ if and only if $p.seqno_D < q.seqno_D$ . At the same time, if the connection is sequential, the data structure will preserve a second invariant that we call bidirectional logical data order, which is the opposite of the data concurrency conditions defined above: for any pair of segments $p$ and $q$ , the ordered_segments node of $p$ precedes the ordered_segments node of $q$ if and only if

$\begin{displaymath}(p.seqno_A < q.seqno_A) \wedge (p.seqno_B = q.seqno_B)\end{displaymath}$

$\begin{displaymath}(p.seqno_A = q.seqno_A) \wedge (p.seqno_B < q.seqno_B).\end{displaymath}$

Insertion of a node into the list starts backward from the tail of the ordered_segments looking for an insertion point that would satisfy the first invariant. If the connection is still being considered sequential, the insertion point must also satisfy the second invariant. This implies comparing the sequence numbers of the new segment with those of the last segment in the ordered_segments. The comparison can result in the following cases:

The last segment of ordered_segments precedes the new one according to the bidirectional order above. If so, the new segment is inserted as the new last element of ordered_segments.
The last segment of ordered_segments and the new segment have the same sequence numbers. In this case, the new segment is a retransmission and it is discarded.
The new segment precedes the last segment of ordered_segments according to the bidirectional order. This implies that network reordering of TCP segments occurred, and that the new segment should be inserted before the last segment of ordered_segments to preserve the bidirectional order of the data structure. The new segment is then compared with the predecessors of the last segment in ordered_segments until its proper location is found, or inserted as the first segment if no predecessors are found.
The last segment of ordered_segments and the new segment have different sequence numbers and do not show bidirectional order. This means that the connection is concurrent. The segment is then inserted according to its unidirectional order.

Since TCP segments can be received out of order by at most $W$ bytes (the size of the maximum receiver window), the search pass (third bullet) never goes backward more than $W$ segments. Therefore, the insertion step takes $O(s W)$ time, where $s$ is the number of TCP data segments in the connection.

The second step is to walk through the linked list and produce an a-b-t connection vector. This can be accomplished in $O(s)$ time using ordered_segments. For concurrent connections, the analysis goes through the list keeping separate data for each direction of the connection. When a long enough quiet time is found (or the connection is closed), the algorithm outputs the size of the ADU. For sequential connections, the analysis looks for changes in directionality and outputs the amount of data in between the change as the size of the ADU. Sufficiently long quiet times also mark ADU boundaries, indicating an epoch without one of the ADUs.

Reordering makes the computation of quiet times more complex than it seems. As shown in Figure 3.11, if the monitor does not see the first instance of the retransmitted segment, the quiet times should be computed based on the segments with sequence numbers 341 and 2555. This implies adding two more fields to the list nodes:

$min\_ts$ : the minimum timestamp of any segment whose position in the order is not lower than the one represented by this node. Due to reordering, one segment can precede another in the bidirectional order and at the same time have a greater timestamp. In this case, we can use the minimum timestamp as a better estimate of the send time of the lower segment.
$max\_ts$ : the maximum timestamp of any segment whose place in the order is not greater than the one represented by this node. This is the opposite of the previous $min\_ts$ field, providing a better estimate of the receive time of the greater segment.

These fields can be computed during the insertion step without any extra comparison of segments. The best possible estimate of the quiet time between two ADU becomes

$\begin{displaymath}q.min\_ts - p.max\_ts\end{displaymath}$

for $p$ being the last segment (in the logical data order) of the first ADU, and $q$ being the first segment (in the logical data order) of the second ADU. For the example in Figure 3.11, at monitoring point 1, the value of $min\_ts$ for the node for the 9th segment (that marks a data directionality boundary when segment nodes are sorted according to the logical data order) is the timestamp of the 7th segment. Therefore, the quiet time $ta_1$ is estimated as $t_1$ . Note that the use of more than one timestamp makes it possible to handle IP fragmentation elegantly. Fragments have different timestamps, so a single timestamp would have to be arbitrarily set to the timestamp of one of the fragments. With our algorithm, the first fragment provides sequence numbers and usually $min\_ts$ , while the last fragment usually provides $max\_ts$ .

Other Issues in Trace Processing

Our trace processing algorithm makes two assumptions. First, it assumes we can isolate the segments of individual connections. Second, it assumes that no wraparound of sequence numbers occurs (otherwise, logical data order would not be preserved). These two assumptions can be satisfied by preprocessing the trace of segment headers. Isolating the segments of individual TCP connections was accomplished by sorting packet header traces on five keys: source IP address, source port number, destination IP address, destination port number, and timestamp. The first four keys can separate segments from different TCP connections as long as no source port number is reused. When a client establishes more than one connection to the same server (and service), these connections share IP addresses and destination port numbers, but not source port numbers. This is true unless the client is using so many connections that it reuses a previous source port number at some point. Finding such source port number reuses is relatively common in our long traces, which are at least one hour long. Since segment traces are sorted by timestamp, it is possible to look for pure SYN segments and use them to separate TCP connections that reuse source port numbers. However, SYN segments can suffer from retransmissions, just like any other segment, so the processing must keep track of the sequence number of the last SYN segment observed. Depending on the operating system of the connection initiator, this sequence number is either incremented or randomly set for each new connection. In either case, the probability of two connections sharing SYN sequence numbers is practically zero.

Segment sorting according to the previous 5 keys requires $O(s\ log\ s)$ time (we use the Unix sort utility for our work). It is also possible to process the data without an initial sorting step by keeping state in memory for each active connection. On the one hand, this can potentially eliminate the costly $O(s\ log\ s)$ step, making the entire processing run in linear time. On the other hand, it complicates the implementation, and increases the memory requirements substantially^3.8. Detecting the existence of distinct connections with identical source and destination IP addresses and port numbers requires $O(s)$ time, simply by keeping track of SYN sequence numbers as discussed above. In our implementation, this detection is done at the same time as segments are inserted into ordered_segments data structure, saving one pass.

TCP sequence numbers are 32-bit integers, and the initial sequence number of a TCP connection can take any value between 0 and $2^{32} - 1$ . This means that wraparounds are possible, and relatively frequent. One way to handle sequence number wraparound is by keeping track of the initial sequence number and performing a modular subtraction. However, if the SYN segment of a connection is not observed (and therefore the initial sequence number is unknown), using modular arithmetic will fail whenever the connection suffers from reordering of the first observed segments. In this case the subtraction would start in the wrong place, i.e., from the sequence number of the first segment seen, which is not the lowest sequence number due to the reordering. One solution is to use backtracking, which complicates the processing of traces.

A related problem is that representing sequence numbers as 32-bit integers is not sufficient for connections that carry more than $2^{32}$ bytes of data (4 GB). The simplest way to address this measurement problem is to encode sequence numbers using more than 32 bits in the ordered_segments data structure. In our implementation we use 64 bits to represent sequence numbers, and rely on the following algorithm^3.9to accurately convert 32 bit sequence numbers to 64-bit integers even in the presence of wraparounds. The algorithm makes use of a wraparound counter and a pair of flags for each direction of the connection. The obvious idea is to increment the counter each time a transition from a high sequence number to a low sequence number is seen. However, due to reordering, the counter could be incorrectly incremented more than once. For example, we could observe four segments with sequence numbers $2^{32}-1000, 1000, 2^{32}-500$ , and $2000$ . Wraparound processing should convert them into $2^{32}-1000, 2^{32}+1000, 2^{32}-500$ , and $2^{32}+2000$ . However, if the wraparound counter is incremented every time a transition from a high sequence number to a low sequence number is seen, the counter would be incremented once for the segment with the sequence number $1000$ and again for the segment with sequence number $2000$ . In this case, the wraparound processing would result in four segments with sequence numbers $2^{32}-1000, 2^{32}+1000, 2^{32}-500$ , and $2^{32}+2^{32}+2000$ . The second increment of the counter would be incorrect.

The solution is to use a flag that is set after a ``low'' sequence number is seen, so the counter is incremented only once after each ``crossing'' of $2^{32}$ . This opens up the question of when to unset this flag so that the next true crossing increments the counter. This can be solved by keeping track of the crossing of the middle sequence number. In our implementation, we use two flags, low_seqno and high_seqno, which are set independently. If the next segment has a sequence number in the first quarter of $2^{32}$ (i.e., in the range between 0 and $2^{30}-1$ ), the flag low_seqno is set to true. If the next segment has a sequence number in the fourth quarter of $2^{32}$ (i.e., in the range between $2^{31}$ and $2^{32} - 1$ ), the other flaghigh_seqno is set to true. These flags are unset, and the counter incremented, when both flags are true and the next segment is not in the first or the fourth quarter of $2^{32}$ . Sequence numbers in the first quarter are incremented to $2^{32}$ times the counter plus 1. The rest are incremented by $2^{32}$ plus the counter. This handles the pathological reordering case in which the sequence number of the first segment in a connection is very close to zero, and the next segment is very close to $2^{32}$ . In this case the low sequence number would be incremented by $2^{32}$ . This algorithm requires no backtracking, and runs in $O(s)$ time. In our implementation, the sequence number conversion algorithm has been integrated into the same pass as the insertion step of the ADU analysis.

Our data acquisition techniques have been implemented in the analysis program tcp2cvec. The program also handles a number of other difficulties that arise when processing real traces, such as TCP implementations that behave in non-standard ways. In addition, it also implements the analysis of network-level parameters described in the next chapter.

Next: Validation using Synthetic Applications Up: Abstract Source-level Modeling Previous: The Concurrent a-b-t Model Contents

Doctoral Dissertation: Generation and Validation of Empirically-Derived TCP Application Workloads
© 2006 Félix Hernández-Campos