Our methodology strongly relies on *non-parametric modeling*.
Parametric models are far more
compact and can often provide deeper insights than non-parametric ones. However, their use
has little to do with the quality of synthetic traffic. A non-parametric model can result in
traffic as realistic or more than a parametric model, without the risk of oversimplification.
In any case, our a-b-t connection vectors offer a good foundation for building a parametric
model of Internet traffic mixes. Our analysis of the relationship between ADU sizes and
numbers of epochs in Section 3.5.1 uncovered substantial complexity
and a striking lack of consistency among the different links considered in our study.
Techniques like Hidden Markov Modeling could perhaps provide the right approach.

Our own related work explored the possibility of attacking this complex modeling problem
by decomposing traffic mixes in to a set of fundamental pattern of communication
[HCNSJ05].
The idea was to use statistical clustering to find applications that behave in a similar
manner, *i.e.*, that follow the same ``communication pattern'', and to separately model each of
the identified *traffic clusters*. For example, interactive applications such as telnet and
SSH are very different from file-sharing applications such as Kazaa or Gnutella, so it seems much
easier to develop separate models for ``interactive applications'' and ``file-sharing applications''
than a single model to encompass both of them. In our exploratory study, we followed a two
step process to find traffic clusters. First, we computed a vector of features for each
connection, which included statistics such as the median size of the ADUs in the connection,
a measure of the directionality of the data exchanges, and the correlation between the sizes
of a-type and b-type ADUs. Feature vectors provide a way to compare connections, even if their
a-b-t connection vectors have very different forms, and use a distance metric to quantify
the similarity between the source behaviors in two connections.
Second, we used a hierarchical clustering algorithm to construct a taxonomy of traffic classes
based on the similarity among connections. The results of our analysis demonstrated that
some clear and intuitive traffic clusters emerged when this procedure was applied to sets of
connection vectors derived from real traces. We believe this type of approach can simplify
the modeling of traffic mixes. Furthermore, it can also provide a more flexible way of
resampling traces, where the fraction of connection vectors from each of the traffic clusters
can be changed at will (*e.g.*, increasing of decreasing the fraction of file-sharing-like traffic).

There are other open questions in the modeling of Internet traffic mixes, and their solution is complicated by the need to devise better measurement methods. We can cite the following examples:

- Our modeling of concurrent connections employs two separate connection vectors, one for each direction, eliminating any dependencies among ADUs flowing in opposite direction. This dependencies are certainly present in some cases, at least when concurrency is used to implement pipelining. A refined version of the a-b-t model where the causality between ADUs is specified using an acyclic graph could capture this type of structure. The analysis of sequence and acknowledgment numbers can provide a starting point for understanding ADU dependencies. However, such an approach would result in a substantial number of spurious dependencies that were not really part of the application behavior.
- The a-b-t model has no mechanism to specify dependencies between ADUs in different
connections. While more complex forms of the model are possible, there is again
great difficulty in determining when these dependencies exists. By analyzing ADU arrival
times for the same endpoint, we could hypothesize a dependency. We could further strengthen
such an analysis by requiring several instances of the same dependency pattern,
*i.e.*, only accepting a timing dependency when several pairs of connections with ``similar'' ADU sizes and number of epochs are observed. - A important problem that has received very limited attention in the source-level
modeling literature is the possibility of changes in user behavior as a function of network
conditions. Such possibility would break the assumption of network independence in source-level
models. Our work in this area [PHCM$^+$06] revealed phenomenal difficulties in measuring
such dependencies. Even a simple question such as whether users with higher access bandwidths
tended to download larger objects was statistically problematic. Our results showed that
this trend does not appear to be present in the UNC trace. While substantial differences
exists in the access bandwidth of different UNC endpoints (
*e.g.*, between wireless and wired end hosts), the number of endpoints with severely limited bandwidth is very small (*e.g.*, few endpoints were behind a modem).

A final question is how to combine source-level modeling and unwanted traffic modeling.
Our analysis in Section 4.2.1 showed the need to carefully separate connections with
regular data exchanges, for which the a-b-t model is applicable, and other types of
connections (*i.e.*, failed connection establishments attempts, port and network scans, *etc.*).
While our filtering for regular connections removed only a tiny fraction of the bytes
in the traces, the number of individual connections was very large, which may be detrimental
for certain studies. In addition, we did not consider how to generate malicious
traffic. Our literature review discussed some relevant efforts on this topic. However,
they tend to be open-loop. Since malicious traffic can have dramatic effect on the
network conditions, understanding its impact on source behavior seems critical. We know
of no study that considered this question.

Doctoral Dissertation: *Generation and Validation of Empirically-Derived TCP Application Workloads*

© 2006 Félix Hernández-Campos