Due: 11:59pm Thursday December 12, 2013
The project will be a non-trivial original piece of work, which is produced by an individual student or by 2 students working together as a team. The project could be structured in any of the following ways:
The end result should be at the level of a decent
workshop
paper. I will offer several ideas for projects related to
issues we
have been covering in lecture. However, you are
welcome to select
a topic of your own choosing if you prefer. Please let me
know ASAP
when you have chosen a topic so I can offer suggestions on
how to
proceed before you (attempt to) do all the work.
The purpose of the project is for you to show me that
you understand
the concepts and techniques covered in the class, and that
you can apply
them in ways that go beyond simply "plugging numbers into
formulas" or
telling what someone else has written. In particular, do
not take
material directly from other sources (e.g., written
descriptions,
figures, computer code, experimental data, etc.) and use
it as if it
was your own work!
That
is called "plagiarism" and it is an academic offence for
which I give a
failing grade.
Almost every year I catch someone who turned in a
"survey
paper" that was copied word-for-word from another document
(or perhaps
pieced together from sections copied from 2 or 3 documents).
That is
just as wrong as copying someone else's computer program.
Moreover,
keeping the same structure while rewriting each sentence
with different
words with the same meaning is still wrong -- like changing
the
variable names, or reordering the function definitions, to
hide its
origin. It is fine to include a few specific items
taken directly
from other sources -- a short definition, a well-done
figure, etc -- as
long as you state the source of the item and provide a
reference.
Anything longer than a few sentences should be rewritten and
reorganized using your own words (yes, even if you are not a
native-speaker of English!) and possibly your own
terminology and/or
mathematical notation. (Developing your own terminology is
surprisingly
important when trying to compare related papers from
different authors
since their terminologies are often unrelated and/or
conflicting.) More
importantly, your goal should be to explain each author's
approach,
specific contribution, point of view, etc, in a way that
highlights the
differences and similarities among the authors.
TCP's standard sliding-window mechanism
creates a major
performance bottleneck on "Long-Fat Networks" (LFNs).
Long ago, a
1.5 Mbps connection through a geostationary satellite link
would have
been considered an LFN, because its 0.5 second minimum RTT
demands an
advertised window size of at least 100 Kbytes to avoid
starvation,
whereas 64Kbytes is the largest value that can fit in the
16-bit field
in the TCP header. RFC
1323 offered some relief from the problem by defining
the
window-scaling option, protection against sequence number
wrap-around,
and recommendations for RTT calculation. Today, however,
there is a
demand for multi-gigabit per second connection speeds
between
supercomputer centers located on different continents --
which is far
beyond the capabilities of TCP additive-increase,
multiplicative-decrease congestion control algorithm.
Therefore, let's ignore the issue of fitting a very large
advertised window and/or byte-level sequence number into a
TCP header and focus on adapting TCP's additive-increase,
multiplicative-decrease (AIMD) congestion control algorithm
to work well on LFNs.
For concreteness, assume the target is to support a
continuous transmission rate of about 1 Gigabit/sec (i.e.,
109 bps) across a trans-continental path (i.e.,
the RTT is 100 msec.). In this case, the TCP sender would
need to use a congestion window size of about 8,500 packets
(based on an Ethernet-like 1,500 byte MSS) in order to
maintain the target rate. Consider one of the following
sub-topics:
(1a) Notice that AIMD's linear growth (one MSS per
RTT) is proportionately very small as the size of the
congestion window increases, so the penalty (in lost
throughput) for each packet loss is very severe for LFNs.
Therefore, evaluate the benefits of adding the following
RAID-inspired form of error correction to the TCP stream.
Following each block of B consecutive TCP segments, assume
the TCP transmitter sends an special "error correction
segment" consisting of the bitwise exclusive-OR of the
entire block to the TCP receiver. Thus, if at least B out of
B+1 segments associated with this block reach the receiver,
it can reconstruct a single lost TCP segment. However, two
or most losses from the same block cannot be fixed.
Furthermore, transmitting the special error correction
segments reduces the throughput of the TCP connection by a
factor of B/(B+1), whether or not there were any lost
segments. Evaluate the effectiveness of this
error-correction method on the throughput for the target TCP
flow. Estimate the optimal value of B as a function of the
probability of loss, P, between 10-3 and 10-6.
(1b) This time consider how to modify the linear
growth in AIMD's congestion avoidance phase to better handle
a very large congestion window. For example, you could try
increasing the congestion window by more than one MSS per
RTT, such as its square root or logarithm. Find a function
that produces a similar ratio between the rates of decrease
(-50%) and increase as AIMD would for small window sizes
(say 4-32). Evaluate what happens if two LFNs using your
modified version of TCP meet at a common bottleneck.
(1c) Finally, consider the option of changing AIMD's
unit of window adjustment from a single maximum-sized
segment (MSS) to a stream of packets.
Unlike part (1b), this time we assume that the TCP sender
uses AIMD to adjust the gap between consecutive packets
while transmitting a continous stream,
i.e., double the gap after each loss, otherwise reduce the
gap linearly.
(2) Modify TCP to handle the exchange of "direct access/block oriented" data (like a disk file) rather than "sequential" data (like a magnetic tape).
Recall that a TCP session transfers a sequence of
bytes
[a, a+1, ..., a+n] -- reliably -- from Alice to Bob,
where the
parameters a
and n are
determined by Alice's SYN and
FIN packets. Bob only releases the data to his client in sequential
order,
and
Bob's
advertised
window restricts the maximum amount by which
Alice's transmissions can deviate from this strict ordering.
However,
this reliable byte-stream abstraction is not appropriate for
every
application. For example, in a live video chat application,
Bob may
prefer a momentary gap in the picture over a long pause in
the event of
a packet loss. More generally, suppose the data represents a
decision
tree, and after seeing node 1 Bob knows he wants to jump
directly to
the subtree rooted at node 50. Why not let him skip the
intermediate
nodes if he already knows the information is useless to him?
Thus, consider the alternative of abstracting the
data
transfer as a sequence of reads by Bob from a direct-access
file at
Alice. At connection establishment, Alice would announce
both the first
and last sequence numbers in the file. (Later on, Alice
could update
these values if the attributes of the file changed, for
example, a live
video stream where new images are constantly being added and
old ones
are thrown away.) Similarly, at connection establishment and
instead of
acknowledging new data as it is receieved, Bob would respond
to Alice
with updated "wish lists" of content from file he wants
Alice to send.
Clearly, Bob can adjust the size of his "wish list" apply
flow control,
similar to the "advertized window" in TCP acknowledgements.
However,
since Bob may choose not to ask for data to be sent
sequentially (and
he is also free to change his mind, and remove items from
his wish list
that he never received), Alice cannot identify lost packets
from the
way in which Bob changes his "wish list". Therefore, Bob
must provide
some other feedback mechanism beyond the "wish list" so that
Alice can
use a TCP-like congestion-control algorithm.
Modify the TCP header fields that are carried in
connection-establishment packets, data packets, and
acknowledgements to
support the "direct access" model, described above. Show how
your
modified header fields would handle a variety of situations,
including:
(a) normal TCP-like sequential, reliable, flow-controlled
data
transfers, (b) real-time unreliable flow-controlled
multimedia
streaming, and (c) receiver-controlled non-sequential access
to a file
(e.g., Bob carrying out a binary search over a large file
stored at
Alice, without transfering the whole data set).
(4) Fun with Network Trace Data.
For many years, researchers have been studying the statistical properties of network traffic collected by a "sniffer" attached to a single link in the Internet core. As we will see in lecture, there has been great interest in understanding the packet inter-arrival time distributions, i.e., Poisson vs. long-range dependence. However, the properties of the packet size distribution have largely been ignored, other than recognizing that sizes tend to be bimodal (e.g., data segements vs. ACKs). In this project, you will look at publically available sources of network traces, such as the WIDE project, to determine the degree (if any) and nature of the statistical correlation in the lengths of sequentially-transmitted of packets. More specifically, let f(x) = Pr[packet size is x bytes] be the unconditional probability density function for packet sizes (for example, see the summary page for a typical WIDE dataset). In general, classical analytical models in queueing theory must assume that the size of each packet is chosen independently from f(x), without regard for the sizes of any other packets. Another possibility, inspired by the study of long-range dependence in interarrival times, would be to assume that the sizes of the ith and jth packet in the sequence are correlated that depends on their separation, |i-j|. A third possibility might be that the packet size distribution is related to its location relative to the start of a busy period, i.e., if xi and xj represent the sizes of the ith and jth packets (if any) transmitted during a single busy period, then how are their respective packet size distributions fi(xi) and fj(xj) related?
(5) Data Center Networking
In recent years, there has been considerable interest
in
how to design specialized networks for large data centers.
Start with
the idea that a data center is a large cluster with
thousands (perhaps
hundreds of thousands) of "nodes" (rack-mount PCs, file
servers, etc)
connected by a fast local area network (such as 1G or 10G
Ethernet).
Each of those rack-mount PCs may further serve as the
platform for
hosting tens, hundreds, or ?? of virtual machines. Most of
those
virtual machines are configured to serve some specific task.
However,
the number of active copies of each VM configuration, and
their
placement on physical nodes may change dynamically, as a
function of
workload. In addition, the data center may provide "cloud
computing"
services to various outside clients, each of which has
service level
agreements that may require the data center to dedicate
certain
resources to that client while isolating it from everyone
else.
Therefore, in the ideal case, a data center network should
be able to
move lots of data quickly from any subset of nodes to any
other subset
of nodes. (For example, search-engine query processing is
based on the MapReduce
paradigm,
which follows a broadcast/convergcast data exchange pattern
over a
multi-level tree structure. Others have suggested "striping"
a large
file across multiple file servers to reduce a client's
access time,
which may lead to network
overload due to "incast" traffic.)
Orthogonal to transport-layer issues that arise in
high-throughput, low-latency networking, data center
networking must
also overcome the issues of scaling, in which (what could
viewed as) a single local area
network might need to support a million
nodes
(including virtual machines, all of which may not all be
active at the
same time). For example, suppose we try to build such
a network
using commodity layer-2 bridges, each of which has 50 1G
Ethernet ports
(say) connected by an internal non-blocking interconnection
fabric.
Using a single switch, we could create an optimally-connected isolated 50-node "island", where we can move data from any subset of nodes to any other subject to a maximum of 1 Gigabit/sec entering and leaving each node.
If we reserved one port as an "uplink", then we need at least 20,835 switches to connect very node to a single tree structure using 20,408 level-0 "edge" switches, 417 level-1 switches, nine level-2 switches, and a single level-3 switch in the middle.
Unfortunately, the (single) uplink from a level-k switch must carry all the traffic going to and from the 49(k+1) nodes belonging to its subtree, and is therefore are vastly overloaded unless the applications are written (and nodes are placed) in a way that restricts data exchanges to occur only between (physical) neighbors.
Indeed, each uplink from a level-2 switch
carries
all the traffic between the 112,000 nodes that belong
to its level-2
subtree and the 888,000 nodes that belong to other
level-2 subtrees.
Thus, if we assume a uniform routing pattern, these
links will be
overloaded by a factor of 105.
If we use faster uplink speeds and/or employ link aggregation in the higher levels of the tree, then we can trade additional network hardware costs for a (slight) reduction in the severity of the bottleneck.
For example, suppose we provide each level-k switch with 2(k+1) uplinks. Then we need 20,833 level-0 "edge" switches (each supporting 48 single downlinks and a double uplink), 906 level-1 switches (each supporting 23 double downlinks and a quadruple uplink), 91 level-2 switches (each supporting 10 quadruple downlinks and an octal uplink), 23 level-3 switches (each supporting 4 octal downlinks and a degree-16 uplink) -- and our switches are too small to continue with level-4!
The best we can
do is to create eight disconnected "islands" of about
125,000 nodes
each, using 48 ports from one "island root" switch to
join three
level-3 switches. Thus, if we assume a uniform routing
pattern, each
uplink group from a level-3 switch must carry 2/3 of
the traffic
generated by 42,000 hosts over its 16 parallel links.
(It is
interesting to note that a 16-fold speed increase for
the most-heavily
loaded links only gave us a 4-fold reduction in the
congestion at the
bottleneck because the limited port count per switch
forced an 8-fold
reduction in the total network size.)
In addition to the bottleneck problem, a tree-structured network with "flat" addressing also has other disadvantages:
Since there are no alternate paths, a single
switch
failure or unplugging a single inter-switch link
(group) will partition
the network.
Every switch would need a MAC-address
filtering table that can store 1 million entries --
which is
prohibitively
expensive. Moreover, since the default time-out for
deleting "stale"
addresses from the cache is 5 minutes, "address
refresh" packets would
flood the network.
Several research groups of researchers have
proposed
novel network architectures to deal with these
limitations. For
example, PortLand
is a system developed at UC San Diego, uses a structured
assignment of
"pseudo-MAC addresses" to each node, so layer-2 switches
only need to
store address ranges, rather than individual addresses.
Similarly,
Microsoft Research developed a system called Monsoon,
which replaces the higher levels of the "tree"
architecture by a mesh
of switches, and introduces randomized routing through
the "mesh" to
protect the static structure of the mesh from local
congestion due to
dynamic changes in the traffic patterns.