CS204 Fall 2013

Group Project

Due: 11:59pm Thursday December 12, 2013

I. Summary

The project will be a non-trivial original piece of work, which is produced by an individual student or by 2 students working together as a team. The project could be structured in any of the following ways:

a prototype implementation to demonstrate the feasibility of some proposed network algorithm(s) that you either found in the literature, or came up with yourself.

a measurement study, or the application of published measurements to study the properties of some existing network component (e.g., queue sizes in a router)

a critical survey of some network-related topic. (However, they have to be thorough, synthesize the read material, and offer an interesting perspective on the state of the art.)

The end result should be at the level of a decent workshop paper. I will offer several ideas for projects related to issues we have been covering in lecture. However, you are welcome to select a topic of your own choosing if you prefer. Please let me know ASAP when you have chosen a topic so I can offer suggestions on how to proceed before you (attempt to) do all the work.

II. Give me your own work -- do not copy from other sources

The purpose of the project is for you to show me that you understand the concepts and techniques covered in the class, and that you can apply them in ways that go beyond simply "plugging numbers into formulas" or telling what someone else has written. In particular, do not take material directly from other sources (e.g., written descriptions, figures, computer code, experimental data, etc.) and use it as if it was your own work! That is called "plagiarism" and it is an academic offence for which I give a failing grade.

Almost every year I catch someone who turned in a "survey paper" that was copied word-for-word from another document (or perhaps pieced together from sections copied from 2 or 3 documents). That is just as wrong as copying someone else's computer program. Moreover, keeping the same structure while rewriting each sentence with different words with the same meaning is still wrong -- like changing the variable names, or reordering the function definitions, to hide its origin. It is fine to include a few specific items taken directly from other sources -- a short definition, a well-done figure, etc -- as long as you state the source of the item and provide a reference. Anything longer than a few sentences should be rewritten and reorganized using your own words (yes, even if you are not a native-speaker of English!) and possibly your own terminology and/or mathematical notation. (Developing your own terminology is surprisingly important when trying to compare related papers from different authors since their terminologies are often unrelated and/or conflicting.) More importantly, your goal should be to explain each author's approach, specific contribution, point of view, etc, in a way that highlights the differences and similarities among the authors.

III. Here are Some Sample Topics

(1) High Performance TCP Connections over Wide Area Networks

TCP's standard sliding-window mechanism creates a major performance bottleneck on "Long-Fat Networks" (LFNs). Long ago, a 1.5 Mbps connection through a geostationary satellite link would have been considered an LFN, because its 0.5 second minimum RTT demands an advertised window size of at least 100 Kbytes to avoid starvation, whereas 64Kbytes is the largest value that can fit in the 16-bit field in the TCP header. RFC 1323 offered some relief from the problem by defining the window-scaling option, protection against sequence number wrap-around, and recommendations for RTT calculation. Today, however, there is a demand for multi-gigabit per second connection speeds between supercomputer centers located on different continents -- which is far beyond the capabilities of TCP additive-increase, multiplicative-decrease congestion control algorithm. Therefore, let's ignore the issue of fitting a very large advertised window and/or byte-level sequence number into a TCP header and focus on adapting TCP's additive-increase, multiplicative-decrease (AIMD) congestion control algorithm to work well on LFNs.
For concreteness, assume the target is to support a continuous transmission rate of about 1 Gigabit/sec (i.e., 10⁹ bps) across a trans-continental path (i.e., the RTT is 100 msec.). In this case, the TCP sender would need to use a congestion window size of about 8,500 packets (based on an Ethernet-like 1,500 byte MSS) in order to maintain the target rate. Consider one of the following sub-topics:

(1a) Notice that AIMD's linear growth (one MSS per RTT) is proportionately very small as the size of the congestion window increases, so the penalty (in lost throughput) for each packet loss is very severe for LFNs. Therefore, evaluate the benefits of adding the following RAID-inspired form of error correction to the TCP stream. Following each block of B consecutive TCP segments, assume the TCP transmitter sends an special "error correction segment" consisting of the bitwise exclusive-OR of the entire block to the TCP receiver. Thus, if at least B out of B+1 segments associated with this block reach the receiver, it can reconstruct a single lost TCP segment. However, two or most losses from the same block cannot be fixed. Furthermore, transmitting the special error correction segments reduces the throughput of the TCP connection by a factor of B/(B+1), whether or not there were any lost segments. Evaluate the effectiveness of this error-correction method on the throughput for the target TCP flow. Estimate the optimal value of B as a function of the probability of loss, P, between 10^-3 and 10^-6.

(1b) This time consider how to modify the linear growth in AIMD's congestion avoidance phase to better handle a very large congestion window. For example, you could try increasing the congestion window by more than one MSS per RTT, such as its square root or logarithm. Find a function that produces a similar ratio between the rates of decrease (-50%) and increase as AIMD would for small window sizes (say 4-32). Evaluate what happens if two LFNs using your modified version of TCP meet at a common bottleneck.

(1c) Finally, consider the option of changing AIMD's unit of window adjustment from a single maximum-sized segment (MSS) to a stream of packets.
Unlike part (1b), this time we assume that the TCP sender uses AIMD to adjust the gap between consecutive packets while transmitting a continous stream,
i.e., double the gap after each loss, otherwise reduce the gap linearly.

(2) Modify TCP to handle the exchange of "direct access/block oriented" data (like a disk file) rather than "sequential" data (like a magnetic tape).

Recall that a TCP session transfers a sequence of bytes [a, a+1, ..., a+n] -- reliably -- from Alice to Bob, where the parameters a and n are determined by Alice's SYN and FIN packets. Bob only releases the data to his client in sequential order, and Bob's advertised window restricts the maximum amount by which Alice's transmissions can deviate from this strict ordering. However, this reliable byte-stream abstraction is not appropriate for every application. For example, in a live video chat application, Bob may prefer a momentary gap in the picture over a long pause in the event of a packet loss. More generally, suppose the data represents a decision tree, and after seeing node 1 Bob knows he wants to jump directly to the subtree rooted at node 50. Why not let him skip the intermediate nodes if he already knows the information is useless to him?

Thus, consider the alternative of abstracting the data transfer as a sequence of reads by Bob from a direct-access file at Alice. At connection establishment, Alice would announce both the first and last sequence numbers in the file. (Later on, Alice could update these values if the attributes of the file changed, for example, a live video stream where new images are constantly being added and old ones are thrown away.) Similarly, at connection establishment and instead of acknowledging new data as it is receieved, Bob would respond to Alice with updated "wish lists" of content from file he wants Alice to send. Clearly, Bob can adjust the size of his "wish list" apply flow control, similar to the "advertized window" in TCP acknowledgements. However, since Bob may choose not to ask for data to be sent sequentially (and he is also free to change his mind, and remove items from his wish list that he never received), Alice cannot identify lost packets from the way in which Bob changes his "wish list". Therefore, Bob must provide some other feedback mechanism beyond the "wish list" so that Alice can use a TCP-like congestion-control algorithm.

Modify the TCP header fields that are carried in connection-establishment packets, data packets, and acknowledgements to support the "direct access" model, described above. Show how your modified header fields would handle a variety of situations, including: (a) normal TCP-like sequential, reliable, flow-controlled data transfers, (b) real-time unreliable flow-controlled multimedia streaming, and (c) receiver-controlled non-sequential access to a file (e.g., Bob carrying out a binary search over a large file stored at Alice, without transfering the whole data set).

(4) Fun with Network Trace Data.

For many years, researchers have been studying the statistical properties of network traffic collected by a "sniffer" attached to a single link in the Internet core. As we will see in lecture, there has been great interest in understanding the packet inter-arrival time distributions, i.e., Poisson vs. long-range dependence. However, the properties of the packet size distribution have largely been ignored, other than recognizing that sizes tend to be bimodal (e.g., data segements vs. ACKs). In this project, you will look at publically available sources of network traces, such as the WIDE project, to determine the degree (if any) and nature of the statistical correlation in the lengths of sequentially-transmitted of packets. More specifically, let f(x) = Pr[packet size is x bytes] be the unconditional probability density function for packet sizes (for example, see the summary page for a typical WIDE dataset). In general, classical analytical models in queueing theory must assume that the size of each packet is chosen independently from f(x), without regard for the sizes of any other packets. Another possibility, inspired by the study of long-range dependence in interarrival times, would be to assume that the sizes of the ith and jth packet in the sequence are correlated that depends on their separation, |i-j|. A third possibility might be that the packet size distribution is related to its location relative to the start of a busy period, i.e., if x_i and x_j represent the sizes of the ith and jth packets (if any) transmitted during a single busy period, then how are their respective packet size distributions f_i(x_i) and f_j(x_j) related?

(5) Data Center Networking

In recent years, there has been considerable interest in how to design specialized networks for large data centers. Start with the idea that a data center is a large cluster with thousands (perhaps hundreds of thousands) of "nodes" (rack-mount PCs, file servers, etc) connected by a fast local area network (such as 1G or 10G Ethernet). Each of those rack-mount PCs may further serve as the platform for hosting tens, hundreds, or ?? of virtual machines. Most of those virtual machines are configured to serve some specific task. However, the number of active copies of each VM configuration, and their placement on physical nodes may change dynamically, as a function of workload. In addition, the data center may provide "cloud computing" services to various outside clients, each of which has service level agreements that may require the data center to dedicate certain resources to that client while isolating it from everyone else. Therefore, in the ideal case, a data center network should be able to move lots of data quickly from any subset of nodes to any other subset of nodes. (For example, search-engine query processing is based on the MapReduce paradigm, which follows a broadcast/convergcast data exchange pattern over a multi-level tree structure. Others have suggested "striping" a large file across multiple file servers to reduce a client's access time, which may lead to network overload due to "incast" traffic.)

Orthogonal to transport-layer issues that arise in high-throughput, low-latency networking, data center networking must also overcome the issues of scaling, in which (what could viewed as) a single local area network might need to support a million nodes (including virtual machines, all of which may not all be active at the same time). For example, suppose we try to build such a network using commodity layer-2 bridges, each of which has 50 1G Ethernet ports (say) connected by an internal non-blocking interconnection fabric.

Using a single switch, we could create an optimally-connected isolated 50-node "island", where we can move data from any subset of nodes to any other subject to a maximum of 1 Gigabit/sec entering and leaving each node.
If we reserved one port as an "uplink", then we need at least 20,835 switches to connect very node to a single tree structure using 20,408 level-0 "edge" switches, 417 level-1 switches, nine level-2 switches, and a single level-3 switch in the middle.

Unfortunately, the (single) uplink from a level-k switch must carry all the traffic going to and from the 49^(k+1) nodes belonging to its subtree, and is therefore are vastly overloaded unless the applications are written (and nodes are placed) in a way that restricts data exchanges to occur only between (physical) neighbors.
Indeed, each uplink from a level-2 switch carries all the traffic between the 112,000 nodes that belong to its level-2 subtree and the 888,000 nodes that belong to other level-2 subtrees. Thus, if we assume a uniform routing pattern, these links will be overloaded by a factor of 10⁵.

If we use faster uplink speeds and/or employ link aggregation in the higher levels of the tree, then we can trade additional network hardware costs for a (slight) reduction in the severity of the bottleneck.

For example, suppose we provide each level-k switch with 2^(k+1) uplinks. Then we need 20,833 level-0 "edge" switches (each supporting 48 single downlinks and a double uplink), 906 level-1 switches (each supporting 23 double downlinks and a quadruple uplink), 91 level-2 switches (each supporting 10 quadruple downlinks and an octal uplink), 23 level-3 switches (each supporting 4 octal downlinks and a degree-16 uplink) -- and our switches are too small to continue with level-4!
The best we can do is to create eight disconnected "islands" of about 125,000 nodes each, using 48 ports from one "island root" switch to join three level-3 switches. Thus, if we assume a uniform routing pattern, each uplink group from a level-3 switch must carry 2/3 of the traffic generated by 42,000 hosts over its 16 parallel links. (It is interesting to note that a 16-fold speed increase for the most-heavily loaded links only gave us a 4-fold reduction in the congestion at the bottleneck because the limited port count per switch forced an 8-fold reduction in the total network size.)

In addition to the bottleneck problem, a tree-structured network with "flat" addressing also has other disadvantages:

Since there are no alternate paths, a single switch failure or unplugging a single inter-switch link (group) will partition the network.

Every switch would need a MAC-address filtering table that can store 1 million entries -- which is prohibitively expensive. Moreover, since the default time-out for deleting "stale" addresses from the cache is 5 minutes, "address refresh" packets would flood the network.

Several research groups of researchers have proposed novel network architectures to deal with these limitations. For example, PortLand is a system developed at UC San Diego, uses a structured assignment of "pseudo-MAC addresses" to each node, so layer-2 switches only need to store address ranges, rather than individual addresses. Similarly, Microsoft Research developed a system called Monsoon, which replaces the higher levels of the "tree" architecture by a mesh of switches, and introduces randomized routing through the "mesh" to protect the static structure of the mesh from local congestion due to dynamic changes in the traffic patterns.

(6) Modify TCP to solve the "incast" congestion problem in data center networks
The phenomenon of network overload due to "incast" traffic is basically a self-inflicted distributed denial of service attack, where a client invites larger number of file servers to send it large volumes of data simultaneously. The right way to solve this problem is NOT reducing TCP timeout values, but allowing the application on the TCP client that is requesting the data to instruct the file serves to SLOW DOWN their transmission rate so it does not overwhelm the switch buffer. For years there have been proprietary algorithms for lowering TCP transmisson rates to accomodate a flow where the sender is much faster than the receiver. When dialup connections to the Internet were common, web servers would voluntarily reduce their transmission rate whenever they detected a slow connection. Conversely, Packeteer has developed external "traffic shaper" devices that can adjust the throughput of TCP flows passing through them. Thus, there is nothing in TCP to prevent a sender from reducing its transmission rate. However, there is no way for the receiver to REQUEST a lower transmission rate other than reducing its advertised window. In a high performance local area network, even an advertised window of 1 MSS doesn't cause much of a slowdown. Thus, we must find a mechanism to allow the TCP session to negotiate slow speeds.