U.S. patent application number 12/486667 was filed with the patent office on 2009-12-10 for layered multicast and fair bandwidth allocation and packet prioritization.
This patent application is currently assigned to Internet Broadcasting Corporation. Invention is credited to Erik Van Zijst.
Application Number | 20090303997 12/486667 |
Document ID | / |
Family ID | 36741088 |
Filed Date | 2009-12-10 |
United States Patent
Application |
20090303997 |
Kind Code |
A1 |
Zijst; Erik Van |
December 10, 2009 |
LAYERED MULTICAST AND FAIR BANDWIDTH ALLOCATION AND PACKET
PRIORITIZATION
Abstract
Embodiments include an overlay multicast network. The overlay
multicast network may provide a set of features to ensure reliable
and timely arrival of multicast data. The embodiments include a
congestion control system that may prioritize designated layers of
data within a data stream over other layers of the same data
stream. Each data stream transmitted over the network may be given
an equal share of the bandwidth. Addressing in routing tables
maintained by routers in the may utilize summarized addressing
based on the difference in location of the router and destination
address. Summarization levels may be adjusted to minimize travel
distances for packets in the network. Data from high priority data
stream layers may also be retransmitted upon request from a
destination machine to ensure reliable delivery of data.
Inventors: |
Zijst; Erik Van;
(Amstelveen, NL) |
Correspondence
Address: |
PERKINS COIE LLP
P.O. BOX 1208
SEATTLE
WA
98111-1208
US
|
Assignee: |
Internet Broadcasting
Corporation
Schiphol Oost
NL
|
Family ID: |
36741088 |
Appl. No.: |
12/486667 |
Filed: |
June 17, 2009 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
11342167 |
Jan 26, 2006 |
|
|
|
12486667 |
|
|
|
|
60647601 |
Jan 26, 2005 |
|
|
|
Current U.S.
Class: |
370/390 ;
370/431 |
Current CPC
Class: |
H04L 47/2416 20130101;
H04L 47/32 20130101; H04L 1/18 20130101; H04L 47/125 20130101; H04N
21/6405 20130101; H04L 12/1868 20130101; H04L 47/30 20130101; H04L
47/34 20130101; H04N 21/64322 20130101; H04L 45/18 20130101; H04L
45/04 20130101; H04L 47/2441 20130101; H04L 47/11 20130101; H04L
61/103 20130101; H04L 12/1836 20130101; H04L 29/12801 20130101;
H04L 45/02 20130101; H04L 45/123 20130101; H04L 45/64 20130101;
H04L 45/302 20130101; H04L 45/00 20130101; H04N 21/23605 20130101;
H04N 19/39 20141101; H04L 29/12009 20130101; H04L 47/10 20130101;
H04L 45/12 20130101; H04L 45/48 20130101; H04N 21/2383 20130101;
H04L 43/0829 20130101; H04L 47/15 20130101; H04N 19/34 20141101;
H04L 47/2408 20130101; H04N 21/6125 20130101; H04L 1/1887 20130101;
H04L 45/16 20130101; H04L 1/1874 20130101; H04N 21/234327 20130101;
H04L 61/6004 20130101; H04L 29/12028 20130101; H04L 12/1881
20130101 |
Class at
Publication: |
370/390 ;
370/431 |
International
Class: |
H04L 12/56 20060101
H04L012/56; H04L 12/28 20060101 H04L012/28 |
Claims
1. A system comprising: a primary communication network; a first
router in communication with the primary communication network; and
a second router in communication with the primary network, the
second router configured to communicate with the first router via a
reliable link over the primary network, the first and second router
to form an overlay network over the primary network, the second
router to manage bandwidth by equal allotment to each data stream
transmitted via the overlay network.
2. The system of claim 1, wherein the overlay network transmits
data categorized into logical layers within a data stream.
3. The system of claim 1, further comprising: a first application
in communication with the first router, the first application to
generate a layered data stream and forward the data stream to the
first router.
4. The system of claim 3, further comprising: a second application
in communication with the second router, the second application to
receive the layered data stream from the second router and
reassemble the layered data stream.
5. The system of claim 4, wherein the first router and second
router implement an extended Bellman-Ford routing algorithm.
6. The system of claim 1, wherein the second router requests
retransmission of missing packets from high priority logical
layers.
7. The system of claim 1, wherein the reliable link is a TCP
connection.
8. The system of claim 1, wherein the second router processes an
hierarchical dotted string address system.
9. The system of claim 1, wherein the second router summarizes
address of neighbor nodes in the overlay network.
10. The system of claim 1, wherein the second router maintains a
subscription table for multicast data.
Description
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application is a divisional of copending U.S. patent
application Ser. No. 11/342,167, filed Jan. 26, 2006, which claims
priority to U.S. Provisional Application No. 60/647,601, filed Jan.
26, 2005, which is incorporated by reference as if set forth in
full herein.
FIELD OF THE INVENTION
[0002] The invention relates to network management. Specifically,
the invention relates to the management of data packets to support
multicasting.
BACKGROUND
[0003] Despite the versatility in digital communication,
interoperability and internationally accepted communication
protocols of the Internet, its fundamental design has not changed
much since its conception and does not excel in everything.
Watching live TV for example is something which is not typically
done over the Internet, even though television has been around
almost twice as long as the Internet Protocol and represents a huge
market. The reasons for this are based on the design of the
Internet and Internet Protocol (IP).
[0004] The Internet is a packet-switching network where data is
exchanged in small units or packets that are independently
transported over the network and concatenated again at the receiver
into its original form. A strength of packet-switching is that it
allows for very flexible use of the physical network wires. When
two communicating parties have no data to exchange for a certain
period of time, no packets are sent and the wires can carry packets
from other parties. On the Internet, bandwidth is not reserved, but
available to and shared by everyone. The consequence is that it
cannot guarantee a minimum amount of end-to-end bandwidth, making
live video streams often appear jerky because frames are skipped
due to congestion that delays or prevents delivery.
[0005] Even though with help from specialized protocols such as
Distance Vector Multicase Routing Protocol (DVMRP) or Protocol
Independent Multicast (PIM), the Internet Protocol allows for data
packets to be multicast to a large number of receivers
simultaneously, using this feature to successfully realize a live
video broadcast is a challenge. A video stream is transmitted at a
fixed high rate and not all parts of the network are likely to have
sufficient bandwidth available to forward the stream.
[0006] When a bandwidth bottleneck is reached, the router discards
the packets that cannot immediately be forwarded. This causes two
problems. The data stream that is eventually received by one or
more receivers further down the network is corrupt and the
congestion also has a negative impact on communication sessions of
other nodes that communicate through the bottleneck router. The
only way to avoid this problem using the Internet Protocol and
standard multicast is to find a transmission rate that is supported
by all parts of the network. However, since the network is
available to anyone, this rate will continuously change. A
transmission rate is selected and the packet loss is accepted.
However, when packets are dropped randomly by overloaded routers
the data stream will suffer packet loss. If additional packets are
sent through the bottleneck router, there is a larger chance that
the router will choose one of these packets when ready to send
another packet, implicitly rewarding heavy streams during
congestion. This encourages sending redundant data thereby
exacerbating the problem.
[0007] A more fundamental problem of flow control using the
Internet Protocol is that slowing down the data may not be an
option for certain types of live data streams. However, packet loss
is unavoidable using the Internet Protocol and while data types
such as audio and video data can usually withstand some packet loss
without becoming too corrupted to play, this does not apply to all
types of live data. Real-time financial data, for example, will
become useless and even dangerous to use if random packets of
trades are lost.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] Embodiments of the invention are illustrated by way of
example and not by way of limitation in the figures of the
accompanying drawings in which like references indicate similar
elements. It should be noted that references to "an" or "one"
embodiment in this discussion are not necessarily to the same
embodiment, and such references mean at least one.
[0009] FIG. 1 is a diagram of one embodiment of an overlay
multicast network.
[0010] FIG. 2 is a diagram of one embodiment of the basic
components of a router daemon in the overlay multicast system.
[0011] FIG. 3A is diagram of one example embodiment of a layered
multicast system network divided into logical clusters.
[0012] FIG. 3B is a diagram of a routing table for the network of
FIG. 3A.
[0013] FIG. 4 is a diagram of an example network with hierarchical
structure.
[0014] FIG. 5A is a diagram demonstrating summarization of an
example network with an S2 domain.
[0015] FIG. 5B is a diagram demonstrating the stretch factor for
the example network of FIG. 5A and shows the inner network of the
S2 domain.
[0016] FIG. 6 is a diagram of an example embodiment of a multicast
distribution or "sink" tree.
[0017] FIG. 7 is a diagram of an overlay multicast network with a
failed link.
[0018] FIG. 8A is a flowchart of one embodiment of a process or
managing congestion in the overlay multicast system.
[0019] FIG. 8B is a flowchart of one embodiment of a process for
handling layer repair.
[0020] FIG. 9A is a diagram of two router daemons connected by a
link.
[0021] FIG. 9B is a diagram of two router daemons connected by a
link where a retransmission is requested.
[0022] FIG. 10 is a diagram that shows the process of selective
packet repair and ordering.
[0023] FIG. 11A is a diagram showing thinning of a stream of
data.
[0024] FIG. 11B is a diagram of a multi-level token ring
structure.
[0025] FIG. 11C is a diagram of a multi-level token ring structure
showing thinning of data.
DETAILED DESCRIPTION
[0026] To provide multicast services a network needs to support
one-to-many communication that can send data packets from a data
source to more than one receiver, ideally without putting extra
stress on the network or source when the number of receivers
increases. Multicast routing can be offered by different methods.
One method is to let receivers tell the network, but not
necessarily the source, which data streams they want to receive and
let the network compute data distribution paths that deliver just
the right packets to each receiver. Multicasting can also be done
by letting the source encode the list of receivers in each data
packet, thereby freeing the network from the potentially
computationally intensive task of maintaining multicast
distribution paths. However, this method does not scale to handle a
large number of receivers. A third method relies on logic at the
receivers by letting the network apply a broadcast mechanism
whereby each packet is delivered to every connected node and
letting the receivers filter out only those packets that are
interesting. This method may also generate a heavy load on a larger
network but it is simple.
[0027] In one embodiment, a multicast network is constructed as an
overlay network. In one embodiment, the overlay network includes a
number of software implemented routers connected by normal IP or
TCP connections. A mesh is created in which every software router
is connected to one or more other software routers by means of
virtual connections that appear to be direct connections to the
other software routers, but are likely implemented by a number of
intervening traditional TCP/IP routers situated between the
software routers. Two routers that are connected this way, are
adjacent in the perspective of the overlay, but in reality are many
physical hops away from one another. Also, a software router that
has three software router neighbors has three independent virtual
links. However, it is possible that this software router only has a
single physical network connection that is shared by the three
virtual links.
[0028] To the underlying network, the overlay network is nothing
more than a collection of applications that send data between
static pairs. Beyond each router pair (identifiable by their TCP
connection) there is no relation between the individual daemons. As
such, the overlay network can easily work through firewalls, NAT
(IP masquerading), proxies and VPN's. Firewalls cannot control
which software router POPs can talk to each other. The system
shares some principals with a HTTP proxy server tunneling traffic
to and from web browsers. In an example HTTP proxy server system,
an intranet has two web browser machines. They both browse the
Internet using the proxy server also on the local intranet.
Although the individual web browser machines can surf the net
individually, they never make a direct connection with a remote
webserver, but only with the local proxy server that acts like a
relay point. As such, the firewall that sits between the proxy
server and the remote webserver can only choose to either allow the
proxy server to talk to the webserver or deny it. It is generally
unable to enforce unique policies for individual browser machines.
It generally cannot tell on behalf of which client the proxy server
is fetching a webpage. Returning to the overlay network, because
each software router is a relay point that tunnels data traffic for
different senders and receivers similar to a HTTP proxy server,
firewalls have no fine-grained control over communication over the
overlay network. As soon as a firewall allows for only a single TCP
connection between an internal and an external POP, all software
routers connected to the internal one can talk to all POPs
connected to the external one, and vice versa, without
restriction.
[0029] In one embodiment, the overlay multicast routing system also
manages flow control and timely delivery. Non interactive live data
streams do not actively anticipate network congestion. To manage
congestion the network manages the available bandwidth to allow for
fair or equal division among the streams. Without management, high
volume streams are assigned a larger capacity percentage on
overloaded links resulting in little benefit in keeping the
bandwidth requirements of a stream low, as the packet loss
percentage is determined by how much the network is overloaded by
all streams combined and not by the requirements of the individual
streams. An alternative to letting the network handle the flow
control and congestion is to put the responsibility at the source
and receivers. However, letting data streams anticipate network
conditions requires a form of feedback information from the network
or the receivers. In this case it is beneficial that the amount of
feedback does not grow linearly with the size of the audience, as
that would reduce the scalability of the multicast. Even when a
scalable form of feedback information can be realized and the data
stream adapts its transmission rate according to the network
conditions, the problem remains that live streams loose their value
when they are slowed down and delivered late.
[0030] In one embodiment, an overlay multicast system also
implements or manages delivery of packets including an option for
guaranteed delivery. It would be ideal if every receiver would
receive the data stream without loss or corruption. However, when
the content is `live` and cannot be slowed down, but the network
has insufficient capacity, packet loss is difficult to avoid. In
fact, even when there is sufficient bandwidth at all times,
store-and-forward packet-switched networks are not able to
guarantee the delivery of all packets. For example, when a router
crashes, all packets in its buffers may be irrecoverably lost. If a
network uses dynamic routing, packets are also dropped when routing
policies change or packets are trapped in an occasional, temporary
routing loop. In cases where packets are `accidentally` lost, an
end-to-end mechanism of retransmissions can be applied that can
compensate for the loss. However, since this requires a form of
feedback information, it is beneficial for reasons of scalability
that the overhead involved with retransmissions is not linearly
related to the size of the audience.
[0031] End-to-end retransmission feedback may be avoided in at
least two ways. First, it is possible to let the network components
keep a copy of the most recently forwarded packets and let them
participate in retransmissions by intercepting the retransmission
requests and servicing them locally. This approach often utilizes
greater storage and increased processing power requirements at the
network components.
[0032] The second alternative to end-to-end retransmission requests
is that of encoding redundant information in the data packets. If
enough redundancy is encoded, a lost packet's content may be
entirely recovered from the extra information in the other packets.
The downside of this system is that it comes with a constant level
of bandwidth overhead that is related to the level of packet loss
tolerance, regardless of whether packets are actually lost. Each of
these approaches to packet loss fail in the case of a live data
stream that produces more bytes than the network can forward. When
local or end-to-end retransmission requests are used, the problem
may even be increased as the retransmission requests use extra
bandwidth, causing more data packets to be lost.
[0033] Embodiments of the overlay multicast system are designed to
offer an end-to-end solution to efficiently multicast live data
including stock market data to any machine connected to the
Internet. The system is capable of giving certain hard guarantees
over what is delivered to receivers. If packets must be dropped due
to congestion or other irrecoverable problems, it is done in a
fully deterministic way that does not corrupt the data. Where a
receiver of data such as a viewer of a film may accept the random
loss of one or two video frames, this type of data loss may wreak
havoc in financial data when the missed packet contains an
important trade. The system supports deterministically delivery
designated parts of a data stream when the network lacks sufficient
capacity. The system utilizes a layered multicast with data streams
subdivided into individual streams or layers. This allows receivers
to subscribe to only those layers that the network can handle, so
that random packet loss can largely be avoided.
[0034] In another embodiment, an enhanced form of layered multicast
is used that guarantees complete delivery for certain layers to
avoid random loss altogether, making it suitable for certain types
of critical data such as market data. The system is characterized
as controlling two primary activities. The first activity is
running and managing a robust and scalable overlay network that
uses its own routing algorithms and supports multicast, while the
second activity is managing flow control and congestion when live
streams overload the network and ensuring layered multicast can be
offered with guarantees.
[0035] FIG. 1 is a diagram of one example embodiment of an overlay
multicast system. In one embodiment, the system includes an overlay
network 100 that includes a number of software router daemons 109,
115, that are interconnected by normal TCP connections 119 or
similar reliable communication connections. The overlay multicast
system forms an intricate web of routers and virtual links similar
to the Internet, but operating on the application layer. This
system operating at the application layer `overlays` the physical
network and other layers of the Internet providing its own system
of routing, controlling the lower levels of the network. Any number
or routers and client applications may be a part of the system. In
the example, two of the routers 109, 115 are in communication with
local client applications 101, 111.
[0036] In one embodiment, each of the routers have a unique,
string-based addresses 117. In another embodiment, the routers may
have other types of unique addresses such as numerical addresses.
Each of the router daemons 109, 115 executes a routing algorithm
that computes the shortest paths that allows each router to send
data packets to any other router in the overlay multicast system as
well as any machine in communication with these routers.
[0037] In one embodiment, the system includes a runtime client
library 103, 113, application programming interface (API) or
similar system for sending and receiving packets that can be
utilized by user applications to communicate over the system. This
library connects an application to a nearby system router daemon
109, 115 through a TCP connection 107, 121, native inter-process
communication or similar communication method. The native
inter-process communication method may be used if both the router
and the client application run on the same physical machine. When
connected to a router daemon, the client application 101, 111 can
send and receive data packets from the network under the router's
unique address. Router daemons have a unique address, while user
applications connected to a router daemon are identified through
logical ports.
[0038] In one embodiment, the topology of the overlay network may
be configured at router daemon startup. The topology may be
relatively static after configuration. In one embodiment, the
relationship of router daemons to applications may be one to many,
with a single router daemon serving multiple applications. In
another embodiment, the relationship may be many to many.
[0039] In one embodiment, the data packets in the overlay network
may be addressed to a single communication endpoint or to abstract
multicast addresses. Single communication endpoints are network
address used by user applications. Access to them is exclusive such
that only one user application can use them for sending and
receiving packets at a time.
[0040] In one embodiment, a unique network address in the overlay
network that is used by an application is the combination of a
logical node address assigned to the overlay router and the port
name chosen by the application. If an application that is connected
to a overlay router with the logical node address "n1.msx.1" wants
to use a port named "myport" for receiving unicast data packets,
the fully qualified network address becomes "n1.mxs.1.:myport."
Other applications connected to the overlay network that want to
send packets to this application can use this as a destination
address. Ports may be referred to as tports. Overlay network
addresses must be bound, prior to sending or receiving packets.
When a data packet is sent from an endpoint address it will contain
this information as its source, allowing other routers or the
receiving application to send a response.
[0041] In one embodiment, the overlay network may support packets
being sent to abstract multipoint destinations, or multicast
addresses. An overlay network multicast address is a destination
that each machine on the network may subscribe to. The overlay
network software routers ensure that a copy of each data packet
published to this multicast address is delivered to every user
application that subscribed to it. In one embodiment, multipoint
destinations are single sourced. Only one application can publish
data packets to the multicast address, making the overlay network
suitable for one to many, but not many to many communications.
Because a tport session of a multicast address can be freely
chosen, each multicast address explicitly contains the location of
the source on the overlay network. This makes multicast
communication less flexible because publishing is not anonymous,
but it greatly simplifies subscription management. In another
embodiment, multiple applications may publish to a multicast
address and many to many communication is supported.
[0042] In one embodiment, both unicast and multicast addresses are
syntactically equal. For example, the address "n1.mxs.1:timeserice"
could be a unicast address used and bound by an application that
provides the local date and time in response to any data packet it
receives. However, it could also be a multicast address that anyone
can subscribe to. For example, subscription to the address may
provide periodic date and time broadcasts from the user application
that bound this address as a multicast address. In one embodiment,
each packet may contain a flag or similar indicator that indicates
whether its destination address should be interpreted as a unicast
or multicast address.
[0043] In one embodiment, the overlay network provides a set of
higher layered protocols as services over the packet oriented base
layer. These services may be referred to as protocol endpoints. Any
number of protocol endpoints may be defined and supported. In one
embodiment, five protocol endpoints may be supported. A unreliable
unicast protocol (UUP), a reliable unicast protocol (RUP),
unreliable multicast protocol (UMP), ordered layered multicast
protocol (OLMP) and reliable ordered layered multicast protocol
(ROLMP) may be supported. The UUP offers a best effort unicast
datagram service to applications. The RUP offers reliable unicast
communication between peers. The UMP offers best effort multicast
datagram services to applications on the overlay network. The OLMP
offers multicast communication with receiver driven rate control.
Complete delivery is not guaranteed, but the packets that are
received are guaranteed to be in their original order. The ROLMP
offers reliable multicast communication with receiver driven rate
control. Stream layering allows each subscriber to receive the data
stream in the highest possible quality, while the source never has
to slow down.
[0044] FIG. 2 is a diagram of one embodiment of the basic
components of a router daemon in the overlay multicast system. Each
overlay network router 109 includes a packet switching core or
kernel 201. In one embodiment, packets that are received, either
from a connection to a neighbor router or from a connected
application, pass through the kernel 201. The kernel 201 may not
handled some specialized control packets. The task of the kernel
201 is to inspect the destination of each packet and use a set of
routing tables to determine how to forward the packet.
[0045] In one embodiment, the kernel 201 forwards packets to
neighbors through a set of interface modules 203, 205. In one
embodiment, the kernel 201 may execute on its own thread and be
event driven. The kernel 201 remains idle until it is notified by
an interface module 203, 205 of an incoming packet or by an
application 227 that is sending a packet. The kernel thread is
woken up, reads the packet from the interface module or application
and processes it. If the kernel decides that the packet must be
sent out through an interface, it passes it to that interface and
waits for the next notification.
[0046] In one embodiment, each router daemon in the network is
connected to one or more neighbors. This is done by establishing
connections between the routers. In one embodiment, the connections
may be long lived TCP connections or similar communication
connections. In one embodiment, a router daemon 109 runs one
interface module 203, 205 instance for each configured neighbor or
configured communication channel. In another embodiment, the router
109 may run multiple interface modules for a configured neighbor or
communication channel or a single interface module for multiple
configured neighbors or communication channels. The responsibility
of an interface module 203, 205 is to establish a connection, e.g.,
a TCP connection, and to pass packets from the kernel 201 to the
connection and vice versa. Packets of the first kind are referred
to as outbound packets, while the latter are inbound packets.
[0047] In one embodiment, the kernel 201 maintains a unicast
routing table that is used for packet switching. To make it
possible for the network to find shortest paths as well as
adjusting these paths when the underlying physical network's
characteristics change, each interface module may measure the
quality of its virtual connection. These measurements are passed on
to the kernel 201 when the link quality is found to have changed.
Inside the kernel 201, the measurements may be fed to a routing
algorithm to determine if the changed link quality alters any of
the entries in the routing table. If the routing table is changed,
it is advertised to the neighbors by encoding the entire table or a
portion of the table in a data packet and passing it to each
interface. In one embodiment, if this type of packet is received
from a neighbor and propagated to the kernel 201 through the
receiving interface, the kernel inspects the message and analyzes
it through the routing algorithm. If the new information leads to
changes in routing entries, the router sends its own related
routing entries to all its neighbors.
[0048] In one embodiment, if a neighbor router crashes, the
interface detects this through an error on the virtual link and
passes an indicator to the routing algorithm. The routing algorithm
then changes the cost indication associated with the link to an
infinite value or similar value indicating that the link should not
be utilized. In one embodiment, the kernel 201 does not distinguish
between a crashed neighbor and an infinitely slow link. The kernel
201 only detects that the link is not to be utilized when reading
the entry in the routing table.
[0049] In one example, when a source application connected to
router Swants to publish a live data stream to multicast group
s:mygroup, where applications on node p and q want to receive it,
the source may first bind the group in publish-mode. Binding a
multicast group in publish-mode means that the group is owned by
the binding process. Only the owner of the group is able to publish
to it. The receiving or "sink" applications connected to routers p
and q now bind group s:mygroup in subscribe-mode. Binding a
multicast group in subscribe-mode results in the router node being
connected to a virtual multicast distribution tree. The subscribers
receive a copy of any packet published by the source.
[0050] In one embodiment, multicast groups do not necessarily need
to be bound in publish-mode first. Any application may subscribe to
any multicast group at any time. If there is no source, there will
be no data to receive. Binding a multicast group either in
publish-mode or subscribe-mode are distinct operations. When the
source also wants to receive a copy of its own stream, it binds its
group in subscribe-mode and uses the resulting tsocket to read the
data back. Data packets carry both the address of their source
application, as well as the network address of their
destination.
[0051] For unicast packets, the router uses its unicast routing
table to find the preferred next hop, while multicast packets are
routed according to a subscription list in a multicast subscription
table. In one embodiment, a unicast data packet contains the node
address of the source router, the tport of the source application,
the address of the destination router and the tport of the
destination application. A multicast data packet contains the
multicast group it was published to, represented by the node
address of the source router and the group identifier that was
bound by the source application. It does not contain any other
addressing information.
[0052] In one embodiment, the overlay multicast system determines
unicast routing tables and multicast subscription tables for each
router. The system utilizes any type of routing algorithms or
protocols to determine routing tables. Protocols that are utilized
include distance-vector protocols and link-state protocols.
Link-state protocols take the relatively simple approach of
propagating the state of the entire network as a list of all known
links in the network with their cost. Distance-vector protocols are
based on the Bellman-Ford protocol. They work by letting every
router periodically advertise its own routing table to its
neighbors. In one embodiment, the Extended Bellman-Ford protocol,
hereafter referred to as ExBF, is used as the basis of the overlay
multicast network. For sake of convenience, the embodiments of
overlay multicast system are described that utilize the ExBF,
however, those of ordinary skill in the art would understand that
other routing algorithms may also be utilized.
[0053] In the ExBF protocol, a slight increase in the amount of
information is kept for every destination. Instead of storing just
the distance to each destination for every neighbor, ExBF also
stores the address of the pre-final node of each path. Hence,
instead of storing the collection of distances {D.sub.i(j)} where j
represents a destination, D the distance between this node and j,
while i ranges over the node's neighbors, each router also stores
{P.sub.i(j)} where P is the pre-final, or so-called `head-of-path`
node in the shortest path to destination j. Now because the router
knows the pre-final node of every destination, it can backtrack the
full path to any destination by recursively looking at the
head-of-path of a destination and treating it as a new
destination.
[0054] In one embodiment, the interface modules 203, 205 implement
a simple algorithm to establish the communication connection to the
neighbor machine or device. The interface module 203, 205 attempts
to connect to the configured neighbor router by actively trying to
connect to its network address, e.g., its TCP/IP network address.
If the connection is aborted with a connection refused or other
error, it is assumed that the neighbor is not yet running and the
interface module 203, 205 starts to listen on its configured IP
port so that the neighbor can connect to it when it is started up.
The interface module 203, 205 waits for an incoming connection
request for a brief period of time. After this time period expires,
the interface module 203, 205 returns to actively connecting to its
neighbor. To avoid a situation where both neighbors continue to
switch between the listen and active states at the same time, the
duration of the listening state is influenced by a random factor.
An advantage of allowing each neighbor to switch between the active
connection and listening states when establishing a connection is
that it allows the connection of routers even if one of the routers
is on a masqueraded IP network. A router on a masqueraded IP
network is unable to accept incoming connections such as TCP
connections.
[0055] In one embodiment, interface modules 203, 205 can be
configured at router deployment through configuration files. This
will make the router daemon automatically create the specified
interfaces at startup. If the configured neighbor routers are
online, all long-lived connections will automatically be
established. It is also possible to add new neighbor connections
and interface module instances dynamically at runtime. This way the
overlay network topology can be changed flexibly and new routers
can be added to the network. If an interface module 203, 205
manages to establish a connection with a neighbor router, the
interface modules of the routers exchange overlay network node
address information to inform each other of their presence. When
the interface module 203, 205 receives the node address of its
neighbor, it passes this information, together with an announcement
that a connection has been made to the kernel 201. This information
allows the kernel routing algorithms to build dynamic routing
tables.
[0056] In one embodiment, a role of the interface modules 203, 205
is to establish the connection with a configured neighbor router
and to send data packets from the router kernel 201 to the neighbor
connection and vice versa. The interface module 203, 205
incorporates a framework that allows custom software plug-ins to
influence the stream of packets that flows between the network and
the kernel 201. This mechanism is referred to as the interceptor
pipeline 251, 257. In one embodiment, each software plug in
component that needs to control the packet flow is a class that
implements a simple programming interface. In another embodiment,
the software plug ins may have any implementation structure
including objected oriented structures. This interface allows
interceptor instances to be chained together, forming a packet
processing pipeline. The contract of an interceptor is that it
receives a packet, applies it operations and then passes the
modified packet to the next interceptor in the chain.
[0057] In one embodiment, an interceptor pipeline 251, 257 sits
between the router switching core and the network connection. When
the router kernel 201 delivers a packet to the interface module
203, 205 for transmission to the neighbor router, the interface
module runs the packet through the interceptor pipeline, giving the
interceptors the chance to modify the packet. Each packet that
comes out of the pipeline is transmitted to the neighbor
router.
[0058] In one embodiment, each interface module 203, 205 has two
interceptor pipelines 251, 257. The first interceptor pipeline 257
is used to process outbound packets. The second interceptor
pipeline 251 is used for inbound packets. These pipelines are
independent of one another, the ordering of the interceptors and
the number of processing steps can be different for inbound and
outbound packets. Each pipeline can be configured uniquely.
Interceptor pipelines may have any number of interceptors, provided
they do not add too much latency.
[0059] In one embodiment, an example of an interceptor is one that
filters packets from a specific overlay network router. When this
interceptor is implemented as a manageable component that can be
configured dynamically at runtime, it can be used to implement
basic firewall functionality on the overlay network. If it receives
a packet that matches its rejection pattern, it discards the packet
by not passing it to the next interceptor in the pipeline. Another
type of interceptor that may be used is a traffic monitor that
counts the size of each packet that passes by and uses this to log
traffic activity and compute bandwidth statistics. This plug in
mechanism allows an overlay network router to be extended with
additional functionality without modifications to the underlying
software.
[0060] In one embodiment, the interceptor pipelines 251, 257 act as
a packet buffer between the router kernel 201 and the network. The
interface modules 203, 205 temporarily buffer inbound and outbound
packets. This ensures that the kernel 201 is not blocked when
sending a packet to an interface. In one embodiment, the interface
modules have a separate thread or set of threads that continuously
dequeue and serialize packets from the buffer and write them to the
communication connection or enqueue into a buffer received
packets.
[0061] In one embodiment, the interceptor pipeline provides a
temporary packet buffer and offers inter-thread communication
between the kernel thread and the interface thread. Interceptors
may be divided into two categories: interceptors that block and
interceptors that return immediately. The first category is
referred to as blocking or synchronous interceptors. In one
embodiment, to avoid situations where a router kernel 201 is
blocked for an arbitrarily long time, an interceptor pipeline may
contain one non-blocking interceptor. A non-blocking interceptor
guarantees immediate return of control to a caller by storing the
packet in an internal buffer. The packets in the buffer may be
discarded if the buffer exceeds a certain threshold size.
[0062] In one embodiment, a maximum size limit is placed on the
packet buffers to prevent them from exhausting the router's memory.
Storing packets before processing them means that their delivery
will be delayed. The larger the buffer gets, the longer the packets
are delayed. Because of this, the interceptor drops packets when
the buffer reaches its maximum size. In one embodiment, the system
uses reliable TCP or similar connections for transmitting packets
between routers in which packets are only dropped inside the buffer
interceptors of the interface modules.
[0063] In one embodiment, the interceptor plays a role in making
packet loss deterministic. Packets are not explicitly dropped in
any other part of the layered multicast system network, except the
buffer interceptor in the interface pipelines. However, that may
not guarantee that a packet that successfully makes it through all
buffer interceptors of the network's routers is delivered at its
destination. Aside from controllable packet loss, the network may
also occasionally lose packets in a non-deterministic way, for
example when a router crashes with pending packets in its buffers,
or when a connection between adjacent routers is unexpectedly
closed during transmission of a packet.
[0064] In one example, an inbound interceptor pipeline 251 may be
structured such that traffic throughput monitor interceptors 253
are positioned after a buffer interceptor 255 and a firewall
interceptor 265 may be positioned before for the buffer interceptor
255. In an example outbound interceptor pipeline 257, a throughput
limiting interceptor 259 may be positioned before a buffer
interceptor 261 and a traffic monitor interceptor 267 may be
positioned after the buffer interceptor 261.
[0065] In one embodiment, the overlay network is accessible to
applications through the use of the client-side programming library
221. This library connects to a router daemon 109 at application
startup and communicates with it using remote procedure protocol or
similar communication protocols. The communication between the
library 221 and the router daemon 109 is driven by the application
227. The application 227 invokes functions in the router 109
through the library 221. In one embodiment, when the kernel 201
receives packets addressed to a tport that is bound by the
application 227, it stores them until the client actively picks
them up. The application 227 through the library 221 continuously
polls the router for packets. To minimize the overhead of the
polling mechanism, the router poll function does not return until
there is at least one packet delivered by the kernel 201. If more
than one packet is waiting, the poll function returns all waiting
packets at the time it is invoked. In another embodiment, the
router may send an indication such as an invoking a marshaled stub
from the client, event notification or similar indicator to the
application 227 through the library 221 to indicate the reception
of a data packet for the application 227.
[0066] In one embodiment, a user application 227 uses a client
library 221 through instances of the overlay multicast system
communication sockets 225. If a user wants to be able to receive
packets, they reserve a network address. In one embodiment, a
network address in the overlay multicast system is represented as a
combination of the address of the router and a unique port
identifier, reserved for the socket of the router. For sake of
convenience, a port in the layered multicast system will be
referred to as a tport and a socket as a tsocket. When a user
application creates a tsocket for receiving normal unicast packets,
the tsocket automatically binds a local tport at the router daemon
through a remote call to the client IO multiplexor 207 at the
router. In one embodiment tports are bound exclusively and other
clients may not use it concurrently.
[0067] In one example embodiment, communication sockets 225
communicate with a local input output (IO) multiplexor 229 that
coordinates the handling of communication between the sockets and
the router kernel through the formation of RPC calls, native
inter-process communication or similar systems. A local IO
multiplexor 229 utilizes an RPC stub 223 or similar program to
communicate with the router via an RPC skeleton 219 and client
adaptor 209. A client IO multiplexor 207 at the router manages the
relay of these socket requests to the kernel 201.
[0068] In one embodiment, the router daemon 109 uses a packet
buffer to temporarily store packets for each connected client
application until they are picked up. An interceptor pipeline in
the client adaptor 209 or similar process may be utilized for this
buffering function.
[0069] To overcome the problem of growing routing tables,
computation time and excessive advertisement overhead, large
networks can be partitioned into smaller sub sections, connected by
gateways. While the gateway routers maintain routing information
necessary to route packets to nodes in other network sections,
nodes inside a section or domain only maintain information for
those nodes inside the same domain. By substituting logical ranges
of hosts in the routing table by one single condensed entry, the
size of the routing table is reduced. This is called address
summarization. This mechanism introduces a form of hierarchy that
allows the network as a whole to grow far beyond the practical
limits of standard distance-vector or link-state algorithms. The
farther a destination host is away, the more efficient it can be
condensed together with other remote hosts. The more summarization
is applied, though, the less efficient the paths become on average.
The factor by which the actual data paths on summarized networks
differ from the optimal paths, is known as the stretch-factor: the
maximum ratio between the length of a route computed by the routing
algorithm and that of a shortest path connecting the same pair of
nodes.
[0070] In one embodiment, the overlay multicast system takes a
relatively straightforward approach to address summarization. An
administrator decides at deploy time which nodes form clusters and
which clusters form aggregations of clusters. This is done by
encoding hierarchy in the layered multicast system node addresses
using a dotted or similar notation. Node addresses may be ASCII
strings. In one embodiment, the strings are limited to at most 127
characters. In one embodiment, only [a-z] and [0-9] are available.
In another embodiment any characters, numbers of similar symbols
may be utilized. Addressing may be case-sensitive or
case-insensitive.
[0071] FIG. 3A is diagram of one example embodiment of a layered
multicast system network divided into logical clusters. The example
illustrates a network of eight nodes, divided into three clusters.
Assigning nodes to clusters may be based on geographical
properties, administrative boundaries, wide area links and similar
considerations. For example, nodes inside a corporate network are
all assigned the same logical domain, whereas a network that
connects nodes from different corporations, would usually assign a
separate domain to each corporate network. Another criterion for
assigning nodes is that nodes that often disconnect and reconnect
again later, possibly because the overlay multicast system routers
run on personal computers, are placed in a subdomain, to avoid the
routing updates triggered by their changing state to propagate far
into the network.
[0072] Given the logical domains that cluster groups of nearby
nodes, each node can treat domains other than its own as a single
entity and use a wildcard address that matches every host inside
that domain. This reduces the size of the routing table, as well as
the amount of routing information that needs to be exchanged
between the nodes when the topology changes inside a domain. For
example, when a new host is added to domain S1, there is no need to
propagate that information to the other domains, as they already
have a wildcard entry that will match the new address.
[0073] FIG. 3B is a diagram of a routing table for the network of
FIG. 3A. The example routing table shows the effect of address
summarization in this network on the routing table of node S2.C.
The cost value that is listed in the third column represents the
cost of the shortest path to the nearest node inside that domain.
In the fourth routing entry, the number in the cost column is the
cost to reach S1.R from S2.C. If it is assumed that both
interdomain links S2.C-S1.R and S2.A-S1.R have the same weight or
cost, S2.C will route all traffic for domain S1 through neighbor
S1.R.
[0074] In one embodiment, when nodes exchange distance vectors,
either because of a link change, or as a normal, periodic exchange,
the receiving node first summarizes the destination address of each
distance vector relative to its own address. Summarization is done
by matching the destination address with its own address
field-by-field and when a field does not match, the rest of the
address is substituted by a wildcard. For example, when S2.C
receives a distance-vector from S1.R that contains a path to
destination S1.P, it is immediately summarized to S1.* upon arrival
at node S2.C. This is because the first field differs from the
first field of the local address and as such the remaining fields
are replaced by a wildcard. This wildcard value is then fed to the
distance vector algorithm that checks whether the new cost or
path-length is shorter than the cost of the entry that was already
in the routing table. In the present example there already is an
S1.* wildcard entry in the routing table that was derived from
neighbor destination S1.R. Since the path to S1.P runs through
S1.R, the path to S1.R will always be shorter than the path to
S1.P, so the entry in the routing table will not be changed and no
further routing updates will be propagated to S2.C's neighbors.
[0075] In general, when a local address is 1.2.3.4 and an incoming
distance-vector advertises destination 1.2.2.2, it will be
summarized to 1.2.2.*. Destination 2.1.6 becomes 2.*, destination
1.2.3.4.5.6 becomes 1.2.3.4.5.*, destination 1.2.3.5 stays 1.2.3.5
and destination 1.2.3.4.5 also stays 1.2.3.4.5. Note that 1.2.3 and
1.2.3.* are two different addresses. The first only matches the
exact address 1.2.3 while the second is a wildcard that matches
everything that starts with 1.2.3 and has at least 4 address
fields. This includes 1.2.3.4 and 1.2.3.4.5.6, but not 1.2.3. If
this mechanism of address summarization is used in an ExBF
implementation that carries tuples containing destination, cost and
head-of-path attributes in its vectors, then the destination
address is summarized, as well as the head-of-path address.
[0076] FIG. 4 is a diagram of an example network with a
hierarchical structure. When the example network first starts to
converge using the ExBF algorithm, node B.X advertises the
following routing information to neighbor A.S: DV.sub.B.X.A.S:
{(B.X, *, 0), (B.Y, B.X, 1), (B.Z, B.X, 1)}. When A.S receives the
vectors, it summarizes its entries. What remains is the single
vector DV.sub.BX.AS: {(B.*, *, 0)}. The routing table of node A.S
now contains RT.sub.AS: {(A.S, *, *, 0), (A.R, A.R, A.S, 1) (B.*,
B.*, A.S, 1)}. All addresses are summarized before processing. This
includes neighbor addresses. The consequence of this is that when a
node has more than one connection with a foreign domain, both
neighbor addresses will be summarized into the same wildcard. This
leads to ambiguities and non-deterministic routing when this
wildcard is listed as the preferred hop in a routing entry, as it
can not identify a single outgoing link. This problem is solved by
assigning a local identifier to each link and using these numbers
in the preferred hop column, rather than the addresses of the links
peers. The routing update A.S sends to A.S.Y contains
DV.sub.As,A.S.Y: {(A.S, *, 0), (A.R, A.S, 1), (B.*, A.S, 1)} and
leads to A.S.Y's routing table RT.sub.A.S.Y: {(A.S.Y, *, *, 0),
(A.S.X, A.S.X, A.S.Y, 1), (A.S.Z, A.S.Z, A.S.Y, 1), (A.S, A.S,
A.S.Y, 1), (A.R, A.S, A.S, 2), (B.*, A.S, A.S, 2)}. When A.S.Y has
finished updating its routing table, it advertises DV.sub.A.Y.A.S:
{(A.S.Y, *, 0), (A.S.X, A.S.Y, 1), (A.S.Z, A.S.Y, 1), (B.*, *, *),
(A.S, *, *), (A.R, *, *)} back to neighbor A.S where the asterisk
indicates unreachable in the last three records to avoid long-lived
loops. These loops were detected through the normal back trace
mechanism of ExBF.
[0077] In one embodiment, summarization means that only a single
path for a range of remote nodes is maintained. The consequence of
this is that packets will not always be routed according to the
real shortest path between source and destination. To illustrate
this effect in the multicast network, consider the example network
of FIG. 4. Node S1.R receives the distance-vectors of both S2.A and
S2.C. And after summarization learns that both neighbors offer a
route for wildcard S2.*. If it is assumed that both interdomain
links (S1.R-S2.A and S1.R-S2.C) have equals costs, then S1.R will
choose S2.A to be the preferred hop for S2.* because of the fact
that its address logically comes first. When S1.R needs to forward
a data packet to S2.B, it uses the S2.* wildcard entry and forwards
the packet to neighbor S2.A. Unfortunately this is not the shortest
path to S2.B, as S2.A first has to route the packet through S2.C.
Instead, S1.R should have sent the packet directly through neighbor
S2.C. This ratio between the length of the actual path and the
length of the optimal path between the endpoints are called the
stretch factor.
[0078] In one embodiment, path stretching on the overlay multicast
system occurs when packets are forwarded between nodes that are in
different logical domains. An entire subnet is treated as a single
node with several outgoing links. Because a virtual node often
contains a large number of nodes, connected by its own internal
network structure, it is sometimes better to choose a different
interdomain link when sending packets to the domain.
[0079] Since the overlay network runs its own adaptive routing
algorithms, the content streams through the network are constantly
rerouted to avoid the network's hot spots and congestion. This can
be particularly useful on wide area networks that are used for very
different types of applications and hot spots are dynamic (i.e.,
moving around). Another advantage of having custom routing
algorithms is the freedom of substituting them with others in the
future.
[0080] In one embodiment, the software routers implement
load-balancing inside the network. Traditionally, routing
algorithms seek for "best paths" through the network and send
datastreams over these. However, it is sometimes much more
desirable not to just send streams over this optimal path, but to
also select several sub-optimal paths and divide the stream over
all of them. This also avoids the optimal path from getting
congested when the stream requires more bandwidth than this single
path can provide.
[0081] FIG. 5A is a diagram demonstrating summarization of an
example network with an S2 domain. When node S1.P needs to forward
a packet for destination S2.F, it will send the packet directly
down its interdomain link to the S2 domain. In this example both
interdomain links are assumed to have equal costs. FIG. 5B is a
diagram demonstrating the stretch factor for the example network of
FIG. 5A and shows the inner network of the S2 domain. FIG. 5B
clearly shows that the stretch-factor is quite high for packets
from S1.P to S2.F, as the optimal path runs through S1.R instead.
Although the overlay multicast system address summarization
technique can not guarantee a maximum upper bound on the
stretch-factor, it can manipulate the stretch-factor by changing
the summarization opacity.
[0082] By default, an address is summarized after the first field
that differs from the local address. However, if that is changed to
the second field, the overlay multicast system can look inside
other domains for one level. A node with an address 1.2.3.4 will
then summarize 2.3.4.5 into 2.3.*, rather than 2.* and 1.3.4.5 into
1.3.4.*, rather than 1.3.*. Doing this at least at the border nodes
in the overlay multicast network that have the interdomain links
reduces the stretch-factor under certain circumstances. As the
overlay multicast network was designed to be administrated by
independent parties, administrators are free to experiment with
different summarization opacity levels without jeopardizing the
other subnets or domains.
[0083] One advantage of the summarization of addresses in routing
tables is that the number of links that are present between nodes
from different logical domains is irrelevant with respect to the
number of routing entries in the routing table of a distance-vector
protocol or similar protocols. Since these tables only contain
destinations with a single forwarding strategy, the number of
interdomain links does affect the size of the routing tables. If y
is identified to be the number of entries in a node's routing
table, x to be the total number of nodes in the entire network, n
to be the number of entities (nodes or nested domains) inside a
domain and m to be the depth of the hierarchy. From this it follows
that the total number of nodes in the network can be calculated
with:
.A-inverted.x=n.sup.m where n.epsilon.|N n>1 and m.epsilon.=|N
m>0
[0084] The relation between domain depth, number of nodes and
domain density can be expressed by the following formula for a
network with a uniform topology:
y = ( ln ( x ) ln ( n ) ) * ( n - 1 ) + 1 ##EQU00001##
[0085] Given this formula of the address summarization's
effectiveness, it can be shown that routing entries in any network
node remains under 100 even when the network as a whole grows to
well over 10 million nodes. It can also be shown that the domain
size n has no spectacular effect on the scalability. While a small
value n yields a relatively deeply nested network hierarchy, which
implies more routing entries, it also means that each hierarchy
level only contains a small number of entities. A large value n in
a network of the same size yields a relatively flat hierarchy with
few levels, each level contains a large number of entities. As a
routing table can only contain a natural number of entries, its
number remains constant as nodes are added to domains that are
already captured by the existing wildcard entries.
[0086] An advantage of using a string based node addresses scheme
is its large address space and flexible naming. New subdomains may
be added at any time as there is no need for predefined or fixed
hierarchical levels. All nodes inside a flat domain can be
reconfigured into nested subdomains without the need to reconfigure
routers or compute netmasks. Also, as address fields are not bound
to a limited number of characters, they contain location or company
names to make administration of the network easier and more
transparent. An example might be the address
n1.amsterdam.level3.marketxs.node4, or simply n1.ams.13.mxs.n3 to
keep the address relatively short. The drawback of these kinds of
unrestricted addresses is the amount of bytes they require. Because
every data packet always contains a source and a destination
address, the amount of overhead per packet is excessive at
times.
[0087] In one embodiment, the string based node addresses scheme is
implemented by letting the first bytes of each address header
specify the length of the address in bytes, followed by the address
itself as a normal 8 bit ASCII string with a maximum of, for
example, 256 bytes. In one alternative, null-terminated strings are
used.
[0088] In another embodiment, a sufficiently large address space is
used and divided it into logical subsets, similar to IPv4 and IPv6
addresses. In one embodiment, a 64 bit address space is used and
divided into 8 bit fields (identical to IPv6) used to identify
logical subnets. In this embodiment, addresses might look like
120-23-61-201-43-146-128-132. To make distinct addresses, IPv4 uses
the dot "." for separation, IPv6 uses the colon ":" and the present
system uses a hyphen "-". When each address field represents
exactly one logical subnet level, this scheme provides a simple,
but at the same time somewhat limited, way of addressing. It means
that the complete network can address as many individual hosts as
IPv6, while grouping them in at most 8 levels of sub domains, where
each sub domain can contain a maximum of 256 hosts or sub domains.
The sub domains can be utilized to indicate a network hierarchy in
the topology, including a geographical or similarly based
hierarchy.
[0089] In one embodiment, to reduce the amount of bandwidth wasted
on this overhead, a mechanism is used for substituting the address
strings for small 16 bit values or similar small sized value when
transmitting a packet to a neighbor. Also, a translation table is
kept at both sides of the connection that is used to put the
original address string back in the packet before passing it to the
router's packet switching kernel. This optimization works on a
per-link basis and may be totally independent from the
communication with other neighbors. When it is used on more than
one interface, each interface maintains its own translation
table.
[0090] If a packet is transmitted to a neighbor, all address
strings are stripped from the packet header. For each address
string a 16 bit or similar value is generated and stored in the
translation table. These 16 bits or similar values replace the
original addresses and before the packet is transmitted, its
protocol version field in the first header of its wire-level
representation is changed to a higher version. This version number
is reserved for optimized packets that would be unreadable for
routers without support for address substitution. To ensure proper
translation by the peer, the interface first sends a special packet
that contains the new address substitutions. Such a packet is
tagged with the higher protocol version number and contains one or
more tuples of address strings and their substitute values. In one
embodiment, such a packet takes the form:
##STR00001##
Type 0x1 is used for issuing new substitutions and type 0x2 is used
to invalidate an earlier or unknown substitution. The system
attempts to ensure that this (or similar) packet is received by the
peer prior to the data packet itself, which is something that
cannot be guaranteed by every transport layer. TCP or similar
guaranteed communication method may be used for communication
between neighbor nodes in the overlay multicast network.
[0091] In one embodiment, configuring an interface for address
substitution can be done manually, or automatically. In the latter
case the interface uses a special handshake packet that is part of
the higher protocol version. In one embodiment, the handshake
packet takes the form:
##STR00002##
The version field has the higher protocol version. The type for the
handshake packet is 0x0 and the length is 4. A peer transmits such
a (or similar) packet when the connection is first established.
When the packet is echoed back by the peer, the interface knows
that the neighbor also supports substitution and goes into
substitution mode. After a connection with a neighbor has been lost
and is re-established, the handshake is performed again as the peer
could have been replaced by software that does not support
substitution.
[0092] In one embodiment, entries in the translation table all have
a time-out or similar tracking mechanism to ensure timelines and
accuracy of entries. After this period the entry is removed from
the table and any packet still using the substituted value will not
be processed and cause the receiving node to respond with a special
packet containing the substituted values that weren't recognized,
allowing the peer to synchronize translation tables by sending a
packet containing the substitution tuples. Since the invalidation
of timed-out entries is automatic, both peers use the same time-out
or similar values in tracking mechanisms. In one embodiment, when
the translation table is full, entries are removed according to a
last recently used scheme. The address substitution is implemented
in a substitution interceptor that is the lowest interceptor or a
part of the lowest interceptor or other interceptor reconstructing
packets immediately on arrival in an interface module of a router.
Substitution introduces additional processing overhead and enabling
it is optional for each node.
[0093] In one embodiment, the overlay multicast system may be used
solely for routing. The overlay multicast system offers multicast
support over existing IP networks. The overlay multicast system
goes beyond offering plain multicast functionality and additionally
focuses on supporting live multicast streams with limited jitter
and delay through the combination of overlay multicast, multicast
congestion control and layered multicast.
[0094] In the overlay multicast system the ExBF protocol or similar
routing algorithm implicitly contains information on which links in
the network are child links. For example, this is a by-product of
the long-lived loop avoidance mechanism in ExBF. In ExBF, a node
will inform its neighbors about the fastest route it offers to a
destination S, except the neighbor that is in the path towards S.
Instead, an infinite distance is advertised to this neighbor. This
information is used by a router to conclude that each neighbor
advertising that node S is unreachable, is actually using it as
their parent node towards S and hence will immediately mark those
links as child links in the shortest path tree rooted at S.
[0095] Neighbors that do advertise reachable routes to destination
S are not flagged as child links and therefore not used when
forwarding multipoint packets from S. In another embodiment, this
implicit information is available through any distance-vector
protocol that uses poisoned reverse or similar techniques. Both
methods are used to quickly deliver a copy of any multipoint packet
to all nodes in the network. A disadvantage of both mechanisms
however is that a copy of every packet is delivered to every node,
regardless of whether or not that node is actually interested in
multipoint packets from that source.
[0096] In one embodiment, the overlay multicast system is equipped
with a sparse forwarding mechanism. The overlay multicast system
network is used for a single source multicast for high volume
market data, video streams or similar data to be sent to large
numbers receivers. In these types of uses it is not necessary that
receivers themselves or other nodes be capable of publishing data
packets to the multicast group. Instead, for security reasons it is
better to know that only the real source can publish to the group.
Given these requirements, together with the fact that multi-source
protocols are more complex than their single-source alternatives,
in one embodiment, the overlay multicast system preferably uses a
custom single-source protocol for multicast distribution on the
network. In another embodiment, the overlay multicast system
supports multiple sources. For sake of convenience an embodiment of
a single source multicast network is discussed. One of ordinary
skill in the art would understand that the principles of operation
can also be applied to a multi-source application as well.
[0097] FIG. 6 is a diagram of an example embodiment of a multicast
distribution or "sink" tree. In the example embodiment, the
multicast system starts with a source q that is active, but in this
example, a sparse-mode protocol, with no receiver yet subscribed,
no distribution tree exists. When the first receiver application
p.sub.0 subscribes, its router initiates the reverse computation of
one branch of the sink tree rooted at source q. It does so by first
marking that p.sub.0 has a local subscriber for q:G by setting the
bit in Ls.sub.p0[q,S] (LS.sub.p[ ] is a local array at node p that
keeps track of all subscriptions of the local user applications
connected to router p) to true and then sending a join (e.g., a
<join (q:G)>) packet to the preferred hop towards source q.
This neighbor p.sub.1 receives the <join (q:G)> packet and
marks the link with p.sub.0 as a child link for multicast group
q:G. It then forwards the packet to the next hop p.sub.2 in the
shortest path towards q. Eventually node p.sub.n sends the <join
(q:G)> packet to q. On receipt, q marks the link on which the
packet was received as a child link for its local multicast group G
and starts forwarding all packets addressed to G over the link to
node p.sub.n. The multipoint packets addressed to G are inserted
into the overlay multicast network at node q by the user
application that previously bound the local multicast group q:G in
publish mode. Nodes p.sub.n, p.sub.n-1, . . . , p.sub.1 all forward
the packets to their neighbors from which a <join (q:G)>
packet was received earlier. Node p.sub.0 has no child links for
q:G, but does have Ls.sub.p0[q:G] set, so it delivers the packets
to the local application. When another node p.sub.1 becomes
interested in q:G while LS.sub.p1[q:G] is not already set, it sends
a <join (q:G)> packet to its preferred hop towards q. Let
p.sub.i+n be in the path from p.sub.i to q that receives the
<join (q:G)> packet and suppose that p.sub.i+n is also in the
path between p.sub.0 and q. In this case, p.sub.i+n is already in
the established part of the sink tree of q, so it does not need to
forward the <join (q:G)> packet further to q. Instead, it
only marks the link on which it received the packet as a child link
for q:G. In general, a node u only forwards a <join (S:G)>
packet towards S if it is not already subscribed to S:G (hence,
LS.sub.u[S:G] is not set and the collection of child links for S:G
is empty).
[0098] The example sparse-mode distribution tree generated when
processes subscribe to a S:Q group originally equals a part or all
of the optimal sink tree rooted at S. However, when the underlying
unicast routing algorithm detects changes in the network
performance and updates some of the routing table entries, the
optimal sink tree changes accordingly and may no longer be matched
by the multicast distribution tree, rendering the tree as well as
the multicast performance sub-optimal. An extreme example of this
is when a link that is part of the multicast distribution fails
entirely.
[0099] FIG. 7 is a diagram of an example overlay multicast network
with a failed link. In the example a node u currently in the
distribution tree for q:G only has a single route to q via w. The
link between u and w fails. The node u concludes that it can no
longer be part of the distribution tree (as U and q are no longer
connected) and invalidates all child link subscription information.
In case LS.sub.u[q:G] is set (i.e., the subscription table
indicates that U is subscribed to group G from node q), u will
rejoin the distribution tree as soon as a new path to q is found.
In another embodiment, the algorithm sends a notification to the
locally subscribed application to indicate that u is no longer
connected to q.
[0100] In the example, it is assumed that all links have weight 1,
while the link between p and w has weight 100. The result is that
the link connecting p and w is not in the optimal sink tree rooted
at q. Since only t and v have a local subscriber (LS.sub.t[q:G] and
LS.sub.v[q:G] are set), the sparse distribution tree in this
example equals the dotted arrows of FIG. 7.
[0101] In the example, nodes w, u, r and s all have only a single
route to q. Node p has a path through both s and w, while s is the
preferred hop towards q. When the link between u and w fails, u
becomes disconnected from q and invalidates the subscription
information for link u-r. Node u informs neighbor r of this fact as
part of an immediate distance-vector exchange. This routing update
implicitly tells node r that its join for q:G through u is no
longer valid and that it should look for an alternative neighbor to
rejoin.
[0102] In the example, an explicit leave packet to u is
unnecessary. Since r is also left with no alternative path to q and
has no locally subscribed applications, it invalidates its q:G
subscription and is no longer part of the distribution tree.
Neighbors t and s are implicitly informed about this through the
ExBF routing update or similar routing update sent by r. Eventually
w, u, r and s all leave the distribution tree, while nodes t and v
schedule a rejoin when a new path to q is found. Again, they inform
their local subscriber applications about this. When p receives the
routing update from s, it looses its preferred hop for q and
switches to neighbor w. The new cost to q is now 100 plus distance
(w, q) and neighbor s is informed. Upon receipt, s discovers the
new path to q and informs its neighbors r and v. Node v then sends
the <join (q:G)> packet to S. Nodes s, p and w then construct
the first part of the new, optimal distribution tree rooted at q.
When the new path reaches r and t, t sends a join packet to r and r
to s, reconnecting all receivers to the data stream.
[0103] In one embodiment, recovering from link failure or similar
communication errors is divided in three steps. In the first step
the underlying ExBF or similar routing protocol starts a wave that
propagates a link failure through the network. Depending on the
actual topology, this takes up to N-1 packets, where N represents
the number of nodes in the network. The worst case time complexity
for recovering from a link failure is 3N when ExBF or similar
algorithm is used as the underlying unicast routing algorithm. A
best case time complexity would be 2N.
[0104] The example discusses a scenario where the system recovers
from a failure of a link that is part of the multicast distribution
tree. More often however, links will not fail completely, but
rather fluctuate in quality, causing the underlying routing
algorithm to reassign preferred neighbors. In this example case it
is not a requirement that the distribution tree is changed to
reflect these changes, as it is not partitioned. However, since the
tree becomes sub-optimal, it is changed.
[0105] In one embodiment, the decision whether or not to recompute
a distribution tree after a routing table update, is made according
to the quality of the used path. When a node p in the tree detects
that the neighbor in the optimal path towards the multicast source
has a cost that is only marginally lower than the neighbor that is
currently p's parent in the actual distribution tree, the overhead
of recomputation and the risk of packet loss outweighs the
increased performance of the new tree. Also, since the quality of
each link is continuously re-evaluated, the updated sink tree may
only be temporal. Two straightforward solutions to this can be
used. Either the algorithm sets a threshold on recomputation so
that subscriptions are only moved to the neighbor with the
currently optimal path towards the source if the decrease in cost
is at least a ratio x, where x>1. A larger value of x then
postpone tree adjustments until a substantial improvement can be
gained, while a smaller x makes the tree actively follow the
changing characteristics of the underlying network.
[0106] Another solution is to postpone recomputation of the tree
for a period inversely proportional to the cost decrease ration x.
The latter has the advantage that the optimal distribution tree is
always guaranteed to be reached in finite time since the last
routing table change.
[0107] In one embodiment, to run the tree building protocol, two
new messages are introduced: the join message and the leave
message. Both messages use a GroupData control message of the
following format:
##STR00003##
This message is used to send a list of subscriptions to a neighbor
node or to cancel an aggregated list of subscriptions with a
neighbor. The action field (9.sup.th byte) is used to distinguish
between join and leave. The 10.sup.th and 11.sup.th byte indicate
the number of (S, G) groups in the message. The action field is
either "join" (0) or "leave" (1). In an alternative embodiment,
this is extended to "stale", "dead", etc, to indicate the state of
the publishing application.
[0108] Many multicast applications require some level of reliable
communication. Examples of this are uploading files to several
receivers simultaneously or replicating web server caches. Without
guaranteed delivery, a multicast transport service has limited
applicability. In one embodiment, the overlay multicast network
utilizes standard delivery control based on moderating the sending
of data based on the lowest bandwidth available on the route to a
destination. However, this delivery system is unsuitable for
certain applications.
[0109] One example is financial data distribution, it needs a
transport service that can multicast live data without packet loss
or corruption and without substantial end-to-end delay when parts
of the network are temporarily or permanently congested. While live
market data cannot tolerate random packet loss, it can be thinned.
Under certain circumstances it is acceptable to omit certain stock
quote updates. An example is a desktop application that displays
real-time updates for hundreds or thousands of financial
instruments. If all quote updates were delivered to this
application, this would require a substantial amount of bandwidth
and would cause the value of the more volatile instruments to
change faster than a human can read. The data may be thinned
through a process that involves inspection of the individual stock
quotes and encoding them in individual data packets, tagging each
with a priority value. For this application, as long as all packets
labeled with the highest priority number are received, the partial
stream can be considered intact. Additionally, when all updates of
the second highest priority are also received, the quality of the
partial stream is increased usually in the form of less
latency.
[0110] In one embodiment of the overlay multicast system, priority
numbers are associated with data packets. The priority numbers
represent a logical layer inside a data stream. Data provided by a
multicast application may be in the form of a stream of data. This
data is subdivided into categories or priorities based on the
nature of the data. When packets in a data stream are labeled with
priorities in the range 0 to 3, the stream is said to have 4
layers. Also, the convention is to treat 0 as the highest priority
and 3 as the lowest. Any other system of identifying priority
levels may be utilized including alpha numeric indicators or
similar identifiers. If a stream only contains a single layer, all
packets are labeled with priority 0.
[0111] To software router daemons, a priority value of a packet is
relative to the stream and becomes relevant when a decision to
discard data at a congested router is made. The priority value has
no meaning other than as a criterion for discarding packets on
congested links. When an outgoing link of a router has insufficient
bandwidth to transmit all pending packets, it forwards only those
packets with a designated priority or the highest priority. This
technique is also applied to any data type, including audio/video
data and financial data. Using this system, the packet priority
numbers cannot be misused by sources to give their data packets a
greater chance of prioritized transmission by giving them the
highest priority to gain advantage over other sources. Packet
priorities are only compared between packets that are part of the
same data stream.
[0112] In one embodiment, knowing that routers will use the packet
priority numbers when making forwarding selections on congested
parts of the network, a source carefully divides its data packets
over different layers or priorities, in a way that a subset of the
layers still contain a useful, uncorrupted representation of the
data. This system is especially useful when multicasting a live
data stream with a high data rate to a large number of receivers,
scattered over a heterogeneous wide area network. Receivers that
are on congested parts of the network will then receive the highest
priority parts of the data only. This eliminates the need to lower
the publishing rate to match the slowest receiver, while still
being able to offer live, uncorrupted, thinned data streams to
clients suffering from insufficient bandwidth. Example applications
of the system include audio and video codecs that divide live
multimedia content over layers to enhance the user experience over
wide area networks.
[0113] In one embodiment, incoming messages are sorted by unicast
or multicast sender address. As described above, a sender address
is the combination of source router address and application
session. An example of a source address is
n1.mxs.office.erik:video.an2 which could be used by a user
application broadcasting a video channel. In this case, only the
source address is relevant, not the (uni- or multicast) destination
address. Each incoming message is added to the queue that holds
messages sent by that particular sender. If this is the first
message from a sender, a new queue is automatically created to
store it.
[0114] FIG. 8A is a flowchart of one embodiment of a process for
managing congestion in the overlay multicast system. The congestion
can be managed at each individual router. In one embodiment, the
congestion is managed at the interface module level in each router.
Each interface module has an inbound pipeline and outbound pipeline
discussed above for processing inbound and outbound data. Each
pipeline buffers data that is awaiting further processing. However,
if either pipeline is unable to keep up with the pace of incoming
data that needs to be processed some data must dropped.
[0115] In one embodiment, data is received as a set of data streams
at each router (block 851). The data is then buffered in the
inbound pipeline buffer (block 853). The same process applies to
outbound data that is received from the kernel by the outbound
pipeline. This data is stored in the outbound buffer. After the
data has been stored a check is made of the inbound or outbound
buffer to determine if it is full (block 855). In one embodiment,
data is stored in the data structures in the buffers that organize
the data packets into a set of queues. Each source address, data
stream or layer in a data stream has a separate queue. A queue is
sorted with highest priority and oldest data packets at the front
of the queue. If the buffer is full then a decision is made to drop
a designated amount of data in the form of packets from the buffer
to make room for incoming data packets (block 857).
[0116] In one embodiment, a queue is chosen randomly or in a round
robin to have data dropped. In another embodiment, the queue with
the most data is chosen to have data dropped. In a further
embodiment, a weighting factor is calculated to determine which
queue is selected to have data dropped. The weighting factor is
based on the amount of data in a queue, size of packets in a queue
and similar factors. The weighting factor counteracts unfair
distribution that is caused by selecting a queue by a round robin,
random or similar method of selection. Data streams with large
packets are unfairly affected by other methods because a
disproportionate amount of data is dropped in comparison with other
queues with smaller packets. Queue selection is also influence by
the size and the amount of data that is intended to be dropped. A
large queue that can drop close to the amount of data desired is
weighted for selection.
[0117] In one embodiment, data is dropped if a total amount of data
stored in all queues exceeds a threshold value. This threshold
value is set by an administrator or is a set value. These systems
enforce the fair allotment of bandwidth between data streams. Also,
this system implements the prioritization of logical layers by
dropping lower priority level layers when congestion occurs. If the
buffer is not full then the pipelines may continue to store data in
the buffers (block 851).
[0118] In one embodiment, while the queues grow in size as packets
are received, a background thread constantly dequeues packets from
the queues and transmits them over the network.
[0119] In one embodiment, the system that manages in- and output
for the queues is divided in two parts. The first part is run by
the background thread that constantly dequeues packets from the
queues and transmits them over the network, while the second part
is in charge of queuing new packets that are to be transmitted. The
latter also implements the logic that defines when and which packet
should be discarded (due to a buffer overflow).
[0120] The dequeuing part of the system, when selecting a packet
for transmission, only looks at the first packet (most urgent) of
each queue. To make sure each stream gets an equal share of
bandwidth, it takes the individual packet sizes into account. When
all packets (note that only the first packet of each queue is
observed) have equal size, the dequeuing thread simply selects a
random queue with equal probability and dequeues one of its
packets. Since each queue has the same probability of being
selected, each queue will deliver an equal amount of packets per
time unit. Hence, each source will transmit an equal amount of
bytes per second.
[0121] Since individual packets can have any size between 1 and
some determined maximum number of bytes, their size must be
considered when the transmit thread selects queues. Queues with
many large packets should generally have a smaller probability of
getting selected than queues with lots of small packets in order to
keep the bandwidth division fair.
[0122] In one embodiment, only the first packet of each queue is
considered to represent its queue and give it a probability of
setting selected that is reversely proportional to its size in
bytes. In an example where there are three queues, P, Q and R where
the first packet in P is 100 bytes, the first packet from sender Q
is 300 bytes, while the packet from R is 500 bytes, to calculate
the selection probabilities, first their respective selection
weights are defined. For example, the weight of the packet in P is
computed by dividing the total size of P, Q and R by P's size:
(100+300+500)/100=9. Q's weight is 3 and R's weight is 9/5. The
weights are converted into selection probabilities by dividing them
by the sum of all weights. This gives P a probability of
9/(9+3+1.8)=0.65, Q a probability of 0.22 and R of 0.13. When a
queue becomes empty after dequeuing, it is removed.
[0123] In one embodiment, to keep the delay introduced by buffering
packets under control, the second part of the system enforces a
maximum total queue size. The sum of all packets in all queues may
never exceed this threshold. When a new packet comes in, it is
always accepted at first. If necessary a new queue is created for
it, or it is added to its designated, existing queue. After adding
a new packet, the total size of all queues is checked. If it is
larger than the configured maximum, the algorithm runs a removal
round in which it first selects a queue and then tells that queue
to remove one packet. If the queues are still to big after removal
of one packet, the process of selecting a queue and removing a
packet is repeated iteratively until the total queue size is
smaller than or equal to the configured maximum.
[0124] In one embodiment, shrinking the queues is always done after
a packet was added, never preemptively. The reason for this is to
let the new packet immediately participate in the removal selection
process, rather than discarding it or making room for it at the
expense of the other queues. Contrary to de-queuing packets, where
the algorithm tries to select the "most urgent" packet, we now need
to select the "least urgent" one. This packet is determined by
looking at the size of the individual queues (the largest queue
should generally be shrunk to keep resource division among the
streams fair) and which queue can match the required size most
accurately. This comes from the fact that individual packets can
differ greatly in size, so removing the "least urgent" packet from
queue P could result in freeing 8 kilobytes, while removing the
"least urgent" packet from queue Q yields 40 bytes of space for
example. Now if the queues in total exceed the maximum size by only
a couple of bytes, it seems logical to remove the small packet from
Q. This policy comes with a consequence, namely that it implicitly
promotes the use of larger packets. After all, streams with lots of
very small packets are more likely to accurately match the amount
of buffer space that must be freed than stream queues with few very
large packets. This property may encourage developers to use larger
packets, increasing the efficiency and throughput of the network.
In short, when selecting a queue for shrinking, the system favors
queues that are will expunge the least amount of bytes in order to
match the total buffer capacity and are large compared to the
others.
[0125] In one embodiment, Before the selection can be made, the
absolute weight factor for each queue is computed. The weight
factor combines the queue's size and ability to accurately match
the overcapacity of the buffer by removing packets. The latter is
exposed in v. It represents the sum of the sizes of the packets
that a queue must remove in order to eliminate the buffer's
overcapacity. For example, if a queue P has 10 packets of 100 bytes
each while the buffer currently has an overcapacity of 170 bytes
(suppose the maximum is 10000 bytes, while all queues combined add
up to 10170 bytes), P would need to remove at least 2 packets
(2*100 bytes). In this case, v.sub.p is 200. The formula to compute
the absolute weight factor of queue m is defined:
w n = ( i = 0 n v i ) s m v m ##EQU00002##
In this formula, n represents the total number of queues, v
represents the number of bytes that would be removed if the queue
was selected and s represents the total size of a particular queue.
The formula captures the direct relation to queue size and the
inverse relation to the amount of bytes that would be removed by
the queue if it was selected.
[0126] When the absolute weight factors of all queues have been
derived, they are converted to weighted selection probabilities.
This is done by dividing the individual weight values by the sum of
all weights. The selection probability of queue m is expressed in
P.sub.m.
P m = w m i = 0 n w i ##EQU00003##
After the selection probabilities have been computed, a
roulette-wheel algorithm is run to select a queue and let that
queue remove its v bytes. If v is smaller than the current
overcapacity of the buffer in total (note that this implies that
the selected queue became empty and was thus removed
automatically), another selection round is done until the buffers
shrunk sufficiently. Note that the queue selection process is
implemented as an atomic operation. During selection rounds no new
packets may be added to or removed from the queues.
[0127] In one embodiment, the overlay multicast system is a layered
multicast in combination with a scalable, selective packet
retransmission mechanism to offer a service that can meet the
demands of real-time financial data and similar applications.
Selective packet retransmissions are crucial so that the overlay
multicast system is guaranteed that subscribers will receive one or
more layers that are completely intact. When an occasional packet
in a high priority layer is missed, it is repaired. However, the
receiver may decide not to attempt to recover missing packets from
low priority layers that are missing due to network congestion.
Trying to repair these layers would require additional bandwidth,
resulting in additional packet loss. Yet, as more end-to-end
bandwidth becomes available, the receiver detects this and repairs
additional layers, so more intact layers are delivered to the
application.
[0128] FIG. 8B is a flowchart of one embodiment of a process for
handling layer repair. In one embodiment, the overlay multicast
system supports a receiver-side multicast socket that receives the
raw packets from the multicast stream and is in charge of repairing
damaged stream layers before forwarding them to the user
application (block 801). This method includes a retransmission
request packet. When the receiver detects a missing packet in a
layer that should be repaired (block 803), it sends a
retransmission request for this packet towards the source (block
805).
[0129] In one embodiment, detecting packet loss is done in the
conventional way by sequencing each packet with an incrementing
number. In another embodiment, each layer has its own sequence
number, so all packets in one layer are sequenced independent of
the other layers. This way, packet loss is detected in each layer,
regardless of the conditions and packet loss in other layers.
[0130] To avoid a cascade of retransmission requests when a packet
is dropped close to the source, a tree-based negative
acknowledgment is used. This technique provides localized repair
without communication with the source, because an intermediate
router that has a copy of the requested data packet responds by
sending it again and suppressing the retransmission request. When
the next router closer in the distribution tree receives the
retransmission request it checks its buffer to determine if the
requested packet is still stored therein (block 807). If the
requested packet is there, it is retransmitted (block 809). If the
packet is not in the buffer then the retransmission request is
forwarded to the next router in the distribution tree toward the
source (block 811).
[0131] In one embodiment, when layers are dropped due to
congestion, the programming API that is exposed to user
applications notifies the user by means of an exception or special
return value when it is reading packets. These notifications may
not be fatal and only serve to inform the user that the quality of
the stream has changed.
[0132] In one embodiment, localized repair works by having each
router daemon store a window of transmitted packets. Packets are
either stored for a fixed period of time, after which they are
discarded, or a fixed amount of buffer space is reserved to store
the most recently forwarded packets. Since the overlay multicast
network by default use reliable connections between router daemons,
packets never get lost while they are in transit between nodes. The
only place where packets are purposely discarded is in the
interface buffer interceptors. Thus, there is no need to buffer
packets that were actually transmitted over the reliable
connection, as those packets are guaranteed to have reached the
neighbor node. In one embodiment, a router daemon copies packets to
its retransmission packet store before those packets reach a buffer
interceptor that discards packets.
[0133] The router's buffer interceptors copy packets to be stored
for retransmission. In one embodiment, localized packet repair in
the overlay multicast system are implemented by adding an
interceptor to both interceptor pipelines of each interface and
letting those interceptors simply store a copy of each packet.
Additionally, when a retransmission request passes through the
router on its way to the multicast source, the interceptor inspects
the request packet and if it has the requested packet in its
temporal packet store, it does not forward the retransmission
request to the next interceptor. The requested data packet is
injected into the network again.
[0134] FIG. 9A is a diagram of two router daemons connected by a
link. There are two interceptor pipelines 903, 905 inside each
interface 907. Each interface contains both an outbound 903 and an
inbound 905 interceptor pipeline. The latter processes packets
received from the network, while the outbound pipeline processes
packets just before they leave the router daemon. In one
embodiment, the pipelines contain two different interceptors; the
CCI (Congestion Control Interceptor) 909 and the PRI (Packet
Retransmission Interceptor) 911. In another embodiment, a router
daemon contains additional interceptor instances per pipeline.
[0135] The CCI 909 implements bandwidth allocation rules and is
responsible for discarding packets. During normal operation the
source 913 publishes its multicast packet stream to its router
daemon 901 (path 1) that sends the packets to all interfaces that
lead to interested receivers according to the multicast routing
table. In the present example, the packets are passed to the first
outbound interceptor (PRI) of the interface (path 2). The PRI is
responsible for storing a copy of each recoverable packet that is
transmitted. It stores the copies in the private temporary packet
store 915 of its interface (path 3). Then the interceptor passes
the packets to the next interceptor (CCI) (path 4) where it is
buffered until the connection link with the adjacent router daemon
has time to transfer it.
[0136] In one embodiment, if the packets are not dropped in the
outboard CCI, they are received by the neighbor router daemon
(paths 5 and 6) and sent through that router's inbound interceptor
pipeline 921 (path 7). They first enter the PRI 923 (path 7), which
stores copies in its inbound buffer 924 (path 8) and then passes
them on to the next interceptor (CCI) 925 (path 9) where they are
temporarily parked until the router's kernel 927 thread picks them
up (path 10) and delivers the packets to the subscribed multicast
tsocket which forwards them to their next destination 929 (path
11).
[0137] FIG. 9B is a diagram of two router daemons connected by a
link where a retransmission is requested. If a packet is lost in
the outbound packet interceptor pipeline of the source router
because the connection between the routers was not fast enough to
transmit all the packets, the receiving tsocket notices the loss
and send a retransmission request (path 1). The destination address
of this unicast retransmission request packet is that of the
multicast group. The network packet switching kernels recognize the
retransmission packets and use the multicast group to route the
packet towards the source of the group. The retransmission request
packet of the example passes the router kernel 927 and reaches the
outbound packet interceptor pipeline 931 and goes into the PRI 933
(path 2). This interceptor scans specifically for retransmission
requests and tries to answer them locally. To this end it inspects
the packet, looks up the multicast group, the stream layer
identifier and the layer's sequence number and then checks the
inbound packet buffer 935 to see if it contains the packet (path
3). If this is not the case, the packet is passed on to the rest of
the interceptors (path 4) and eventually transmitted to the
adjacent router (paths 5 and 6) where it is fed to the inbound
interceptor pipeline 941 (path 7).
[0138] In one embodiment, the PRI 943 checks the interface's
outbound packet buffer 945 (path 8) and finds the multicast packet
that was dropped by the congested outbound CCI 909 earlier. It is
then re-inserted into the outbound packet stream via the outbound
PRI 911 (path 9). The retransmission request packet is then dropped
and not forwarded any further. Assuming the packet is not dropped
again, it travels the normal way towards the receiver (paths 10,
11, 12, 13). In one embodiment, the receiver's PRI sees the
multicast packet for the first time and store a copy in its
outbound packet buffer (path 14). The packet then continues on to
destination 929 (paths 15, 16, 17). This mechanism of unicast
retransmission packets and interceptors that offer localized repair
is flexible in the way that not all router daemons need to support
local packet repair. A router that does not support it simply
forwards the requests and let the upstream routers handle them. The
network as a whole becomes more efficient when more routers support
it.
[0139] In one embodiment, because storing a copy of every multicast
data packet requires storage capacity, the overlay multicast system
links the packet buffers of the packet retransmission interceptors
to a central data structure that will only store unique packets.
This is done because in multicast transmissions a router that is a
branch point in the multicast distribution tree otherwise ends up
storing at least three copies of every data packet. This is because
the packet is received on one interface and sent out over at least
two other interfaces. If every interface would individually store
its packets, the inbound packet buffer of the receiving interface
as well as the outbound packet buffers of the forwarding interfaces
contain the same packets.
[0140] In one embodiment, the central packet store eliminates
duplicate packets by storing each packet only once and by keeping
reference tables that point to packets for every interface. When an
interface buffer stores a new packet in the central store, while
that same packet was already stored by another interface, the
central packet store merely adds a pointer to that packet instance
to the interface's packet reference table.
[0141] In one embodiment, each entry in the reference tables has a
time-out attached to it. This is to ensure packets are only
temporarily stored. In one embodiment, the same packet could pass
different interfaces at different times. The time-outs are not
attached to the packets in the central store but rather to the
packet references in the interface reference tables. In one
embodiment, packets are only expunged from the central store when
all references to the packet have timed out. Another optimization
that can substantially reduce the amount of required storage space
is to only store packets that were explicitly discarded by the
Congestion Control Interceptors (CCIs) because of bandwidth
constraints. Storing other packets has little value, as those are
guaranteed to have arrived at the next router. No retransmissions
may be requested for them unless some non deterministic packet loss
occurs as a result of a crashing node.
[0142] In one embodiment, if a packet was lost close to the
multicast source and because of this and all receivers
simultaneously send a retransmission request, the upstream routers
still apply conventional negative acknowledgement "nack"
suppression techniques to combine all concurrent retransmission
requests from its child links into a single request that is
forwarded upstream towards the source. Applying local packet stores
at every router reduces the latency of packet recovery.
[0143] In one embodiment, time to live values are adjustable per
router, set system wide or similarly configured. A short time-out
minimizes storage requirements, but will also lead to more overhead
towards the source as more retransmission requests need to be
propagated further upstream. Large time-outs will be more tolerant
to late retransmissions, but require more storage. When a time-out
of 60 seconds is used by the routers that means a lost packet can
still be recovered after a minute. However, when the data stream
contains real-time data such as stock quotes, such a delay in
delivery cannot be tolerated. Problems get worse if the data
packets also have to be delivered to the user applications in their
original order. In that case the user does not receive any data for
up to a minute while the missed packet is recovered and all newer
packets are waiting for it. In one embodiment, a ten second timeout
setting is utilized.
[0144] If all routers in the network use the same time-out value,
there is no use in propagating these requests as the upstream
routers are likely to have purged their copies as well. To avoid
nack-implosions from slow receivers right after a packet is purged,
each router interface remembers which packets it has purged. When a
retransmission request is received for such a packet, the interface
may respond by sending a special packet to the receiver that
indicates that the requested packet had timed-out and cannot be
recovered anymore. The retransmission request in this scenario is
not propagated further upstream. When a receiver gets this time-out
notification, it knows it will not have to wait any longer for the
missed packet. It will notify the user application of the fact that
data was lost and continue to deliver the next packets. Whether or
not an application can tolerate packet loss will depend on the type
of content. Audio and video will usually not be severely impacted
by the loss of an occasional packet, while real-time stock quotes
become dangerous to use when it is not known what quote was
lost.
[0145] In one embodiment, where layered multicast is combined with
localized packet retransmission, a service is realized that applies
packet recovery only to those layers that can be delivered with the
current bandwidth capacity. When the network has insufficient
resources to deliver all layers of a stream, it is beneficial if
the receivers know this and will not attempt to repair all missed
packets from all layers. Instead, the receivers may use a mechanism
that provides them feedback about the current network capacity and
use that to decide for which layers it will recover missed packets.
The algorithm monitors the status of the total stream and from this
information derives how many layers of the stream can be reliably
delivered to the user without stressing the network. It then marks
these layers as being intact, reliable or similarly labeled and
sends retransmission requests when an occasional packet is lost
from these layers.
[0146] In one embodiment, the algorithm only delivers packets from
the reliable layers to the application, but not before they have
been put back into their original global order. In order for the
algorithm to decide which layers are safely marked as intact, it
requires some status information about the reception of the stream
as a whole as well as the network conditions. To provide this,
every packet contains the number of the most recent packet from all
other layers or similar sequence data. Aside from carrying its own
sequence number, each packet contains the current sequence number
of all the other layers as well. By inspecting these sequence
numbers each time a packet arrives, the receiver determines whether
it has missed any packets from the layers it marked as reliable. If
that is the case, these packets may be recovered. Each time a
missed packet is detected, a countdown timer is started for it. If
the packet is recovered before the time-out expired, the timer is
canceled and removed.
[0147] If however the timer manages to expire, the layer is
considered impossible to repair and may be, together with all
higher reliable layers (lower in priority), removed from the list
of reliable layers. How long the time-out interval should be can be
application or content dependent. The interval determines how long
delay is tolerated by the user. Setting the interval to a low value
means a temporary congestion may corrupt a layer long enough to
cause the receivers to drop it. Since packets occasionally get lost
when unicast routing tables converge, changing the shape of
multicast distribution trees, or when a router crashes that had
pending packets in its interface buffers, appropriate time-out
values are adjustable by an administrator according to network
conditions to be found experimentally.
[0148] In a overlay multicast network that uses packet
prioritization, routers explicitly introduce out of order delivery
when a burst of packets queues in the outbound interceptor pipeline
of a router interface with a slow connection. The time-out may be
large enough to allow for this. Setting the time-out to a long
period makes the stream much more resilient to congestion, but also
increases the time for the receiver to discover that a layer must
be dropped due to bandwidth constraints. Until congestion is
finally detected, delivery of the previously received packets that
causally depend on missed packets is postponed. Starting timers
when packet loss is detected allows for congestion detection.
However, it cannot be used to guarantee low end-to-end latency.
[0149] When no packets are lost and all are received according to
their global order, the receiver cannot measure the total
transmission delay. As such, a receiver cannot distinguish between
an idle and a crashed source. In one embodiment, aside from
removing layers from the reliable-list, the algorithm is also able
to detect when more bandwidth becomes available and new higher
layers can be added to this list, so that a higher quality stream
can be delivered to the user. This is done by passively or actively
monitoring the layers that are not currently in the reliable-list
and not under repair. Every time a packet is received from these
layers, this fact is stored for a period that is equal to the
repair time-out discussed earlier. When a packet from a layer is
received with a sequence numbers that shows that a packet has been
lost from a higher layer that is not in the reliable-list, no
retransmission request is sent, but the fact that this packet
should have been received is recorded and stored for the same
repair time-out period.
[0150] If during the time-out period the packet is received after
all, possibly because it was delayed by a router, the stored record
is marked as received. Because the state of each packet from every
layer is recorded, the overlay multicast system builds a packet
arrival history that is used to determine whether the reception
quality was high enough to add a higher layer to the reliable-list
and repair any further packet loss. In the overlay multicast system
the reception history or similar data may be used to calculate a
moving average that describes the amount of packet loss over the
last x seconds, where x is equal to the repair timeout. With a
small moving average history, the receiver will quickly respond to
increased network capacity, while a longer history will only add
layers when the network's capacity was sufficiently high for a
longer period of time. The receiver may keep track of the reception
quality of each layer by inspecting the list of sequence number
that is attached to each data packet.
[0151] This causality information adds overhead to each packet,
linearly related to the number of layers used in the stream. For
example, when sequence numbers are 32 bits and the publisher uses
all 256 available layers, each packet comes with a kilobyte of
causality information, which contributes to at least 12.5% for an 8
Kb packet that is filled to the brim. To ease layer administration
in the receiver socket, an overlay multicast stream only supports a
static number of layers. When a publisher binds a multicast group
address for reliable, layered communication, it may explicitly
specify the number of layers that will be used.
[0152] In one embodiment, by default, packets are delivered to the
user application in their original order. Not only are the packets
inside the individual layers restored to their natural order using
their sequence numbers, but the total ordering across the layers
may also be restored. When the source sends three packets with
different priority (each in a different layer), all three are
received by the user application in the exact same order. A
prioritized, layered packet is described as P.sub.1(9,4,6) where 1
represents the packet's priority or layer, 9 the sequence number of
the last packet of layer 0 (the highest priority layer) that was
sent at the time this packet was published, 4 represents the
sequence number of this packet and 6 represents the sequence number
of the last packet of layer 2. Also, the packet tells that the
stream uses 3 layers in total: layer 0 up to and including layer 2.
It is concluded that this packet causally depends on packet 9 from
layer 0 and packet 6 from layer 2 and will only be delivered to the
user application after those packets have been delivered. In
another embodiment, the overlay multicast system dos not reorder
the packets, but may leave this task to the application using the
network.
[0153] FIG. 10 is a diagram that shows one embodiment of a process
of selective packet repair and ordering. For purposes of
explanation the notation {0 (9,2,6); 1(9,3,6), . . . } to describe
a sequence of packets where P.sub.0(9,2,6) was sent prior to
P.sub.1(9,3,6), is used for sake of convenience. The first segment
1001, shows the sequence of packets originally published by the
source. It reads from left to right, so P.sub.0(9,2,6) was
published first, followed by P.sub.1(9,3,6), etc. In this example
the source has published eight packets, divided over three priority
levels or layers (0, 1 and 2, where 0 is 0 is the lowest layer with
the highest priority). The second segment 1003 shows the packet
stream as received by one of the subscribers. It shows two lost
packets and an incorrect ordering. Before the packets are delivered
to the user application, they are stored in internal buffers during
the repair and reordering process. This state is depicted in
segment 1005. Here the packets are in separate buffers, each
representing a layer. The packets are ordered inside the buffers.
The illustration shows the missing packets P.sub.1(9,4,7) and
P.sub.2(9,4,8).
[0154] If the receiver currently only has layers 0 and 1 in its
reliable-list, it will attempt to repair the hole in layer 1 by
sending a retransmission request. Note that if the given sequence
was just a snapshot of a running stream, the receiver would have
detected the missing packet P.sub.1(9,4,7) when P.sub.0(10,4,8) was
received, because this packet says it depends on packet #4 from
layer 1. So even before P.sub.1(10,5,9) from layer 1 was received
in our example, the receiver already detected loss in layer 1 and
immediately scheduled a repair timeout for the missing packet and
sent a retransmission request. In fact, if it is assumed that the
receiver had received packet #2 from layer 1 prior to our snapshot
of FIG. 10, then the conclusion would be that packet #4 as well as
packet #3 were lost.
[0155] In the example, shortly after P.sub.0(10,4,8) was received,
packet P.sub.1(9,3,6) is received. The receiver places the delayed
packet in the appropriate receiver buffer and cancels the repair
time-out it started earlier when it detected that the packet was
missing. This is illustrated in segment 1007. Note that the hole in
layer 2 may also be detected when P.sub.0(10,4,8) is received, as
that packet claims to be sent after packet #8 of layer 2 was
published, so either packet #8 from layer 2 got lost in the
network, or was delayed. However, since layer 2 is not in the
reliable-list, a retransmission request is not sent. However, as
for all packets, a timer is started for packet #8 of layer 2. When
the user application is ready to read packets, the algorithm
returns only packets from layers that are in the reliable-list.
Even though some of the layer 2 packets were received, they are
discarded and not delivered. The resulting stream of packets that
is delivered to the user is equal to the stream originally
published by the source, with all layer 2 packets removed.
[0156] In this example, despite the fact that the network has
insufficient capacity, having dropped packets from every layer and
delivered the packets out of order, the user received a
deterministic subset of the stream that is uncorrupted. Because the
overlay multicast system sources are live data streams that cannot
slow down or pause and because real-time data are not buffered by
the source too long, receivers will only attempt to recover lost
packets for a limited period of time. Whether a packet is received,
recovered or lost, its buffer slot will be freed after this
timeout, resulting in a natural size limit of the receive buffer.
How large the buffer can get is related to the average speed of the
stream and the length of the repair time-out.
[0157] In one embodiment, the algorithm implementation does not
enforce a hard size limit during packet repairs. A more troublesome
situation occurs when the user application does not read data from
the socket fast enough. When this happens, the amount of pending
data builds up in the socket and both storage requirements and the
transmission delay increases. In one embodiment, this may be
handled by removing a layer from the reliable-list, causing those
packets to be discarded from the buffers, while decreasing the
amount of data that is delivered to the application. If a clean
feedback algorithm that can keep the number of layers in balance
with the application's reading speed is not used, the overlay
multicast system throws a fatal exception to the user or provide a
similar indicator to the user application and close the socket when
the receive buffer reaches a certain maximum size, or when the
total time between arrival of packets and their actual delivery
reaches a threshold.
[0158] Although restoring the original global packet ordering
before delivering the data to the user application is assumed to be
appropriate for most types of data, there is content for which each
packet invalidates all previous packets. This is the case for stock
quotes and similar time sensitive data. For example, when a new
stock quote update is received for a financial ticker symbol, it
renders the previous updates useless. For most applications that
process real-time financial data, only the most recent information
is interesting. When global ordering is restored, the algorithm
will postpone the delivery of the most recent data until all prior
packets have been received also. For a typical application, such as
a market data terminal, that merely displays quote updates to the
screen, this postponing adds little value. When the burst of
pending updates is finally delivered by the tsocket, the
application updates the symbol's last value on the screen, thereby
leaving only the last and most recent update visible and
overwriting all pending updates immediately.
[0159] In one embodiment, because the reordering process adds
additional delay to the data delivery, it can be switched off by
applications that do not benefit from it. Without reordering,
global (causal relations between packets from different layers) and
local ordering (order of individual packets inside a single layer)
is ignored and packets are delivered to the application immediately
after they have been received. Disabling reordering has no impact
on the reliability. Lost or delayed packets may still be recovered
for all layers the reliable-list, only the recovered packets may be
delivered with additional delay. Whether or not this makes them
useless is up to the application to decide.
[0160] For example, with market data it is useful to know whether
an update is older or newer than the one previously received. This
is because a stock quote may be overwritten by a newer one, but not
the other way around. In this case, the source could add a logical
time stamp or similar sequence indicator to each stock quote, so
the receiver can decide how to handle the update. In one
embodiment, aside from configuring a layered multicast tsocket to
restore global ordering or no ordering at all, a receiver can
configure a tsocket to only restore local ordering. Whether or not
this is useful will depend on the type of application and the
content, but it offers the advantage that delivery of packets from
lower (high priority) layers are not delayed during recovery of
packets from higher, lower priority layers.
[0161] In one embodiment, an example market data publishing
application tracks market financial updates in a linked list or
similar data structure. Every time a new update is received, it is
stored in the appropriate slot, overwriting the previous update.
This way, the linked list always contains the most recent trade for
every symbol. A virtual token traverses the list at a fixed speed.
If the stream is to be thinned to one update per second, the token
will visit one slot every second. When the token visits a slot, it
removes the quote update and forwards it, leaving an empty slot.
Empty slots will be skipped by the token without any delay. When
one of the symbols has two updates per second in the incoming
stream, the second incoming update after the last visit of the
token will overwrite the update pending in the slot and the older
pending update is dropped. This allows the leaky bucket algorithm
to always provide the thinned clients with the most recent, live
update of every symbol. While the virtual token visits the slots
that have a recent quote update pending to be sent out, a thread
receives the incoming market data stream and inserts new trades in
the slots. To limit the outgoing transmission rate to one quote per
second, the token thread sleeps for one second after it moved and
sent an update from a slot. When the token is able to complete one
full circle between two incoming updates of every symbol, there
will be no data loss. Instead of limiting the outgoing bandwidth to
a fixed maximum, the dequeuing thread could also be configured to
run as fast as possible, relying on external feedback such as a
blocking write call from the network, to reduce publishing speed.
This could be particularly useful when the algorithm is used on the
server side of a point-to-point TCP connection to a client
application. In one embodiment, the server uses the algorithm to
send the data at original speed, until network saturation slows
down the links. A potential problems with this technique is that
the original order of the updates is lost. In fact, when every
symbol updates more than once while the token circles the list, the
token will find every slot filled with a quote update, causing the
order of the output stream to match the fixed order of the slots in
the linked list. For quote updates from different symbols, this is
usually not a problem, as they are not related to each other.
[0162] In another embodiment, an alternative algorithm may be used
that provides another way of thinning a stream of data, such as
live market data, with varying bandwidth to a fixed maximum rate,
without the risk of providing stale data, for example stale market
symbols. This algorithm may be used to produce enhancement layers
that form a layered data stream, such as a layered market data
stream. One example method of doing this is to have several
instances of the algorithm running that each receive a copy of the
input stream, letting their token run at different speeds and
tagging their updates with a layer number. The layers may then be
combined into a single stream. Unfortunately, this may introduce
redundant information in the stream because the higher layers
contain updates that are also available in the lower layers. This
technique of encoding an entire stream in different qualities and
sending them to clients concurrently to meet their individual
bandwidth capacities may be referred to as simulcasting.
[0163] In one embodiment, the thinning algorithm may be modified to
have each layer contain only information that is not already
present in the lower layers, thereby eliminating redundancy among
the layers. To achieve this the leaky bucket algorithm may be
extended to a multi-level ring as illustrated in FIG. 11B, creating
a virtual cylinder of vertically stacked, circular lists, where
each ring represents a layer of the output stream. Each ring has
its own token. In one embodiment, the tokens all circulate their
ring at the same speed. In other embodiments, the tokens in each
ring may be set to circulate at different speeds. This data
structure also identifies columns. A column is the collection of
vertically stacked slots that can all store an update for the same
symbol.
[0164] FIG. 11A is a diagram of an example method of thinning a
stream of live market data. The gray slots represent symbols for
which a recent quote update was received, while the transparent
slots are empty. FIG. 11B is a diagram illustrating the token ring
data structure. Multiple instances of the thinning algorithm and
data structure may be conceived as being vertically stacked, each
instance represents a stream layer and each symbol has a column
that can store more than one update. FIG. 11C is a diagram of one
embodiment of a representation of the stacked thinning algorithm of
FIG. 11A at work. In the example, the gray slots contain a pending
quote update, while the transparent slots are empty. The
illustration shows how the updates of a volatile symbol are spread
out over several higher layers.
[0165] In one embodiment, the thinning algorithm has a single input
thread that receives the original stream. Newly received updates
may be inserted in the lowest ring. If a new update arrives for a
slot in the lowest ring that already contains an update for that
symbol, this older update is vertically shifted into the next ring,
so that the lower slot can always store the latest update. Once an
update is shifted upwards, it cannot return to a lower ring
anymore, even if all other slots beneath it were emptied.
[0166] In one example embodiment, symbols that have more than one
trade during the time it takes the token to complete a full
rotation of the ring will be spread out to two or more higher
layers, while symbols that have very infrequent trades will only
occupy a single slot in the bottom ring. An example result of this
process is depicted in FIG. 11C. FIG. 11C shows the thinning
algorithm using three layers to thin a market data stream that
contains updates for eight symbols. Four symbols are included in
data structure and assigned to the four visible columns in the
front. The figure also shows that all four symbols have pending
updates in the bottom ring, while the `MSFT` symbol is so volatile
that it shifted updates into all three layers before the tokens
could dequeue any of the pending updates.
[0167] In a case where a symbol is so actively traded that it
manages to shift an update out of the top ring, that update is
irrecoverably lost. If required, loss of updates can be avoided by
letting the token in the top ring run without dequeuing delay,
letting it empty every slot immediately after an update was queued.
The result is that the transmission rate of the upper layer will
equal the transmission rate of the original stream, minus the
combined rate of all lower layers. Because the transmission rate of
the lower layers may have a fixed maximum, the upper layer's rate
may fluctuate and follow the variations of the original stream.
[0168] In one embodiment, if the tokens all rotate at the same
speed, this does not imply that they visit the same symbol slots at
the same time. Instead, they will usually be at different positions
in their ring. This is a result of empty slots that are skipped
without waiting.
[0169] In one embodiment, the layered thinning algorithm has out of
order transmission and may cause additional delay. Every ring can
contain an update for the same symbol at a certain time and because
the tokens can have different column locations inside their ring,
it is undefined which ring will dequeue and transmit its update
first, which will be second and so on. This means that not only the
order of symbols is shuffled, but even the updates for a single
symbol are no longer guaranteed to be transmitted
chronologically.
[0170] While new updates for a symbol are queued, shifted and sent
out, in some cases an update can survive in the rings substantially
longer than its younger successors. This can occur when an update
that is just about to be dequeued by the token, gets shifted to the
next layer by a more recent update that entered through the lower
ring. In this case it is likely that the more recent update is
dequeued sooner than the older update that now has to wait for the
token of its new ring to reach its slot. To address this issue, the
system provides a way of putting the outgoing, layered updates back
in the original, chronological order. This is done by labeling
incoming updates with sequential numbers and having the tokens pass
their dequeued, labeled updates to a transmit window that uses the
sequence numbers to restore the chronological order. When updates
are shifted out of the upper ring and therefore considered lost the
transmit window is notified not to wait for their sequence
numbers.
[0171] When applying order reconstruction across all rings, the
transmit window must have the same capacity as all rings combined,
size.sub.tw=Nj, where size.sub.tw is the maximum capacity of the
transmit window, N the number of symbols in the original stream
(and therefore the number of slots per ring) and j the number of
layers used. The potential size of the transmit window, together
with its bursty behavior when a delayed quote update releases a
substantial amount of pending updates from the window, may justify
a mechanism that applies order reconstruction to achieve
chronologic transmission per symbol only, similar to the output of
the base layer encoder.
[0172] Like the algorithm depicted in FIG. 11A, the layered system
introduces delay. The model is essentially a collection of
instances of the first algorithm, thus, its worst case delay is the
sum of each ring's worst case delay:
t max = i = 0 i N f update ( i ) ##EQU00004##
[0173] where t.sub.max is the maximum delay in seconds that the
algorithm could introduce for a single quote update, N the number
of symbols in the original stream, j the number of used layers and
f.sub.update(i) the update frequency of the token in layer i.
[0174] In the foregoing specification, the invention has been
described with reference to specific embodiments thereof. It will
however, be evident that various modification and changes can be
made thereto without departing from the broader spirit and scope of
the invention as set forth in the appended claims. The
specification and drawings are, accordingly, to be regarding in an
illustrative rather than a restrictive sense.
* * * * *