U.S. patent application number 13/040338 was filed with the patent office on 2011-09-08 for peer-to-peer live content delivery.
Invention is credited to Evan Pedro Greenberg, Bo Yang.
Application Number | 20110219137 13/040338 |
Document ID | / |
Family ID | 44532232 |
Filed Date | 2011-09-08 |
United States Patent
Application |
20110219137 |
Kind Code |
A1 |
Yang; Bo ; et al. |
September 8, 2011 |
PEER-TO-PEER LIVE CONTENT DELIVERY
Abstract
A peer-to-peer live content delivery system and method enables
peer-to-peer sharing of live content such as, for example,
streaming video or audio. Nodes receive broadcasts of available
data from neighboring nodes and determine which data blocks to
request. Nodes receiving requests for data determine whether or not
to accept the requests and provide the requested blocks when
accepted. To enable sharing of live content, sharing of data blocks
is constrained such that a node attempts to receive a particular
data block prior to a playback deadline for the data block. This
allows a node continuously provide an output stream of the received
data such as, for example, an output of live video content to a
display.
Inventors: |
Yang; Bo; (Palo Alto,
CA) ; Greenberg; Evan Pedro; (Palo Alto, CA) |
Family ID: |
44532232 |
Appl. No.: |
13/040338 |
Filed: |
March 4, 2011 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61311141 |
Mar 5, 2010 |
|
|
|
Current U.S.
Class: |
709/231 |
Current CPC
Class: |
H04L 29/12103 20130101;
H04L 67/14 20130101; H04L 63/029 20130101; H04L 29/12528 20130101;
H04L 61/1535 20130101; H04L 61/2575 20130101 |
Class at
Publication: |
709/231 |
International
Class: |
G06F 15/16 20060101
G06F015/16 |
Claims
1. A method for distributing streaming data in a peer-to-peer
network, the method performed by a first node in the peer-to-peer
network, the method comprising: receiving, by the first node, a
data availability broadcast message from a first neighboring node,
the data availability broadcast message identifying one or more
data blocks that the first neighboring node has available for
transmission, the one or more data blocks comprising time-localized
portions of the streaming data; determining, by the first node, a
desired data block selected from the one or more data blocks
specified in the data availability broadcast message, the desired
data block selected such that the first node receives the desired
data block prior to a playback deadline for the first data block;
transmitting, by the first node, a data request message to the
first neighboring node specifying the desired data block; and
receiving, by the first node, the desired block from the first
neighboring node.
2. The method of claim 1, wherein the streaming data comprises
streaming video data and wherein the first node prioritizes an
order of the data request message such that the first node
continuously outputs the streaming video data to a display.
3. The method of claim 1, wherein determining the desired data
block comprises: maintaining an incoming queue of incoming data
blocks scheduled for transmission to the first node within a time
window; and determining the desired data block from among the data
blocks within the time window that are absent from the incoming
queue.
4. The method of claim 1, further comprising: receiving directly
from a server, a pre-burst of a plurality of data blocks, the
pre-burst corresponding to a beginning of a new data stream.
5. The method of claim 1, further comprising: determining a desired
data block that is unavailable from neighboring nodes; transmitting
a request for the desired data block to a server; and receiving the
desired data block from the server.
6. The method of claim 1, further comprising: transmitting a data
availability broadcast message to a plurality of neighboring nodes,
the data availability broadcast message identifying one or more
data blocks available for transmission by the first node; receiving
from a second neighboring node, a data request specifying at least
one desired data block selected from among the one or more data
blocks available for transmission by the first node; determining
whether or not to accept the data request; responsive to
determining to accept the data request, transmitting the desired
data block to the second neighboring node.
7. The method of claim 6, further comprising: rejecting the data
request responsive to determining that the first node is already
currently transmitting at least a first threshold number of data
blocks; and rejecting the data request responsive to determining
that the first node has previously transmitted the desired data
block to at least a second threshold number of nodes.
8. A computer-readable storage medium storing computer-executable
instructions for distributing streaming data in a peer-to-peer
network, the instructions when executed by a processor cause the
processor to perform steps including: receiving a data availability
broadcast message from a first neighboring node, the data
availability broadcast message identifying one or more data blocks
that the first neighboring node has available for transmission, the
one or more data blocks comprising time-localized portions of the
streaming data; determining a desired data block selected from the
one or more data blocks specified in the data availability
broadcast message, the desired data block selected such that the
first node receives the desired data block prior to a playback
deadline for the first data block; transmitting a data request
message to the first neighboring node specifying the desired data
block; and receiving the desired block from the first neighboring
node.
9. The computer-readable storage medium of claim 8, wherein the
streaming data comprises streaming video data and wherein the first
node prioritizes an order of the data request message such that the
first node continuously outputs the streaming video data to a
display.
10. The computer-readable storage medium of claim 8, wherein
determining the desired data block comprises: maintaining an
incoming queue of incoming data blocks scheduled for transmission
to the first node within a time window; and determining the desired
data block from among the data blocks within the time window that
are absent from the incoming queue.
11. The computer-readable storage medium of claim 8, wherein the
instructions when executed further cause the processor to perform
steps including: receiving directly from a server, a pre-burst of a
plurality of data blocks, the pre-burst corresponding to a
beginning of a new data stream.
12. The computer-readable storage medium of claim 8, wherein the
instructions when executed further cause the processor to perform
steps including: determining a desired data block that is
unavailable from neighboring nodes; transmitting a request for the
desired data block to a server; and receiving the desired data
block from the server.
13. The computer-readable storage medium of claim 8, wherein the
instructions when executed further cause the processor to perform
steps including: transmitting a data availability broadcast message
to a plurality of neighboring nodes, the data availability
broadcast message identifying one or more data blocks available for
transmission by the first node; receiving from a second neighboring
node, a data request specifying at least one desired data block
selected from among the one or more data blocks available for
transmission by the first node; determining whether or not to
accept the data request; responsive to determining to accept the
data request, transmitting the desired data block to the second
neighboring node.
14. The computer-readable storage medium of claim 13, wherein the
instructions when executed further cause the processor to perform
steps including: rejecting the data request responsive to
determining that the first node is already currently transmitting
at least a first threshold number of data blocks; and rejecting the
data request responsive to determining that the first node has
previously transmitted the desired data block to at least a second
threshold number of nodes.
15. A system for distributing streaming data in a peer-to-peer
network, the system comprising: one or more processors; and a
computer-readable storage medium storing computer-executable
instructions the instructions when executed by the one or more
processors cause the one or more processors to perform steps
including: receiving a data availability broadcast message from a
first neighboring node, the data availability broadcast message
identifying one or more data blocks that the first neighboring node
has available for transmission, the one or more data blocks
comprising time-localized portions of the streaming data;
determining a desired data block selected from the one or more data
blocks specified in the data availability broadcast message, the
desired data block selected such that the first node receives the
desired data block prior to a playback deadline for the first data
block; transmitting a data request message to the first neighboring
node specifying the desired data block; and receiving the desired
block from the first neighboring node.
16. The system of claim 15, wherein the streaming data comprises
streaming video data and wherein the first node prioritizes an
order of the data request message such that the first node
continuously outputs the streaming video data to a display.
17. The system of claim 15, wherein determining the desired data
block comprises: maintaining an incoming queue of incoming data
blocks scheduled for transmission to the first node within a time
window; and determining the desired data block from among the data
blocks within the time window that are absent from the incoming
queue.
18. The system of claim 15, wherein the instructions when executed
further cause the one or more processors to perform steps
including: receiving directly from a server, a pre-burst of a
plurality of data blocks, the pre-burst corresponding to a
beginning of a new data stream.
19. The system of claim 15, wherein the instructions when executed
further cause the one or more processors to perform steps
including: determining a desired data block that is unavailable
from neighboring nodes; transmitting a request for the desired data
block to a server; and receiving the desired data block from the
server.
20. The system of claim 15, wherein the instructions when executed
further cause the one or more processors to perform steps
including: transmitting a data availability broadcast message to a
plurality of neighboring nodes, the data availability broadcast
message identifying one or more data blocks available for
transmission by the first node; receiving from a second neighboring
node, a data request specifying at least one desired data block
selected from among the one or more data blocks available for
transmission by the first node; determining whether or not to
accept the data request; responsive to determining to accept the
data request, transmitting the desired data block to the second
neighboring node.
Description
RELATED APPLICATIONS
[0001] This application claims priority from U.S. provisional
application No. 61/311,141 entitled "High Performance Peer-To-Peer
Assisted Live Content Delivery System and Method" filed on Mar. 5,
2010, the content of which is incorporated by reference herein in
its entirety.
BACKGROUND
[0002] 1. Field of the Invention
[0003] The invention relates generally to peer-to-peer networking
and more particularly to distributing data such as live content
over a network within some time constraint.
[0004] 2. Description of the Related Art
[0005] Peer-to-peer networking provides an efficient network
architecture for sharing information by creating direct connections
between "nodes" without requiring information to pass through a
centralized server. In a conventional peer-to-peer network, a node
receives different portions of a file from a plurality of different
neighboring nodes. Thus, when sharing a video file, for example, a
node may receive different segments of the video from different
nodes. Once all of the portions are received, the node can
reconstruct the file from the separate portions.
[0006] Conventional peer-to-peer networking systems are not adapted
to sharing live or streaming media content such as live video or
audio. Rather, these conventional networks are only adapted to
operate on discrete files rather than continuous data streams.
Thus, these conventional sharing protocols are not adapted to
handling the time constraints associated with delivery of streaming
content. Therefore, the conventional systems do not provide any way
to distribute data such as live content in a peer-to-peer network
where portions of the data must be received within some time
constraint.
SUMMARY
[0007] A system, method, and computer-readable storage medium
enable nodes in a peer-to-peer network to share streaming data
(e.g., video) and imposes time constraints such that nodes are able
to continuously output the streaming data. A first node receives a
data availability broadcast message from a neighboring node. The
data availability broadcast message identifies one or more data
blocks that the neighboring node has available for sharing. The
data blocks each comprise a time-localized portion of the streaming
data. The first node determines a desired data block selected from
the one or more data blocks specified in the data availability
broadcast message. In one embodiment, the first node makes selected
the desired data block to ensure that it will receive all data
blocks in the stream prior to their playback deadlines, i.e.,
before the first node is scheduled to output the data block (e.g.,
streaming a video to a display). The first node then transmits a
data request message to the neighboring node specifying the desired
data block. Assuming the neighboring node accepts the request, the
first node receives the desired block from the neighboring
node.
[0008] Beneficially, the data sharing system and method enables
sharing of live content such as video or audio. By constraining the
sharing of data blocks to occur within a limited time period, a
node can provide a continuous output stream of the received data
such as, for example, an output of live video content to a
display.
[0009] The features and advantages described in the specification
are not all inclusive and, in particular, many additional features
and advantages will be apparent to one of ordinary skill in the art
in view of the drawings, specification, and claims. Moreover, it
should be noted that the language used in the specification has
been principally selected for readability and instructional
purposes, and may not have been selected to delineate or
circumscribe the inventive subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] The teachings of the embodiments of the present invention
can be readily understood by considering the following detailed
description in conjunction with the accompanying drawings.
[0011] FIG. 1 illustrates an example configuration of a
peer-to-peer network, in accordance with an embodiment of the
present invention.
[0012] FIG. 2 illustrates examples of data structures of streaming
data for sharing in the peer-to-peer network, in accordance with an
embodiment of the present invention.
[0013] FIG. 3 is illustrates an example of a message passing
protocol for sharing data between nodes in a peer-to-peer network,
in accordance with an embodiment of the present invention.
[0014] FIG. 4 illustrates a distribution tree structure for
modeling distribution of data blocks in the peer-to-peer network,
in accordance with an embodiment of the present invention.
[0015] FIG. 5 illustrates an example architecture for a computing
device for use a server or node in a peer-to-peer network, in
accordance with an embodiment of the present invention.
DETAILED DESCRIPTION
[0016] Reference in the specification to "one embodiment" or to "an
embodiment" means that a particular feature, structure, or
characteristic described in connection with the embodiments is
included in at least one embodiment of the invention. The
appearances of the phrase "in one embodiment" or "an embodiment" in
various places in the specification are not necessarily all
referring to the same embodiment.
[0017] Some portions of the detailed description that follows are
presented in terms of algorithms and symbolic representations of
operations on data bits within a computer memory. These algorithmic
descriptions and representations are the means used by those
skilled in the data processing arts to most effectively convey the
substance of their work to others skilled in the art. An algorithm
is here, and generally, conceived to be a self-consistent sequence
of steps (instructions) leading to a desired result. The steps are
those requiring physical manipulations of physical quantities.
Usually, though not necessarily, these quantities take the form of
electrical, magnetic or optical signals capable of being stored,
transferred, combined, compared and otherwise manipulated. It is
convenient at times, principally for reasons of common usage, to
refer to these signals as bits, values, elements, symbols,
characters, terms, numbers, or the like. Furthermore, it is also
convenient at times, to refer to certain arrangements of steps
requiring physical manipulations or transformation of physical
quantities or representations of physical quantities as modules or
code devices, without loss of generality.
[0018] However, all of these and similar terms are to be associated
with the appropriate physical quantities and are merely convenient
labels applied to these quantities. Unless specifically stated
otherwise as apparent from the following discussion, it is
appreciated that throughout the description, discussions utilizing
terms such as "processing" or "computing" or "calculating" or
"determining" or "displaying" or the like, refer to the action and
processes of a computer system, or similar electronic computing
device (such as a specific computing machine), that manipulates and
transforms data represented as physical (electronic) quantities
within the computer system memories or registers or other such
information storage, transmission or display devices.
[0019] Certain aspects of the present invention include process
steps and instructions described herein in the form of an
algorithm. It should be noted that the process steps and
instructions of the present invention could be embodied in
software, firmware or hardware, and when embodied in software,
could be downloaded to reside on and be operated from different
platforms used by a variety of operating systems. The invention can
also be in a computer program product which can be executed on a
computing system.
[0020] The present invention also relates to an apparatus for
performing the operations herein. This apparatus may be specially
constructed for the purposes, e.g., a specific computer, or it may
comprise a general-purpose computer selectively activated or
reconfigured by a computer program stored in the computer. Such a
computer program may be stored in a computer readable storage
medium, such as, but is not limited to, any type of disk including
floppy disks, optical disks, CD-ROMs, magnetic-optical disks,
read-only memories (ROMs), random access memories (RAMs), EPROMs,
EEPROMs, magnetic or optical cards, application specific integrated
circuits (ASICs), or any type of media suitable for storing
electronic instructions, and each coupled to a computer system bus.
Memory can include any of the above and/or other devices that can
store information/data/programs. Furthermore, the computers
referred to in the specification may include a single processor or
may be architectures employing multiple processor designs for
increased computing capability.
[0021] The algorithms and displays presented herein are not
inherently related to any particular computer or other apparatus.
Various general-purpose systems may also be used with programs in
accordance with the teachings herein, or it may prove convenient to
construct more specialized apparatus to perform the method steps.
The structure for a variety of these systems will appear from the
description below. In addition, the present invention is not
described with reference to any particular programming language. It
will be appreciated that a variety of programming languages may be
used to implement the teachings of the present invention as
described herein, and any references below to specific languages
are provided for disclosure of enablement and best mode of the
present invention.
[0022] In addition, the language used in the specification has been
principally selected for readability and instructional purposes,
and may not have been selected to delineate or circumscribe the
inventive subject matter. Accordingly, the disclosure of the
present invention is intended to be illustrative, but not limiting,
of the scope of the invention.
Overview
[0023] A peer-to-peer (P2P) type distributed architecture is
composed of participants, e.g., individual computing resources that
make a portion of their resources (e.g., processing power, disk
storage or network bandwidth) available to other participants. FIG.
1 illustrates an example configuration of a peer-to-peer network
100 for distributing streaming content. The peer-to-peer network
100 comprises a server 110 and a number of nodes 120 (e.g., nodes
A, B, C, D, and E). The server 110 is a computing device that
provides coordinating functionality for the network. In one
embodiment, the server 110 may be controlled by an administrator
and has specialized functionality as will be described below. The
nodes 120 are computing devices communicatively coupled to the
network via connections to the server 110 and/or to one or more
other nodes 120. Examples of computing devices that may act as a
server or a node in the peer-to-peer network 100 are described in
further detail below with reference to FIG. 5.
[0024] In this peer-to-peer network, 100 each node 120 has a
neighboring relationship with and maintains information about a
bounded subset of neighboring nodes (alternatively referred to as
"peer nodes" or "peers" or "partners") to which it can directly
communicate data or from which it can directly receive data. A node
120 does not necessarily have a direct connection to every other
node 120 in the network 100. However, information can flow between
nodes 120 that are not directly connected via hops between
neighboring nodes. Each node 120 maintains a list of its
neighboring nodes in the network graph. In one embodiment, the
server 110 shares a direct connection with each of nodes 120.
[0025] The neighboring relationship between nodes 120 is determined
by a membership management protocol. An example of a membership
management protocol for a peer-to-peer network is described in U.S.
Patent Application No. ______ to Yang, et al. filed on Mar. 4, 2011
and entitled "Network Membership Management for Peer-to-Peer
Networking," which is incorporated by reference herein. In one
embodiment, the membership management protocol determines the
neighboring relationships between nodes 120 randomly according to a
probabilistic formula. Furthermore, the neighboring relationships
may change whenever a new node joins the network, when an existing
node 120 leaves the network, or during periodic re-subscriptions
which serve to rebalance the network graph. When operated in one
such implementation, the probabilistically expected average subset
size is (c+1)*log.sub.2(N), where c is a design parameter (e.g., a
fixed value or a value configured by a network administrator,
typically a small integer value) and N is the total number of nodes
in the system. The protocol establishes that in such a network, for
any constant k, sending a multicast message to log.sub.2(N)+k nodes
(who then proceed to recursively forward the message to
log.sub.2(N)+k other nodes that have not already seen the message)
will reach every node in the network graph with theoretical
probability of e.sup.-e -k. In one embodiment, the backend server
110 manages nodes according to a pod-based management scheme. An
example of a pod-based management scheme is described in U.S.
Patent Application No. ______ to Yang, et al. filed on Mar. 4,
2011, and entitled "Pod-Based Server Backend Infrastructure for
Peer-Assisted Applications," which is incorporated by reference
herein. Generally in the pod-based management scheme, each "pod"
comprises a plurality of nodes and only nodes within the same pod
can directly share data. The server dynamically allocates nodes to
pods and dynamically allocates computing resources for pushing data
to the pods based on characteristics of the incoming data stream
and performance of the peer-to-peer sharing. By dynamically
adjusting the pod structure and resources available to them based
on monitored characteristics, the server 110 can optimize
performance of the peer-to-peer network.
[0026] In one embodiment, the peer-to-peer network 100 distributes
streaming data (e.g., audio, video, or other time-based content) to
the nodes 120. The server 110 receives the streaming data 130 from
a streaming data source (not shown). The streaming data source may
be, for example, an external computing device or a storage device
local to the server 110. In one embodiment, the streaming data 130
comprises am ordered sequence of data blocks. Each data block
comprises a time-localized portion of the data stream 130. For
example, for an input video stream, a data block may comprise a
sequence of consecutive video frames from the input video stream
(e.g., a 0.5 second chunk of video). For an input audio stream, a
data block may comprise a time-localized portion of the audio.
[0027] The server 110 then distributes a received data block to one
or more nodes 120. The receiving node(s) in turn may distribute the
data block to one or more additional nodes, and so on. In this
manner, the data block is distributed throughout the peer-to-peer
network 100. The number of nodes to which the server 110
distributes a data block may vary depending on the network
configuration, and may be vary for different data blocks in the
input data stream 130. Similarly, the number of nodes to which a
particular node distributes a data block may vary depending on the
network configuration, and may also be different for different data
blocks.
[0028] In one embodiment, the data distribution protocol constrains
the timing of the distribution of data blocks in a manner optimized
for streaming data. Distribution of streaming data differs from
distribution of complete files in that the nodes generally do not
store the received data blocks indefinitely, but rather may store
them only temporarily until they are outputted. Thus, unlike file
sharing protocols where the order and timing of data block
distribution is not important, the distribution protocol for
streaming data should attempt to provide data blocks within a
specified time constraint such that they can be continuously
outputted.
[0029] The data distribution protocol may be useful, for example,
to distribute broadcasts of "live" video or other time-based
streams. As used herein, the term "live" does not necessarily
require that video stream is distributed concurrently with its
capture (as in, for example, a live sports broadcast). Rather, the
term "live" refers to data for which it is desirable that all
participating nodes receive data blocks in a roughly synchronized
manner (e.g., within a time period T of each other). Thus, examples
of live data may include both live broadcasts (e.g. sports or
events) that are distributed as the data is captured, and
multicasts of previously stored video or other data.
[0030] In one embodiment, the distribution protocol attempts to
ensure delivery of a data block to each node 120 on the network 100
within a time period T seconds from when the server 110 initially
outputs the data block. For example, in different embodiments, T
could be a few seconds, 5 minutes, or one hour. In one embodiment,
if a node 120 cannot receive the media block within the time period
T (e.g., due to bandwidth constraints or latency), the block is no
longer considered useful to the node 120 and the node may be drop
its request for the data block in favor of later blocks.
Furthermore, the order in which data blocks are requested and
distributed to various nodes 120 may be prioritized in order to
optimize the nodes' ability to meet the time constraints, with the
goal of enabling the nodes 120 to continuously output the streaming
data.
Operation
[0031] In one embodiment, each neighboring node pair has a pair of
connections: one for control and one for data. It is generally
undesirable to have data transfer delay any control message. It is
therefore helpful that control traffic be independent of data
traffic, because a node may need to quickly request data from a
neighboring node (e.g., because it is not able to get the data from
its originally selected neighboring node). To accommodate this in
one implementation, a node can open a pair of TCP connections and
multiplex control packets onto one connection, and data packets
onto the other. In this implementation, data and control packets
may be out-of-order with respect to one another (though still
in-order with respect to packets of the same type).
[0032] In one embodiment, the peer-to-peer sharing protocol
includes (a) exchanging data availability with a set of nodes, (b)
retrieving locally unavailable data from one or more nodes, and (c)
supplying locally available data to nodes, as will be described in
further detail below.
[0033] For each data block received by the server 110, the server
110 selects one or more "partner" nodes to which to distribute the
data block. For live streaming, each node should receive the data
block by a "playback deadline" such that the nodes can maintain a
continuous output stream of the data. For example, for video data,
the playback deadline refers to the point in time when a node
providing a live output of the video is scheduled to output a
particular time-localized block of video. Thus, distribution of
data is prioritized with the goal of attempting to ensure that each
node in the network receives data blocks by their respective
playback deadlines.
[0034] Although nodes attempt to receive each data block prior to
their respective playback deadline, the nodes do not necessarily
receive the data blocks in the exact order that they appear in the
streaming data. Rather, a node may receive a data block
corresponding to a number of different time slots within in a
window of time past the current playback deadline. An example is
illustrated in FIG. 2. Streaming data 130 comprises a sequence of
data blocks with each data block corresponding to a particular time
slot. The server injection point 255 is a time slot corresponding
to the data block that the server 110 is currently pushing out to
the peer-to-peer network. Thus, in the illustrated example, the
server 110 has already pushed all data blocks up to data block in
time slot t+W corresponding to the server injection point 255. Over
time, the server injection point 255 advances 256 as the server
continuously receives and outputs the streaming data. Received data
250 illustrates the data blocks already received by a node A. The
playback deadline 253 indicates the time slot corresponding to the
data block currently being output by the node A. The playback
deadline 253 advances 257 over time as node A continuously outputs
the received data blocks. Thus, to enable node A to continuously
output the streaming data (e.g., a streaming video), node A
attempts to ensure that it receives a data block before the
playback point 253 advances to the time slot corresponding to that
data block. Node A does not necessarily receive the data blocks in
the original data stream order. Thus, in the illustrated example,
node A has received data blocks corresponding to time slots t+1,
t+2, t+4, and t+7, but is still missing the remaining data blocks
in the window 259 between the current playback deadline 253 and the
server injection point 255. Node A attempts to ensure that it
receives each of these data blocks before the playback deadline 253
advance to the corresponding time slots. In one embodiment, nodes
cease sharing of data blocks once the playback deadline 253 has
passed the time slot corresponding to those data blocks. Thus, the
data blocks that are being shared in the peer-to-peer network at
any given moment correspond to the data blocks within a current
share window 259 between the current playback deadline 253 (at time
t) and the server injection point 255 (at time t+W).
[0035] In one embodiment, each node maintains an "incoming queue"
and a "commit queue." The incoming queue identifies a set of data
blocks that the node expects to receive within a time window (e.g.,
the next N time slots in the data stream). The commit queue
identifies a set of data blocks the node expects to transmit to
other nodes within the time window. In one embodiment, the commit
queue of data blocks can be implemented to impose rate limit on
transmission of the data blocks. Specifically, an interval can be
imposed between the transmissions of data blocks so that data
blocks are not sent out back-to-back immediately. The interval
imposed allows capacity to be reserved on the network link to allow
other data to go through. This rate control can be implemented at
the application level.
[0036] FIG. 3 illustrates an example of a message passing protocol
between two neighboring nodes in the peer-to-peer network (e.g.,
node 120-A and node 120-B). In the illustrated example, node 120-A
acts as a requesting node (also referred to herein as a "child
node") with respect to a particular data block, and node 120-B acts
a transmitting node (also referred to herein as a "parent node").
However, each node can perform the functions attributed to both
node 120-A and node 120-B, i.e., any node can act as either a
parent node or a child node with respect to different data blocks.
Furthermore, although the illustrated example only shows two nodes,
similar message passing may occur concurrently between a node and
any of its neighboring nodes.
[0037] Initially, both nodes 120-A, 120-B optionally receive a
PARAMETERIZATION message (not shown) specifying various parameters
that will be used in the communication protocol. For example, in
one embodiment, the PARAMETERIZATION message includes: (a) a data
availability broadcast period (e.g., 0-10 min.) (in one embodiment,
corresponding to the size of window 259) (b) a data block length
corresponding to the size of the data block (e.g., 0-5 min.). The
PARAMETERZATION messages are optionally sent by the server 110
prior to data sharing between the nodes.
[0038] Node 120-B periodically sends a DATA AVAILABILITY BROADCAST
message 201 to its neighboring nodes (including node 120-A). The
message 201 may be sent, for example, once per data availability
broadcast period. This message identifies the data blocks that node
120-B has available for sharing. The DATA AVAILABILITY BROADCAST
message 301 may include, for example, (a) the number of free time
slots node 120-B has available for transmission over the next time
window (e.g., the next N time slots with each time slot
corresponding to a data block), (b) the earliest free time slot
node 120-B has, and (c) whether the node 120-B has a specified data
block and if it's been scheduled for transmission. Optionally, the
node 120-B may communicate a measurement of its current or average
bandwidth and transmission delay. In one embodiment, a node
maintains a data block availability structure corresponding to the
window of blocks it has available for sharing with other nodes. For
example, in one embodiment, the data block availability structure
comprises a maximum of X data blocks, where X is a fixed integer
value or a customizable parameter.
[0039] Upon receiving a DATA AVAILABILITY BROADCAST message 301,
node 120-A determines 307 which, if any, of the available data
blocks it wants to request from node 120-B. Assuming node 120-A
wants to request one of the data blocks from node 120-B, node 120-A
requests the desired data block by sending a DATA REQUEST message
303 to node 120-B. In one embodiment, in order to alleviate
potential problems associated with upload bandwidth hogging, the
data protocol may limit a data request to at most one data block
from a neighboring node per request (or similarly, some small
number). This limitation may reduce the worst case delay of a data
block reaching every node in the network. Optionally, the protocol
may also prevent a node from requesting a data block that is too
old. For example, in one embodiment, a data block is considered too
old if it falls into the first N entries of a data map for some
parameter N. A data block may also be considered too old if the
node would not be able to receive the block before the current
playback point reaches the corresponding time slot for the
block.
[0040] When a node first joins the peer-to-peer network, the new
node may be limited to only request data blocks beyond the server
injection point 255 plus a configured advance window (e.g., an
additional N data blocks). This limitation may help avoid a node
falling behind permanently trying to catch up with old data blocks.
In another embodiment, when a node has just joined the group, the
node should not move the window 259 ahead for B seconds where B can
be, for example, 0-300 seconds. B can optionally be chosen by the
consumer of the data, which could a video player under the video
streaming scenario.
[0041] In one embodiment, a node 120-A may receive DATA
AVAILABILITY BROADCAST message 301 from a plurality of neighboring
nodes and these message may specify one or more of the same data
blocks needed by node 120-A. If node 120-A has a choice of
neighboring nodes from which to request a particular desired block,
node 120-A may first generate a candidate list for each desired
block. The list can be partitioned into different groups based on
previous responses from these neighboring nodes. Each group can be
further sorted using a number of factors, including bandwidth
available, observed performance, network distance, time of previous
rejection or acceptance, etc. The node 120-A may then use these
various factors to determine which neighboring node from which to
request the desired data block.
[0042] For rare data blocks that are not available on most of
neighboring nodes, a node may be less likely to request the data
block (i.e., requests for rare data blocks are issued less often).
In one embodiment, a rare data block comprises a data block
available from less than a threshold number or threshold percentage
or neighboring nodes. In one embodiment, the decision to issue a
request for such a block can be determined probabilistically to
achieve a higher percentage of successful requests. In an
alternative embodiment, an entirely different approach may be used
in which a node will pursue a rare data block aggressively (i.e.,
request it more frequently) than data blocks that are more widely
available.
[0043] Upon receiving a request for data via a DATA REQUEST message
303, a node 120-B determines 309 whether or not accept the request.
In one embodiment, a node can collectively commit at most a first
threshold X data blocks concurrently to all of its neighboring
nodes, where X is a parameterized constant equal to or less than
the window size of the data block availability data structure.
Furthermore, in one embodiment, a node may commit at most a second
threshold N total copies of the same data block to its neighboring
nodes over the lifetime of the data block, where N is some
parameterized constant. Thus once a node has committed N copies of
a data block to neighboring nodes, it will no longer accept
requests for that data block. This limitation provides a more even
fanout distribution across delivery trees of different data blocks.
In one embodiment, a node performs an estimate of whether a data
block can be transmitted and received by another node by the
playback deadline 253 of the block before committing to transfer of
the node. Furthermore, in deciding whether a data block request can
be accepted, a node may first check if there is any potential
conflict. For example, if accepting a new request for a data block
will result in any of the data blocks already in the transmit queue
missing their playback deadline.
[0044] If node 120-B accepts a request, node 120-B sends a DATA
RESPONSE message 305 to the requesting node 120-A indicating which
data block(s) it approved for transmission. For accepted requests,
node 120-B adds the data block(s) to its commit queue and the
requesting node 120-A adds the data block(s) to its incoming queue.
Optionally, data blocks can be stamped by nodes each time they are
transmitted from one node to another, for example, by incrementing
a relay count tracing the number of times the block has been
transmitted. This information can be used to maximize
performance.
[0045] If node 120-B determines to reject node 120-A's a request
for a data block, the requesting node 120-A may wait for a period
of time or immediately try to request the data block from another
node. If node 120-A cannot obtain the data block from any of its
neighboring nodes, it may request the data block directly from the
server 110.
[0046] In one embodiment, the server 110 can pre-burst data to
newly joined nodes (i.e., for the most recent N data blocks of the
streaming data 130 for some parameter N). The amount of data to be
pre-bursted can be adjusted based on the particular stream or
configuration. Furthermore, nodes can request missing data from the
server 110 at a later time if they are unable to obtain the data
from other nodes. For example, in one embodiment, a node determines
to request a data block from the server 110 when it would not
otherwise be able to obtain the data block in time to meet the
playback deadline. Such requests can be batched for a number of
blocks. A server 110 may also implement logic to prevent a node
from requesting too much data in this way.
[0047] Optionally, data blocks in the transmit queue can be sorted
based on a number of criteria such as, for example, their sequence
in the original data stream, deadline of delivery, urgency of
delivery, etc. Optionally, commitments may be prioritized based on
tree level, which is lowest at the server and incremented each time
the data block is delivered to the next level.
[0048] Under the video streaming scenario, the following parameters
can be adjusted to achieve the described data sharing efficiency
and video startup latency while maintaining the same playback point
among nodes: pod size, size of the data blocks, and number of
blocks in the window of active sharing among nodes. This scheme is
flexible to support different tradeoffs between peer-to-peer
sharing efficiency, small startup latency, and synchronized
playback.
Analysis
1. Modeling
[0049] The distribution of an individual data block can be modeled
using a distribution tree structure as illustrated in FIG. 4. For
any given data block, a distribution tree 400 illustrates the flow
of the data block between interconnected nodes. In the illustrated
example, a root node 401 receives a data block 403. The root node
401 may correspond to the server 110. The root node distributes the
data block 403 to one or more first level nodes 405, which may be
referred to herein as "partners" of the root node 401 with respect
to the particular data block 403. The first level nodes 405 then
distribute the data block 403 to one or more second level nodes
407, and so on for any number of levels.
[0050] Each data block may flow through the nodes in an entirely
different manner. Thus, for a window size of W data blocks, there
will be W such distribution trees, with each tree corresponding to
one of the data blocks. For a given data block, a physical node may
appear in a distribution tree multiple times. This occurs, for
example, if the node receives the data block from two or more
different neighboring nodes. To distinguish between a physical node
and a point in the distribution tree, the points in the
distribution tree are referred to herein as s-nodes while
physically separate nodes are referred to as "nodes." Thus, a node
may appear multiple times in a given distribution tree and may
therefore correspond to multiple s-nodes in the same tree.
[0051] The distribution tree may change for each data block
distributed by the server 110. Thus, a node does not always receive
media blocks from the same nodes and a node (or server) does not
always distribute media blocks to the same nodes. Thus, for each
data block distributed from the server, a distribution tree may
arise that looks totally different than the distribution tree for
previous or subsequent data blocks. As a result, a node may be
located at different levels in distribution trees of different data
blocks. A node affects more children in a tree where it is closer
to the server than in a tree where it is closer to the bottom.
Also, queuing delay introduced by a node adds to the overall delay
experienced by the leaves (nodes that only receive but do not
transmit the data block).
2. Tree Depth, Overlay Radius, and Coverage Ratio
[0052] Tree Depth, or Overlay Radius, for a particular group size
directly affects delay and robustness. Related to this is Coverage
Ratio, which is percentage of nodes covered within a certain
depth.
2.1 Proof
[0053] In the distribution tree, the root node is at level 0. As
previously discussed nodes may appear multiple times at different
levels or within the same level. In the following discussion, nodes
are numbered based on their appearances in the breadth-first
search.
[0054] For an s-node t, the identifier, or the index, assigned to
it is denoted as Pt. The root node's identifier is 1, the first
(e.g., leftmost) child of root node's identifier is 2, and so on.
For simplicity, a homogeneous link bandwidth condition may be
assumed for all nodes. For an individual node, the depth of its
first appearance in the distribution tree is an important factor.
The expected tree depth of a group with N nodes is thus the average
tree depth of all N nodes.
[0055] An auxiliary function .delta.(t) is defined as:
.delta. ( t ) = { 1 , if .delta. ( t ) .noteq. .delta. ( t ' ) , 0
< t ' < t 0 , otherwise ##EQU00001##
[0056] In other words, .delta.(t) is 1 if and only if the s-node t
corresponds to the first appearance of the node in the tree.
[0057] Another auxiliary function is defined as:
[0058] f(t)=total number of unique nodes associated with s-nodes 1
through t.
[0059] Because membership and partnership are formed randomly among
all nodes, the probability of an s-node t being the first
appearance of a node is given by:
Pr [ .delta. ( t ) = 1 ] = N - f ( t - 1 ) N ( 1 ) ##EQU00002##
[0060] Note that:
.delta.(t)=f(t-1) (2)
[0061] Taking expectations of (1) and (2), yields:
E [ f ( t ) - f ( t - 1 ) ] = E [ .delta. ( t ) ] = N - E [ f ( t -
1 ) ] N ##EQU00003##
[0062] Thus:
E [ f ( t ) ] = E [ f ( t - 1 ) ] + N - E [ f ( t - 1 ) ] N E [ f (
t ) ] = 1 + N - 1 N E [ f ( t - 1 ) ] ( 3 ) ##EQU00004##
[0063] which gives the iteration for expected number of unique
nodes from one s-node to the next. Note that:
f(t)=1 (4)
[0064] Also, note that:
E [ f ( 2 ) ] = 1 + N - 1 N = N ( 2 N - 1 N 2 ) = N ( 1 - ( N - 1 N
) 2 ) ( 5 ) ##EQU00005##
[0065] This forms the iteration base which holds for t=2:
E [ f ( t - 1 ) ] = N [ 1 - ( N - 1 N ) t - 1 ] ( 6 )
##EQU00006##
[0066] Thus:
E [ f ( t ) ] = 1 + N - 1 N ( N ( 1 - ( N - 1 N ) t - 1 ) = 1 + ( N
- 1 ) ( 1 - ( N - 1 N ) t - 1 ) = 1 + ( N - 1 ) ( N t - 1 - ( N - 1
) t - 1 N t - 1 ) = 1 + N t - N t - 1 - ( N - 1 ) t N t - 1 = N ( 1
N + N t - N t - 1 - ( N - 1 ) t N t ) = N ( 1 - ( 1 - 1 N - N t - N
t - 1 - ( N - 1 ) t ) N t ) ) = N ( 1 - N t - N t - 1 - N t + N t -
1 + ( N - 1 ) t N t ) = N ( 1 - ( N - 1 N ) t ) ##EQU00007##
[0067] Thus:
E [ f ( t ) ] = N ( 1 - ( N - 1 N ) t ) ( 7 ) ##EQU00008##
[0068] Note that:
( N - 1 N ) t > - t N ( 8 ) ##EQU00009##
[0069] Thus:
E [ f ( t ) ] > N ( 1 - - t N ) ( 9 ) ##EQU00010##
[0070] t.sub.k denotes the identifier of the last s-node at level
k. The number of new unique nodes at level k, but not level
0-(k-1), is then:
f(t.sub.k)-(f(t.sub.k-1) (10)
[0071] The expected distance (depth) d of all nodes is then:
d = 1 N k = 1 .infin. ( k * E [ f ( t k ) - f ( t k - 1 ) ] ) ( 11
) ##EQU00011##
[0072] Note:
when k.fwdarw..infin., E[f(t.sub.k)]=N (12)
[0073] Thus:
lim k .infin. k ( 1 - e [ f ( t k ) ] N ) = 0 ( 13 )
##EQU00012##
[0074] From (11), taking N into the summation yields:
d = k = 1 .infin. ( k ( E [ f ( t k ) ] N - E [ f ( t k - 1 ) ] N )
) = k = 1 .infin. ( k ( 1 - 1 + E [ f ( t k ) ] N - E [ f ( t k - 1
) ] N ) ) = k = 1 .infin. ( k ( ( 1 - E [ f ( t k - 1 ) ] N ) - ( 1
- E [ f ( t k ) ] N ) ) ) ( 14 ) ##EQU00013##
[0075] Another way to reach (14) is by noting that the following is
equivalent of (10):
f ( t k ) - ( f ( t k - 1 ) = N - E [ f ( t k - 1 ) ] N - N - E [ f
( t k ) ] N = ( 1 - E [ f ( t k - 1 ) ] N ) - ( 1 - E [ f ( t k ) ]
N ) ( 15 ) ##EQU00014##
This is because the probability of unique new nodes appearing after
level (k-1) is:
N - E [ f ( t k - 1 ) ] N ( 16 ) ##EQU00015##
and the probability of unique new nodes appearing after level k
is:
N - E [ f ( t k ) ] N ( 17 ) ##EQU00016##
[0076] Eq. (14) may be solved by expanding each term in the
summation. Expansion of each individual term with a unique k value
of i "offsets" (i-1) number of the right term of the expansion for
the k value of (i-1), thus:
d = ( 1 - E [ f ( t 0 ) ] N ) - ( 1 - E [ f ( t 1 ) ] N ) ( k = 0
and k = 1 ) + 2 ( 1 - E [ f ( t 1 ) ] N ) - 2 ( 1 - E [ f ( t 2 ) ]
N ) ( k = 2 ) + + ( i - 1 ) ( 1 - E [ f ( t i - 1 ) ] N ) - ( i - 1
) ( 1 - E [ f ( t i ) ] N ) ( k = i - 1 ) + ( i ) ( 1 - E [ f ( t i
- 1 ) ] N ) - ( i ) ( 1 - E [ f ( t i ) ] N ) ( k = i ) + ( i ) ( 1
- E [ f ( t i - 1 ) ] N ) - ( i ) ( 1 - E [ f ( t i ) ] N ) ( k = i
) + = ( 1 - 1 N ) + ( 1 - E [ f ( t 1 ) ] N ) + ( 1 - E [ f ( t 2 )
] N ) + + ( 1 - E [ f ( t i ) ] N ) + + ( 1 - E [ f ( k = .infin. )
] N ) d = k = 0 .infin. ( 1 - E [ f ( t k ) ] N ) ( 18 )
##EQU00017##
[0077] Combining (7) and (18) yields:
d = k = 0 .infin. ( 1 - N ( 1 - ( N - 1 N ) t k ) N ) = k = 0
.infin. ( ( N - 1 N ) t k ) ##EQU00018## d = k = 0 .infin. ( 1 - N
( 1 - ( N - 1 N ) t k N ) = k = 0 .infin. ( N - 1 N ) t k
##EQU00018.2##
[0078] Then, combining (7), (8) and (18), yields:
d < k = 0 .infin. - t k N ( 19 ) ##EQU00019##
[0079] Before solving this summation, note that:
t k = 1 + M + M ( M - 1 ) + M ( M - 1 ) 2 + M ( M - 1 ) 3 + + M ( M
- 1 ) k - 1 = 1 + M * i = 0 k - 1 ( M - 1 ) i = 1 + M 1 ( 1 - ( M -
1 ) k ) 1 - ( M - 1 ) = 1 + M 1 - ( M - 1 ) k 2 - M = ( M - 1 ) k -
1 M - 2 M + 1 = M ( M - 1 ) k - M + M - 2 M - 2 = M ( M - 1 ) k - 2
M - 2 ##EQU00020##
[0080] Thus:
t k = M ( M - 1 ) k - 2 M - 2 ( 20 ) ##EQU00021##
[0081] Dividing (19) into two parts, the first part from k=0 to
k=log.sub.M-1 N, the second part from k=log.sub.M-1 N+1 to
k=.infin.,
d < k = 0 log M - 1 N - M ( M - 1 ) k - 2 ( M - 2 ) N + k = log
M - 1 N + 1 .infin. - M ( M - 1 ) k - 2 ( M - 2 ) N ( 21 ) < log
M - 1 N + 1 + k = 0 .infin. - MN ( M - 1 ) k - 2 ( M - 2 ) N ( 22 )
##EQU00022##
[0082] Note that:
MN ( M - 1 ) ( M - 2 ) N > 1 ##EQU00023##
[0083] So:
d .ltoreq. log M - 1 N + 1 + k = 0 .infin. - ( M - 1 ) k ( 23 )
##EQU00024##
[0084] For M.gtoreq.3:
(M-1).sup.k.gtoreq.(M-1)k (24)
[0085] Thus:
e.sup.-(M-1).sup.k.ltoreq.e.sup.-(M-1)k (25)
[0086] So:
d < log M - 1 N + 1 + k = 0 .infin. - ( M - 1 ) k ( 26 ) = log M
- 1 N + 1 + 1 1 - - ( M - 1 ) ( 27 ) < log M - 1 N + 3 ( 28 )
##EQU00025##
2.2 Tree Depth/Overlay Radius.
[0087] (28) shows that the average distance of a node from the root
node is bounded by O(log(N)):
d=O(log N) (29)
2.3 Coverage Ratio.
[0088] From (7), (9) and (20), we also have
coverage ratio at level k = E [ f ( t k ) ] N > N ( 1 - - t k N
) N = 1 - - t k n = 1 - M ( M - 1 ) k - 2 M - 2 ( 30 )
##EQU00026##
2.4 Notes
[0089] A heterogeneous environment where different nodes have
different uplink bandwidth affects fanout of each node. The larger
fanout of a node with fat upload link offsets to a degree the
smaller fanout of another node. This does not affect the analysis
herein.
[0090] Nodes that cannot support any upload, either because a
narrow uplink or firewall, can only serve as leaves. This does not
affect the conclusion above for a single tree. Its impact on
analysis of multi-tree is described below.
3. Multi-Tree in a Window
[0091] The section above examines characteristics of the tree for
one data block. Next, the relation between trees within the same
window is discussed. Specifically, the following discussion
illustrates: (a) whether there is enough bandwidth to support
streaming need for all nodes while the uplink bandwidth constraint
of each node is met; (b) what requirement an individual node should
meet in allocating its uplink bandwidth order to achieve optimal
performance for the group collectively; and (c) how the root node
can do in selecting its immediate partners and distributing data
across the partners to provide best performance possible for the
group.
[0092] The tree can be modeled as described below. At any given
point, a snapshot of the random graph following the flow of data,
one distribution tree is obtained for each data block. For a window
size of W data blocks, there will be W such trees, each tree
corresponding to one data block. A tree here is different from a
regular tree in that a node may appear in the tree multiple times
because these nodes form a graph (i.e. a node may receive the same
data block from more than one neighboring node). This type of tree
a is referred to herein as a "Tree with Redundant Nodes", or simply
"tree" hereinafter, unless specified otherwise.
[0093] The trees are numbered from 1 to W and each tree is denoted
as T.sub.1, T.sub.2, T.sub.3, . . . T.sub.i, . . . T.sub.w.
3.1 Partner Selection and Data block Scheduling Policy at Root
[0094] Various policies may be used to determine how the root node
should select its partners for the initial injection of data blocks
into the network. In the very simple case, where the root node
selects only one node, A, as its sole partner, A will receive all
data blocks within the window and will further deliver all the data
blocks to other nodes. For any tree T.sub.i, fanout at level 0 is
1. Also, due to A's uplink bandwidth limit, A can support at most
one child in each tree for a maximum of W children across W trees.
Thus, the fanout is 1 at level 1 for A in any tree T.sub.i. This
means that every tree actually would reduce to a line.
[0095] In a second case, the root node selects M peers, P.sub.1,
P.sub.2, . . . , P.sub.M, as its partners (M>1). Furthermore,
the root nodes sends out exactly one copy of each data block to one
of these partners. In alternative embodiments, the root node can
send out multiple copies of the same data block to multiple
partners. This may add a level of robustness, but generally does
not affect the major performance characteristics. Different
policies may be applied at the root node for this distribution. For
example, the root node could distribute the data blocks to its
partners in a round-robin fashion, one data block at a time per
partner. Alternatively, the root node may send W/M consecutive data
blocks to one partner, then move on to the next partner. For the
description below, it is assumed that the round-robin approach is
used.
3.2 Bandwidth Utilization
[0096] In the second case of section 3.1 above (where the root node
selects M peers, P.sub.1, P.sub.2, . . . , P.sub.M, as its
partners), fanout from the root node for any tree T.sub.i is 1.
Fanout from level 1 is (M-1). P.sub.1 would appear in W/M trees
corresponding to T.sub.1, T.sub.M+1, T.sub.2M+1, . . . ,
T.sub.W-M+1. This group of trees is referred to herein as the Tree
Group of P.sub.1. P.sub.1 would then utilize
( M - 1 ) W M ##EQU00027##
units worth of its upload bandwidth in total to distribute these
data blocks, with W/M units worth of bandwidth left (a node can
support a maximum of W units of upload bandwidth). This means in
other trees, P.sub.1 would mostly be located as leaf level, not
contributing to upload much. Further, note that this applies to
children of P.sub.1 in these trees as well, i.e., those nodes are
inner nodes and contribute most of their uplink bandwidth in these
trees. They will be located at leaf level in most of other trees.
This is generally feasible, though, given the ratio of inner nodes
to leaf nodes in an (M-1)-way tree. For a tree with N nodes, leaf
nodes account for
M - 1 M ##EQU00028##
portion of all nodes, while inner nodes taking up 1/M. (fanout at
level 0 is M, that does not change the portion in a material
way).
[0097] In trees T.sub.1, T.sub.M+1, T.sub.2M+1, . . . , T.sub.W-M+1
or the tree group of P.sub.1, P.sub.1's children could vary,
depending on the dynamics of the partnership formation. If M is
small, (e.g., 4), P.sub.1's children in those trees could well be
the same. These children of P.sub.1 would almost use up their
uplink bandwidth in P.sub.1's tree group. Each of them has W/M
units worth of bandwidth left, for a total of
W M ( M - 1 ) ##EQU00029##
units worth of bandwidth for other tree groups. This is the total
download bandwidth P.sub.1 would need in the other (M-1) tree
groups. The overall bandwidth consumption is balanced at the
highest level, since each node is contributing at least as much as
it uses.
3.3 Delay Within a Single Tree
[0098] Performance in peer-to-peer sharing may be measured by
various characteristics including, for example, how long it takes a
data block to be delivered to a leaf node, and how long it takes
for a node to accumulate all W data blocks in order to start
playback.
[0099] Within a single tree T.sub.i, the expected tree depth is
bounded by O(log N). An inner node also needs to support (M-1)
children. In the worst case, data block i would take
(M-1)*tree-depth to reach a leaf node, if all ancestors of this
leaf node are the last one in the transmission queue of their
corresponding parents. This delay is given by:
.DELTA. one - tree = ( M - 1 ) tree - depth .ltoreq. ( M - 1 ) O (
log ( N ) ) ##EQU00030##
3.4 Delay Within a Tree Group
[0100] Within a tree group, (e.g., the tree group of P.sub.1) data
blocks can be transferred down each tree in a "pipelined" fashion.
The timing and delay of delivery of those data blocks are
interdependent because they share a lot of common inner nodes. Each
inner node there "serializes" transfer of data blocks 1, 1+M, 1+2M,
. . . W-M+1. Assuming a node always transfer data blocks in the
order of the data block id number, there would be a (M-1) seconds
shift in delivery time between consecutive trees in the group. This
shift, .DELTA..sub.inter-tree, is constrained as follows:
1.ltoreq..DELTA..sub.inter-tree.ltoreq.(M-1).
[0101] The last data block in the group, namely data block (W-M+1),
would arrive at a leaf node
( M - 1 ) ( W M - 1 ) ##EQU00031##
seconds later than data block 1. If the root node starts sending
out data block 1 and data block (W-M+1) at time t, then the last
leaf node receiving data block 1 will receive it at time
t+(M-1)log(N), and the last node receiving data block (W-M+1) would
get that data block at time
t + ( M - 1 ) log N + ( M - 1 ) ( W M - 1 ) . ##EQU00032##
3.5 Delay Across Tree Groups
[0102] Across tree groups, the transfer can take place in parallel
to a large degree because there is very little overlap between
inner nodes in these tree groups. An inner node in one tree group
still contributes W/M worth of bandwidth in other tree groups. So
the extra shift delay across tree groups is W/M seconds. This
delay, .DELTA..sub.inter-group, is constrained as:
.DELTA. inter - group .ltoreq. W M ##EQU00033##
seconds.
3.6 Overall Delay, Buffering Time at Startup
[0103] The maximum overall shift delay caused by "serialization"
among all W trees is:
.DELTA. inter - tree * trees in one tree group + .DELTA. inter -
group = ( M - 1 ) ( W M - 1 ) + w M = W - ( M - 1 ) seconds
##EQU00034##
[0104] This means a peer can receive a first data block in at most
(M-1log N seconds, and receive all the other (W-1) data blocks in
the window within W-(M-1) seconds thereafter. The buffering time
.DELTA..sub.buffering needed by a newly joined node is:
.DELTA. buffering = .DELTA. one - tree + .DELTA. inter - tree *
trees in one tree group + .DELTA. inter - group .ltoreq. ( M - 1 )
O ( log N ) + W - ( M - 1 ) ##EQU00035##
[0105] There are two cases. In case (A), for a node close to the
bottom of all trees, the (M-1)O(log N) seconds delay is largely
invisible to end-user, because the differential in data block
arrival time across tree groups is not big. As a result, the
buffering time needed by a node is around W-(M-1) seconds.
Intuitively, a node tends to be located at about the same level
across trees and tree groups. For example, a newly joined node will
logically be located close to bottom of all trees. Thus, case (A)
is expected to be the majority of the cases. In case (B), for a
node that is at level 1 in a tree group and at leaf level in other
tree groups, this node could receive its first data block very
fast, then wait for an additional (M-1)O(log N)+W-(M-1) seconds to
receive the rest of the data blocks in the window.
3.7 Ordering of Data Blocks Transfer by a Node
[0106] The order in which a node transfers its data blocks affects
overall performance of the peer-to-peer network. Suppose a node is
ready to transfer data blocks 1, 5, 7, and 8 to its partners. Under
different policies, the node may, for example, transfer the data
blocks based on the data block ID (i.e., lowest to highest with the
lowest ID corresponding to an earlier time slot in the streaming
data), or the node may transfer the data blocks in the order the
requests are received. From the analysis above, the exact order of
transfer does not affect the overall throughput. All data blocks
should be able to arrive within the W-second window, assuming there
are no transmission errors. Yet, since data block 1 would be needed
earlier than data blocks 5, 7 and 8, it appears reasonable to
always transfer 1 first, thus providing a bigger head room for its
delivery. Thus, in one embodiment, a preferred approach is to order
data block transfers simply based on their data block ID numbers.
This also suggests that the root node should perform round-robin
with one data block per unit between its M partners instead of
transferring W/M consecutive data blocks to a partner node. The
reason is that the arrival time would then favor those earlier data
blocks needed for playback.
4. Discontinuity
[0107] The following parameters may be used to derive the
probability of a node experiencing discontinuity: [0108]
P.sub.f--the probability of node failure. Node leave is also a form
of "node failure" and will be counted for in this. [0109]
P.sub.o--the probability that a dependent node cannot find an
alternative supplier partner for a particular data block within
.DELTA.t time; [0110] P.sub.s--the probability that a partner can
support full rate streaming for a dependent node once needed.
[0111] P.sub.d--the probability that a node experiencing
discontinuity. This is also the percentage of nodes in a group that
may suffer discontinuity.
[0112] This is a summation of the probability of finding an
alternative supplier node for each node failure case. The
probability of having exactly i number of failed partners is:
(1-P.sub.f).sup.M-1P.sub.f.sup.i
[0113] The probability of none of the (M-1) non-failing nodes
cannot be an alternative supplier node is:
(1-P.sub.s).sup.M-1
[0114] If none of the existing partners can become the alternative
supplier, the probability of not finding a new partner that can
supply the data is P.sub.o. Thus:
P d = P o [ i = 1 M ( M i ) ( 1 - P f ) M - i P f i ( 1 - P s ) M -
i ] ##EQU00036##
[0115] If the peer-to-peer sharing protocol can manage to keep
P.sub.o big and P.sub.s small, then P.sub.d can also be kept small.
P.sub.o and P.sub.s depend on the protocol operation. Ps can be
reasonably high, (e.g., 0.5 seconds). However, using P.sub.s here
could be a conservative estimate, because a node can retrieve data
from multiple partners collectively.
[0116] P.sub.o can indeed be very small, especially if a node keeps
around a large cache with a lot of "backup" mode nodes in it. They
can be contacted immediately when exiting partners cannot supply
all the data. In one embodiment, a P.sub.o of 10% is used.
[0117] The expected number of nodes suffering from discontinuity is
then:
N P d = N P o [ i = 1 M ( M i ) ( 1 - P f ) M - i P f i ( 1 - P s )
M - i ] ##EQU00037##
System Architecture
[0118] FIG. 5 is a high-level block diagram illustrating an example
of a computing device 500 that could act as a node or a server 102
(or sub-server 106) on the peer-to-peer network 100. Illustrated
are at least one processor 502, and input controller 504, a network
adaptor 506, a graphics adaptor 508, a storage device 510, and a
memory 512. Other embodiments of the computer 500 may have
different architectures with additional or different components. In
some embodiments, one or more of the illustrated components are
omitted.
[0119] The storage device 510 is a computer-readable storage medium
such as a hard drive, compact disk read-only memory (CD-ROM), DVD,
or a solid-state memory device. The memory 512 store instructions
and data used by the processor 502. The pointing device 526 is a
mouse, track ball, or other type of pointing device, and is used in
combination with the keyboard 524 to input data into the computer
system 500. The graphics adapter 508 outputs images and other
information for display by the display device 522. The network
adapter 506 couples the computer system 500 to a network 530.
[0120] The computer 500 is adapted to execute computer program
instructions for providing functionality described herein. In one
embodiment, program instructions are stored on the storage device
510, loaded into the memory 512, and executed by the processor 502
to carry out the processes described herein.
[0121] The types of computers 500 operating on the peer-to-peer
network can vary substantially. For example, a node comprising a
personal computer (PC) may include most or all of the components
illustrated in FIG. 5. Another node may comprise a mobile computing
device (e.g., a cell phone) which typically has limited processing
power, a small display 522, and might lack a pointing device 526. A
server 110 may comprise multiple processors 502 working together to
provide the functionality described herein and may lack an input
controller 504, keyboard 524, pointing device 526, graphics adapter
508 and display 522. In other embodiments, the nodes or the server
could comprises other types of electronic device such as, for
example, a personal digital assistant (PDA), a mobile telephone, a
pager, a television "set-top box," etc.
[0122] The network 530 enables communications among the entities
connected to it (e.g., the nodes and the server). In one
embodiment, the network 530 is the Internet and uses standard
communications technologies and/or protocols. Thus, the network 530
can include links using a variety of known technologies, protocols,
and data formats. In addition, all or some of links can be
encrypted using conventional encryption technologies. In another
embodiment, the entities use custom and/or dedicated data
communications technologies.
[0123] Upon reading this disclosure, those of skill in the art will
appreciate still additional alternative designs for membership
management having the features described herein. Thus, while
particular embodiments and applications of the present invention
have been illustrated and described, it is to be understood that
the invention is not limited to the precise construction and
components disclosed herein and that various modifications, changes
and variations which will be apparent to those skilled in the art
may be made in the arrangement, operation and details of the method
and apparatus of the present invention disclosed herein without
departing from the spirit and scope of the invention as defined in
the appended claims.
* * * * *