U.S. patent application number 12/714480 was filed with the patent office on 2010-09-02 for adaptive network with automatic scaling.
This patent application is currently assigned to YOTTAA INC. Invention is credited to COACH WEI.
Application Number | 20100220622 12/714480 |
Document ID | / |
Family ID | 42666263 |
Filed Date | 2010-09-02 |
United States Patent
Application |
20100220622 |
Kind Code |
A1 |
WEI; COACH |
September 2, 2010 |
ADAPTIVE NETWORK WITH AUTOMATIC SCALING
Abstract
A method for automatic scaling the processing capacity and
bandwidth capacity of a network includes providing a network
comprising a plurality of traffic processing units and a plurality
of network links. Next, providing monitoring means for monitoring
processing capacity demand and bandwidth capacity demand of the
network. Next, providing managing means for adding traffic
processing units to the network, removing traffic processing units
from the network, connecting links to the network and disconnecting
links from the network. Next, monitoring processing capacity demand
and bandwidth capacity demand of the network via the monitoring
means and then dynamically adjusting processing capacity of the
network by selectively adding or removing traffic processing units
in the network via the managing means upon observation of
processing capacity demand increase or processing capacity demand
decrease, respectively. The method also includes dynamically
adjusting bandwidth capacity of the network by selectively
connecting or disconnecting links in the network via the managing
means upon observation of bandwidth capacity demand increase or
bandwidth capacity decrease, respectively.
Inventors: |
WEI; COACH; (CAMBRIDGE,
MA) |
Correspondence
Address: |
AKC PATENTS
215 GROVE ST.
NEWTON
MA
02466
US
|
Assignee: |
YOTTAA INC
CAMBRIDGE
MA
|
Family ID: |
42666263 |
Appl. No.: |
12/714480 |
Filed: |
February 27, 2010 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61156069 |
Feb 27, 2009 |
|
|
|
61165250 |
Mar 31, 2009 |
|
|
|
Current U.S.
Class: |
370/252 ;
370/468 |
Current CPC
Class: |
H04L 41/0896 20130101;
H04L 47/781 20130101; H04L 41/145 20130101; H04L 47/822
20130101 |
Class at
Publication: |
370/252 ;
370/468 |
International
Class: |
H04L 12/26 20060101
H04L012/26; H04J 3/22 20060101 H04J003/22 |
Claims
1. A method for automatic scaling of processing capacity and
bandwidth capacity of a network comprising: providing a network
comprising a plurality of traffic processing units and a plurality
of network links; providing monitoring means for monitoring
processing capacity demand and bandwidth capacity demand of said
network; providing managing means for adding traffic processing
units to said network, removing traffic processing units from said
network, connecting links to said network and disconnecting links
from said network; monitoring processing capacity demand and
bandwidth capacity demand of said network via said monitoring
means; dynamically adjusting processing capacity of said network by
selectively adding or removing traffic processing units in said
network via said managing means upon observation of processing
capacity demand increase or processing capacity demand decrease,
respectively; dynamically adjusting bandwidth capacity of said
network by selectively connecting or disconnecting links in said
network via said managing means upon observation of bandwidth
capacity demand increase or bandwidth capacity decrease,
respectively.
2. The method of claim 1, wherein said traffic processing units
comprise virtual machines.
3. The method of claim 2, wherein said virtual machines comprise
virtual computing instances provided by commercial cloud computing
providers.
4. The method of claim 1, wherein said traffic processing units
comprise physical machines.
5. The method of claim 1, wherein said network comprises an overlay
network superimposed over an underlying network.
6. The method of claim 5, wherein said network links comprise
network links of said underlying network.
7. The method of claim 5, wherein said underlying network comprises
one of the Internet, WAN, wireless Network or a private
network.
8. The method of claim 1, wherein said traffic processing units are
distributed at different geographic locations.
9. The method of claim 1, wherein said traffic processing units are
added or removed via an Application Programming Interface
(API).
10. The method of claim 1 wherein said traffic processing units
comprise specially designed traffic processing hardware and general
purpose computers running specially designed traffic processing
software and wherein said traffic processing hardware comprise at
least one of router, switch, or hub.
11. A system for automatic scaling of processing capacity and
bandwidth capacity of a network comprising: a network comprising a
plurality of traffic processing units and a plurality of network
links; monitoring means for monitoring processing capacity demand
and bandwidth capacity demand of said network; managing means for
adding traffic processing units to said network, removing traffic
processing units from said network, connecting links to said
network and disconnecting links from said network; wherein said
monitoring means monitor processing capacity demand and bandwidth
capacity demand of said network and provide processing capacity
demand information and bandwidth capacity demand information to
said managing means; wherein said managing means dynamically adjust
said processing capacity of said network by selectively adding or
removing traffic processing units in said network upon receiving
information of processing capacity demand increase or processing
capacity demand decrease, respectively; and wherein said managing
means dynamically adjust bandwidth capacity of said network by
selectively connecting or disconnecting links in said network upon
receiving information of bandwidth capacity demand increase or
bandwidth capacity decrease, respectively.
12. The system of claim 11, wherein said traffic processing units
comprise virtual machines.
13. The system of claim 11, wherein said virtual machines comprise
virtual computing instances provided by commercial cloud computing
providers.
14. The system of claim 11, wherein said traffic processing units
comprise physical machines.
15. The system of claim 11, wherein said network comprises an
overlay network superimposed over an underlying network.
16. The system of claim 15, wherein said network links comprise
network links of said underlying network.
17. The system of claim 15, wherein said underlying network
comprises one of the Internet, WAN, wireless Network or a private
network.
18. The system of claim 11, wherein said traffic processing units
are distributed at different geographic locations.
19. The system of claim 11, wherein said traffic processing units
are added or removed via an Application Programming Interface
(API).
20. The system of claim 11, wherein said traffic processing units
comprise specially designed traffic processing hardware and general
purpose computers running specially designed traffic processing
software and wherein said traffic processing hardware comprise at
least one of router, switch, or hub.
Description
CROSS REFERENCE TO RELATED CO-PENDING APPLICATIONS
[0001] This application claims the benefit of U.S. provisional
application Ser. No. 61/156,069 filed on Feb. 27, 2009 and entitled
METHOD AND SYSTEM FOR COMPUTER CLOUD MANAGEMENT, which is commonly
assigned and the contents of which are expressly incorporated
herein by reference.
[0002] This application claims the benefit of U.S. provisional
application Ser. No. 61/165,250 filed on Mar. 31, 2009 and entitled
CLOUD ROUTING NETWORK FOR BETTER INTERNET PERFORMANCE, RELIABILITY
AND SECURITY, which is commonly assigned and the contents of which
are expressly incorporated herein by reference.
FIELD OF THE INVENTION
[0003] The present invention relates to network design and
management and in particular to a system and a method for an
adaptive network with automatic capacity scaling in response to
load demand changes.
BACKGROUND OF THE INVENTION
[0004] Networking changed the information technology industry by
enabling different computing systems to communicate, collaborate
and interact. There are many types of networks. The Internet is
probably the biggest network on earth. It connects millions of
computers all over the world. Wide Area Networks (WAN) are networks
that are typically used to connect the computer systems of a
corporation located in different geographies. Local Area Networks
(LAN) are networks that typically provide connectivity in an office
environment.
[0005] The purpose of a network is to enable communications between
the systems that are connected to the network by delivering
information from the source of the information to its destination.
In such a mission, the network itself needs to have sufficient
processing capacity and bandwidth capacity in order to perform
traffic delivery and various processing tasks including figuring
out an appropriate route for the traffic to travel through,
handling of errors and accidents and ensuring the necessary
security measures, among others.
[0006] A typical network includes two types of components: traffic
processing components and connectivity components. Traffic
processing components include the various types of networking
devices such as router, switch and hub, among others. The
connectivity components are typically called "links" that
interconnect two processing components or end points. There are
many ways to classify network links. Physical network links include
those via Ethernet cable, wireless connectivity, satellite
connectivity, optic fiber connections, dial-up phone line and so
on. Virtual network links refer to logic links formed between two
entities and may include many physical links as well as various
processing components along the way. The combination of the
processing capacity of the traffic processing components of a
network determines the network's processing capacity. The bandwidth
capacity of the various links together ultimately determines the
bandwidth capacity of a network.
[0007] FIG. 1 shows a typical network 90 with many traffic
processing components 105, 115, 125, 135 labeled as "router" as
well as many links 101, 111, 121,131, 141, 151. Through this
network 90, traffic is sent from source 100 to destination 150.
When designing and managing a network, it is crucial to provision
sufficient capacity. When there is not enough capacity, problems
ranging from slowness, congestion, to packet loss and
malfunctioning would occur.
[0008] In the prior art, network design and management are based on
a fixed amount of capacity provisioned beforehand. One would
acquire all the necessary hardware and software components,
configure them, and then build connectivity between them. This
fixed infrastructure provides a fixed amount of capacity. The
problems of such approaches include high acquisition cost and
over-provisioning or under-provisioning of capacity. Acquiring all
the traffic processing components and setting up the links upfront
can be very expensive for a large-scale network. The cost to build
a large-scale network can range from millions of dollars to even
higher. An example is the Internet itself, which costs billions of
dollars to build and we are still investing millions of dollars to
improve its capacity. An important aspect of the network is the
fact that network traffic demand varies. Peak demands can be a few
hundred percent or even higher than the average demand. In order to
meet the needs of peak demand, the capacity of the network has to
be over-provisioned. For example, a rule of thumb in designing a
network is to provision 3-5 times the capacity of its normal
demand. Such over-provisioning is necessary in order for the
network to function properly and to meet its service agreements.
However, normal bandwidth demand and processing demand are
significantly lower than peak demands. It is not unusual to see a
typical network's utilization rate to be only at 20%. Thus a
significant portion of capacity is wasted. For large-scale
networks, such waste is significant and ranges from thousands of
dollars to millions of dollars or even higher. Further, such
over-provisioning creates a significant carbon footprint. Today's
telecommunication networks are responsible for 1% to 5% of global
carbon footprint, and this percentage has been rising rapidly due
to the rapid growth and adoption of information technology. FIG. 1A
shows the discrepancy for typical networks between the provisioned
capacity and actual capacity demand. Because prior art networks are
based on fixed capacity, service suffers when capacity demand
overwhelms the fixed capacity and waste occurs when demand is below
the provisioned capacity.
[0009] Thus there is an unfulfilled need for new approaches to
build and manage a network that can eliminate the expensive upfront
costs, reduce capacity waste, and improving utilization
efficiency.
SUMMARY OF THE INVENTION
[0010] In general, in one aspect, the invention features a method
for automatic scaling the processing capacity and bandwidth
capacity of a network. The method includes providing a network
comprising a plurality of traffic processing units and a plurality
of network links. Next, providing monitoring means for monitoring
processing capacity demand and bandwidth capacity demand of the
network. Next, providing managing means for adding traffic
processing units to the network, removing traffic processing units
from the network, connecting links to the network and disconnecting
links from the network. Next, monitoring processing capacity demand
and bandwidth capacity demand of the network via the monitoring
means and then dynamically adjusting processing capacity of the
network by selectively adding or removing traffic processing units
in the network via the managing means upon observation of
processing capacity demand increase or processing capacity demand
decrease, respectively. The method also includes dynamically
adjusting bandwidth capacity of the network by selectively
connecting or disconnecting links in the network via the managing
means upon observation of bandwidth capacity demand increase or
bandwidth capacity decrease, respectively.
[0011] Implementations of this aspect of the invention may include
one or more of the following. The traffic processing units include
specially designed traffic processing hardware, such as router,
switch, and hub, among others. The traffic processing units also
include general purpose computers running specially designed
traffic processing software. The traffic processing units utilize
virtual machines and physical machines. The virtual machines are
based on virtualization technology including VMWare, Xen and
Microsoft Virutalization. The virtual machines are virtual
computing instances provided by commercial cloud computing
providers. The cloud computing providers include Amazon.com's EC2,
RackSpace, SoftLayer, AT&T, GoGrid, Verizon, Fijitsu, Voxel,
Google, Microsoft, FlexiScale, among others. The network is an
overlay network superimposed over an underlying network. The
network links are virtual network links of the underlying network.
The underlying network may be the Internet, WAN, Wireless Network
or a private network. The traffic processing units are distributed
at different geographic locations. The traffic processing units are
added or removed via an Application Programming Interface
(API).
[0012] In general, in another aspect, the invention features a
system for automatic scaling of the processing capacity and
bandwidth capacity of a network. The system includes a network
comprising a plurality of traffic processing units and a plurality
of network links, monitoring means for monitoring processing
capacity demand and bandwidth capacity demand of the network and
managing means for adding traffic processing units to the network,
removing traffic processing units from the network, connecting
links to the network and disconnecting links from the network. The
monitoring means monitor processing capacity demand and bandwidth
capacity demand of the network and provide processing capacity
demand information and bandwidth capacity demand information to the
managing means. The managing means dynamically adjust the
processing capacity of the network by selectively adding or
removing traffic processing units in the network upon receiving
information of processing capacity demand increase or processing
capacity demand decrease, respectively. The managing means also
dynamically adjust bandwidth capacity of the network by selectively
connecting or disconnecting links in the network upon receiving
information of bandwidth capacity demand increase or bandwidth
capacity decrease, respectively.
[0013] Among the advantages of the invention may be one or more of
the following. The network system is adaptive so that it always
"provision" optimal capacity in response to the demand, eliminating
capacity waste without sacrificing service quality, as shown in
FIG. 2A. The network system is horizontally scalable. Its capacity
increases linearly by just adding more traffic processing nodes to
the system. It is also fault-tolerant. Failure of individual
components within the system does not cause system failure. In
fact, the system assumes component failures as common occurrences
and is able to run on commodity hardware to deliver high
performance and high availability services.
[0014] The details of one or more embodiments of the invention are
set forth in the accompanying drawings and description below. Other
features, objects and advantages of the invention will be apparent
from the following description of the preferred embodiments, the
drawings and from the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] FIG. 1 shows the current Internet routing (prior art);
[0016] FIG. 1A is a graph of the network capacity demand versus
time in a prior art network with fixed capacity;
[0017] FIG. 2 shows a cloud routing network of the present
invention;
[0018] FIG. 2A shows the global locations of a geographically
distributed network;
[0019] FIG. 2B a graph of the network capacity demand versus time
in an adaptive network that changes its capacity based on
demand;
[0020] FIG. 3 shows the functional blocks of the cloud routing
system of FIG. 2;
[0021] FIG. 4 shows the traffic processing pipeline in the cloud
routing network of FIG. 2;
[0022] FIG. 5 shows the cloud routing workflow of the present
invention;
[0023] FIG. 6 shows the process of network capacity auto-scaling
and route convergence of the present invention;
[0024] FIG. 7 shows the node management workflow of the present
invention;
[0025] FIG. 8 shows various components in a cloud routing
network;
[0026] FIG. 9 shows a traffic management unit (TMU); and
[0027] FIG. 10 shows the various sub-components of a traffic
processing unit (TPU).
DETAILED DESCRIPTION OF THE INVENTION
Cloud Routing Network
[0028] The present invention describes a cloud routing network that
is implemented as an overlay virtual network or as a physical
network. By way of background, we use the term "cloud routing
network" to refer to a network (virtual or physical) that includes
traffic processing nodes (TPUs) deployed at various locations
inter-connected by network links, through which client traffic
travels to destinations. A cloud routing network can be a virtual
overlay network superimposed on an underlying physical network, a
physical network or a combination of both. Referring to FIG. 2, the
cloud routing network 300 includes router clouds 340, 350 and 360,
which are superimposed over a physical network 370, which in this
case is the Internet. Cloud 340 includes TPUs 342, 344, 346. Cloud
350 includes TPUs 352, 354 and cloud 360 includes TPUs 362, 364.
Each TPU has a certain amount of processing capacity. The TPUs are
connected to each other via network links. Each link possesses a
certain amount of bandwidth. The processing capacity of the cloud
network is the combined processing capacities of all the TPUs. The
bandwidth capacity of the cloud network is the combined bandwidth
capacity of all the links.
[0029] Cloud network 300 also includes a traffic management system
330, a traffic processing system 334, a data processing system 332
and a monitoring system 336. These systems are specialized software
that the traffic processing nodes run in order to perform functions
such as traffic monitoring, TPU node management, traffic
re-direction, traffic splitting, load balancing, traffic
inspection, traffic cleansing, traffic optimization, route
selection, route optimization, among others. In one example, cloud
network 300 is implemented as a virtual network that includes
virtual machines at various commercially available cloud computing
data centers, such as Amazon.com's Elastic Computing Cloud (EC2),
SoftLayer, RackSpace, GoGrid, FlexiScale, AT&T, Verizon,
Fujitsu, Voxel, among others. These cloud computing data centers
provide the physical infrastructure to add or remove TPU nodes
dynamically, which further enables the virtual network to scale
both its processing capacity and network bandwidth capacity. When
traffic grows to a certain level, the network starts up more TPUs,
adds links to these new TPU nodes and thus increases the network's
processing power as well as bandwidth capacity. When traffic level
decreases to a certain threshold, the network shuts down certain
TPUs to reduce its processing and bandwidth capacity.
[0030] The traffic management system 330 directs network traffic to
its traffic processing units (TPU). The traffic monitoring system
336 monitors the network traffic, the traffic processing system 334
inspects and processes the network traffic and the data processing
332 gathers data from different sources and provides global
decision support and means to configure and manage the system.
Referring to FIG. 3, the functional components of the cloud routing
system 300 include a traffic management interface unit 410, a
traffic redirection unit 420, a traffic routing unit 430, a node
management unit 440, a monitoring unit 450 and a data repository
460. The traffic management interface unit 410 includes a
management user interface (UI) 412 and a management API 414.
[0031] For a virtual overlay network based cloud routing network,
most TPU nodes are virtual machines running specialized traffic
handling software. Various TPU nodes may belong to different
clouds. Each cloud itself is a collection of nodes located in the
same data center (or the same geographic location). Some nodes
perform traffic management. Some nodes perform traffic processing.
Some nodes perform monitoring and data processing. Some nodes
perform management functions to adjust the network's capacity. Some
nodes perform access management and security control. These nodes
are connected to each other via the underlying network 370. The
connection between two nodes may contain many physical links and
hops in the underlying network, but these links and hops together
form a conceptual "virtual link" that conceptually connects these
two nodes directly. All these virtual links together with the TPU
nodes form a virtual network. Each node has only a fixed amount of
bandwidth and processing capacity. The capacity of the network is
the sum of the capacity of all nodes, and thus a cloud routing
network has only a fixed amount of processing and network capacity
at any given moment. This fixed amount of capacity may be
insufficient or excessive for the traffic demand. By adjusting the
capacity of individual nodes or by adding or removing nodes, the
network is able to adjust its processing power as well as bandwidth
capacity.
[0032] In the case when a cloud routing network is primarily a
physical network, most TPU nodes are physical machines running
specialized traffic handling software, including general purpose
computers as well as specially designed hardware appliances. Again,
various TPU nodes may belong to different clouds. In each cloud,
some nodes perform traffic management. Some nodes perform traffic
processing. Some nodes perform monitoring and data processing. Some
nodes perform management functions to adjust the network's
capacity. Some nodes perform access management and security
control. These nodes are connected to each other via network links.
These links together with the TPU nodes form a network. Each node
has only a fixed amount of bandwidth and processing capacity. The
capacity of this network is the sum of the capacity of all nodes,
and thus a cloud routing network has only a fixed amount of
processing and network capacity at any given moment. This fixed
account of capacity may be insufficient or excessive for the
traffic demand. By adjusting the capacity of individual nodes or by
adding or removing nodes, the network is able to adjust its
processing power as well as bandwidth capacity.
Traffic Processing
[0033] The invention uses a cloud routing network service to
process traffic and thus delivers "conditioned" traffic from source
to destination according to delivery requirements. FIG. 2 shows a
typical traffic processing service. When a client 305 issues a
request to a network service running on servers 550 and 560, the
cloud routing network 300 processes the request by doing the
following steps: [0034] 1. Traffic management service 330
intercepts the requests and routes the request to a TPU node;
[0035] 2. The TPU node checks the service's specific policy and
performs the pipeline processing shown in FIG. 4; [0036] 3. If
necessary, a global data repository 332 is used for data collection
and data analysis for decision support; [0037] 4. If necessary, the
client request is routed to the next TPU node, i.e., from TPU 342
to 352; and then [0038] 5. Request is sent to an "optimal" server
550 for processing
[0039] More specifically, when a client issues a request to a
server (for example, a consumer enters a web URL into a web browser
to access a web site), the default Internet routing mechanism would
route the request through the network hops along a certain network
path from the client to the target server ("default path"). Using a
cloud routing network, if there are multiple server nodes, the
cloud routing network first selects an "optimal" server node from
the multiple server nodes to as the target serve node to serve the
request. This server node selection process takes into
consideration factors including load balancing, performance, cost,
and geographic proximity, among others. Secondly, instead of going
through the default path, the traffic management service redirects
the request to an "optimal" TPU within the overlay network
("Optimal" is defined by the system's routing policy, such as being
geographically nearest, most cost effective, or a combination of a
few factors). This "optimal" TPU further routes the request to
second "optimal" TPU within the cloud routing network if necessary.
For performance and reliability reasons, these two TPU nodes
communicate with each other using either the best available or an
optimized transport mechanism. Then the second "optimal" node may
route the request to a third "optimal" node and so on. This process
can be repeated within the cloud routing network until the request
finally arrives at the target. The set of "optimal" TPU nodes
together form a "virtual" path along which traffic travels. This
virtual path is chosen in such a way that a certain routing measure
(such as performance, cost, carbon footprint, or a combination of a
few factors) is optimized. When the server responds, the response
goes through a similar pipeline process within the cloud routing
network until it is reaches the client.
[0040] FIG. 5 shows a typical network routing process. In this
embodiment, the traffic management service utilizes a Domain Name
Server (DNS) mechanism. The customer 801 configures the DNS record
for an application so that DNS queries are processed by the cloud
routing network 800, as shown in FIG. 8. Typical ways of
configuring DNS records include setting the DNS server, the CNAME
record or the "A" record of the application to a DNS server
provided by the cloud routing network. When a client wants to
access the application (e.g. www.somesite.com), the client needs to
resolve the hostname to an IP address. The cloud routing network
receives the DNS query. Based on the current routing policy, the
network 800 first selects an "optimal" server node among the
plurality of server nodes that the application is running on, and
then selects an entry router 803. The IP address of the entry
router node 803 is returned as a result of the DNS query. When the
entry router 803 receives a message from the client 801, it selects
an optimal exit router node 804, optimal path 805 as well as an
optimal transport mechanism to deliver the message. The exit router
node 804 receives the message, and further delivers it to the
target server node 820. In this process, client IP, path
information and performance metrics data are collected and logged
in data processing unit (DPU) 806, which can be used for future
path selection and node selection.
Process Capacity Scaling and Bandwidth Capacity Scaling
[0041] The invention enables a network to adjust its process
capacity and bandwidth in response to traffic demand variations.
The cloud routing network 300 monitors traffic demand, load
conditions, network performance and various other factors via its
monitoring service 336. When certain conditions are met, it
dynamically launches new nodes at appropriate locations, activates
links to these new nodes and spreads traffic to these new nodes in
response to increased demand, or shuts down some existing nodes in
response to decreased traffic demand. The net result is that the
cloud routing network dynamically adjusts its processing and
network capacity to deliver optimal results while eliminating
unnecessary capacity waste and carbon footprint.
[0042] A cloud routing network utilizes an Application Programming
Interface (API) from individual nodes to add or remove nodes from
the network. Cloud computing providers typically provide APIs that
allows a third party to manage machines instances. For example,
Amazon.com's EC2 provides Amazon Web Services (AWS) based APIs that
a third party can send web services messages to interact with and
manage virtual machine instances, such as starting a new node,
shutting down an existing node, checking the status of a node, etc.
The managing means of the cloud routing network typically utilizes
such APIs to add or remove traffic processing nodes and links, thus
adjusting the network's capacity.
[0043] FIG. 6 depicts two important aspects of the cloud routing
network: adaptive scaling and path convergence. Based on the
continuously collected metrics data from monitor nodes and logs,
the node management module 440 (shown in FIG. 3) checks the current
capacity and takes actions. When it detects that capacity is
"insufficient" according to a certain measure, it starts new router
nodes. The router table is updated to include the new routers and
thus spreads traffic to the new routers. When too much capacity is
detected, node management module selectively shuts down some of the
router nodes after traffic to these nodes have been drained up. The
router tables are updated by removing these router nodes from the
tables. At any time, when an event such as router failure or path
condition change occurs, the router table is updated to reflect the
change. The updated router table is used for subsequent
routing.
[0044] Further, the cloud routing network can quickly recover from
"fault". When a fault such as node failure and link failure occurs,
the system detects the problem and recovers from it by either
starting a new node or selecting an alternative route. As a result,
though individual components may not be reliable, the overall
system is highly reliable.
Traffic Processing Unit Node Management
[0045] Node management module 440 provides services for managing
the TPU nodes, such as starting a virtual machine (VM) instance,
stopping a VM instance and recovering from a node failure, among
others. In accordance to the node management policies in the
system, this service launches new nodes when the traffic demand is
high and it shuts down some nodes when it detects these nodes are
not necessary any more.
[0046] The node monitoring module 450 monitors the TPU nodes over
the network, collects performance and availability data, and
provides feedback to the cloud routing system 300. This feedback is
then used to make decisions such as when to scale up and when to
scale down. Data repository 460 contains data for the cloud routing
system, such as Virtual Machine Image (VMI), application artifacts
(files, scripts, and configuration data), routing policy data, and
node management policy data, among others.
[0047] FIG. 7 shows the node management workflow. When the system
receives a node status change event from its monitoring agents, it
first checks whether the event signals a node down. If so, the node
is removed from the system. If the system policy says "re-launch
failed nodes", the node controller will try to launch a new node.
Then the system checks whether the event indicates that the current
set of server nodes are getting overloaded. If so, at a certain
threshold, and if the system's policy permits, a node manager will
launch new nodes and notify the traffic management service to
spread load to the new nodes. Finally, the system checks to see
whether it is in the state of "having too much capacity". If so and
the node management policy permits, a node controller will try to
shut down a certain number of nodes to eliminate capacity
waste.
[0048] In launching new nodes, the system picks the best geographic
region to launch the new node. Globally distributed cloud
environments such as Amazon.com's EC2 cover several continents, as
shown in FIG. 2A. Launching new nodes at appropriate geographic
locations help spread application load globally, reduce network
traffic and improve application performance. In shutting down nodes
to reduce capacity waste, the system checks whether session
stickiness is required for the application. If so, shutdown is
timed until all current sessions on these nodes have expired.
Monitoring
[0049] The cloud routing network contains a monitoring service 336
(that includes monitoring module 450) that provides the necessary
data to the cloud routing network 300 as the basis for its
decisions. Various embodiments implement a variety of techniques
for monitoring. The following lists a few examples of monitoring
techniques: [0050] 1. Internet Control Message Protocol (ICMP)
Ping: A small IP packet that is sent over the network to detect
route and node status; [0051] 2. traceroute: a technique commonly
to check network route conditions; [0052] 3. Host agent: an
embedded agent running on host computers that collects data about
the host; [0053] 4. Web performance monitoring: a monitor node,
acting as a normal user agent, periodically sends HTTP requests to
a web server and processes the HTTP responses from the web server.
The monitor nodes records metrics along the way, such as DNS
resolution time, request time, response time, page load time,
number of requests, number of JavaScript files, or page footprint,
among others. [0054] 5. Security monitoring: A monitor node
periodically scans a target system for security vulnerabilities
such as network port scanning and network service scanning to
determine which ports are publicly accessible and which network
services are running, further determining whether there are
vulnerabilities. [0055] 6. Content security monitoring: a monitor
nodes would periodically crawls a web site and scans its content
for detection of infected content, such as malware, spyware,
undesirable adult content, or virus, among others.
[0056] The above examples are for illustration purpose. The present
invention is agnostic and accommodates a wide variety of ways of
monitoring. An embodiment of the present invention employs all
above techniques for monitoring different target systems: Using
ICMP, traceroute and host agent to monitor the cloud routing
network itself, using web performance monitoring, network security
monitoring and content security monitoring to monitor the
available, performance and security of target network services such
as web applications. A data processing system (DPS) would aggregate
data from such monitoring service and provides all other services
global visibility to such data.
[0057] Several embodiments of the present invention have been
described. Nevertheless, it will be understood that various
modifications may be made without departing from the spirit and
scope of the invention. Accordingly, other embodiments are within
the scope of the following claims.
* * * * *
References