U.S. patent number 8,352,482 [Application Number 12/506,965] was granted by the patent office on 2013-01-08 for system and method for replicating disk images in a cloud computing based virtual machine file system.
This patent grant is currently assigned to VMware, Inc.. Invention is credited to Jacob Gorm Hansen.
United States Patent |
8,352,482 |
Hansen |
January 8, 2013 |
System and method for replicating disk images in a cloud computing
based virtual machine file system
Abstract
A replicated decentralized storage system comprises a plurality
of servers that locally store disk images for locally running
virtual machines as well as disk images, for failover purposes, for
remotely running virtual machines. To ensure that disk images
stored for failover purposes are properly replicated upon an update
of the disk image on the server running the virtual machine, a hash
of a unique value known only to the server running the virtual
machine is used to verify the origin of update operations that have
been transmitted by the server to the other servers storing
replications of the disk image for failover purposes. If verified,
the update operations are added to such failover disk images.
Inventors: |
Hansen; Jacob Gorm (Ryomgaard,
DK) |
Assignee: |
VMware, Inc. (Palo Alto,
CA)
|
Family
ID: |
43498175 |
Appl.
No.: |
12/506,965 |
Filed: |
July 21, 2009 |
Prior Publication Data
|
|
|
|
Document
Identifier |
Publication Date |
|
US 20110022574 A1 |
Jan 27, 2011 |
|
Current U.S.
Class: |
707/758; 713/189;
707/812; 380/286; 707/698 |
Current CPC
Class: |
G06F
9/45558 (20130101); G06F 11/2097 (20130101); G06F
16/23 (20190101); G06F 16/188 (20190101); G06F
16/10 (20190101); G06F 16/178 (20190101); G06F
16/20 (20190101); G06F 2009/45595 (20130101); G06F
11/2025 (20130101); G06F 11/2041 (20130101); G06F
2009/45583 (20130101); G06F 11/1484 (20130101); G06F
11/2048 (20130101) |
Current International
Class: |
G06F
7/00 (20060101); G06F 17/30 (20060101) |
References Cited
[Referenced By]
U.S. Patent Documents
Other References
Hakim Weatherspoon et al., "Antiquity: Exploiting a Secure Log for
Wide-Area Distributed Storage." EuroSys '07: Proceedings of the
2007 Conference on EuroSys, pp. 371-384, New York, NY 2007. ACM
Press. cited by examiner .
Moni Naor et al., "Access Control and Signatures via Quorum Secret
Sharing." Parallel and Distributed Systems, IEEE Transactions,
9(9):909-922 Sep. 1998. cited by other .
David Mazieres et al., "Building secure file systems out of
Byzantine storage." PODC '02: Proceedings of the Twenty-First
Annual Symposium on Principles of Distributed Computing, pp.
108-117, New York, NY 2002. ACM Press. cited by other .
Notification of Transmittal of the International Search Report and
the Written Opinion of the International Searching Authority, or
the Declaration, Mar. 2, 2011, Patent Cooperation Treaty "PCT",
KIPO, Republic of Korea. cited by other.
|
Primary Examiner: Nguyen; Loan T
Attorney, Agent or Firm: Lin; Daniel
Claims
I claim:
1. A computer system configured to replicate a log file stored in a
primary server, the computer system comprising: a local storage
unit for storing a replication of the log file; one or more
computer processors; and a non-transitory computer-readable storage
medium comprising instructions for controlling the one or more
computer processors to be configured to: receive an update
operation sent from a server, wherein the update operation includes
a public unique id and a private unique id from a previous update
operation, obtain a previous public unique id from a last update
operation stored in the replication of the log file stored in the
local storage unit, wherein the previous public unique id was
generated using the private unique id from the previous update
operation, generate a hash of the private unique id, compare the
generated hash of the private unique id to the previous public
unique id to confirm the server that sent the received update
operation is designated as a primary server, wherein the private
unique id is previously known only to the primary server prior to
transmission of the update operation to the computer system, and
append the update operation to the replication of the log file upon
confirmation that the generated hash of the private unique id
equals the previous public unique id.
2. The computer system of claim 1, wherein the public unique id is
a hash of a new private unique id generated from a master secret
value stored in the primary server.
3. The computer system of claim 2, wherein the new private unique
id comprises a hash of a bitwise intersection of the master secret
value, the private unique id from the previous update operation,
and data corresponding to the update operation.
4. The computer system of claim 1, wherein the log file comprises a
list of update operations.
5. The computer system of claim 4, wherein the log file and the
replication of the log file correspond to a disk image of a virtual
machine running on the primary server.
6. The computer system of claim 5, wherein the update operation
further includes a data portion corresponding to a write operation
issued by the virtual machine and comprising a logical address of
the virtual machine and data to be written to the logical
address.
7. The computer system of claim 1, configured to reject the update
operation upon confirmation that the generated hash of the private
unique id does not equal the previous public unique id.
8. A method for updating a replication of a log file, wherein the
log file is stored in a local storage unit of a primary server and
the replication is stored in a local storage unit of a secondary
server, the method comprising: receiving an update operation sent
from a server, wherein the update operation includes a public
unique id and a private unique id from a previous update operation;
obtaining a previous public unique id from a last update operation
stored in the replication of the log file stored in the local
storage unit of the secondary server, wherein the previous public
unique id was generated using the private unique id from the
previous update operation; generating a hash of the private unique
id; comparing the generated hash of the private unique id to the
previous public unique id to confirm the server that sent the
received update operation is designated as a primary server,
wherein the private unique id is previously known only to the
primary server prior to transmission of the update operation to the
secondary server; and appending the update operation to the
replication of the log file upon confirmation that the generated
hash of the private unique id equals the previous public unique
id.
9. The method of claim 8, wherein the public unique id is a hash of
a new private unique id generated from a master secret value stored
in the primary server.
10. The method of claim 9, wherein the new private unique id
comprises a hash of a bitwise intersection of the master secret
value, the private unique id from the previous update operation,
and data corresponding to the update operation.
11. The method of claim 8, wherein the log file comprises a list of
update operations.
12. The method of claim 11, wherein the log file and the
replication of the log file correspond to a disk image of a virtual
machine running on the primary server.
13. The method of claim 12, wherein the update operation further
includes a data portion corresponding to a write operation issued
by the virtual machine and comprising a logical address of the
virtual machine and data to be written to the logical address.
14. The method of claim 8, further comprising rejecting the update
operation upon confirmation that the generated hash of the private
unique id does not equal the previous public unique id.
15. A non-transitory computer-readable storage medium including
instructions that, when executed by a processing unit of a
secondary server having a local storage unit storing a replication
of a log file stored on in a local storage unit of a primary
server, causes the processing unit to update the replication by
performing steps of: receiving an update operation sent from a
server, wherein the update operation includes a public unique id
and a private unique id from a previous update operation; obtaining
a previous public unique id from a last update operation stored in
the replication of the log file stored in the local storage unit of
the secondary server, wherein the previous public unique id was
generated using the private unique id from the previous update
operation; generating a hash of the private unique id; comparing
the generated hash of the private unique id to the previous public
unique id to confirm the server that sent the received update
operation is designated as a primary server, wherein the private
unique id is previously known only to the primary server prior to
transmission of the update operation to the secondary server; and
appending the update operation to the replication of the log file
upon confirmation that the generated hash of the private unique id
equals the previous public unique id.
16. The non-transitory computer readable storage medium of claim
15, wherein the public unique id is a hash of a new private unique
id generated from a master secret value stored in the primary
server.
17. The non-transitory computer readable storage medium of claim
16, wherein the new private unique id comprises a hash of a bitwise
intersection of the master secret value, the private unique id from
the previous update operation, and data corresponding to the update
operation.
18. The non-transitory computer readable storage medium of claim
15, wherein the log file comprises a list of update operations.
19. The non-transitory computer readable storage medium of claim
18, wherein the log file and the replication of the log file
correspond to a disk image of a virtual machine running on the
primary server.
20. The non-transitory computer readable storage medium of claim
19, wherein the update operation further includes a data portion
corresponding to a write operation issued by the virtual machine
and comprising a logical address of the virtual machine and data to
be written to the logical address.
21. The computer system of claim 1, further configured to store the
public unique id included in the update operation for use to
confirm a next update operation including a new private ID used to
generate the public unique ID.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
This application is related to and filed on the same day as U.S.
patent application entitled "Method for Voting with Secret Shares
in a Distributed System" Ser. No. 12/507,013, which application is
assigned to the assignee of the present invention.
BACKGROUND OF THE INVENTION
Current enterprise level virtual machine file systems, such as
VMware Inc.'s VMFS, are typically shared disk file systems that
utilize an external storage device, such as a storage area network
(SAN), to provide storage resources to virtual machines. These
virtual machines are instantiated and run on one or more servers
(sometimes referred to as a server cluster) that store their
virtual machines' disk images as separate files in the SAN. Each
server in the cluster runs a virtualization layer (sometimes
referred to as a hypervisor) that includes an implementation of a
virtual machine file system that coordinates the interaction of the
server with the SAN. For example, each virtual machine file system
on each server in a cluster implements and follows a common
per-file locking protocol that enables virtual machines running on
multiple servers to simultaneously access (e.g., read and write)
their disk images in the SAN without fear that other servers may
simultaneously access the same disk image at the same time.
FIG. 1 depicts one example of a network architecture for a cluster
of virtualization servers utilizing a SAN. Each virtualization
server 100.sub.A to 100.sub.J is networked to SAN 105 and
communicates with SAN 105 using SCSI-based protocols. As previously
discussed, each virtualization server 100.sub.A to 100.sub.J
includes a hypervisor, such as 110.sub.A, that includes a virtual
machine file system, such as 115.sub.A. Hypervisor 110.sub.A
provides virtualization support to enable its server 100.sub.A to
instantiate a number of virtual machines, such as 120.sub.A through
125.sub.A. The disk images for each of virtual machines .sup.120A
through 125.sub.A are stored in SAN 105.
The network architecture of FIG. 1 provides protection against
server failures because SAN 105 serves as a central storage
resource that stores disk images for virtual machines of all the
servers in the cluster. For example, if server 100.sub.A
experiences a hardware failure, any of the other servers in the
cluster can "failover" any of virtual machines 120.sub.A through
125.sub.A by instantiating a new virtual machine and associating
the newly created virtual machine with the failed virtual machine's
disk image stored in SAN 105 (i.e., provided such server has
sufficient computing resources to support the virtual machine).
However, SAN 105 itself becomes a potential bottleneck and a single
point of failure. Furthermore, by its nature, the use of a central
SAN limits the capability to scale the number of servers in a
cluster and/or distribute the servers in the cluster over a
wide-area network (WAN). Additionally, SANs have traditionally been
one of the most expensive components of a data center, often
costing more than the aggregate cost of the virtualization software
and server cluster.
SUMMARY OF THE INVENTION
One or more embodiments of the invention provide a virtual machine
file system that employs a replicated and decentralized storage
system. In this system, as in warehouse-style or "cloud" computing
systems, multiple networked servers utilize cheaper local storage
resources (such as SATA disks) rather than a centralized SAN, even
though they may be less reliable, because such a replicated and
decentralized storage system eliminates the bottleneck and single
point of failure of a SAN and also provide the potential for both
incremental and large-scale data center growth by simply adding
more servers. However, use of such local storage resources is also
less reliable than use of a SAN. To improve reliability, data
replication techniques that provide high availability and ensure
the integrity and consistency of replicated data across the servers
are employed.
A computer system according to an embodiment of the present
invention is configured to replicate a log file stored in a primary
server. The computer system comprises a local storage unit for
storing a replication of the log file and a software module
configured to receive an update operation from the primary server,
wherein the update operation includes a public unique id and a
private unique id from a previous update operation, obtain a
previous public unique id from a last update operation stored in
the replication of the log file stored in the local storage unit,
generate a hash of the private unique id, compare the generated
hash of the private unique id to the previous public unique id, and
append the update operation to the replication of the log file upon
confirmation that the generated hash of the private unique id
equals the previous public unique id. In one embodiment of such a
computer system, the log file and the replication of the log file
correspond to a disk image of a virtual machine running on the
primary server.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 depicts a network architecture for a cluster of
virtualization servers utilizing a SAN.
FIG. 2 depicts a network architecture using a replicated and
decentralized storage system for a virtual machine file system,
according to one or more embodiments of the present invention.
FIG. 3 depicts disk image replication for a virtual machine running
on a server according to one or more embodiments of the present
invention.
FIG. 4A depicts a log structure of a disk image of a virtual
machine stored on local storage, according to one or more
embodiments of the present invention.
FIG. 4B depicts the internal data structure of an update operation
of a disk image, according to one or more embodiments of the
present invention.
FIG. 5 depicts a flow chart for replicating a primary data image to
secondary servers, according to one or more embodiments of the
present invention.
FIG. 6 depicts a sequence of update operations to a data image,
according to one or more embodiments of the present invention.
FIG. 7 depicts a flow chart for sharing a master secret token
across a number of servers, according to one or more embodiments of
the present invention.
DETAILED DESCRIPTION
FIG. 2 depicts a network architecture using a replicated and
decentralized storage system for a virtual machine file system,
according to one or more embodiments of the present invention. In
contrast to the network architecture of FIG. 1, in which
virtualization servers 100', each including a virtual file system
115', a hypervisor 110' and one or more virtual machines 120', 125'
communicate with a centralized SAN 105 to access stored disk images
corresponding to their respective instantiated virtual machines,
each of the virtualization servers 200.sub.A to 200.sub.H in the
cluster of FIG. 2 has its own directly attached local storage, such
as local storage 205.sub.A for virtualization server 200.sub.A. As
such, virtual machines 210.sub.A to 215.sub.A running on server
200.sub.A store their disk images in local storage 205.sub.A.
Storage in such a network architecture can therefore be considered
"decentralized" because disk image data (in the aggregate) is
stored across the various local storages residing in the servers.
Each of virtualization servers 200.sub.A to 200.sub.H includes
virtualization software, for example, a hypervisor such as
210.sub.A, that supports the instantiation and running of virtual
machines on the server. Hypervisor 210.sub.A further includes a
virtual machine file system 220.sub.A that coordinates and manages
access to local storage 205.sub.A by virtual machines 210.sub.A to
215.sub.A (i.e., to read from or write to their respective disk
images).
Each of servers 200.sub.A to 200.sub.H is further networked to one
or more of the other servers in the cluster. For example, server
200.sub.A is networked to server 200.sub.B, server 200.sub.C,
server 200.sub.G, and server 200.sub.H. As depicted in the network
topology of FIG. 2, each server is networked to four other servers
in the cluster and can reach another server in no more than one
hop. It should be recognized, however, that the network topology of
FIG. 2 is a simplified illustration for exemplary purposes and that
any network topology that enables communication among the servers
in a cluster can be used consistent with the teachings herein,
including, without limitation, any ring, mesh, star, tree,
point-to-point, peer-to-peer or any other network topology, whether
partially connecting or fully connecting the servers. By removing a
centralized SAN from the network architecture, embodiments of the
present invention remove a potential bottleneck and point of
failure in the architecture and are more easily able to scale
storage for a virtualized data center in a cost efficient manner by
incrementally adding servers utilizing local storage to the
cluster.
An embodiment of the invention that utilizes a network architecture
similar to that of FIG. 2 replicates disk images across the local
storages of servers in a cluster to provide server failure
protection. If a server fails, another server in the cluster that
has a locally stored replica of the disk image of a virtual machine
in the failed server can failover that particular virtual machine.
In one embodiment, a designated server in the cluster has
responsibilities as a replication manager and may, for example,
instruct server 200.sub.A to replicate the disk image for virtual
machine 210.sub.A to the local storages of servers 200.sub.B,
200.sub.C, and 200.sub.H. As referred to herein, a server that is
running a virtual machine is the "primary server" with respect to
the virtual machine, and other servers that store replications of
the virtual machine's disk image for failover purposes are
"secondary servers." Similarly, a copy of the disk image of a
virtual machine that is stored in the local storage of the primary
server is a "primary" copy, replica or disk image, and a copy of
the disk image of a virtual machine that is stored in the local
storage of a secondary server is a "secondary" copy, replica or
disk image. FIG. 3 depicts disk image replication for a virtual
machine running on a server using a decentralized storage system,
according to one or more embodiments of the present invention. In
particular, virtual machine 210.sub.A running on primary server
200.sub.A utilizes a primary disk image 300 stored on local storage
205.sub.A of server 200.sub.A during normal operations. Primary
disk image 300 is replicated as secondary disk images 305, 310 and
315, respectively, in the local storages of secondary servers
200.sub.B, 200.sub.C, and 200.sub.H.
FIG. 4A depicts a log structure of a disk image of a virtual
machine stored on local storage, according to one or more
embodiments of the present invention. As illustrated in FIG. 4A,
disk image 300 for virtual machine 210.sub.A running on server
200.sub.A is structured as a temporally ordered log of update
operations made to the disk. For example, when virtual machine
210.sub.A issues a write operation (e.g., containing a logical
block address from the virtual address space of the virtual machine
and data to be written into the logical block address) to its disk,
virtual machine file system 220.sub.A receives the write operation
and generates a corresponding update operation, such as update
operation 400, and appends update operation 400 to the end of the
log structure of disk image 300. In one embodiment, virtual machine
file system 220.sub.A further maintains a B-tree data structure
that maps the logical block addresses referenced in write
operations issued by virtual machine 210.sub.A to physical
addresses of local storage 205.sub.A that reference locations of
the update operations (and data residing therein) corresponding to
the issued write operations. In such an embodiment, when virtual
machine file system 220.sub.A receives a write operation from
virtual machine 210.sub.A, it additionally inserts the physical
address corresponding to the update operation in the log structure
of the disk image into the B-tree such that the physical address
can be found by providing the logical block address of the write
operation to the B-tree. This B-tree enables virtual machine file
system 220.sub.A to handle read operations issued by virtual
machine 210.sub.A. For example, when virtual machine 210.sub.A
issues a read operation (e.g., containing a logical block address
from the virtual address space of the virtual machine from which to
read data) to its disk, virtual machine file system 220.sub.A
receives the read operation, obtains a physical address from the
B-tree that corresponds to a previous update command 405 (e.g.,
from a prior completed write operation) stored in the log structure
that contains the requested data, and retrieves the data for
virtual machine 210.sub.A. Instead of a B-tree data structure,
other similar tree or search data structure, such as but not
limited to lookup tables, radix trees and the like, may be
used.
FIG. 4B depicts the internal data structure of an update operation
of a disk image, according to one or more embodiments of the
present invention. An update operation stored in disk image 300,
such as update operation 410 in FIG. 4B, contains a header portion
415 and data portion 420. Header portion 415 includes an id entry
425 that stores a public unique identification or id for the update
operation, a "parent" id entry 430 that stores a private unique id
of the preceding update operation stored in the log of disk image
300, and data information entry 435 that stores descriptive
information about data portion 420 (e.g., amount of data, address
locations, etc.).
In one embodiment of the present invention, a replicated
decentralized storage system, such as that depicted in FIGS. 2 and
3, performs replication of a primary data image to secondary
servers in a manner that avoids split-brain scenarios. A
split-brain scenario can occur, for example, if the network
connections of server 200.sub.A fail, but virtual machine 210.sub.A
of server 200.sub.A continues to otherwise operate normally and
issue write operations that are stored as update operations in
primary data image 300. Because server 200.sub.A is no longer
accessible by any other server in the cluster, in one embodiment, a
designated server responsible for failover management may conclude
that server 200.sub.A has failed and therefore instruct server
200.sub.B to failover virtual machine 210.sub.A utilizing its
secondary disk image 305. In the event that the network connections
for 200.sub.A are subsequently restored, two different
instantiations of virtual machine 210.sub.A will be running on
servers 200.sub.A and 200.sub.B. Furthermore, the respective disk
images 300 and 305 for virtual machine 210.sub.A in server
200.sub.A and server 200.sub.B will not be properly synchronized.
In order to prevent such split-brain situations, in which secondary
servers inappropriately update their secondary replicas of a data
image, a virtual machine file system of the primary server,
according to an embodiment of the present invention, employs a
master secret token that is known only to the primary server to
ensure that only update operations propagated by the primary server
are accepted by the secondary servers.
FIG. 5 depicts a flow chart for replicating a primary data image on
secondary servers, according to one or more embodiments of the
present invention. While the steps of the flow chart reference
structures of FIGS. 2, 3, 4A, and 4B, it should be recognized that
any other network architectures, virtualization servers, disk image
formats and update operation structures that are consistent with
the teachings herein may be used in conjunction with the flow chart
of FIG. 5. In step 500, virtual machine file system 220.sub.A of
primary server 200.sub.A receives a write operation from virtual
machine 210.sub.A. In step 505, virtual machine file system
220.sub.A generates a private unique id for an update operation for
the write operation. In one embodiment, the private unique id is
generated by hashing a bitwise intersection of the primary server's
200.sub.A master secret token, a parent id relating to the
preceding update operation (stored as the last entry in the primary
and secondary disk images), and the data for the write operation
(or otherwise combining the data, parent id, master secret token in
an alternative bitwise fashion such as concatenation, XOR, etc.), H
(s|parent|data), where H is a cryptographic one way hash function
such as SHA-1 or SHA-256, s is the master secret token, and parent
is the parent id. In step 510, the private unique id is then hashed
again (e.g., with the same or a different hashing function,
depending upon the embodiment) to obtain a public unique id, H
(H(s|parent|data)). In step 515, virtual machine file system
220.sub.A obtains a stored copy of the previous private unique id
generated from the previous update operation stored in primary disk
image 300. In step 520, virtual machine file system 220.sub.A
constructs an update operation structure corresponding to the
received write operation in which: (i) id entry 425 of the update
operation structure is the public unique id generated in step 510;
(ii) parent id entry 430 of the update operation structure is the
previous private unique id obtained in step 515; and (iii) the data
of the update operation structure is the data of the received write
operation. In step 525, virtual machine file system 220.sub.A
appends the update operation structure to the end of primary disk
image 300. In step 530, virtual machine file system 220.sub.A
further transmits the update operation structure to each of
secondary servers 200.sub.B, 200.sub.C, and 200.sub.H. In one
embodiment, the update operation structure is transmitted to the
secondary servers using HTTP or other similar network communication
protocols. In step 535, virtual machine file system 220.sub.A
replaces the stored copy of the previous private unique id obtained
in step 515 with the private unique id of the current update
operation generated in step 505 (i.e., H (s|parent|data), not
H(H(s|parent|data)). In step 540, virtual machine file system
220.sub.A obtains the physical address corresponding to the
appended update operation in primary disk image 300 and inserts the
physical address into its B-tree data structure such that the
physical address can be found by providing the logical block
address of the write operation to the B-tree data structure.
In step 545, the virtual machine file system for each of the
secondary servers receives the update operation structure. In step
550, each virtual machine file system of the secondary servers
extracts the parent id entry 430, which is the private unique id of
the previous update operation, known only to primary server
200.sub.A prior to transmission of the update operation structure
to the secondary servers in step 530, from the received update
operation structure and generates, in step 555, a hash of the
parent id entry 430. In step 560, each virtual machine file system
of the secondary servers extracts the id entry 425 from the last
update operation in its secondary disk image replica. Similar to
the id entry 425 of the update operation structure constructed in
step 520, id entry 425 extracted in step 560 is the public unique
id that was created by virtual machine file system 220.sub.A for
the prior update operation. In step 565, if the generated hashed
parent id equals the public unique id stored as the id entry 425 of
the last update operation of the secondary disk image, then in step
570, the virtual machine file system of the secondary server
confirms that the received update operation structure originated
from primary server 220.sub.A and appends the received update
operation structure to the end of its secondary data image
(respectively, 305, 310 and 315 for primary disk image 300). In
step 575, the virtual machine file system of the secondary server
obtains the physical address corresponding to the appended update
operation in the secondary data image and inserts the physical
address into its B-tree data structure. However, if, in step 565,
the generated hashed parent id does not equal the public unique id
stored as the id entry 425 of the last update operation of the
secondary disk image, then the received update operation structure
is rejected in step 580.
The steps depicted in FIG. 5 ensure that only update operations
generated by the primary server will be accepted and appended by
secondary servers to their respective secondary disk images.
Specifically, only the virtual machine file system of primary
server possesses a copy of the current update operation's private
unique id that can be provided as a parent id in a subsequent
update operation. All other secondary servers can only obtain the
corresponding public unique id that is stored as id entry 425 of
the update operation in the secondary disk image. To further
illustrate the relationship between update operations, FIG. 6
depicts a sequence of update operations to a data image, according
to one or more embodiments of the present invention. While update
operations in FIG. 6 have been illustrated with only the id entry
425, parent id entry 430 and data portion 420 for exemplary
purposes, it should be recognized that update operations, in
accordance with one or more embodiments of the invention, may
include additional fields and information, including, for example,
data information entry 435. As previously discussed, the primary
server keeps a memory buffer 600 that stores the current private
unique id corresponding to the last entry of the primary data
image. This is the stored copy of the private unique id that is
obtained in step 515 and subsequently replaced in step 535. Of
note, this stored copy of the current private unique id is an
unhashed version of the public unique id that is generated in step
510 and stored in the id entry 425 of the corresponding update
operation. For example, if a current private unique id is
H(s|parent|data), then id entry 425 for the corresponding update
operation in the primary and secondary disk images contains a
derived public unique id,H (H(s|parent|data)). As is apparent due
to the nature of hash functions, only a primary server has access
to private unique id stored in buffer 600 and no other server in a
cluster, including the secondary servers that have access to the
corresponding public unique id in id entry 425 of the last update
operation in their secondary disk images, can determine or
otherwise derive the private unique id stored in buffer 600. Update
operation U.sub.0 of
FIG. 6 represents a first update operation of a disk image that is
currently stored on the primary disk images and all secondary disk
images. A private unique id 605, H (s|data.sub.0) , is generated by
the virtual memory file system as in step 505 and then hashed, in
step 510, prior to being stored as a public unique id in the id
entry 425 of update operation U.sub.0. Private unique id 605 is
then subsequently stored in memory buffer 600 of primary server in
step 535. Parent id entry 430 of update operation U.sub.0 is NULL
because it is the first update operation for the disk image. The
primary server generates the next update operation U.sub.1 by
creating a new private unique id 610 by hashing that intersection
of its master secret token s, the new data for the update operation
U.sub.1, and the current id, id.sub.0, stored in buffer 600, H
(s|id.sub.0|data.sub.1) , where id.sub.0 is H (s|data.sub.0) . The
parent id entry 430 of update operation U.sub.1 is the id.sub.0.
When update operation U.sub.1 is forwarded to the secondary servers
in step 530, the secondary servers are able to confirm that update
operation U.sub.1 originates from primary server by verifying in
step 565 that the hash of the parent id of received update
operation U.sub.1, H (id.sub.0) , is equal to the id entry 425 of
currently stored update operation U.sub.0, H (H (s|data.sub.0)).
The primary server generates another update operation U.sub.2 by
creating a new private unique id 615 by hashing that intersection
of its master secret token s, the new data for the update operation
U.sub.2, and the current id, id.sub.1, stored in buffer 600,
H(H(s|id.sub.1|data.sub.2)) , where id.sub.1 is H (s|id.sub.0
data.sub.1). The parent id entry 430 of update operation U.sub.2 is
the id.sub.1. When update operation U.sub.2 is forwarded to the
secondary servers in step 530, the secondary servers are able to
confirm that update operation U.sub.2 originates from primary
server by verifying in step 565 that the hash of the parent id of
received update operation U.sub.2, H (s|id.sub.0 |data.sub.1)), is
equal to the id entry 425 of currently stored update operation
U.sub.1, H (H (s|id.sub.0|data.sub.1)).
To avoid losing the master secret token in the event that a primary
server fails, one or more embodiments of the present invention
utilize a secret sharing protocol to distribute the master secret
token across other servers in a manner that does not actually
reveal the master secret token. FIG. 7 depicts a flow chart for
sharing a master secret token across a number of servers, according
to one or more embodiments of the present invention. In step 700, a
virtual machine file system of a primary server, such as virtual
machine file system 220.sub.A, generates a master secret token, s,
to be used to propagate update operations to secondary servers to
be stored in secondary disk images, for example, in accordance with
the flow of FIG. 5. Prior to utilizing the master secret token s
(e.g., in accordance with the flow of FIG. 5), in step 705, the
virtual memory file system divides the master secret token s into n
parts or shares. The n shares have a characteristic that the
combination of any threshold number t of the n shares can recreate
the master secret token s. In step 710, the virtual memory file
system of the primary server distributes each of the n shares to a
different server in the cluster. It should be recognized that known
secret sharing techniques such as Shamir's secret sharing,
Blakley's secret sharing and other similar secret sharing methods
may be used to divide and reconstruct master secret token s in
accordance with embodiments of the invention.
Upon a failure of primary server 200.sub.A, as in step 715, a
secondary server, such as secondary server 200.sub.B, may recognize
the failure of primary server 200.sub.A in step 720. For example,
in one embodiment, a designated server with failover management
responsibilities may inform secondary server 200.sub.B of the
failure of primary server 200.sub.A and instruct secondary server
200.sub.B to become the new primary server and initiate failover
procedures. In an alternative embodiment, secondary server
200.sub.B may itself discover the failure of primary server
200.sub.A (i.e., using its own monitoring capabilities) and
initiate voting procedures, for example, by utilizing Lamport's
Paxos algorithm or similar known voting algorithms, to become the
new primary server, potentially competing with other secondary
servers that have also recognized the failure of the primary server
and initiated their own voting procedures to become the new primary
server. For example, in step 725, secondary server 200.sub.B issues
a request to other servers in the cluster for their respective
shares of the master secret token s possessed by failed primary
server 200.sub.A. In steps 730 and 735, secondary server 200.sub.B
continues to receive master secret token shares until it has
received a threshold t of master secret token shares. In an
embodiment having competing secondary servers, another secondary
server may obtain the threshold t of master secret token shares
before secondary server 200.sub.B, for example, if the secondary
servers follow the rules of acceptance in accordance with Lamport's
Paxos algorithm or similar algorithms. In step 740, secondary
server 200.sub.B is able to generate master secret token s from the
t shares. In step 745, secondary server 200.sub.B generates a
correct parent id for a new update operation by hashing the
intersection of master secret token s, the parent id of the last
update operation in its secondary disk image, and the data from the
last update operation: H (s|parent|data). In step 750, secondary
server 200.sub.B notifies all the other secondary servers that it
has assumed responsibilities as the new primary server by
transmitting a "view-change" update operation that contains the
correct version of the parent id generated in step 745. In step
755, the secondary server 200.sub.B instantiates a new virtual
machine and associates it with its secondary disk image for the
failed virtual machine of the failed primary server, assumes
responsibility as the new primary server and generates and
subsequently propagates a newly generated master key token by
returning to step 700.
It should be recognized that various modifications and changes may
be made to the specific embodiments described herein without
departing from the broader spirit and scope of the invention as set
forth in the appended claims. For example, although the foregoing
embodiments have described in the context of updating virtual
machine disk images in a replicated and decentralized
virtualization data center, it should be recognized that any system
having any log files or objects (or files or object that may be
structured as logs according to the teachings herein) that are
replicated over multiple computers or devices may utilize the
techniques disclosed herein to ensure exclusive access to such file
or object. Similarly, alternative embodiments may transmit other
types of operations to be appended into a disk image instead of or
in addition to update operations. For example, one embodiment may
include a "branch" and a delete operation, where the branch
operation enables a new disk image to be created based on the
current disk image without requiring knowledge of the master secret
token such that any server in the cluster can request the creation
of such a new disk image (for example, for snapshotting purposes)
and the delete operation enables the deletion of an entire disk
image. Alternative embodiments may utilize other techniques to
generate a unique id. For example, rather than creating a hash of
the intersection of the master secret token, parent id and current
data, alternative embodiments may create a hash of the intersection
of the master secret token and the current data or the parent id,
or generate a unique id in any other manner consistent with its use
as described herein. In one embodiment, the unique id may be a 160
bit value. In another alternative embodiment, a virtual machine
file system may utilize a 64 bit indexed B-tree that tracks entire
extents rather than individual block locations. Server clusters of
alternative embodiments may employ a combination of shared storage,
such as a SAN, and local storage in the servers themselves. For
example, in one such embodiment, a primary server both stores a
primary disk image for a virtual machine on a SAN such that other
servers networked to the SAN can failover the virtual machine, and
also propagates update operations corresponding to the virtual
machine to secondary disk images in the local storage units of
other secondary servers in order to provide additional safeguards
in the event of a failure of the SAN. In yet another alternative
embodiment, each server of a cluster includes its own local storage
and is also networked to a shared SAN. Severs in such an embodiment
may utilize local storage consistent with the teachings herein and
access the SAN in the event that its local storage fails or is
otherwise full. Alternatively, servers in such an embodiment may
utilize the SAN as its primary storage and resort to local storage
only upon a failure of the SAN. It should be recognized that
various other combinations of using both a shared storage and local
storage units may be utilized consistent with the teachings
herein.
The various embodiments described herein may employ various
computer-implemented operations involving data stored in computer
systems. For example, these operations may require physical
manipulation of physical quantities usually, though not
necessarily, these quantities may take the form of electrical or
magnetic signals where they, or representations of them, are
capable of being stored, transferred, combined, compared, or
otherwise manipulated. Further, such manipulations are often
referred to in terms, such as producing, identifying, determining,
or comparing. Any operations described herein that form part of one
or more embodiments of the invention may be useful machine
operations. In addition, one or more embodiments of the invention
also relate to a device or an apparatus for performing these
operations. The apparatus may be specially constructed for specific
required purposes, or it may be a general purpose computer
selectively activated or configured by a computer program stored in
the computer. In particular, various general purpose machines may
be used with computer programs written in accordance with the
teachings herein, or it may be more convenient to construct a more
specialized apparatus to perform the required operations.
The various embodiments described herein may be practiced with
other computer system configurations including hand-held devices,
microprocessor systems, microprocessor-based or programmable
consumer electronics, minicomputers, mainframe computers, and the
like.
One or more embodiments of the present invention may be implemented
as one or more computer programs or as one or more computer program
modules embodied in one or more computer readable media. The term
computer readable medium refers to any data storage device that can
store data which can thereafter be input to a computer system
computer readable media may be based on any existing or
subsequently developed technology for embodying computer programs
in a manner that enables them to be read by a computer. Examples of
a computer readable medium include a hard drive, network attached
storage (NAS), read-only memory, random-access memory (e.g., a
flash memory device), a CD (Compact Discs) CD-ROM, a CD-R, or a
CD-RW, a DVD (Digital Versatile Disc), a magnetic tape, and other
optical and non-optical data storage devices. The computer readable
medium can also be distributed over a network coupled computer
system so that the computer readable code is stored and executed in
a distributed fashion.
Although one or more embodiments of the present invention have been
described in some detail for clarity of understanding, it will be
apparent that certain changes and modifications may be made within
the scope of the claims. Accordingly, the described embodiments are
to be considered as illustrative and not restrictive, and the scope
of the claims is not to be limited to details given herein, but may
be modified within the scope and equivalents of the claims. In the
claims, elements and/or steps do not imply any particular order of
operation, unless explicitly stated in the claims.
Plural instances may be provided for components, operations or
structures described herein as a single instance. Finally,
boundaries between various components, operations and data stores
are somewhat arbitrary, and particular operations are illustrated
in the context of specific illustrative configurations. Other
allocations of functionality are envisioned and may fall within the
scope of the invention(s). In general, structures and functionality
presented as separate components in exemplary configurations may be
implemented as a combined structure or component. Similarly,
structures and functionality presented as a single component may be
implemented as separate components. These and other variations,
modifications, additions, and improvements may fall within the
scope of the appended claims(s).
* * * * *