Shared Locking Mechanism For Storage Centric Leases Baron; Ayal ; et al. [Baron; Ayal]

Shared Locking Mechanism For Storage Centric Leases

Baron; Ayal ; et al.

Patent Application Summary

U.S. patent application number 13/602779 was filed with the patent office on 2014-03-06 for shared locking mechanism for storage centric leases. This patent application is currently assigned to Red Hat Israel, Ltd.. The applicant listed for this patent is Ayal Baron, Federico Simoncelli, Eduardo Warszawski. Invention is credited to Ayal Baron, Federico Simoncelli, Eduardo Warszawski.

Application Number	20140068127 13/602779
Document ID	/
Family ID	50189083
Filed Date	2014-03-06

United States Patent Application	20140068127
Kind Code	A1
Baron; Ayal ; et al.	March 6, 2014

SHARED LOCKING MECHANISM FOR STORAGE CENTRIC LEASES

Abstract

A computing device receives a request from a host for a shared lock on a resource. The computing device obtains an exclusive lock on the resource using a locking data structure that is stored on the storage domain. The computing device subsequently obtains a shared lock on the resource for the host by writing a flag to the locking data structure, wherein the flag indicates that the host has the shared lock on the resource. The computing device then releases the exclusive lock on the resource.

Inventors:

Baron; Ayal; (Kiryat Ono, IL) ; Simoncelli; Federico; (Fano, IT) ; Warszawski; Eduardo; (Kfar Saba, IL)

Applicant:

Name	City	State	Country	Type
Baron; Ayal Simoncelli; Federico Warszawski; Eduardo	Kiryat Ono Fano Kfar Saba		IL IT IL

Assignee:

Red Hat Israel, Ltd.
Raanana
IL

Family ID:

50189083

Appl. No.:

13/602779

Filed:

September 4, 2012

Current U.S. Class:	710/200
Current CPC Class:	G06F 9/526 20130101; G06F 2209/523 20130101; G06F 2209/522 20130101
Class at Publication:	710/200
International Class:	G06F 12/14 20060101 G06F012/14

Claims

1. A method comprising: receiving, by a processing device, a request from a host for a shared lock on a resource; obtaining, by the processing device, an exclusive lock on the resource for the host using a locking data structure that is stored on a storage domain; subsequently obtaining a shared lock on the resource for the host by writing a flag to the locking data structure, wherein the flag indicates that the host has the shared lock on the resource; and releasing the exclusive lock on the resource after obtaining the shared lock.

2. The method of claim 1, wherein the exclusive lock is obtained and released using a Paxos locking algorithm.

3. The method of claim 1, wherein writing the flag to the locking data structure comprises setting a bit in a bitmap, wherein the set bit is associated with the resource and the host.

4. The method of claim 1, wherein the storage domain comprises a block domain and the locking data structure comprises a logical volume on the block domain, and wherein the logical volume comprises a plurality of regions, each of the plurality of regions being associated with a different resource and comprising a plurality of subregions, a first subregion comprising a shared lock bitmap identifying hosts having a shared lock on the resource.

5. The method of claim 1, wherein the storage domain comprises a file domain and the locking data structure comprises a file in the file domain associated with the resource, the file comprising a plurality of regions, one of the plurality of regions comprising a shared lock bitmap identifying hosts having a shared lock on the resource.

6. The method of claim 1, further comprising: prior to obtaining the shared lock for the host, obtaining for that host a lock to a host identifier (ID) in a lock space associated with the storage domain, wherein the host ID uniquely identifies the host to the storage domain.

7. The method of claim 1, further comprising: receiving a request from the host for an exclusive lock on a second resource; momentarily granting the exclusive lock on the second resource to the host; reading the locking data structure to determine whether any hosts have shared locks on the second resource; and in response to determining that a second host has a shared lock on the second resource and that an additional criterion is satisfied, revoking the exclusive lock from the host and providing a fail result to the host.

8. The method of claim 7, wherein the additional criterion is that the second host is responsive, the method further comprising: checking whether the second host is non-responsive; and in response to determining that the second host is non-responsive, revoking the shared lock from the second host and granting the exclusive lock to the host.

9. A computer readable storage medium having instructions that, when executed by a processing device, cause the processing device to perform a method comprising: receiving, by the processing device, a request from a host for a shared lock on a resource; obtaining, by the processing device, an exclusive lock on the resource for the host using a locking data structure that is stored on a storage domain; subsequently obtaining a shared lock on the resource for the host by writing a flag to the locking data structure, wherein the flag indicates that the host has the shared lock on the resource; and releasing the exclusive lock on the resource after obtaining the shared lock.

10. The computer readable storage medium of claim 9, wherein the exclusive lock is obtained and released using a Paxos locking algorithm.

11. The computer readable storage medium of claim 9, wherein writing the flag to the locking data structure comprises setting a bit in a bitmap, wherein the set bit is associated with the resource and the host.

12. The computer readable storage medium of claim 9, wherein the storage domain comprises a block domain and the locking data structure comprises a logical volume on the block domain, and wherein the logical volume comprises a plurality of regions, each of the plurality of regions being associated with a different resource and comprising a plurality of subregions, a first subregion comprising a shared lock bitmap identifying hosts having a shared lock on the resource.

13. The computer readable storage medium of claim 9, wherein the storage domain comprises a file domain and the locking data structure comprises a file in the file domain associated with the resource, the file comprising a plurality of regions, one of the plurality of regions comprising a shared lock bitmap identifying hosts having a shared lock on the resource.

14. The computer readable storage medium of claim 9, the method further comprising: prior to obtaining the shared lock for the host, obtaining for that host a lock to a host identifier (ID) in a lock space associated with the storage domain, wherein the host ID uniquely identifies the host to the storage domain.

15. The computer readable storage medium of claim 9, the method further comprising: receiving a request from the host for an exclusive lock on a second resource; momentarily granting the exclusive lock on the second resource to the host; reading the locking data structure to determine whether any hosts have shared locks on the second resource; and in response to determining that a second host has a shared lock on the second resource and that an additional criterion is satisfied, revoking the exclusive lock from the host and providing a fail result to the host.

16. The computer readable storage medium of claim 15, wherein the additional criterion is that the second host is responsive, the method further comprising: checking whether the second host is non-responsive; and in response to determining that the second host is non-responsive, revoking the shared lock from the second host and granting the exclusive lock to the host.

17. An apparatus comprising: a memory; and a processing device coupled to the memory, wherein the processing device is configured to: receive a request from a host for a shared lock on a resource; obtain an exclusive lock on the resource for the host using a locking data structure that is stored on a storage domain; subsequently obtain a shared lock on the resource for the host by writing a flag to the locking data structure, wherein the flag indicates that the host has the shared lock on the resource; and release the exclusive lock on the resource after obtaining the shared lock.

18. The apparatus of claim 17, wherein the processing device is further configured to: prior to obtaining the shared lock for the host, obtain for that host a lock to a host identifier (ID) in a lock space associated with the storage domain, wherein the host ID uniquely identifies the host to the storage domain.

19. The apparatus of claim 17, wherein the processing device is further configured to: receive a request from the host for an exclusive lock on a second resource; momentarily grant the exclusive lock on the second resource to the host; read the locking data structure to determine whether any hosts have shared locks on the second resource; and in response to determining that a second host has a shared lock on the resource and that an additional criterion is satisfied, revoke the exclusive lock from the host and provide a fail result to the host.

20. The apparatus of claim 19, wherein the additional criterion is that the second host is responsive, and wherein the processing device is further configured to: check whether the second host is non-responsive; and in response to determining that the second host is non-responsive, revoke the shared lock from the second host and grant the exclusive lock to the host.

Description

TECHNICAL FIELD

[0001] Embodiments of the present invention relate to locking of shared storage, and more specifically to a locking mechanism that manages shared locks using reserved spaces on shared storage that represent locking states for resources.

BACKGROUND

[0002] For shared storage, all hosts on a cluster may have access to the same data. Multiple hosts may attempt to read from or write to specific resources at the same time. This can create errors and cause the hosts to malfunction or crash due to unsynchronized resources. Accordingly, locking managers are used to assign locks (also known as leases) to individual hosts. This ensures that while a host with a lock is using a resource, another host will not be able to modify the resource. Additionally, lock managers may manage locks for other types of resources than shared storage.

[0003] Some locking mechanisms manage locks on resources using the storage domain that stored those resources. However such locking mechanisms typically only provide exclusive locks to the resources. One such locking mechanism is the Paxos protocol.

BRIEF DESCRIPTION OF THE DRAWINGS

[0004] The present invention is illustrated by way of example, and not by way of limitation, and can be more fully understood with reference to the following detailed description when considered in connection with the figures in which:

[0005] FIG. 1 illustrates an exemplary network architecture in which embodiments of the present invention may operate.

[0006] FIG. 2 is a block diagram of a lock manager, in accordance with one embodiment of present invention.

[0007] FIG. 3 is a block diagram of one embodiment of a locking data structure.

[0008] FIG. 4 is a flow diagram illustrating one embodiment of a method for managing a shared lock for a resource on a storage domain.

[0009] FIG. 5 is a flow diagram illustrating one embodiment of a method for managing an exclusive lock for a resource on a storage domain.

[0010] FIG. 6 illustrates a diagrammatic representation of a machine in the exemplary form of a computer system.

DETAILED DESCRIPTION

[0011] Described herein is a method and system for managing locks on shared resources (e.g., resources in a clustered environment). A lock manager may generate one or more locking data structures that identify resources that are locked and hosts holding the locks. These locking data structures may be stored in storage domains, which may be the same storage domains that store some or all of the resources. Accordingly, a current lock state for any resource can be determined simply by reading the locking data structure associated with that resource in the storage domain. This enables locking to be performed without the use of any centralized locking mechanism, which might act as a bottleneck, could crash, etc. Additionally, this enables different hosts with different capabilities and configurations to all share usage of resources using the locking data structures.

[0012] The lock manager may grant both shared locks (also referred to as read only locks) and exclusive locks (also referred to as read/write locks). If a host has a shared lock to a resource, then other resources can still read the resource and can also obtain shared locks on the resource. However, no host can acquire an exclusive lock for, or write to, a resource that has a shared lock on it. In contrast, if a host has an exclusive lock to a resource, then no other resource can read from or write to the resource. Nor can the other resources acquire shared or exclusive locks on the resource while there is an exclusive lock on the resource.

[0013] In one embodiment, when the lock manager receives a request from the host for a shared lock, the lock manager initially obtains an exclusive lock to the resource for the host. The lock manager then obtains a shared lock by writing a flag to a locking data structure at a region that is associated with the host and the resource in question. Subsequently, the lock manager releases the exclusive lock on the resource. Thereafter, any other host can also obtain a shared lock on the resource, following the same procedure. Once the host no longer needs the shared lock, it may signal the lock manager to release the lock. In response, the lock manager may remove the flag from the region of the locking data structure associated with the host and the resource.

[0014] In the following description, numerous details are set forth. It will be apparent, however, to one skilled in the art, that the present invention may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.

[0015] FIG. 1 illustrates an exemplary network architecture 100, in which embodiments of the present invention may operate. The network architecture 100 includes a cluster of host machines 105 (also referred to as "cluster 105") coupled to one or more clients 120-122 over a network 115. The network 115 may be a private network (e.g., a local area network (LAN), a wide area network (WAN), intranet, etc.), a public network (e.g., the Internet), or a combination thereof. The cluster 105 includes multiple host machines 130, 131, with each host machine 130, 131 including one or more virtual machines 145-148. The cluster 105 is also coupled to shared storage 110. The shared storage 110 may include one or more mass storage devices (e.g., disks), which may form a storage pool shared by all of the host machines 130-131 in the cluster 105.

[0016] In one embodiment, the shared storage 110 is a network-based storage system, such as network attached storage (NAS), a storage area networks (SAN), or other storage system. Another example of a network-based storage system is cloud storage (also known as storage as a service (SaaS), as provided by Amazon.RTM. Simple Storage Service (S3.RTM.), Rackspace.RTM. Cloud Storage, etc. Network-based storage systems are commonly used for a variety of purposes, such as providing multiple users with access to shared data, backing up critical data (e.g., by data mirroring), etc.

[0017] Shared storage 110 may include multiple different storage domains 125-128. Each storage domain may be a physically or logically distinct storage device. Storage domains 125-128 may be block domains that handle data at a block level. Such storage domains may be accessible via small computer system interface (SCSI), internet small computer system interface (iSCSI), Fibre Channel Protocol (FCP), ATA over Ethernet (AoE), or other block I/O protocols. Storage domains 125-128 may also be file domains that handle data at a file level. Such storage domains may include a file system such as, for example a network file system (NFS), a common internet file system (CIFS), a fourth extended file system (EXT4), an XFS file system, a hierarchical file system (HFS), a BTRFS file system, or other file system.

[0018] Each storage domain 125-128 may contain locking data structures for a collection of resources. The locking data structures may be reserved spaces on shared storage that represent locking states for the resources. The resources may include data and objects that are stored in the storage domains 125-128 as well as logical resources that are not stored in any storage domain. One standard type of resource is a virtual disk image. However, resources may be any type of object, such as anything that can be stored in shared storage and/or anything whose lock state can be managed via the shared storage. Some resources may be a single file, set of files or sequence of data (e.g., a contiguous or non-contiguous set of blocks in a block device) that contains the contents and structure representing the resource (e.g., a virtual image). Examples of resources include libraries, files, logical volumes, processes, threads, roles, capabilities, services, and so on. For example, the cluster may have a single storage pool manager (SPM) that is responsible for performing operations in the cluster such as moving data, changing configurations, and so forth. Any host may assume the role of the SPM by acquiring an exclusive lock on the SPM resource. The actual composition of the resources on the storage domains 125-128 may depend on a storage type for that storage domain (e.g., whether or not it includes a file system, a type of file system, etc.) as well as a type of resource.

[0019] Each host machine 130-131 may be a rackmount server, a workstation, a desktop computer, a notebook computer, a tablet computer, a mobile phone, a palm-sized computing device, a personal digital assistant (PDA), etc. The host machines 130-131 include host hardware, which includes one or more processing devices, memory, and/or additional devices such as a graphics card, hardware RAID controller, network controller, hard disk drive, universal serial bus (USB) device, internal input/output (I/O) device, keyboard, mouse, speaker, etc.

[0020] Each host machine 130-131 may include a hypervisor 135 (also known as a virtual machine monitor (VMM)) that emulates the underlying hardware platform for the virtual machines 145-148. In one embodiment, hypervisor 135 is a component of a host operating system (OS). Alternatively, the hypervisor 135 may run on top of a host OS, or may run directly on host hardware without the use of a host OS.

[0021] The hypervisor 135 manages system resources, including access to memory, devices, storage devices (e.g., shared storage), and so on. The hypervisor 135, though typically implemented in software, may emulate and export a bare machine interface (host hardware) to higher level software. Such higher level software may comprise a standard or real-time operating system (OS), may be a highly stripped down operating environment with limited operating system functionality, may not include traditional OS facilities, etc. The hypervisor 135 presents to other software (i.e., "guest" software) the abstraction of one or more virtual machines (VMs) 145-148, which may provide the same or different abstractions to various guest software (e.g., guest operating system, guest applications, etc.). Some examples of hypervisors include quick emulator (QEMU.RTM.), kernel mode virtual machine (KVM.RTM.), VMWare.RTM. Workstation, VirtualBox.RTM., and Xen.RTM..

[0022] Each host machine 130-131 hosts any number of virtual machines (VM) 145-148 (e.g., a single VM, one hundred VMs, etc.). A virtual machine 145-148 is a combination of guest software that uses an underlying emulation of the host machine 130-131 (e.g., as provided by hypervisor 135). The guest software may include a guest operating system, guest applications, guest device drivers, etc. Virtual machines 145-148 can be, for example, hardware emulation, full virtualization, para-virtualization, and operating system-level virtualization virtual machines. The virtual machines 145-148 may have the same or different guest operating systems, such as Microsoft.RTM. Windows.RTM., Linux.RTM., Solaris.RTM., etc.

[0023] Each VM 145-148 may be associated with a particular virtual disk image or set of virtual disk images, each of which may be a resource in a storage domain 125-128. These disk images may appear to the virtual machine 145-148 as a contiguous block device, which may have a file system installed thereon. The guest operating system, guest applications, user data, and so forth may be included in one or more of the disk images.

[0024] The clients 120-122 may include computing devices that have a wide range of processing capabilities. Some of the clients 120-122 may be thin clients, which may have limited processing and memory capacities. For example, a thin client may a tablet computer, cellular phone, personal digital assistant (PDA), a re-purposed desktop computer, etc. Some of the clients 120-122 may be thick (fat) clients, which have powerful CPUs and large memory. For example, a thick client may be a dual-core or multi-core computer, workstation, graphics workstation, etc. The clients 120-122 may run client applications such as a Web browser and a graphic user interface (GUI). The clients 120-122 may also run other client applications, which receive multimedia data streams or other data from one or more host machines 130-131 and re-direct the received data to a local display or other user interface.

[0025] Each virtual machine 145-148 can be accessed by one or more of the clients 120-122 over the network 115. In one scenario, each virtual machine 145-148 provides a virtual desktop for a connected client 120-122. From the user's point of view, the virtual desktop may function as a physical desktop (e.g., a personal computer) and be indistinguishable from a physical desktop.

[0026] In one embodiment, the host machines 130-131 each include a lock manager 140. The lock manager 140 may manage locks, including both exclusive locks and shared locks, to resources in the shared storage 110 for virtual machines 145-148 that are collocated on a host machine 130, 131 with the lock manager 140. Lock manager 140 may also manage locks for the host machine 130, 131 on which it is located, as well as other applications or processes running on the host machine 130, 131.

[0027] Lock manager 140 manages locks by maintaining locking data structures in the storage domains 125-128. Each storage domain 125-128 may include multiple different locking data structures. The locking data structures may include flags that identify specific hosts to which specific resources are locked. Any lock manager 140 may read the locking data structures to determine whether a particular resource has an exclusive lock, a shared lock, or is free of locks. Additionally, any lock manager 140 may read the locking data structures to determine which hosts hold the locks to the various resources.

[0028] When a virtual machine 145-148 is to be loaded onto a host machine, lock manager 140 may obtain a lock for the host to the appropriate resources (e.g., disk images) associated with that virtual machine. Some virtual machines may be associated with a chain of disk images (e.g., a disk image that refers back to one or more previous disk images). A first disk image in a chain may be a live image to which changes may be recorded. All other disk images in the chain may be read only snapshots. The lock manager 140 may obtain shared locks for each of the snapshots in the chain, and obtain an exclusive lock for the live image in the chain. Therefore, hosts may still be able to run virtual machines that depend on one or more snapshots in the chain while the host holds the shared locks to those snapshots. This enables multiple virtual machines that are based on the same snapshot or set of snapshots to be run in parallel without generating copies of the snapshots.

[0029] FIG. 2 is a block diagram of a lock manager 205, in accordance with embodiments of present invention. In one embodiment, lock manager 205 corresponds to lock manager 140 of FIG. 1. In one embodiment, lock manager 205 includes an exclusive locking module 245, a shared locking module 250, a lock space module 255 and a locking data structure (LDS) generating module 248. Alternatively, the functionality of one or more of the exclusive locking module 245, shared locking module 250, lock space module 255 and LDS generating module 248 may be subdivided into multiple modules or may be combined into a single module.

[0030] LDS generating module 248 generates locking data structures in storage domains 210, 215 for representing a lock state of resources on those storage domains 210, 215. For storage resources (e.g., virtual disk images, files, etc.), a locking data structure may be generated in the same storage domain that contains the resources being managed. However, the locking data structure may not be contiguous with those managed resources in the storage domain. There are two standard classes of storage domains 210, 215 on which LDS generating module 248 may create locking data structures. A first class of storage domain is a block domain 210. A block domain 210 includes a block level storage device that may be accessed using protocols such as SCSI, iSCSI, Fibre Channel, and so forth. A block domain 210 is divided into a sequence of blocks. Each block may represent a contiguous (or in some instances non-contiguous) region of memory (e.g., a sequence of bytes or bits). Typically, each block in a block level storage domain 210 will be equally sized. For example, each block may be 512 bytes, 1 megabyte (MB), 2 MB, 4 MB or another size.

[0031] A block domain 210 may be divided into a collection of logical volumes by a logical volume manger (not shown). Each logical volume may be a region of storage that is virtual and logically separated from an underlying physical storage device. Each logical volume may contain one or more resources and/or other information such as a locking data structure. Each logical volume may include a specified number of blocks. In one embodiment, block level storage domain 210 includes a separate logical volume for each disk image (referred to herein as a disk image logical volume 224-226). A disk image may be a single file, set of files or sequence of data (e.g., a contiguous or non-contiguous set of blocks in a block device) that contains the contents and structure representing a storage device such as a hard drive. Each disk image may contain all the information that defines a particular virtual machine.

[0032] In one embodiment, LDS generating module 248 generates two locking data structures that together can fully identify a locking state of every host that is attached to the block level storage domain 210 and of every resource (e.g., disk image logical volumes 224-228) that is contained in the block domain 210.

[0033] A first locking data structure may be a lock space logical volume 220. Each storage domain may be associated with a particular lock space. Each host that wants access to resources in the storage domain (e.g., in block domain 210) first registers with the lock space for that storage domain. A host may register to the lock space by acquiring a lock on a particular host identifier (ID) in the lock space. The lock space logical volume 220 may contain an entry for each host that is registered to the block level storage domain 210 indicating the host ID associated with that host. This host ID may be used to uniquely identify the host on the lock space. Additionally, each entry may include a timestamp that indicates a last time that the host having a particular host ID renewed its lease (also known as lock) on the host ID. In one embodiment, each block in the lock space logical volume 220 is associated with a particular host ID.

[0034] Lock space module 255 may be responsible for managing the lock space logical volume 220. When a host requests access to block domain 210, lock space module 255 may obtain a lock to a particular host ID for that host, and may write information identifying that host to a block in the lock space logical volume 220 associated with the particular host ID. The information written to the block may include an address of a host machine that is registering to the lock space of the block domain 210. The information may also include a time stamp indicating when the host registered to the block level storage domain and/or an expiration period indicating a time and date when the lease on the host ID will expire. The host may periodically renew its lease on the host ID, causing the expiration period to be reset. If a host's lease on a host ID expires, lock space module 255 may revoke that host's lock on the host ID.

[0035] In one embodiment, lock space module 255 acquires a delta lease on a lock space associated with the particular host ID for the host. A delta lease is relatively slow to acquire, and may involve a regular exchange of messages (e.g., input/output operations) to shared storage to confirm that the host is alive. Acquiring a delta lease involves performing reads and writes to a particular sector (e.g., a block) of storage separated by specific delays. Once acquired, a delta lease is periodically renewed by updating a timestamp in the block. Granting leases to host IDs prevents two hosts from using the same host ID and provides basic host liveliness information based on the renewals.

[0036] The second locking data structure that LDS generating module 248 creates in the block domain 210 is a resource management logical volume 222. The resource management logical volume 222 maintains lease information on each resource in pr associated with the block domain 210. The resource management logical volume 222 may contain a separate region for each resource stored in the block level storage domain 210. Each region may be divided into a series of subregions, with some or all subregions being associated with particular host IDs. In one embodiment, each subregion is a block (e.g., 512 bytes in one embodiment) in the block level storage domain 210. The subregions may include a first subregion that includes a leader block, a second subregion that includes a shared lock bitmap, and an additional separate subregion for each host ID that is used to track exclusive locks. In one embodiment, the additional separate subregions are used for the Paxos exclusive locking algorithm. For subregions associated with particular host IDs (e.g., each block in the resource management logical volume 222), a flag indicating that a particular host ID has an exclusive lock on a particular resource may be written. Additionally, the shared lock bitmap subregion may include flags indicating particular hosts having shared locks on the resource.

[0037] In an example, there may be 2000 possible resources, the block size may be 512 bytes, and there may be 2000 host IDs. Note that other numbers of resources and host IDs and other block sizes are also possible. In this example, the resource management logical volume 222 may be 2 gigabytes (GB), with 2000 1 MB regions, each having 2000 separate 512 byte subregions. In one embodiment, the resource management logical volume 222 includes 1 leader block, 4 blocks for the shared flags bitmap, and 2000 exclusive blocks, one for each host. Bytes at offsets 513-1024 may correspond to a first region and subregion associated with a particular resource and a particular host. If these bytes have a first state, this may indicate that the particular host has an exclusive lock on the particular resource. Additionally, four blocks (or a different number of blocks) within the resource management logical volume 222 may represent a shared lock bitmap. Each bit in the shared lock bitmap may be a shared lock flag that is associated with a particular host and resource pair. If the bit has a first state (e.g., a 1), this may indicate that the particular host has a shared lock on the particular resource. If the bit has a second state (e.g., a 0), this may indicate that the particular host does not have a shared lock on the particular resource.

[0038] Note that in an alternative embodiment, two different resource management logical volumes may be maintained. A first resource management logical volume may be maintained to track and manage shared locks on resources, and a second resource management logical volume may be maintained to track and manage exclusive locks on the resources.

[0039] Exclusive locking module 245 and shared locking module 250 may be responsible for managing the resource management logical volume 222. In one embodiment, exclusive locking module 245 acquires exclusive locks for hosts by writing exclusive lock flags to an appropriate region in the resource management logical volume associated with a host ID leased to the host and a resource ID for the resource in question. Exclusive locking module 245 may first check the resource management logical volume 222 to determine if any other hosts have an exclusive or shared lock on the resource before obtaining the exclusive lock for the host. If any other hosts have exclusive or shared locks on the resource, then exclusive locking module 245 may return a failure to a host that requested the lock. In one embodiment, the Paxos protocol is followed for performing exclusive locking.

[0040] In one embodiment, shared locking module 250 acquires shared locks for hosts by writing shared lock flags to an appropriate region in the resource management logical volume 222 associated with a host ID leased to the host and a resource ID for the resource in question. The appropriate region may be a particular bit in a shared lock bitmap. Shared locking module 250 may first check the resource management logical volume 222 to determine if any other hosts have an exclusive lock on the resource before obtaining the shared lock for the host. If any other hosts have an exclusive lock on the resource, then shared locking module 250 may return a failure to the requesting host. In one embodiment, shared locking module 250 first briefly obtains an exclusive lock to the resource before acquiring a shared lock to the resource. This may ensure that no other hosts acquire an exclusive lock to the resource while the shared locking module 250 is in the process of acquiring a shared lock for the resource. Once the shared lock is successfully obtained, then shared locking module 250 may release the exclusive lock on the resource.

[0041] In one embodiment, shared locking module 250 and exclusive locking module 245 acquire locks on resources using paxos leases. Paxos leases are generally fast to acquire, and may be made available to hosts as general purpose resource leases. Acquiring a paxos lease involves reads and writes to sectors associated with a maximum number of hosts in a specific sequence specified by the Disk Paxos algorithm.

[0042] As mentioned, a second class of storage domain is a file domain 215. A file domain includes a file system over a block device. Accordingly, data is managed as files as opposed to as blocks. In a file domain 215, each resource may be represented as a file or directory (e.g., disk image files 234-236). LDS generating module 248 may generate a lock space file 230 that is used to manage lock spaces on the file system level storage domain 215. The lock space file 230 may operate similarly to the lock space logical volume, except that it is structured as a file rather than a logical volume. LDS generating module 248 may additionally generate a separate resource management (RM) file 238-240 for each resource in the file system level storage domain 215. Each RM file 238-240 may include a separate region for each host ID. Each region in an RM file 238-240 may have exclusive lock flags written thereto. Each RM file 238-240 may additionally include a shared lock bitmap indicating which hosts have a shared lock on the resource. In an alternative embodiment, LDS generating module 248 may generate a single RM file that contains locking information for all resources on the file domain 215, similar to the resource management logical volume 222.

[0043] As previously mentioned, lock space module 255 may determine whether hosts that have locks to host IDs are still alive. When a host requests a shared lock to a resource, lock space module 255 may determine whether the host having the exclusive lock to the resource is still alive. This may include consulting the lock space logical volume 220 or lock space file 230 (as appropriate) to determine if that host recently updated its lease on the host ID assigned to the host. If not, then lock space module 255 may query that host. If no reply is received in response to the query, lock space module 255 may determine that the host with the exclusive lock is dead or otherwise unresponsive. Alternatively, lock space module 255 may assume that a remote host watchdog killed the machine if it wasn't able to renew the lease. Lock space module 255 may then release the exclusive lock to that host and release the lock to the host ID for the host. This frees other hosts to be able to use the freed host ID, and additionally frees hosts to be able to acquire shared or exclusive locks on the resource.

[0044] Lock space module 255 may perform a similar procedure if a request for an exclusive lock to a resource that already has an exclusive lock or shared lock to another host is received. Additionally, lock space module 255 may periodically or continuously perform checks to confirm the liveness of hosts (e.g., as per the delta lease algorithm).

[0045] FIG. 3 is a block diagram of one embodiment of a locking data structure (LDS) 305. In one embodiment, the locking data structure 305 corresponds to resource management logical volume 222 of FIG. 2. Alternatively, the locking data structure may correspond to a resource management file. As shown, the LDS 305 includes a series of resource regions 1 through N (labeled as resource 1 region 310 to resource N region 315). Each resource region includes a sequence of host subregions 1 through M (labeled as host 1 subregion 320 through host M subregion 325 and host 1 subregion 330 through host M subregion 335). Each resource region 310, 315 additionally includes a shared lock subregion 342, 344, and may also include a leader subregion (not shown).

[0046] Each host subregion may include an exclusive lock flag 340, 350, 360, 370. Solid lines indicate that a flag is set in a host subregion, while dashed lines indicate a placeholder where a flag may be set. In the illustrated embodiment, within resource region N 315, host 1 subregion 330 has an exclusive lock flag 360 set in resource N region 315, indicating that host 1 has an exclusive lock on resource 1. However, no hosts have shared locks on resource N (e.g., no bits are set in shared lock subregion 340). In contrast, within resource 1 region 310 no hosts have exclusive locks, but shared lock subregion 342 has two shared lock flags set (host 1 flag 380 and host M flag 382) indicating that two hosts have shared locks on resource 1.

[0047] FIGS. 4-5 are flow diagrams showing various methods for managing locking by reading and writing to blocks and/or files on shared storage. These methods enable multiple hosts to share storage and resources without communicating with one another via a network. The hosts may have synchronized access to resources simply by reading from and writing to particular data structures in the shared storage. The methods may be performed by a computer system that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, microcode, etc.), software (e.g., instructions run on a processing device to perform hardware simulation), or a combination thereof. In one embodiment, at least some operations of the methods are performed by the lock manager 205 of FIG. 2.

[0048] FIG. 4 is a flow diagram illustrating one embodiment of a method for obtaining a shared lock to a resource for a host. At block 405 of method 400, processing logic receives a request from a host for a shared lock on a resource. The host may be a host machine or a particular process (e.g., a virtual machine) running on a host machine.

[0049] At block 410, processing logic determines whether any other hosts have an exclusive lock on the resource. This check may be performed by reading a locking data structure. The locking data structure may be stored on the same storage domain as the resource, and may include a region associated with the resource. If the region in the locking data structure associated with the resource includes an exclusive lock flag (e.g., if the Paxos algorithm detects the presence of an exclusive lock), then processing logic may determine that another host already has an exclusive lock on the resource. If another host has an exclusive lock on the resource, the method proceeds to block 415. Otherwise, the method continues to block 430.

[0050] At block 415, processing logic determines whether the other host with the exclusive lock to the resource is alive. The exclusive lock flag in the locking data structure may indicate a host ID that has a lock on the resource. Processing logic may read a second locking data structure that grants locks to host IDs. Processing logic may look up the host ID that has the exclusive lock on the resource in the second locking data structure. This may reveal a time stamp indicating a last time that the host associated with that host ID refreshed its lock to the host ID and/or and expiration period. If the expiration period has lapsed, then the host may no longer be alive. If the host is still alive, the method proceeds to block 445, and processing logic reports to the host that no shared lock was obtained. If the other host is not alive, the method continues to block 420.

[0051] At block 420, processing logic releases the exclusive lock on the resource for the other host. In one embodiment, processing logic kills the other host. Alternatively, a host may kill itself (e.g., using a watchdog application) after not being able to renew his liveness on the storage for longer than the timeout. Processing logic may also release a lock for the other host to a specific host ID. The method then continues to block 430.

[0052] At block 430, processing logic obtains an exclusive lock on the resource for the host. This may include writing to a region of a locking data structure. The region that is written to may be associated with a host ID that was granted to the host and to the specific resource that the lock is to be obtained for. In one embodiment, processing logic writes an exclusive lock flag to the region.

[0053] At block 435, processing logic obtains a shared lock on the resource for the host. In one embodiment, processing logic writes a shared lock flag to a region of the locking data structure associated with the resource and the host. This may include setting an appropriate bit in a shared lock bitmap. At block 440, processing logic then releases the exclusive lock on the resource (e.g., by removing the exclusive lock flag from an exclusive lock region of the locking data structure associated with the host and the resource.

[0054] FIG. 5 is a flow diagram illustrating one embodiment of a method for obtaining an exclusive lock to a resource for a host. At block 505 of method 500, processing logic receives a request from a host for an exclusive lock on a resource.

[0055] At block 510, processing logic determines whether any other hosts have an exclusive lock on the resource. This check may be performed by reading a region of a locking data structure that is associated with the resource. If the region in the locking data structure associated with the resource includes an exclusive lock flag, then processing logic may determine that another host already has an exclusive lock on the resource. If another host has an exclusive lock on the resource, the method proceeds to block 520. Otherwise, the method continues to block 512.

[0056] At block 512, processing logic temporarily or momentarily obtains an exclusive lock to the resource for the host (e.g., by setting bits at an offset in a logical volume or file, where the offset is associated with a host ID assigned to the host and to the resource).

[0057] At block 515, processing logic determines whether any other hosts have shared locks on the resource. This check may also be performed by reading a region of a locking data structure that is associated with the resource. If the region in the locking data structure associated with the resource includes any shared lock flags, then processing logic may determine that one or more other hosts have a shared lock on the resource. If another host has a shared lock on the resource, the method proceeds to block 520. Otherwise, the method continues to block 535.

[0058] At block 520, processing logic determines whether any of the other hosts with exclusive locks or shared locks to the resource are alive. Processing logic may read a second locking data structure that grants locks to host IDs. Processing logic may look up the host IDs for hosts that have locks on the resource in the second locking data structure. For each such host, processing logic may determine whether the host is alive either from a time stamp and/or expiration period associated with that host or by querying the host. If any host with a lock to the resource is still alive, the method proceeds to block 545. If none of the hosts with locks on the resource are alive, the method continues to block 525.

[0059] At block 545, processing logic revokes the temporary exclusive lock on the resource. At block 548, processing logic reports to the host that requested the exclusive lock that no exclusive lock was obtained.

[0060] At block 525, processing logic releases the exclusive lock or shared lock (or locks) on the resource for the other host or hosts. Processing logic may also release a lock for the other hosts to specific host Ids for the shared flags. The method then continues to block 535.

[0061] At block 535, processing logic leaves the exclusive lock in place (e.g., extends the temporary exclusive lock to a standard exclusive lock). The method then ends.

[0062] FIG. 6 illustrates a diagrammatic representation of a machine in the exemplary form of a computer system 600 within which a set of instructions, for causing the machine to perform any one or more of the methodologies discussed herein, may be executed. The computer system 600 may correspond to host machine 100 of FIG. 1. In embodiments of the present invention, the machine may be connected (e.g., networked) to other machines in a Local Area Network (LAN), an intranet, an extranet, or the Internet. The machine may operate in the capacity of a server or a client machine in a client-server network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine may be a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a server, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term "machine" shall also be taken to include any collection of machines (e.g., computers) that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

[0063] The exemplary computer system 600 includes a processing device 602, a main memory 604 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc.), a static memory 606 (e.g., flash memory, static random access memory (SRAM), etc.), and a secondary memory 616 (e.g., a data storage device), which communicate with each other via a bus 608.

[0064] The processing device 602 represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. The processing device may include multiple processors. The processing device 602 may include a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, processor implementing other instruction sets, or processors implementing a combination of instruction sets. The processing device 602 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like.

[0065] The computer system 600 may further include a network interface device 622. The computer system 600 also may include a video display unit 610 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), an alphanumeric input device 612 (e.g., a keyboard), a cursor control device 614 (e.g., a mouse), and a signal generation device 620 (e.g., a speaker).

[0066] The secondary memory 616 may include a machine-readable storage medium (or more specifically a computer-readable storage medium) 624 on which is stored one or more sets of instructions 654 embodying any one or more of the methodologies or functions described herein (e.g., virtual disk image manager 690, which may correspond to lock manger 205 of FIG. 2). The instructions 654 may also reside, completely or at least partially, within the main memory 604 and/or within the processing device 602 during execution thereof by the computer system 600; the main memory 604 and the processing device 602 also constituting machine-readable storage media.

[0067] While the computer-readable storage medium 624 is shown in an exemplary embodiment to be a single medium, the term "computer-readable storage medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "computer-readable storage medium" shall also be taken to include any medium that is capable of storing or encoding a set of instructions for execution by the machine that cause the machine to perform any one or more of the methodologies of the present invention. The term "computer-readable storage medium" shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media.

[0068] The computer system 600 may additionally include an interrupt programming module (not shown) for implementing the functionalities of the interrupt programmer. The modules, components and other features described herein (for example in relation to FIG. 1) can be implemented as discrete hardware components or integrated in the functionality of hardware components such as ASICS, FPGAs, DSPs or similar devices. In addition, the modules can be implemented as firmware or functional circuitry within hardware devices. Further, the modules can be implemented in any combination of hardware devices and software components, or only in software.

[0069] Some portions of the detailed descriptions which follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

[0070] It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as "receiving", "obtaining", "determining", "releasing", "performing", or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

[0071] Embodiments of the present invention also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer system selectively programmed by a computer program stored in the computer system. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic disk storage media, optical storage media, flash memory devices, other type of machine-accessible storage media, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.

[0072] The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear as set forth in the description above. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.

[0073] It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reading and understanding the above description. Although the present invention has been described with reference to specific exemplary embodiments, it will be recognized that the invention is not limited to the embodiments described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative sense rather than a restrictive sense. The scope of the invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

* * * * *