U.S. patent application number 14/866037 was filed with the patent office on 2017-03-30 for cache load balancing by reclaimable block migration.
This patent application is currently assigned to DELL PRODUCTS, L.P.. The applicant listed for this patent is Dell Products, L.P.. Invention is credited to Noelan Olson, Scott Peterson.
Application Number | 20170091107 14/866037 |
Document ID | / |
Family ID | 58407272 |
Filed Date | 2017-03-30 |
United States Patent
Application |
20170091107 |
Kind Code |
A1 |
Peterson; Scott ; et
al. |
March 30, 2017 |
CACHE LOAD BALANCING BY RECLAIMABLE BLOCK MIGRATION
Abstract
Systems and methods for cache load balancing by reclaimable
block migration are described. In some embodiments, a computer
system may include a processor; and a memory coupled to the
processor, the memory having program instructions stored thereon
that, upon execution by the processor, cause the computer system
to: maintain a first list of reclaimable blocks that reside in a
first caching device and a first advertised age for the oldest
reclaimable block of the first list; maintain a second list of
reclaimable blocks that reside in a second caching device and a
second advertised age for the oldest reclaimable block of the
second list; determine that the second advertised age is older than
the first advertised age; and cause the oldest reclaimable block on
the first list to be migrated from the first caching device to the
second caching device.
Inventors: |
Peterson; Scott; (Beaverton,
OR) ; Olson; Noelan; (Portland, OR) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Dell Products, L.P. |
Round Rock |
TX |
US |
|
|
Assignee: |
DELL PRODUCTS, L.P.
Round Rock
TX
|
Family ID: |
58407272 |
Appl. No.: |
14/866037 |
Filed: |
September 25, 2015 |
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 2212/1024 20130101;
G06F 12/0888 20130101; G06F 2212/502 20130101; G06F 9/5083
20130101; G06F 12/123 20130101; G06F 2212/601 20130101; G06F 12/084
20130101; G06F 12/0806 20130101; G06F 12/0813 20130101; G06F 16/172
20190101 |
International
Class: |
G06F 12/08 20060101
G06F012/08 |
Claims
1. A computer system, comprising: a processor; and a memory coupled
to the processor, the memory having program instructions stored
thereon that, upon execution by the processor, cause the computer
system to: maintain a first list of reclaimable blocks that reside
in a first caching device and a first advertised age for the oldest
reclaimable block of the first list; maintain a second list of
reclaimable blocks that reside in a second caching device and a
second advertised age for the oldest reclaimable block of the
second list; determine that the second advertised age is older than
the first advertised age; and cause the oldest reclaimable block on
the first list to be migrated from the first caching device to the
second caching device.
2. The computer system of claim 1, wherein all reclaimable blocks
have no read or write references, are clean, and have no
corresponding replica blocks.
3. The computer system of claim 1, wherein the first advertised age
indicates how long ago the oldest reclaimable block of the first
list became reclaimable, and wherein the second advertised age
indicates how long ago the oldest reclaimable block of the second
list became reclaimable.
4. The computer system of claim 1, wherein the program
instructions, upon execution by the processor, further cause the
computer system to determine that a difference between the first
and second advertised ages is greater than a threshold time value
prior to the migration.
5. The computer system of claim 1, wherein the program
instructions, upon execution by the processor, further cause the
computer system to add a reference to the migrated reclaimable
block to the second list and to remove a reference to the migrated
reclaimable block from the first list.
6. The computer system of claim 5, wherein the reference to the
migrated reclaimable block is added to the second list among other
references to other reclaimable blocks in order by age.
7. The computer system of claim 1, wherein the program
instructions, upon execution by the processor, further cause the
computer system to cause the second caching device to discard the
oldest reclaimable block of the second list prior to the migration
in response to the second caching device being full.
8. The computer system of claim 1, wherein the program
instructions, upon execution by the processor, further cause the
computer system to determine that the first caching device has a
storage utilization above a minimum threshold level prior to the
migration.
9. The computer system of claim 7, wherein the program
instructions, upon execution by the processor, further cause the
computer system to determine that the second caching device has a
storage utilization below a maximum threshold level prior to the
migration.
10. The computer system of claim 1, wherein the program
instructions, upon execution by the processor, further cause the
computer system to update the first and second advertised ages over
time.
11. A memory device having program instructions stored thereon
that, upon execution by a processor of a computer system, cause the
computer system to: maintain a first list of reclaimable blocks
that reside in a first caching device and a second list of
reclaimable blocks that reside in a second caching device, wherein
all reclaimable blocks have no read or write references, are clean,
and have no corresponding replica blocks; determine that the second
caching device is under-utilized with respect to the first caching
device; migrate at least one reclaimable block from the first
caching device to the second caching device; update the first and
second lists of reclaimable blocks; maintain a first advertised age
for the oldest reclaimable block of the first list, wherein the
first advertised age indicates how long ago the oldest reclaimable
block of the first list became reclaimable; maintain a second
advertised age for the oldest reclaimable block of the second list,
wherein the second advertised age indicates how long ago the oldest
reclaimable block of the second list became reclaimable; and
determine that the second advertised age is older than the first
advertised age.
12. The memory device of claim 11, wherein the second caching
device is under-utilized with respect to the first caching device
when it has less available storage space than the second caching
device.
13. (canceled)
14. The memory device of claim 11, wherein the program
instructions, upon execution by the processor, further cause the
computer system to determine that a difference between the first
and second advertised ages is greater than a threshold time value
prior to the migration.
15. The memory device of claim 11, wherein the program
instructions, upon execution by the processor, further cause the
computer system to add a reference to the migrated reclaimable
block to the second list and to remove a reference to the migrated
reclaimable block from the first list after the migration.
16. The memory device of claim 15, wherein the reference to the
migrated reclaimable block is added to the second list among other
references to other reclaimable blocks in order by age.
17. The memory device of claim 11, wherein the program
instructions, upon execution by the processor, further cause the
computer system to determine that the first caching device has a
storage utilization above a minimum threshold level and that the
second caching device has a storage utilization below a maximum
threshold level prior to the migration.
18. A method, comprising: in a clustered memory cache having a
plurality of caching devices, maintain lists of reclaimable blocks
that reside in each caching device, wherein all reclaimable blocks
have no read or write references, are clean, and have no
corresponding replica blocks; migrating a subset of the reclaimable
blocks among two or more caching devices to effect a load balancing
operation; maintaining a first list of reclaimable blocks that
reside in a first caching device and a second list of reclaimable
blocks that reside in a second caching device; determining that the
second caching device is under-utilized with respect to the first
caching device; maintaining a first advertised age for the oldest
reclaimable block of the first list, wherein the first advertised
age indicates how long ago the oldest reclaimable block of the
first list became reclaimable; maintaining a second advertised age
for the oldest reclaimable block of the second list, wherein the
second advertised age indicates how long ago the oldest reclaimable
block of the second list became reclaimable; determining that the
second advertised age is older than the first advertised age;
migrating at least one reclaimable block from the first caching
device to the second caching device; and updating the first and
second lists of reclaimable blocks.
19. (canceled)
20. (canceled)
Description
FIELD
[0001] This disclosure relates generally to computer systems, and
more specifically, to systems and methods for cache load balancing
by reclaimable block migration.
BACKGROUND
[0002] As the value and use of information continues to increase,
individuals and businesses seek additional ways to process and
store information. One option available to these users is an
information handling system. An information handling system
generally processes, compiles, stores, and/or communicates
information or data for business, personal, or other purposes
thereby allowing users to take advantage of the value of the
information. Because technology and information handling needs and
requirements vary between different users or applications,
information handling systems may vary with respect to the type of
information handled; the methods for handling the information; the
methods for processing, storing or communicating the information;
the amount of information processed, stored, or communicated; and
the speed and efficiency with which the information is processed,
stored, or communicated. The variations in information handling
systems allow for information handling systems to be general or
configured for a specific user or specific use such as financial
transaction processing, airline reservations, enterprise data
storage, or global communications. In addition, information
handling systems may include or comprise a variety of hardware and
software components that may be configured to process, store, and
communicate information and may include one or more computer
systems, data storage systems, and networking systems.
[0003] The information handling system may include one or more
operating systems. An operating system serves many functions, such
as controlling access to hardware resources and controlling the
execution of application software. Operating systems also provide
resources and services to support application software. These
resources and services may include a file system, a centralized
configuration database (such as the registry found in Microsoft
Windows operating systems), a directory service, a graphical user
interface, a networking stack, device drivers, and device
management software. In some instances, services may be provided by
other application software running on the information handling
system, such as a database server.
[0004] Some information handling systems are designed to interact
with other information handling systems over a computer network
connection. In particular, certain information handling systems may
be designed to monitor, configure, and adjust the features,
functionality, and software of other information handling systems
by communicating with those information handling systems over a
network connection. For example, one information handling system
might be configured to manage a shared, distributed cache.
SUMMARY
[0005] Embodiments of systems and methods for managing secured
storage devices in an Information Handling System (IHS) are
described herein. In an illustrative, non-limiting embodiment, a
computer system may include a processor; and a memory coupled to
the processor, the memory having program instructions stored
thereon that, upon execution by the processor, cause the computer
system to: maintain a first list of reclaimable blocks that reside
in a first caching device and a first advertised age for the oldest
reclaimable block of the first list; maintain a second list of
reclaimable blocks that reside in a second caching device and a
second advertised age for the oldest reclaimable block of the
second list; determine that the second advertised age is older than
the first advertised age; and cause the oldest reclaimable block on
the first list to be migrated from the first caching device to the
second caching device.
[0006] In various embodiments, all reclaimable blocks have no read
or write references, are clean, and have no corresponding replica
blocks. The first advertised age may indicate how long ago the
oldest reclaimable block of the first list became reclaimable, and
the second advertised age may indicate how long ago the oldest
reclaimable block of the second list became reclaimable.
[0007] The computer system may determine that a difference between
the first and second advertised ages is greater than a threshold
time value prior to the migration. The program instructions may
further cause the computer system to add a reference to the
migrated reclaimable block to the second list and to remove a
reference to the migrated reclaimable block from the first list. In
some cases, the reference to the migrated reclaimable block is
added to the second list among other references to other
reclaimable blocks in order by age. Additionally or alternatively,
the second caching device may discard the oldest reclaimable block
of the second list prior to the migration in response to the second
caching device being full.
[0008] The computer system may determine that the first caching
device has a storage utilization above a minimum threshold level
prior to the migration. Additionally or alternatively, the computer
system may determine that the second caching device has a storage
utilization below a maximum threshold level prior to the migration.
The program instructions, upon execution by the processor, may
further cause the computer system to update the first and second
advertised ages over time.
[0009] In another illustrative, non-limiting embodiment, a memory
device may have program instructions stored thereon that, upon
execution by a processor of a computer system, cause the computer
system to: maintain a first list of reclaimable blocks that reside
in a first caching device and a second list of reclaimable blocks
that reside in a second caching device, wherein all reclaimable
blocks have no read or write references, are clean, and have no
corresponding replica blocks; determine that the second caching
device is under-utilized with respect to the first caching device;
migrate at least one reclaimable block from the first caching
device to the second caching device; and update the first and
second lists of reclaimable blocks. For example, the second caching
device may be under-utilized with respect to the first caching
device when it has less available storage space than the second
caching device.
[0010] The program instructions may cause the computer system to,
prior the migration: maintain a first advertised age for the oldest
reclaimable block of the first list, wherein the first advertised
age indicates how long ago the oldest reclaimable block of the
first list became reclaimable; maintain a second advertised age for
the oldest reclaimable block of the second list, wherein the second
advertised age indicates how long ago the oldest reclaimable block
of the second list became reclaimable; and determine that the
second advertised age is older than the first advertised age.
[0011] The program instructions may further cause the computer
system to determine that a difference between the first and second
advertised ages is greater than a threshold time value prior to the
migration. The computer may add a reference to the migrated
reclaimable block to the second list and it may remove a reference
to the migrated reclaimable block from the first list after the
migration. The reference to the migrated reclaimable block may be
added to the second list among other references to other
reclaimable blocks in order by age. The computer system may also
determine that the first caching device has a storage utilization
above a minimum threshold level and that the second caching device
has a storage utilization below a maximum threshold level prior to
the migration.
[0012] In yet another illustrative, non-limiting embodiment, a
method may include, in a clustered memory cache having a plurality
of caching devices, maintain lists of reclaimable blocks that
reside in each caching device, where all reclaimable blocks have no
read or write references, are clean, and have no corresponding
replica blocks; and migrating a subset of the reclaimable blocks
among two or more caching devices to effect a load balancing
operation.
[0013] The method may also include maintaining a first list of
reclaimable blocks that reside in a first caching device and a
second list of reclaimable blocks that reside in a second caching
device; determining that the second caching device is
under-utilized with respect to the first caching device; migrating
at least one reclaimable block from the first caching device to the
second caching device; and updating the first and second lists of
reclaimable blocks.
[0014] The method may further include, prior to migrating at least
one reclaimable block from the first caching device to the second
caching device: maintaining a first advertised age for the oldest
reclaimable block of the first list, wherein the first advertised
age indicates how long ago the oldest reclaimable block of the
first list became reclaimable; maintaining a second advertised age
for the oldest reclaimable block of the second list, wherein the
second advertised age indicates how long ago the oldest reclaimable
block of the second list became reclaimable; and determining that
the second advertised age is older than the first advertised
age.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] The present invention(s) is/are illustrated by way of
example and is/are not limited by the accompanying figures, in
which like references indicate similar elements. Elements in the
figures are illustrated for simplicity and clarity, and have not
necessarily been drawn to scale.
[0016] FIG. 1 is a block diagram of an example network with
distributed shared memory according to some embodiments.
[0017] FIG. 2 is a block diagram of an example of a memory manager
according to some embodiments.
[0018] FIG. 3 is a block diagram of an example of another memory
manager according to some embodiments.
[0019] FIG. 4 is a block diagram of an example of a distributed
shared memory environment with a clustered memory resource
distributed across multiple network segments according to some
embodiments.
[0020] FIG. 5 is a flowchart of an example of a method for using a
distributed shared memory resource according to some
embodiments.
[0021] FIGS. 6 and 7 are block diagrams of examples of
communication stack configurations that may be employed to enable
devices to access a distributed shared memory resource according to
some embodiments.
[0022] FIG. 8 is a flowchart of an example of a method for
performing cache load balancing by reclaimable block migration
according to some embodiments.
DETAILED DESCRIPTION
[0023] FIG. 1 depicts an example computer network 20 with
distributed memory. The memory resource and supporting systems may
be configured in a variety of different ways and for different
applications. Caching is one example of a use of computer network
20. Accordingly, the distributed memory resource in the example of
FIG. 1, and in other examples discussed herein, includes a
clustered memory cache 22. Referring specifically to FIG. 1,
clustered memory cache 22 may be aggregated from and comprised of
physical memory locations 24 on a plurality of physically distinct
computing systems 26 (individually designated as Computing System
1, Computing System 2, etc.) and associated local memory managers
34 (individually designated as MM1, MM2, etc.). In particular
embodiments, physical memory 24 may include one or more solid state
devices (SSDs) including, for example, one or more SSDs compliant
with a standard such as the Peripheral Component Interconnect
Express (PCIe) standard. Physical memory 24 may include persistent
or non-volatile memory devices 24 including, for example, flash and
magnetic disk. In particular embodiments, each type of physical
memory 24 (e.g., RAM, flash, magnetic disk) on a computing system
26 may have its own local memory manager 34. Additionally, physical
memory 24 may have hot plug capabilities, such that physical memory
24 may be inserted into, removed from, or swapped between computing
systems 26 without the need for pausing the operation of computer
network 20 or clustered cache 22. Computer network 20 also includes
a metadata service 30, a plurality of clients 32 (only one of which
is shown in the example embodiment of FIG. 1), and, as described
above, a plurality of local memory managers 34 (individually
designated as MM1, MM2, etc.). In particular embodiments, metadata
service 30 may be located on one or more computing systems 26. Each
of the local memory managers 34 is local to and associated with a
different portion of clustered memory cache 22. The metadata
service, clients and local memory managers are all operatively
coupled with each other via network 40. In addition, one or more
configuration managers 42 (only one is shown in the example of FIG.
1), a policy manager 44, and an admin interface 46 may also be
provided as part of computer network 20 (and may, in particular
embodiments, be operatively coupled to other elements via network
40), to provide various functions that will be described below. In
particular embodiments, configuration manager 42 may be located on
one or more computing systems 26. Computer network 20 includes an
auxiliary store 50 which may also be coupled to other elements in
computer network 20 via network 40. Auxiliary store 50 may include
one or more storage devices or systems at various locations (local
or remote), including but not limited to hard disks, file servers,
disk arrays, storage area networks, and the like. Auxiliary store
50 may, in particular embodiments, include DAS backing devices
(used by a particular computing system 26), SAN backing devices
(shared among computing systems 26), or a combination of the
two.
[0024] Clustered memory cache 22 provides a shared memory resource
that can be accessed and used by the clients. Depending on the mode
of operation, clients 32 can read from and write to the clustered
memory cache and cause insertion and/or eviction of data items
to/from the cache.
[0025] As used herein, "client" may broadly to refer to any
hardware or software entity that makes use of the shared memory
resource. For example, clients may include personal computers,
workstations, servers and/or applications or other software running
on such devices.
[0026] "Client" may also more specifically refer to a driver or
other software entity that facilitates access to the shared memory
resource. For example, as will be described in more detail, a
driver can be loaded into memory of a networked computer, allowing
applications and the operating system of that computer to recognize
or make use of the clustered cache.
[0027] The distributed shared memory described herein may be
operated in a variety of modes. Many of the examples discussed
herein will refer to a mode where clustered memory cache 22
provides caching functionality for data used by clients 32. In
particular, data items read from an auxiliary store 50 may be
cached in clustered memory cache 22, and data items to be written
to auxiliary store 50 may also be cached in clustered memory cache
22. Thus, even though a particular client may have ready access to
the auxiliary store (e.g., access to a file system stored on a hard
disk), it may be desirable to place requested data in the clustered
memory cache, so as to provide faster access to the data.
[0028] Local Memory Managers
[0029] Regardless of the particular mode of operation, the
clustered memory cache may span multiple physically distinct
computing systems. For example, in FIG. 1, clustered memory cache
22 includes memory from N different computing systems 26 (Computing
System 1, Computing System 2, etc., through Computing System N).
The individual computing systems can be of varying configurations,
for example ranging from relatively low-powered personal devices to
workstations to high-performance servers. SMP or other
multiprocessor architectures may be employed as well, in which one
or more of the computing systems employ multiple processors or
cores interconnected via a multiprocessor bus or other
interconnect. As described in detail herein, physical memory 24
from these physically distinct systems 26 may be aggregated via
network 40 and made available to clients 32 as a unified logical
resource.
[0030] Referring particularly to local memory managers 34, each
memory manager may be local to and associated with a different
portion of clustered memory cache 22. The memory managers typically
are independent of one another, and each is configured to allocate
and manage individual units of physical memory in its associated
portion of clustered memory cache 22.
[0031] The local memory managers can be configured to manage client
references and access to cached data items. As an illustration,
assume a particular client 32 needs access to a data item cached in
the portion of clustered cache 22 that is managed by memory manager
MM1. The client may query metadata service 30 to identify which
local memory manager 34 (in this case, MM1) manages the desired
cached data item, as described in further detail below. Once the
client knows the memory location for the cached item is managed by
MM1, the client contacts MM1 via network 40 to gain access to the
cached item. If access is permitted, the memory manager MM1 grants
access and maintains a record of the fact that the requesting
client has a reference to the memory location. The record may
indicate, for example, that the client has a read lock on a
particular block of memory that is managed by memory manager
MM1.
[0032] In some embodiments, clustered memory cache 22 may be
implemented using Remote Direct Memory Access (RDMA). RDMA
implementations that may be employed include, but are not limited
to, the Virtual Interface Architecture, InfiniBand, RDMA over
Converged Ethernet (RoCE), RDMA over TCP/IP, and iWARP. In such a
setting, the local memory manager may be configured to provide RDMA
keys to requesting clients or otherwise manage the respective
access controls of the RDMA implementation.
[0033] For any given memory manager, the associated portion of the
clustered cache will often include many different blocks or other
units of memory. In particular, referring to FIG. 2, an exemplary
memory manager 34 is depicted, including a cache store 60. In the
depicted example, cache store 60 is schematically represented as a
table, with a record (row entry) for each block or other unit of
physical memory managed by the memory manager. In particular
embodiments of clustered cache 22 having cache data replication
functionality, one cache store 60 may be created in memory manager
34 for non-replica portions of clustered cache 22 managed by memory
manger 34. Separate cache stores 60 may be created in memory
manager 34 for each replica store managed by memory manger 34. The
first column in the example is simply an index, tag or other
identifier used to designate a particular block of memory.
[0034] The remaining column or columns may contain metadata or
other information associated with the corresponding unit of memory
and/or the data stored in that unit of memory. As depicted in FIG.
2, memory manager 34 may also include a monitor thread 62 to
facilitate the acquisition and updating of the cache store
information. The associated information may include, by way of
example, information about read locks, write locks and/or other
client references to the unit of memory; a filename/path hash or
other mechanism for identifying the cached data item(s); status
indicators; rates of eviction and insertion; temporal information
such as time resident in the cache, time since last access, etc.;
block size or other capacity information relating to the unit of
memory; and/or other information concerning the memory unit, such
as statistical information regarding usage of the memory unit or
the items cached in the memory unit. These are but illustrative
examples. Also, it should be understood that while cache store 60
is depicted schematically to include the information in a table, a
variety of other data structures or mechanisms may be employed to
maintain the information store.
[0035] Local memory managers 34 may also be configured to receive
and respond to requests to insert particular data items into
clustered memory cache 22. As will be explained in more detail
below, these cache insertion requests can arise from and be
initiated by actions of metadata service 30 and clients 32. In some
cases, the local memory manager may deny the cache insertion
request. One situation where an insertion request can be denied is
if the request is directed to a block containing an item that
cannot be immediately evicted, for example because there are active
client references to the cached item.
[0036] Assuming, however, that the insertion request is grantable
by the local memory manager, the local memory manager acknowledges
and grants the request. The memory manager also coordinates the
population of the respective memory block with the data item to be
cached, and appropriately updates any associated information for
the block in the cache store (e.g., cache store 60).
[0037] Similarly, each local memory manager 34 is configured to
receive and respond to requests to evict items from its associated
portion of clustered memory cache 22. As with insertion requests,
the eviction requests can arise from actions of the metadata
service 30 and one or more of clients 32, as will be explained in
more detail below. Assuming the request is grantable, the memory
manager acknowledges and grants the request, and flushes the memory
block or takes other appropriate action to make the memory block
available for caching of another item.
[0038] In various situations, prior to completing the eviction of a
free block, the free block first becomes reclaimable. A client
reference release signal, which is not an eviction signal but an
indication that a client is no longer actively using the block, may
be used to determine that the block is now reclaimable prior to it
being evicted. Incidentally, any block may have read or write
references by any number of clients. It is when all of these read
and write references go away that the block becomes
reclaimable.
[0039] In some example embodiments, it will be desirable to notify
clients 32 when items are to be evicted from the clustered memory
cache. Accordingly, the local memory managers may also be
configured to maintain back references to clients accessing items
in the cache. For example, assume a client requests access to an
item in a portion of the cache managed by a memory manager, and
that the memory manager has responded by granting a read lock to
the client. Having maintained a back reference to the client (e.g.,
in cache store 60), the local memory manager can then notify the
client in the event of a pending eviction and request that the
client release the lock.
[0040] As discussed above, each local memory manager may be local
to and associated with a different portion of the clustered memory
cache. Although memory managers may be referred to herein as
"local" memory managers, they need not be physically proximate to
the physical memory. The local memory managers may be located
elsewhere in some embodiments. In the example of FIG. 1, each of
the distinct computing systems 26 has an individual memory manager
responsible for the physical memory 24 contributed by the computing
system 26 to the clustered cache. Alternatively, multiple local
memory managers may be employed within a computing system.
[0041] In particular embodiments, clustered memory cache 22 may
operate in a write-through mode; that is, write operations
(initiated, for example, by client 32) are not completed until data
that has been written to cache 22 is also flushed to a backing
store such as auxiliary store 50. In other embodiments, clustered
memory cache 22 may operate in a write-back mode; that is, write
operations (initiated, for example, by client 32) are completed as
soon as the data is written to cache 22, and write data is flushed
to a backing store such as auxiliary store 50 at a later time. This
later time may occur, for example, when a client 32 issues a flush
on all cache blocks to which it has written.
[0042] In particular embodiments, clustered memory cache 22 may
include cache data replication functionality, described in further
detail below. In an embodiment including the cache data replication
functionality, physical memory 24 may include data representing a
portion of clustered memory cache 22 as well as one or more replica
stores of data representing another portion or portions of
clustered memory cache 22, with both the data and the replica
stores managed by local memory manager 34. As an example, with
reference to FIG. 1, computing system 1 includes local memory
manager MM1. The physical memory 24 associated with MM1 may include
both data representing a portion of clustered memory cache 22, as
well as a replica store of data representing the portion of
clustered memory cache 22 associated with local memory manager MM2.
Additionally, in an embodiment with cache data replication
functionality, each unit of physical memory 24 may include certain
metadata including, for example, memory 24 identifier (e.g.,
manufacture ID, worldwide name, etc.); for each replica store
hosted by memory 24, the identifier, state, and primary store; for
each replica store replicating data in memory 24, the replica store
identifier and host memory 24; and for each cache block in memory
24, whether the cache block is dirty/unflushed or clean (and if
dirty, when the cache block became dirty), and if dirty/unflushed,
the replica stores where this block is replicated.
[0043] FIG. 3 depicts an example of an alternate memory manager
configuration. As in the previous example, computing system 70 is
one of several physically distinct computing systems contributing
physical memory 24 to a distributed memory resource. The example of
FIG. 3 illustrates two configuration variations that may be applied
to any of the examples discussed herein. First, the figure
demonstrates a configuration in which the memory contributed from a
single computing system is allocated in to multiple different
segments. The individual segments, which may or may not be
contiguous, are each managed by a different memory manager 34
(individually and respectively designated as MMa, MMb and MMc). As
described below, the use of multiple memory managers and memory
segments on a single computing system may be used to allow
exportation of physical memory to multiple different aggregate
memory resources. On the other hand, it may be desirable to employ
multiple memory managers even where the memory is contributed to a
single cache cluster or other shared memory resource.
[0044] Secondly, the figure demonstrates the use of multiple
different clusters. Specifically, each local memory manager and
memory segment pairing in the FIG. 3 example belongs to a different
cache cluster (i.e., clusters 22a, 22b and 22c). Multiple cluster
configurations may be employed for a variety of reasons, such as
for security reasons, access control, and to designate specific
clusters as being usable only by specific applications.
[0045] Local memory managers 34 may also be configured to report
out information associated with the respective portions of
clustered memory cache 22. As discussed above with reference to
FIG. 2, each memory manager may include a cache store 60 with
information about the memory managers memory locations. This
information may be provided from time to time to metadata service
30, configuration manager 42, and/or other components of the
systems described herein.
[0046] In particular embodiments, local memory manager may examine
all possible local memory 24 devices upon startup or upon a
plug-and-play event (indicating that memory 24 has been added to
the associated computing system 26) to determine which memory 24
belongs to clustered cache 22. This may, in particular embodiments,
be determined by examining the memory identifier in the metadata of
memory 24. If it is determined that memory 24 belongs to clustered
cache 22, local memory manager 34 may update entries in its cache
store 60 and communicate data regarding memory 24 to metadata
service 30 or configuration manager 42 (including, for example, the
journal in configuration manager 42). The determination whether
memory 24 belongs to clustered cache 22 may, in some embodiments,
be determined by examining an entry in the journal of configuration
manager 42. In particular embodiments, local memory manager 34 may
not allow access to the newly-added memory 24 until the memory 24
has been approved by the configuration manager 42 (e.g., approved
as not being obsolete after an examination of an entry in the
journal of the configuration manager).
[0047] Metadata Service Data Store
[0048] For example, as will be described in more detail below,
metadata service 30 can provide a centralized, or relatively
centralized, location for maintaining status information about the
clustered cache. In particular, in FIG. 1, memory managers MM1,
MM2, etc. through MMN may be considered to all be within a domain
that is assigned to metadata service 30. Metadata service 30 can
monitor the domain, for example by maintaining information similar
to that described with reference to cache store 60, but for all of
the memory managers in the domain.
[0049] More particularly, metadata service 30 may include a
metadata service data store 80 for maintaining information
associated with the memory locations in its domain that form the
clustered cache. In one class of examples, and as shown in FIG. 1,
metadata service data store 80 may include multiple records 82.
Specifically, a record 82 is provided for each of the physical
memory units 24 of clustered memory cache 22. For example, assume
clustered memory cache 22 includes 64 million 8-kilobyte memory
blocks (512 gigabytes of addressable cache memory) spread across
computing systems 1 through N and local memory managers MM1 through
MMN. In this example, metadata service data store 80 could be
configured with 64 million records (rows), with each pertaining to
one of the cache memory blocks in the cluster. In an alternate
example, each record could apply to a grouping of memory locations.
Numerous other arrangements are possible.
[0050] Various additional information may be associated with the
records of metadata service data store 80. In particular, the
metadata service may store a tag for each of the memory locations
of the cache, as shown in the figure. In one example, the tag
allows a requesting entity, such as one of clients 32, to readily
determine whether a particular data item is stored in the cache.
Specifically, the tag column entries may each be a hash of the
path/filename for the data item resident in the associated memory
block. To determine whether a requested data item (e.g., a file) is
present in the cache, the path/filename of the requested item may
be hashed using the same hash routine and the resulting hash
compared to the tag column entries of the metadata service data
store 80. The path and filename hash described above is provided by
way of example; hash methodologies may be employed on other data,
and/or other identification schemes may be employed.
[0051] Metadata service data store 80 may also indicate an
associated local memory manager for each of its records, as shown
at the exemplary column designated "MM." For example, data store 80
could indicate that a first memory block or range of memory blocks
was managed by memory manager MM1, while a second bock or range of
blocks was managed by local memory manager MM2. With such a
designation, in the event that a query for a particular item
reveals the item is present in the cache (e.g., via a match of the
path/filename hash described above), then the response to that
query can also indicate which local memory manager 34 should be
dealt with to read or otherwise access the cached item.
[0052] In the example of FIG. 1, data store 80 also includes a
status indication for each of the cache blocks. In one example,
each of the cache blocks is indicated as having one of the
following statuses: (1) empty, and therefore available to be
populated; (2) insertion pending, indicating that the memory block
is in the process of being populated with a newly-inserted cached
item; (3) active, indicating that the memory block presently
contains an active cached data item; or (4) deletion pending,
indicating that the data item in the cache block is being deleted.
It will be appreciated that these are illustrative examples, and
other status information and flags may be employed. The specific
exemplary status indications referred to above will be described in
further detail below.
[0053] The tag, memory manager and status entries described above
with reference to the cache blocks in data store 80 are
non-limiting examples. As described in more detail below, metadata
service 30 and its policy engine 90 typically play a role in
implementing various policies relating to the configuration and
usage of clustered memory cache 22. Application of various policies
can be dependent upon rates of eviction and insertion for a cache
block or data item; temporal information such as the time a data
item has been cached in a particular block, time since last access,
etc.; and/or other information concerning the cache block, such as
statistical information regarding usage of the cache block or the
data items cached therein.
[0054] It will thus be appreciated that the information maintained
in metadata service data store 80 may overlap to some extent with
the information from the various cache stores 60 (FIG. 2) of the
local memory managers. Indeed, as previously indicated, the
described system can be configured so that the memory managers
provide periodic updates to maintain the information in the
metadata service data store 80.
[0055] Also, the metadata service may be distributed to some extent
across the network infrastructure. For example, multiple mirrored
copies of the metadata service may be employed, with each being
assigned to a subset of local memory managers. Memory manager
assignments could be dynamically reconfigured to achieve load
balancing and in the event of failure or other changes in operating
conditions of the environment.
[0056] Operational Examples: Cache Hit, Cache Miss
[0057] Various examples will now be described illustrating how
clients 32 interact with metadata service 30 and local memory
managers 34 to access clustered memory cache 22. The basic context
of these examples is as follows: a particular client 32 (FIG. 1) is
running on an applications server executing a data-intensive
financial analysis and modeling program. To run a particular
analysis, the program may need to access various large data files
residing on auxiliary store 50.
[0058] In a first example, the financial analysis program makes an
attempt to access a data file that has already been written into
clustered memory cache 22. This may have occurred, for example, as
a result of another user causing the file to be loaded into the
cache. In this example, client 32 acts as a driver that provides
the analysis program with access to the clustered memory cache 22.
Other example embodiments include client 32 operating in user mode,
for example as an API for interacting with the clustered
resource.
[0059] In response to the client request for the data file,
metadata service 30 determines that the requested file is in fact
present in the cache. This determination can be performed, for
example, using the previously-described filename/path hash method.
Metadata service 30 then responds to the request by providing
client with certain metadata that will enable the client to look to
the appropriate portion of the clustered memory cache (i.e., the
portion containing the requested file).
[0060] In particular, metadata service 30 responds to the request
by identifying the particular local memory manager 34 which is
associated with the portion of the cache containing the requested
file. This identification may include the network address of the
local memory manager, a logical block address or a cache block
number, or another identifier allowing derivation of the address.
Once the client has this information, the client proceeds to
negotiate with the local memory manager to access and read the
requested file from the relevant block or blocks managed by the
memory manager. This negotiation may include granting of a read
lock or other reference from the local memory manager to the
client, and/or provision of RDMA keys as described above.
[0061] When caching small files, an entire file may fit in a single
cache block. When caching larger files or whole block devices, each
file may require many cache blocks. As such, as part of a lookup
process, a client may determine which cache block sized region of
the file/LUN it will include in the query, which is part of the
hash for the cache block.
[0062] As shown in FIG. 1, client 32 may include a local store 92
of metadata. In the above example, this local store may be used by
the client to record the association between the requested data
file and the corresponding local memory manager and respective
portion of the clustered cache. Thus, by consulting local store 92,
subsequent cache accesses to the cached file can bypass the step of
querying metadata service 30. Indeed, clients 32 may be implemented
to first consult local store 92 before querying metadata service
30, thereby allowing clients to more directly and efficiently
access cached items. Metadata service 30 may thus function in one
respect as a directory for the clustered memory cache 22. Clients
having up-to-date knowledge of specific entries in the directory
can bypass the directory and go directly to the relevant local
memory manager.
[0063] In particular embodiments, local store 92 may include
metadata such as a list of client write or read references to
portions of clustered cache 22. As an example, client 32 may keep
track of which cache blocks it holds write references to (as well
as which local memory manager 34 manages these cache blocks) in
local store 92. By keeping track of these write references, client
32 may be able to communicate with the corresponding local memory
managers 34 and, upon request by a local memory manger 34, release
certain of its write references to allow the local memory manager
34 to make room in its corresponding memory 24 for new data to be
cached. Local store 92 may also contain a queue of which cache
blocks are most- or least-recently used by client 32. Thus, if a
particular cache block is the least recently used cache block by
client 32, then it will be at the front of the least-recently-used
(LRU) queue in local store 92 and may be the first write reference
that client 32 releases, either voluntarily or when asked by a
local memory manager 34. If there is a pending input/output request
on a particular cache block, then the reference to that cache block
may move to the back of the least-recently-used (LRU) queue in
local store 92. In particular embodiments, there may be a limit on
the number of cache block references (write, read, or some
combination of both) that a client 32 is allowed to have in using
clustered memory cache 22. This limit may be tracked, for example,
by metadata service 30 (e.g., the policy engine 90), by one or more
local memory mangers 34 (described below), or may be tracked and
enforced at client 32 itself.
[0064] Another example will now be considered, in which the file
requested by the analysis program is not present in clustered
memory cache 22. As before, the analysis program and/or client 32
cause the file request to issue, and the request is eventually
received at metadata service 30. Prior to messaging of the request
to metadata service 30, however, the local client store 92 of
metadata is consulted. In this case, because the requested file is
not present in the cache, no valid metadata will be present in the
local store. The request is thus forward to metadata service
30.
[0065] In response to the request, metadata service 30 cannot
respond with a memory manager identification, as in the previous
example, because the requested file is not present in the clustered
memory cache. Accordingly, the hash matching operation, if applied
to metadata service data store 80, will not yield a match.
[0066] The metadata service can be configured to implement system
policies in response to this type of cache miss situation.
Specifically, policies may be implemented governing whether the
requested item will be inserted into the clustered memory cache,
and/or at what location in the cache the item will be written.
Assuming clustered cache 22 is populated with the requested item,
the metadata service data store 80 will be updated with metadata
including the designation of the responsible memory manager 34.
This metadata can then be supplied in response to the original
request and any subsequent requests for the item, so that the
cached version can be accessed through client interactions with the
appropriate memory manager.
[0067] Policies
[0068] The systems and methods described herein may be configured
with various policies pertaining to the shared memory resource.
Policies may control configuration and usage of the clustered
memory cache; client access to the cache; insertion and eviction of
items to and from the cache; caching of items in particular
locations; movement of cached items from one location to another
within the cache; etc. Policies may also govern start/stop events,
such as how to handle failure or termination of one of the
computing systems contributing memory locations to the cluster.
These are non-limiting examples--a wide variety of possibilities
exist.
[0069] In the example of FIG. 1, configuration manager 42, admin
interface 46 and policy manager 44 perform various functions in
connection with the policies. In particular, admin interface 46 can
provide a command-line, graphical or other interface that can be
used by a system administrator to define policies and control how
they are applied. Configuration manager 42 typically is adapted to
coordinate startup events, such as the login or registration of
entities as they come on-line. In many settings, startup procedures
will also include distribution of policies.
[0070] For example, in FIG. 1, initialization of clients 32 is
handled by configuration manager 42. Specifically, when coming
on-line, each client 32 initializes and registers with
configuration manager 42. Configuration manager 42 provides the
initializing client with addresses of the appropriate metadata
service 30. Configuration manager 42 may also retrieve relevant
policies from policy manager 44 and distribute them to the client,
which stores them locally for implementation via client policy
engine 94 (FIG. 1).
[0071] Configuration manager 42 typically also coordinates
registration and policy distributions for metadata service 30 and
local memory managers 34. The distributed policies may be stored
locally and implemented via metadata service policy engine 90 (FIG.
1) and memory manager policy engines 64 (FIG. 2), respectively.
From time to time during operation, the size and underlying makeup
of the clustered memory resource may change as local memory
managers launch and terminate, either intentionally or as a result
of a failure or other unintentional system change. These startups
and terminations may be handled by the configuration manager, to
provide for dynamic changes in the shared memory resource. For
example, during periods where heavier usage volume is detected
(e.g., an escalation in the number of cache insertion requests),
the configuration manager may coordinate with various distributed
devices and their associated memory managers to dynamically scale
up the resource. On the other hand, performance lags or other
circumstances may dictate a dynamic adjustment where one or more
memory managers are taken off-line. As described in more detail
below, the present system may be configured to permit migration of
cache data from one location to another in the shared resource. The
startups and terminations described above provide examples of
situations where such data migration may be desirable.
[0072] In particular embodiments, configuration manager 42 may
include a journal (or any suitable data structure) containing state
information about clustered cache 22, stored locally in persistent
or non-volatile memory. Because the journal is maintained in
persistent memory in configuration manager 42, even if the
configuration manager fails (or, in the case of multiple
configuration managers, if any or all of the configuration managers
42 of network 20 fail), cache state information may still be
maintained. In particular embodiments, the journal may be mirrored
elsewhere in network 20 or in clustered memory cache 22. Even in
the case of a complete failure of all copies of the journal, the
journal may be reconstructed from metadata information stored in
memory 24 (described above); if memory 24 is non-volatile memory,
then the journal may be reconstructed even after a complete
shutdown of cache 22.
[0073] The journal of the configuration manager 42 may include the
following information about each memory unit 24 of the clustered
cache 22: one or more memory 24 identifiers (e.g., manufacture ID,
worldwide name, cache-specific name, etc.), memory 24 type (e.g.,
RAM, flash, persistent local disk), memory 24 size, memory 24 state
(e.g., inactive, active, failed, failed and recovered, removed), an
identifier of the local memory manager 34 that manages memory 24
(e.g., the local memory manager that most recently registered
memory 24 with the journal), associated replica store identifiers
(e.g., physical IDs of memory 24 containing any associated replica
stores, cache-specific IDs of memory 24 containing replica stores),
an identifier of the local memory manager(s) 34 of the associated
replica store(s), associated replica store states, and replica
stores that are currently being re-hosted on associated replica
stores. Additionally, the journal may also include information
about the one or more metadata services 30 that are part of the
clustered cache 22 including, for example, the identifiers of any
metadata servers that have been expelled from cache 22. The journal
may also include partition map generation numbers, local memory
manager 34 membership generation numbers, and, for each auxiliary
store 50 (or each device in auxiliary store 50), a device pathname
and a device state.
[0074] The configuration manager 42 may communicate with metadata
service 30 (including, for example, data store 80), clients 32,
local memory managers 34 (including, e.g., cache store 60), or any
other part of network 20 to obtain information to update entries in
its journal. Additionally, entries in the journal may be examined
by configuration manager 42 to communicate information to metadata
service 30 (including, for example, data store 80), clients 32,
local memory managers 34 (including, e.g., cache store 60), or any
other part of network 20.
[0075] As an example, if a local memory manager 34 communicates to
configuration manager 42 that a new physical memory 24 has been
detected (e.g., upon startup or upon a plug-and-play event) and
also communicates the memory identifier in the metadata of new
memory 24, the configuration manager 42 may examine its journal to
determine whether the memory identifier corresponds to an existing
memory unit in cache 22 or whether a new entry must be created for
the new memory 24. Additionally, the configuration manager may also
determine, if the identifier corresponds to an existing memory unit
in cache 22, whether the existing memory unit is valid for use
(e.g., based on the memory state--whether failed, recovered,
removed, etc.). Configuration manager 42 may then communicate to
local memory manager whether the "new" memory 24 may be used by
local memory manager 34. If so, local memory manager 34 may update
entries in its cache store 60 and communicate data regarding memory
24 to metadata service 30 or configuration manager 42.
[0076] As another example, a local memory manager 34 may report the
failure of a unit of memory 24. Configuration manager 42 may update
its journal to record the new state of the memory 24, and may
examine its journal to determine whether a replica store exist for
memory 24, and if so, which local memory manager manages this
replica store. Configuration manager 42 may communicate with the
local memory manger managing the replica store and tell it to
"absorb" the replica as a normal (non-replica) portion of the cache
22, and subsequently the journal may be updated. Configuration
manager 42 may also communicate with yet another local memory
manager to create a new replica store for the absorbed replicas
(e.g., in the same physical memory 24 containing replica stores for
the local memory manager who has "absorbed" the replica), and
subsequently update the journal.
[0077] As indicated above, policy manager 44 may be configured to
provide a master/central store for the system policy definitions,
some or all of which may be derived from inputs received via admin
interface 46. Policy manager 44 may also validate or verify
aggregate policies to ensure that they are valid and to check for
and resolve policy conflicts. The policy manager 44 typically also
plays a role in gathering statistics relating to policy
implementations. For example, the policy manager may track the
number of policy hits (the number of times particular policies are
triggered), and/or the frequency of hits, in order to monitor the
policy regime, provide feedback to the admin interface, and make
appropriate adjustments. For example, removal of unused policies
may reduce the processing overhead used to run the policy
regime.
[0078] As should be appreciated from the foregoing, although the
policies may be defined and managed centrally, they may also be
distributed and implemented at various locations in the system.
Furthermore, the policy ruleset in force at any given location in
the system may vary based on the nature of that location. For
example, relative to any one of memory managers 34 or clients 32,
metadata service 30 has a more system-wide global view of clustered
memory cache 22. Accordingly, policy rulesets affecting multiple
clients or memory managers can be distributed to and implemented at
metadata service 30.
[0079] Client Filter
[0080] Referring to clients 32, and more particularly to the client
policy engines 94 incorporated into each client, various exemplary
client-level policy implementations will be described. Many example
policies implemented at the clients operate as filters to
selectively control which client behaviors are permitted to impact
the shared memory resource. More specifically, the client policy
engine may be configured to control whether requests for data items
(e.g., an application attempting to read a particular file from
auxiliary store 50) are passed on to metadata service 30, thereby
potentially triggering an attempted cache insertion or other action
affecting the clustered cache.
[0081] The selective blocking of client interactions with metadata
service 30 operates effectively as a determination of whether a
file or other data item is cacheable. This determination and the
corresponding policy may be based on a wide variety of factors and
criteria. Non-limiting examples include:
[0082] (1) Size--i.e., items are determined as being cacheable by
comparing the item size to a reference threshold. For example,
files larger than N bytes are cacheable.
[0083] (2) Location--i.e., items are determined as being cacheable
depending on the location of the item. For example, all files in a
specified path or storage device are cacheable.
[0084] (3) Whitelist/Blacklist--a list of files or other items may
be specifically designated as being cacheable or non-cacheable.
[0085] (4) Permission level or other flag/attribute--for example,
only read-only files are cacheable.
[0086] (5) Application ID--i.e., the cacheable determination is
made with respect to the identity of the application requesting the
item. For example, specified applications may be denied or granted
access to the cache.
[0087] (6) User ID--e.g., the client policy engine may be
configured to make the cacheable determination based on the
identity of the user responsible for the request.
[0088] (7) Time of Day.
[0089] In addition, these examples may be combined (e.g., via
logical operators). Also, as indicated above, the list is
illustrative only, and the cacheability determination may be made
based on parameters other than the cited examples.
[0090] Cache Insertion and Cache Eviction
[0091] Cache insertion policies may determine whether or not a file
or other data item may be inserted into clustered memory cache 22.
For example, cache insertion policies may be applied by metadata
service 30 and its policy engine 90, though application of a given
policy may be based upon requests received from one or more clients
32, and/or upon metadata updates and other messaging received from
the local memory managers 34 and maintained in metadata service
data store 80 (FIG. 1).
[0092] In some examples, administrators or other users are able to
set priorities for particular items, such as assigning relatively
higher or lower priorities to particular files/paths. In addition,
the insertion logic may also run as a service in conjunction with
metadata service 30 to determine priorities at run time based on
access patterns (e.g., file access patterns compiled from
observation of client file requests).
[0093] Further non-limiting examples of cache insertion policies
include:
[0094] (1) Determining at metadata service 30 whether to insert a
file into clustered memory cache 22 based on the number and/or
frequency of requests received for the file. The metadata service
can be configured to initiate an insertion when a threshold is
exceeded.
[0095] (2) Determining at metadata service 30 whether to insert a
file into clustered memory cache 22 based on available space in the
cache. This determination typically will involve balancing of the
size of the file with the free space in the cache and the
additional space obtainable through cache evictions. Assessment of
free and evictable space may be based on information in metadata
service data store 80.
[0096] (3) Determining at metadata service 30 whether to insert a
file into clustered memory cache 22 based on relative priority of
the file.
[0097] Metadata service 30 may also implement eviction policies for
the clustered memory cache 22. Eviction policies determine which
data items to evict from the cache as the cache reaches capacity.
Eviction policies may be user-configured (e.g., by an administrator
using admin interface 46) based on the requirements of a given
setting, and may be applied based on metadata and other information
stored at metadata service 30 and/or memory managers 34.
[0098] In particular, metadata service 30 may reference its data
store 80 and predicate evictions based on which memory location
within its domain has been least recently used (LRU) or least
frequently used (LFU). Other possibilities include evicting the
oldest record, or basing evictions on age and frequency based
thresholds. These are provided as examples, and evictions may be
based upon a wide variety of criteria in addition to or instead of
these methods.
[0099] As previously mentioned, although metadata service 30 has a
global view of the cache and is therefore well-positioned to make
insertion/eviction determinations, the actual evictions and
insertions are carried out by the memory managers 34 in some
embodiments. Indeed, the insertion/eviction determinations made by
metadata service 30 are often presented to the memory managers as
requests that the memory managers can grant or deny. In other
cases, the memory manager may grant the request, but only after
performing other operations, such as forcing a client to release a
block reference prior to eviction of the block.
[0100] In other cases, metadata service 30 may assign higher
priority to insertion/eviction requests, essentially requiring that
the requests be granted. For example, the overall policy
configuration of the system may assign super-priority to certain
files. Accordingly, when one of clients 32 requests a
super-priority file, if necessary the metadata service 30 will
command one or more memory managers 34 to evict other data items
and perform the insertion.
[0101] In many embodiments, however, the local memory managers have
authority over the cache memory locations that they manage, and are
able in certain circumstances to decline requests from metadata
service 30. One reason for this is that the memory managers may
have more accurate and/or current information about their
associated portion of the cache. Information at the memory managers
may be more granular, or the memory managers may maintain certain
information that is not stored at or reported to metadata service
30. On the other hand, there may be delays between changes
occurring in the cache and the reporting of those changes from the
respective memory manager to metadata service 30. For example,
metadata service 30 might show that a particular block is
evictable, when in fact its memory manager had granted multiple
read locks since the last update to the metadata service. Such
information delays could result from conscious decisions regarding
operation of the clustered cache system. For example, an
administrator might want to limit the reporting schedule so as to
control the amount of network traffic associated with managing the
shared memory resource.
[0102] The above-described distribution of information,
functionality and complexity can provide a number of advantages.
The highly-distributed and non-blocking nature of many of the
examples discussed herein may allow them to be readily scaled in
large datacenter environments. The distributed locking and
insertion/eviction authority carried out by the memory managers may
allow for many concurrent operations and reduce the chance of any
one thread blocking the shared resource. Also, the complicated
tasks of actually accessing the cache blocks are distributed across
the cluster. This distribution is balanced, however, by the
relatively centralized metadata service 30, and the global
information and management functionality it provides.
[0103] Furthermore, it should be appreciated that various different
persistence modes may be employed in connection with the clustered
memory resource described herein. In many of the examples discussed
herein, a read-only caching mode is described, where the clustered
resource functions to store redundant copies of data items from an
underlying auxiliary store. This may enhance performance because
the cluster provides a shareable resource that is typically faster
than the auxiliary store where the data originates. However, from a
persistence standpoint, the data in the cluster may be flushed at
any time without concern for data loss because the cluster does not
serve as the primary data store. Alternatively, the cluster may be
operated as a primary store, with clients being permitted to write
to locations in the cluster in addition to performing read
operations. In this persistence mode, the cluster data may be
periodically written to a hard disk or other back-end storage
device.
[0104] A further example of how the clustered memory resource may
be used is as a secondary paging mechanism. Page swapping
techniques employing hard disks are well known. The systems and
methods described herein may be used to provide an alternate paging
mechanism, where pages are swapped out the high performance memory
cluster.
[0105] Locality within Clustered Cache
[0106] The exemplary policy regimes described herein may also
operate to control the location in clustered memory cache 22 where
various caching operations are performed. In one class of examples,
metadata service 30 selects a particular memory manager 34 or
memory managers to handle insertion of a file or other item into
the respective portion of the cache. This selection may be based on
various criteria, and may also include spreading or striping an
item across multiple portions of the cluster to provide increased
security or protection against failures.
[0107] In another class of examples, the metadata service
coordinates migration of cached items within clustered memory cache
22, for example from one location to another in the cache. This
migration may be necessary or desirable to achieve load balancing
or other performance benefits.
[0108] A variety of exemplary locality policies will now be
described, at times with reference to FIG. 1 and FIG. 4. FIG. 4
depicts another example of a shared-memory computer network 20. The
depicted example is similar in many respects to the example of FIG.
1, except that network 40 includes multiple segments. Two segments
are depicted: Segment A and Segment B. The segments may be
separated by a router, switch, etc. As before, clustered memory
cache 22 is comprised of memory 24 from multiple physically
distinct computing systems 26, however some portions of the cache
are local to network Segment A, while others are local to network
Segment B. Clients 32a, auxiliary store 50a and metadata service
30a are on Segment A, while Clients 32b, auxiliary store 50b and
metadata service 30b are on Segment A
[0109] In a first example, cache insertion locality is determined
based on relative usage of memory locations 24. Usage information
may be gathered over time and maintained by memory managers 34 and
the metadata services, and maintained in their respective stores.
Usage may be based on or derived from eviction rates, insertion
rates, access frequency, numbers of locks/references granted for
particular blocks, etc. Accordingly, when determining where to
insert an item in clustered memory cache 22, the metadata service
may select a less utilized or underutilized portion of the cache to
achieve load balancing.
[0110] The metadata service may also coordinate migration of cache
items from one location to another based on relative usage
information. For example, if information in metadata service data
store 80 (FIG. 1) indicates unacceptable or burdensome over-usage
at memory managers MM2 and MM3, metadata service 30 can coordinate
relocation of some of the data items to other memory managers
(e.g., memory managers MM1 or MM4).
[0111] In another example, locality policies are implemented based
on location of the requesting client. Assume for example, with
reference to FIG. 4, that a cache insertion request is triggered
based on an application associated with one of clients 32a (Segment
A). The policy configuration could be implemented such that this
would result in an attempted insertion at one of the Segment A
memory managers (MM1, MM2 or MM3) instead of the Segment B
managers.
[0112] In yet another example, if a client 32a has an application
that is located on a computing system 26 on Segment A, a policy
configuration could be implemented such that this would result in
an attempted insertion at the Segment A memory manager (MM1, MM2,
or MM3) that is co-located on the same computing system 26 as the
application.
[0113] In particular embodiments, a locality policy may be
implemented based on the location of the client most likely to
access the data. As an example, in the case of virtualization
environments, it is often the case that a single virtual machine (a
type of client application) accesses a cache block without
overlapping or sharing this cache block with another client 32 or
client application. Thus, as described above, one locality policy
may be to locate the requested data from auxiliary store 50 in a
cache block in the memory 24 of the same computing system 26
hosting the virtual machine application. Because it is unlikely (in
the case of a virtual machine application) that a request for that
same data would come from another client application, if a
different memory manager 34 (or computing system 26) seeks to
access this same data due to a client request, it is likely that
the virtual machine application has actually migrated to a portion
of network 20 associated with this different memory manager 34 (or
computing system 26). Thus, in one implementation of this locality
policy (whether for virtual machine applications or general client
applications), a timer is started when a second memory manager (or
computing system) seeks to access (at the request of a client
application) the same data that is stored in a cache block
co-located with a first client application and managed by a first
memory manager that created (or allocated or wrote) the cache
block. Metadata associated with the cache block (located, e.g., in
cache store 60 or in memory 24 itself) may contain an identifier
for the client or client application who initially requested the
cache block. If a certain amount of time has passed (e.g., several
seconds or several milliseconds) since the first memory manager or
client application has accessed the cache block, it may be
determined that the first client application has actually migrated
to a second portion of network 20 associated with the second memory
manager. The cache block may then be migrated to the second memory
managers associated memory in order to serve the client application
in its new location. In particular embodiments, once a cache block
has been migrated, a second timer is started, such that the cache
block cannot be migrated (for locality policy reasons) again until
the second timer reaches a predetermined value (e.g., one hour).
The pattern of access to a particular cache block by client
applications (or memory managers) may, in particular embodiments,
be stored and tracked (e.g. in cache stores 60) before it is
determined whether a migration of a client application has occurred
and whether the cache block should also be migrated. Additionally,
a variety of statistics regarding accesses to individual cache
blocks or groups of associated or correlated cache blocks may also
be tracked by memory managers 34 and stored in cache store 60. The
locality policy may be turned on or off depending on a variety of
factors, and it may be applied globally within memory cache 22 or
locally within certain segments of network 40. For example, the
policy may be turned on or off depending on whether a particular
logical volume contains support for virtualized data. Additionally,
certain clients may have more or less priority in terms of the
locality policy than other clients. For example, even if a
particular client application accesses a cache block frequently, if
it is a low priority client application, it will not trigger a
migration event for the cache block. In yet another embodiment,
data relating to the performance of access times (collected, e.g.,
from clients 32) may be used to determine whether network 20 has
slow or fast links, and to use this information in determining
whether and where to migrate cache blocks within the network.
Metadata relating to this locality policy (stored, e.g., in cache
store 60 or in memory 24) may include bits indicating the type of
placement policy, a time stamp for the last access to the cache
block, and the network address (e.g., IP address) for the last
accessor. Any or all of this data may be communicated to or stored
in metadata service 30 (including data store 80) or configuration
manager 42 (including a journal), and any locality policy may be
controlled by metadata service 30, configuration manager 42, policy
manager 44, or any other suitable component of computer network
20.
[0114] In another example, the relative location of the underlying
data item is factored into the locality policy. Referring to FIG.
4, policies may be configured to specify that files located on
auxiliary store 50b (on Segment B) are to be cached with the
Segment B memory managers 34. This may be the case even where the
requesting client is located on Segment A. Where policy
implementations compete, as in this example, other aspects of the
policy configuration can resolve the conflict, for example through
prioritization of various components of the overall policy
regime.
[0115] From the above, it should be understood that locality may be
determined by tracking usage patterns across the cluster and
migrating memory blocks to nodes optimized to reduce the total
number of network hops involved in current and anticipated uses of
the cluster. In many cases, such optimization will significantly
reduce latency and potential for network congestion. The usage data
may be aggregated from the clients by the configuration manager and
propagated to the metadata service(s) as a form of policy that
prioritizes various cache blocks.
[0116] The policy implementation may also be employed to detect
thrashing of data items.
[0117] For example, upon detecting high rates of insertion and
eviction for a particular data item, the system may adjust to relax
eviction criteria or otherwise reduce the thrashing condition.
[0118] A further locality example includes embodiments in which a
block or data item is replicated at numerous locations within the
clustered memory resource, described further below. In certain
settings, such replication will improve fault tolerance,
performance, and may provide other advantages. For example, in a
caching system, multiple copies of a given cache block could be
sited at multiple different locations within the clustered cache. A
metadata service query would then result in identification of one
of the valid locations. In some embodiments, the second valid
location may be maintained as a replica purely for fault tolerance
purposes and may not be directly accessible to clients.
[0119] Shared Memory Method
[0120] Referring now to FIG. 5, an example shared memory method 120
will be described, in the context of client entities accessing a
clustered memory cache. As before, the clustered memory cache may
be aggregated from and comprised of physical memory on multiple
physically distinct computing systems. The context further includes
attempts by the clients to access data items that are stored in an
auxiliary store, but which may also be inserted into the clustered
memory cache.
[0121] The method may generally include running a local memory
manager on each of a plurality of physically distinct computing
systems operatively coupled with each other via network
infrastructure. One or more metadata services are instantiated, and
operatively coupled with the network infrastructure. Communications
are conducted between the metadata service(s) and the local memory
managers to provide the metadata service with metadata (e.g.,
file/path hashes, usage information/statistics, status, etc.)
associated with the physical memory locations. The metadata service
may then be operated to provide a directory service and otherwise
coordinate the memory managers, such that the physical memory
locations are collectively usable by clients as an undifferentiated
memory resource.
[0122] Referring specifically to the figure, at 122, method 120 may
also include issuing of a client request. As in the examples
described above, the request may originate or issue from an
operating system component, application, driver, library or other
client entity, and may be directed toward a file or other data item
residing on a file server, disk array or other auxiliary store.
[0123] As shown at 124, method 120 may also include checking a
local store to determine whether metadata is already available for
the requested item. The existence of local metadata indicates that
the requested item is currently present and active in the clustered
memory cache, or at least that it was at some time in the past. If
local metadata is available, a read lock is obtained if necessary
(126) and the item is read from its location in clustered memory
cache (128).
[0124] In the context of FIG. 1, these steps could correspond to an
application request, via client 32, for a particular file located
on auxiliary store 50. In response to the request, client 32 would
retrieve valid metadata for the requested file from local metadata
store 92. The retrieved metadata would indicate the particular
memory manager 34 for the data item, and/or would otherwise
indicate the location of the data item in clustered memory cache
22. The requesting client would then access the item from its
location in the cache, for example by interacting with the
respective memory manager to obtain a read lock and perform an RDMA
read of the cached item.
[0125] Continuing with FIG. 5, if it cannot be determined from the
local store that the requested item is or had been cached in the
shared memory resource, method 120 may include a determination of
whether the item is eligible for caching, as shown at 130.
Referring again to FIG. 1, client 32 and its policy engine 94
provide examples of components configured to make the eligibility
determination of step 130. Specifically, as discussed above, the
client and policy engine may filter the passing of requests to
metadata service 30, and thereby filter the usage of clustered
memory cache.
[0126] If the requested item is not eligible for caching, the
request is satisfied by means other than through the clustered
memory cache. In particular, as shown at 132, the client request is
satisfied through auxiliary access, for example by directly
accessing a back-end file system residing on auxiliary store 50
(FIG. 1).
[0127] Proceeding to 134, a metadata service may be accessed for
eligible requests that cannot be initiated with locally stored
metadata. Similar to the inquiry at step 124, the metadata service
is queried at 136 to determine whether metadata exists
corresponding to the client request. If the metadata service has
current metadata for the request (e.g., the address of a local
memory manager overseeing a portion of cache 22 where the requested
item is cached), then the metadata is returned to the requesting
entity (138), and the access and read operations may proceed as
described above with reference to steps 126 and 128.
[0128] The absence of current metadata at the queried metadata
service is an indication that the requested item is not present in
the shared memory resource (e.g., clustered memory cache 22 of FIG.
1 does not contain a non-stale copy of a file requested by one of
clients 32). Accordingly, as shown at 140, method 120 may include
determining whether an attempt will be made to insert the requested
item into the shared memory. If the item will not be inserted, the
client request may be serviced other than through use of the shared
resource, as previously described and shown at 132.
[0129] Continuing with FIG. 5, if an insertion is to be made,
method 120 may include determining the locality of the insertion,
as shown at 142. More particularly, an assessment may be made as to
a specific location or locations within the shared memory resource
where the item is to be placed.
[0130] As in the various examples discussed with reference to FIG.
1, the locality determination may be made based on various
parameters and in accordance with system policy configurations. In
some cases, locality will also be determined in response to data
gathered during operation, for example usage statistics accumulated
at a metadata service based on reports from memory managers.
[0131] As also shown at 142, the cache insertion may also include
messaging or otherwise conferring with one or more local memory
managers (e.g., memory managers MM1, MM2, etc. of FIG. 1). This
communication may include requests, acknowledgments and the like.
As an illustration, metadata service 30 might determine, based on
usage statistics and certain metadata, to attempt to cache a
requested block of data in a memory location managed by memory
manager MM4. Metadata service 30 would send the insertion request
to memory manager MM4, which could then grant the request and
permitted the requested block to be written into its managed memory
location 24. The interaction of metadata service 30 and memory
manager MM4 can also include receiving an acknowledgment at the
metadata service, as shown at 144.
[0132] As previously discussed, the memory manager in some cases
may deny the insertion request, or may honor the request only after
performing an eviction or other operation on its managed memory
location(s). Indeed, in some cases, insertion requests will be sent
to different memory managers, successively or in parallel, before
the appropriate insertion location is determined. In any event, the
insertion process will typically also include updating the metadata
service data store, as also shown at 144. For example, in the case
of a cached file, the data store 80 of metadata service 30 (FIG. 1)
may be updated with a hash of the path/filename for the file.
[0133] As shown at 146, if the insertion is successful, metadata
may be provided to the client and the access and read operations
can then proceed (138, 126, 128). On the other hand, failed
insertion attempts may result in further attempts (142, 144) and/or
in auxiliary access of the requested item (132).
[0134] Client Configuration: Libraries, Drivers, Virtual Memory,
Page Fault Handling
[0135] Referring now to FIGS. 6 and 7, the figures depict exemplary
architectures that may be employed to provide clients 32 with
access to the shared memory resource(s). The figures depict various
components of client 32 in terms of a communications stack for
accessing data items, and show access pathways for reading data
items from an auxiliary store (e.g., auxiliary store 50 of FIG. 1)
or from a clustered memory resource (e.g., clustered memory cache
22 of FIG. 1), which typically provides faster and more efficient
access than the auxiliary store access.
[0136] In the example of FIG. 6, cluster interface 602 is disposed
in the communications stack between application 600 and file system
abstraction layer 604. Auxiliary store access may be made by the
file system layer through known mechanisms such as TCP/IP--Ethernet
layers 606, SCSI--Fibre Channel layers 608, and the like. As
discussed above, auxiliary store access may occur for a variety of
reasons. The file requested by application 600 might be of a type
that is not eligible for loading into clustered memory cache.
Cluster interface 602 may apply a filter that blocks or prevents
access to the shared memory resource, as in step 130 of the
exemplary method of FIG. 5. Alternatively, auxiliary store access
may be performed after a failed cluster insertion attempt, as shown
at steps 146 and 132 of FIG. 5.
[0137] Alternatively, cluster interface 602 is configured to bypass
file system layer 604 in some cases and read the requested data
from a location in the shared memory resource (e.g., a memory
location 24 in clustered memory cache 22), instead of from the
auxiliary store 50. As indicated, this access of the clustered
resource may occur via a client RDMA (over Infiniband/iWarp/RoCE)
layer 610 and a target host channel adapter 612.
[0138] Cluster interface 602 may perform various functions in
connection with the access of the shared memory resource. For
example, interface 602 may search for and retrieve metadata in
response to a request for a particular file by application 600
(e.g., as in step 124 or steps 134, 136 and 138 of FIG. 5).
Interface 602 may also interact with a metadata service to insert a
file into the clustered cache, and then, upon successful insertion,
retrieve metadata for the file to allow the cluster interface 602
to read the file from the appropriate location in the clustered
cache.
[0139] In one example embodiment, cluster interface 602 interacts
with the virtual memory system of the client device, and employs a
page-fault mechanism. Specifically, when a requested item is not
present in the local memory of the client device, a virtual memory
page fault is generated. Responsive to the issuance of the page
fault, cluster interface 602 performs the previously described
processing to obtain the requested item from the auxiliary store 50
or the shared memory cluster. Cluster interface 602 may be
configured so that, when use of the clustered cache 22 is
permitted, item retrieval is attempted by the client simultaneously
from auxiliary store 50 and clustered memory cache 22.
Alternatively, attempts to access the clustered cache 22 may occur
first, with auxiliary access occurring only after a failure.
[0140] FIG. 7 alternatively depicts a block-based system, where
cluster interface 602 is positioned between the file layer 604 and
block-based access mechanisms, such as SCSI--Fibre Channel layer
608 and SRP 620, ISER 622 and RDMA--Infiniband/iWarp (or RoCE)
layers 610. In this example, the mechanisms for storing and
accessing blocks are consistent with the file-based example of FIG.
6, though the data blocks are referenced from the device with an
offset and length instead of via the file path. In particular
embodiments, application 600 may be a virtual machine.
Additionally, cluster interface 602 may be part of a virtual
appliance with which a virtual machine communicates. In particular
embodiments, a combination of iSER and RDMA transports may be used
(in conjunction with iSER target devices in the virtual machine).
In yet other embodiments, a native driver (operable to function
with cache cluster 22) may be placed inside a hypervisor itself,
and may use the RDMA stack instead of iSER in its data path. In
these example embodiments, I/O flows from a virtual machine file
system (e.g., 604) to a native driver and then to a local memory
manager 34, for example, running inside a virtual machine.
[0141] Depending on the particular configuration employed at the
client, block-level or file-level invalidation may be employed. For
example, in the event that an application is writing to a data item
that is cached in the clustered resource, the cached copy is
invalidated, and an eviction may be carried out at the local
memory/cache manager in the cluster where the item was stored.
Along with the eviction, messaging may be sent to clients holding
references to the cached item notifying them of the eviction.
Depending on the system configuration, the clients may then perform
block or file-level invalidation.
[0142] Furthermore, it will be appreciated that variable block
sizes may be employed in block-based implementations. Specifically,
block sizes may be determined in accordance with policy
specifications. It is contemplated that block size may have a
significant effect on performance in certain settings.
[0143] Finally, configurations may be employed using APIs or other
mechanisms that are not file or block-based.
[0144] Load Balancing: Reclaimable Block Migration
[0145] In various implementations, systems and methods described
herein may be configured to place new cache blocks with a
round-robin mechanism or the like--which usually results in a
fairly even distribution of blocks across cache devices 24.
However, when cache devices are added, removed, and/or fail (such
that there is a change the number of devices in the pool), some
cache devices can become very over-utilized or under-utilized.
Moreover, these non-ideal conditions can persist until the normal
eviction and recreation process results in enough new block
placement operations to fill the under-utilized devices, thus
relieving the space pressure on the over-utilized devices.
Unfortunately, the normal eviction and recreation process can take
a long time to complete, and, while it is happening, cache
performance is impaired by its backing store operations (e.g.,
flushing, filling, and write-through).
[0146] As used herein, the term "reclaimable blocks" refers to
valid cache blocks (that is, up to date) which can nonetheless be
quickly discarded to make space, or quickly made accessible for
application I/O to that region of a Logical Unit Number (LUN) (a
unique identifier to designate an individual or collection of
physical or virtual storage devices that execute I/O commands). As
such, reclaimable blocks (or the valid portions thereof) contain an
up-to-date copy of that region of the SAN LUN.
[0147] Generally speaking, a reclaimable block in any distributed
host cache would be free of any states or conditions (e.g., ongoing
application or fill/flush I/O, any incomplete internal bookkeeping
for the block, locks, etc.) that prevented it from being
immediately discarded. For example, in some embodiments, a block
may be deemed reclaimable when it has no client read or write
references (no application can currently read or write that cache
block), is clean, and has no replica block. If any of these
conditions is not true, a block is instead deemed to be
"active".
[0148] A reclaimable block can be freed immediately and its cache
device space reused for a different cache block (or a replica of
some other cache block). A reclaimable block can also be made
readable and writable (with certain restrictions) again by any
client with very little delay--at which point, however, the block
then has a client reference and is deemed active and no longer
"reclaimable" (once any client gets a reference on a block, then
that cache block is no longer in the reclaimable list).
Specifically, the use of reclaimable blocks may avoid the delay of
allocation, and any eviction required to accomplish that (or the
risk that eviction will fail and force us to fall back to slower
mechanisms to access that region of the SAN LUN). If a reclaimable
block is already completely valid (say, because the last client was
reading it), it also avoids the delay of a fill read.
[0149] This "completely valid" condition refers to the ways the
up-to-date contents of the cache block arrive there. If an
application reads from a cache block, the cache server must fill
the whole block from the SAN LUN, and the block becomes completely
valid. To clarify, the application generally reads only a portion
of a cache block. Generally, a typical size of a cache block may be
many times larger than a client read portion (e.g., 512 k bytes;
which indicates why cache blocks may be in a partially valid
state). In some embodiments, selective incremental filling may be
used such that reads do not cause a whole block to become
valid.
[0150] If the first access to a cache block is a write, it is
created (in WRITE ONLY mode) in a completely invalid state in which
none of its regions are consistent with the SAN LUN. The regions
the application writes to become valid, and are eventually flushed
to the SAN LUN. After some number of writes and no reads, the block
is probably still only partially valid. Now when an application
needs to read from that cache block, the cache server must fill the
invalid portions of the cache block from the SAN LUN (converting
the WRITE ONLY block to a READ WRITE block). At that time we may
discover the whole block has been written to and no fill read is
necessary. More commonly, some portion of the block must be read.
Blocks that become reclaimable may or may not be completely valid.
If the next access to a reclaimable block is a read, the fill step
still has to be done as the block becomes active again, but there
will be less fill work to do than if the block had not already
existed because portions of the block are already valid.
[0151] In short, systems and methods described herein may include a
mechanism for maintaining a pool of quickly reusable cache blocks
on each cache device to satisfy requests to create blocks not
already in the cache. This is called the "reclaimable list." The
contents of these reusable blocks are preserved until it they are
actually reused, so demand for that same cached region can be
satisfied without any backing store I/O.
[0152] FIG. 8 shows method 800 for performing cache load balancing
by reclaimable block migration. In various embodiments, method 800
may be performed at least in part, by memory manager 34,
configuration managers 42, and/or policy manager 44. For example,
configuration managers 42 may distribute cache device utilization
and/or reclaimable list age information from each cache server to
all the other cache servers, whereas the actual measurements of
utilization and reclaimable list age, and the actual reclaimable
block migration may be performed by memory manager 34.
[0153] At block 801, method 800 includes maintaining reclaimable
lists for various caching devices 24. At block 802, method 800
determined whether any under-utilized and over-utilized devices
exist in the system and, if so, block 804 performs reclaimable
block migration and update operations using a "Space-Based Load
Balancing" technique; otherwise control returns to block 801. At
block 803, method 800 determined whether older reclaimable blocks
exist in the system and, if so, block 804 performs reclaimable
block migration and update operations using an "Age-Based Load
Balancing" technique; otherwise again control returns to block
801.
[0154] The "Space-Based Load Balancing" technique and the
"Age-Based Load Balancing" technique are each described in turn
below, followed by a discussion of various implementation
considerations, as well as a couple of operational examples.
[0155] Space-Based Load Balancing
[0156] A cache space balancing technique discussed herein adds a
mechanism for advertising under-utilized cache devices to other
cache devices with high utilization. For example, memory manager 34
may advertise utilization of each cache device that it manages and
the reclaimable blocks for each cache device. When a cache device
with high utilization places a block on its reclaimable list (or
notices a reclaimable block and a migration opportunity for any
other reason), memory manager 34 may notify a selected
under-utilized cache device about this block. The IHS hosting the
under-utilized device may then trigger the block migration
mechanism to move the reclaimable block to its under-utilized
device. Each successfully migrated reclaimable block is an avoided
cache block eviction, with corresponding avoided backing store
reads.
[0157] Each time this happens, the under-utilized device becomes a
little more utilized, and the highly utilized device gains a little
free space. As more blocks are created on all cache devices, some
free space is consumed by these new blocks. On the highly utilized
cache devices this will cause the reclaimable list mechanism to
request all clients to release some of their least recently used
references on blocks in that cache device. As these blocks are
dereferenced and become reclaimable, they too will become targets
for migration. This way the space pressure on highly utilized
devices results not in eviction, but in migration to another cache
device with free space.
[0158] In various implementations, such a space-based load
balancing mechanism is best-effort, and failure to migrate a
reclaimable block has no adverse consequences beyond the backing
I/O not avoided.
[0159] Age-Based Load Balancing
[0160] In various implementations, loading balancing efforts may be
performed even in the absence of under-utilized cache devices. For
example, a mechanism that advertises the under-utilized cache
devices to all cache servers may also advertise the ages of the
oldest reclaimable blocks (least recently made reclaimable) on each
cache device. Cache devices with the least recently reclaimable
(the ones that have been reclaimable the longest) blocks can free a
number of them, making migration opportunities (space) for more
recently referenced reclaimable blocks on other cache devices. The
time spent by the block on the reclaimable list may be preserved
when reclaimable blocks are migrated. In this manner, the most
recently referenced cache blocks may be preserved in the cache
wherever that space happens to be.
[0161] Additionally or alternatively, age-based load balancing may
be used instead of space-based balancing. Over-utilized cache
devices tend to have very recently added blocks on their
reclaimable lists, while under-utilized ones tend to have
reclaimable blocks that have been there longer (or have no
reclaimable blocks, which we can treat as infinitely old). If cache
servers simply seek to migrate newer reclaimable blocks to
themselves, freeing their oldest reclaimable blocks if necessary to
do so, an empty cache device will gradually fill itself with the
most recently reclaimable blocks from around the cluster. Once
full, the balancing will continue so the oldest reclaimable blocks
may be replaced by newer ones throughout the cluster.
[0162] Variants and Implementation
[0163] In some implementations, the age-based load balancing
technique may be used first, and the space-based technique may be
used an alternative approach.
[0164] Because reclaimable blocks may be placed in a list, with the
newest arrivals on one end and the oldest on the other, the oldest
reclaimable block is readily identified. For example, an entry stat
for each block may be added and used to indicate when the block
became reclaimable. This timestamp can be converted into an age
when communicating with other IHSs, and then back into a time when
a block is migrated in--which overcomes the problem of every
device's zero timestamp being a different absolute time.
[0165] In some cases, a predetermined number of reclaimable blocks
may be placed at the "new" end of the reclaimable list (blocks used
by write-same, for example) so they get reused ahead of those that
involve backing store work to create. In some situations, a policy
mechanism may be used to avoid migrating these block unless there
were no other reclaimable blocks that are not in the "early reuse"
category, or a lot of them would get migrated at the expense of
better choices to remain reclaimable.
[0166] Still referring to age-based balancing, a cutoff may be
implemented for the age difference below which we it is not worth
migrating reclaimable blocks. In some cases, the cutoff may be many
minutes.
[0167] Moreover, a migrated block may be inserted in the
reclaimable list of the destination in order by age with the rest
of the population. Otherwise, if migrated blocks are added at the
old end of a list, the list appears to have the newest reclaimable
blocks (our older ones become hidden). Conversely, if migrated
blocks are added to the new end of a list, they may actually be
older than the newest reclaimable blocks on the list, and it may
cause those to get recycled or migrated away instead. Therefore, to
find its right position on the list, an algorithm may walk down the
list one by one, or in any other suitable manner. For instance, an
aggregate structure (such as a tree or the like) for the
reclaimable pool may be used.
[0168] With space-based balancing, thresholds may be used for both
the "to" and "from" cutoff decision. That is, if a cache device is
less that x % full, it is not worth migrating reclaimable blocks
away from it. If the cache device is more than y % full, it is not
worth migrating reclaimable blocks to it.
[0169] In some cases, when a cache server chooses a block to adopt,
the cache server it is adopting from can optionally give it a
"better" block. That is, if server A decides that server B has
block X with age Y that's a good candidate for server A to adopt,
when server A asks server B for block X, server B can give server A
block Z instead, as long as block Z is not older than age Y of
block X. In other cases, cache server A may ask for any block from
cache server B that meets a given age criteria.
[0170] Space-Based Example
[0171] To illustrate the mechanism for Space-Based techniques
discussed above, consider the following example. At time=0,
TABLE-US-00001 TABLE I Time = 0 Dev % F % F + R Adv Reclaim (old .
. . new) 1.A 0 1 9 b0,9 b1,9 b2,8 b3,1 1.B 0 1 7 b10,7 b11,7 b12,5
2.A 0 1 50 b20,50 b21,15 b22,2 2.B 0 1 15 b30,15 b31,15 b32,10
[0172] In Table 1, each row represents a different caching device.
As to the columns, "Dev" indicates a caching device ID using a
<server#>.<storage_device#> nomenclature (for purposes
of simplicity, it is assumed that all caching devices have the same
storage capacity or size), "% F" is the percentage of available
storage in the device, "% F+R" indicates a percentage of the device
that is free or reclaimable, "Adv" is the reclaimable block age
advertised by each device (that is, the time its oldest reclaimable
block has been reclaimable), and "Reclaim" is the reclaimable list
stored on that device, from oldest to newest. Each block on a given
reclaimable list is listed with a <block#>,<age>
nomenclature, where <age> is the time in seconds that
<block#> has been reclaimable.
[0173] In this example, the "adoption age threshold" (AAT) is the
difference in "advertised age" below which the adoption of a block
will not occur. That is, someone else's reclaimable block must be
AAT seconds less than your advertised age for you to offer to adopt
the block, which assume to have a value of 30 seconds. The
"adoption space threshold" (AST) is the amount of free space a
cache device must have before it will offer to adopt reclaimable
blocks of any age, assumed to be 20%. And the "reclaimable
threshold" (RT) is the amount of free space plus reclaimable space
each cache device tries to maintain for low block allocation
latency, assumed to be 10%.
[0174] Referring back to Table I, device 2.A has the oldest
reclaimable block, and device 1.B has the newest. Device 1.B's
advertised age is 43 seconds less than device 2.A's. When device
2.A notices this, it may offer to adopt a block with age AAT
seconds (or more) less than device 2.A's currently advertised age
(50-30=20, newer than device 2.A's oldest reclaimable block by an
amount of time enough to be worth the overhead).
[0175] At time=0.1 seconds,
TABLE-US-00002 TABLE II Time = 0.1 Device % F % F + R Adv Reclaim
(old . . . new) 1.A 0 1 9 b0,9 b1,9 b2,8 b3,1 1.B 0 1 7 b11,7 b12,5
2.A 0 1 15 b21,15 b10,7 b22,2 2.B 0 1 15 b30,15 b31,15 b32,10
[0176] As shown in Table II, block b20 on device 2.A has been freed
to adopt block b10 from device 1.B (this is not necessarily
required, but let us assume that happened here because 2.A did not
have space for more than three reclaimable blocks). Block b10 has
been inserted into device 2.A's reclaim list in age order (older
and newer blocks existed there in this example). Less than one
second has elapsed in this process, so all blocks still have the
same age. No other changes to the reclaim list have occurred.
[0177] No device now has an advertised age AAT seconds larger than
any other, so no adoption will begin. At time=1 second,
TABLE-US-00003 TABLE III Time = 1 Dev % F % F + R Adv Reclaim (old
. . . new) 1.A 0 1 10 b0,10 b1,10 b2,9 b3,2 1.B 0 1 8 b11,8 b12,6
2.A 0 1 16 b21,16 b10,8 b22,3 2.B 0 1 16 b30,16 b31,16 b32,11 3.A
100 100 NONE 3.B 100 100 NONE
[0178] Table III shows that 1 second has elapsed since time 0, so
all reclaimable blocks are now 1 s older. No changes have occurred
to the reclaimable lists on devices 1.A through 2.B.
[0179] Moreover, at this time devices 3.A and 3.B have been added,
but are completely empty. (Assume these are the same size as all
others.) In some implementations, devices with no reclaimable
blocks advertise an age of "NONE". Hence, devices 3.A and 3.B will
both notice that they have more than AST space free, and will both
offer to adopt a block of any age from the device with the least
free space (choosing randomly among ties, and excluding devices
with the same or more free space as they have). Here devices 1.A
through 2.B each have 0% free, so assume device 3.A offers to adopt
a block of any age from device 1.B, and device 3.B makes that same
offer to device 2.A.
[0180] Shortly thereafter, at time=1.1 seconds:
TABLE-US-00004 TABLE IV Time = 1.1 Dev % F % F + R Adv Reclaim (old
. . . new) 1.A 0 1 10 b0,10 b1,10 b2,9 b3,2 1.B 0 1 6 b12,6 2.A 0 1
8 b10,8 b22,3 2.B 0 1 16 b30,16 b31,16 b32,11 3.A 99 100 8 b11,8
3.B 99 100 16 b21,16
[0181] This may continue until no device has anything reclaimable,
or devices 3.A and 3.B had less than AST space free. At time=1.2
seconds:
TABLE-US-00005 TABLE V Time = 1.2 Dev % F % F + R Adv Reclaim (old
. . . new) 1.A 1 1 NONE 1.B 1 1 NONE 2.A 1 1 NONE 2.B 1 1 NONE 3.A
99 100 10 b0,10 b1,10 b2,9 b11,8 b10,8 b22,3 b3,2 3.B 99 100 16
b21,16 b30,16 b31,16 b32,11 b12,6
[0182] In Table V, because devices 3.A and 3.B have greater than
AST space free, they have both adopted blocks older than their
oldest reclaimable blocks.
[0183] Even with no I/O in the cluster creating new blocks on
devices 1.A through 2.B, those cache servers would try to maintain
their RT, so would request their clients to release their oldest
blocks to do so. This may continue to make new reclaimable blocks
until % F+R reaches the RT (which again we assume here is 10%). Now
let us assume that this process continues on for 99 seconds, and by
then devices 1.A through 2.B have reached % F+R=RT. Devices 3.A
& 3.B would continue to adopt all these reclaimable blocks.
Since we are assuming these are all the same size, that could
happen and devices 3.A & 3.B would still have greater than AST
free space.
[0184] At time=100 seconds,
TABLE-US-00006 TABLE VI Time = 100 Dev % F % F + R Adv Reclaim (old
. . . new) 1.A 10 10 NONE 1.B 10 10 NONE 2.A 10 10 NONE 2.B 10 10
NONE 3.A 79 100 110 b0,110 b1,110 b2,109 b11,108 b10,108 b22,103
b3,102 . . . <many much newer blocks> 3.B 79 100 116 b21,116
b30,116 b31,116 b32,111 b12,106 . . . <many much newer
blocks>
[0185] Now devices 3.A and 3.B have all the reclaimable blocks in
the cluster, and they still have a lot of free space. They each
still have the oldest reclaimable blocks they ever acquired, and
they will keep them until they evict them to create a new block, or
evict them to adopt a newer reclaimable block. If work starts in
the cluster again, it will produce a stream of new blocks on all
cache devices. It may also result in some reclaimable blocks
becoming referenced by clients (and therefore no longer
reclaimable).
[0186] Assume this scenario goes on for another 100 seconds, by
which time devices 3.A and 3.B have less than AST free space and
are no longer adopting reclaimable blocks of any age, but only
those AAT seconds newer than the oldest one they already have. That
is, at time=200 seconds,
TABLE-US-00007 TABLE VII Time = 200 Dev % F % F + R Adv Reclaim
(old . . . new) 1.A 0 1 1 b6,1 b7,1 b8,1 1.B 0 1 5 b12,5 b13,1
b14,1 2.A 0 1 5 b23,5 b24,5 b25,1 2.B 0 1 1 b33,1 b34,1 3.A 20 50
210 b0,210 b22,203 b3,202 . . . b4,100 . . . b5,1 . . . <many
much newer blocks> 3.B 20 50 211 b32,211 b12,206 . . . <many
much newer blocks>
[0187] Here blocks b1,210 b2,209 b11,208 b10,208 b21,216 b30,216
b31,216 were all referenced and are no longer reclaimable. The
oldest reclaimable block on device 3.B is now b32, and the
advertised age has been adjusted accordingly.
[0188] Devices 1.A through 2.B can tell there are currently no
other devices with an advertised age greater than AAT seconds less
than theirs, so they will not adopt any blocks. Devices 3.A and 3.B
both have 4 devices to adopt from. Because they have less than AST
free space and are no longer adopting blocks of any age, they
ignore the free space on the other cache devices, and choose the
one with the highest advertised age that is AAT newer than theirs.
Devices 1.B and 2.A are tied, so they choose randomly. Assume
device 3.A adopts from device 1.B and device 3.B adopts from device
2.A. At time 200.1 seconds,
TABLE-US-00008 TABLE VIII Time = 200.1 Dev % F % F + R Adv Reclaim
(old . . . new) 1.A 0 1 1 b6,1 b7,1 b8,1 1.B 0 1 1 b13,1 b14,1 2.A
0 1 5 b24,5 b25,1 2.B 0 1 1 b33,1 b34,1 3.A 20 50 50 b0,210 b22,203
b3,202 . . . b4,100 . . . b12,5 b5,1 3.B 20 50 50 b32,211 b12,206 .
. . b23,5
[0189] As new blocks continue to get created and reclaimable blocks
continue to be re-referenced, the full cache devices may continue
to produce new reclaimable blocks. While old reclaimable blocks
remain on devices 3.A and 3.B, newer reclaimable blocks will
continue to be adopted there, possibly replacing the old
reclaimable blocks. The process continues until less than AAT
seconds remain between the advertised reclaimable block age between
any two cache devices, at which point migration stops, for example,
until the next configuration change or availability event produces
another empty cache device.
[0190] Age-Based Example
[0191] If we omit the use of AST, the previously described example
in Tables I-III proceeds in the same manner until time=1 second, at
which point devices 3.A and 3.B behave differently. Particularly,
devices 3.A and 3.B both notice that all other devices have newer
reclaimable blocks than they do (since they each have NONE), and
that devices 2.A and 2.B are tied for having the oldest. They may
each choose randomly between the best candidates. Assume device 3.A
chooses device 2.A, and device 3.B chooses device 2.B. Then, at
time=1.1 seconds,
TABLE-US-00009 TABLE IX Time = 1.1 Dev % F % F + R Adv Reclaim (old
. . . new) 1.A 0 1 10 b0,10 b1,10 b2,9 b3,2 1.B 0 1 8 b11,8 b12,6
2.A 0 1 16 b10,8 b22,3 2.B 0 1 16 b31,16 b32,11 3.A 99 100 16
b21,16 3.B 99 100 16 b30,16
[0192] Before devices 3.A and 3.B adopted any blocks, devices 1.A
through 2.B each had an advertised age within AAT of each other, so
no adoption took place. Now that devices 3.A and 3.B have each
adopted something, they too have an advertised age within AST of
the other nodes. This would happen as long as the first several
reclaimable blocks on any device had about the same age, which is
what would be expected with eviction happening uniformly before
devices were added.
[0193] Therefore, even though 3.A and 3.B have lots of free space,
they will not adopt anything from the full cache devices until
something changes. If there continues to be load on the system that
creates new blocks and makes more reclaimable blocks, some
reclaimable blocks on devices 1.A through 2.B may get evicted to
make space, and other blocks may take their place on the
reclaimable list. Once AAT has gone by, blocks placed on the
reclaimable list of devices 1.A through 2.B may be adopted by
devices 3.A or 3.B if they are the oldest on the reclaimable
list.
[0194] Assume that 29 seconds goes by, during which devices 1.A
through 2.B are evicting at a fair rate, at time=30 seconds their
oldest reclaimable block is now only 10 seconds old:
TABLE-US-00010 TABLE X Time = 30 Dev % F % F + R Adv Reclaim (old .
. . new) 1.A 0 1 10 b4,10 b5,2 b6,2 b7,2 1.B 0 1 9 b13,9 b14,6 2.A
0 1 1 b23,1 b24,1 2.B 0 1 5 b33,5 b34,4 3.A 99 100 46 b21,46 3.B 99
100 46 b30,46
[0195] Now devices 3.A and 3.B may adopt from any of the other
cache devices, because they have all evicted (or re-referenced)
their reclaimable blocks that were less than AAT seconds newer that
the advertised ages of devices 3.A and 3.B.
[0196] From this point on, the example proceeds as it did above in
Tables VII and VIII. This omits the AST logic and the need for each
caching device to learn the % F of each other. The only cost is
continued eviction by the full cache devices for AAT seconds after
an empty cache device is added; although eviction was already
happening for some unspecified amount of time before devices 3.A
and 3.B arrived anyway.
[0197] It should be understood that various operations described
herein may be implemented in software executed by processing
circuitry, hardware, or a combination thereof. The order in which
each operation of a given method is performed may be changed, and
various operations may be added, reordered, combined, omitted,
modified, etc. It is intended that the invention(s) described
herein embrace all such modifications and changes and, accordingly,
the above description should be regarded in an illustrative rather
than a restrictive sense.
[0198] Although the invention(s) is/are described herein with
reference to specific embodiments, various modifications and
changes can be made without departing from the scope of the present
invention(s), as set forth in the claims below. Accordingly, the
specification and figures are to be regarded in an illustrative
rather than a restrictive sense, and all such modifications are
intended to be included within the scope of the present
invention(s). Any benefits, advantages, or solutions to problems
that are described herein with regard to specific embodiments are
not intended to be construed as a critical, required, or essential
feature or element of any or all the claims.
[0199] Unless stated otherwise, terms such as "first" and "second"
are used to arbitrarily distinguish between the elements such terms
describe. Thus, these terms are not necessarily intended to
indicate temporal or other prioritization of such elements. The
terms "coupled" or "operably coupled" are defined as connected,
although not necessarily directly, and not necessarily
mechanically. The terms "a" and "an" are defined as one or more
unless stated otherwise. The terms "comprise" (and any form of
comprise, such as "comprises" and "comprising"), "have" (and any
form of have, such as "has" and "having"), "include" (and any form
of include, such as "includes" and "including") and "contain" (and
any form of contain, such as "contains" and "containing") are
open-ended linking verbs. As a result, a system, device, or
apparatus that "comprises," "has," "includes" or "contains" one or
more elements possesses those one or more elements but is not
limited to possessing only those one or more elements. Similarly, a
method or process that "comprises," "has," "includes" or "contains"
one or more operations possesses those one or more operations but
is not limited to possessing only those one or more operations.
* * * * *