U.S. patent number 10,852,959 [Application Number 16/001,077] was granted by the patent office on 2020-12-01 for data storage system, process and computer program for such data storage system for reducing read and write amplifications.
This patent grant is currently assigned to Hitachi, Ltd.. The grantee listed for this patent is Hitachi, Ltd.. Invention is credited to Christopher James Aston, Simon Latimer Benham, Mitsuo Hayasaka, Yuko Matsui, Jonathan Mark Smith, Trevor Edward Willis.
![](/patent/grant/10852959/US10852959-20201201-D00000.png)
![](/patent/grant/10852959/US10852959-20201201-D00001.png)
![](/patent/grant/10852959/US10852959-20201201-D00002.png)
![](/patent/grant/10852959/US10852959-20201201-D00003.png)
![](/patent/grant/10852959/US10852959-20201201-D00004.png)
![](/patent/grant/10852959/US10852959-20201201-D00005.png)
![](/patent/grant/10852959/US10852959-20201201-D00006.png)
![](/patent/grant/10852959/US10852959-20201201-D00007.png)
![](/patent/grant/10852959/US10852959-20201201-D00008.png)
![](/patent/grant/10852959/US10852959-20201201-D00009.png)
![](/patent/grant/10852959/US10852959-20201201-D00010.png)
View All Diagrams
United States Patent |
10,852,959 |
Hayasaka , et al. |
December 1, 2020 |
Data storage system, process and computer program for such data
storage system for reducing read and write amplifications
Abstract
The present disclosure relates to a data storage system, and
processes and computer programs for such data storage system, for
example including processing of: managing one or more metadata tree
structures for storing data to one or more storage devices of the
data storage system in units of blocks, each metadata tree
structure including a root node pointing directly and/or indirectly
to blocks, and a leaf tree level having one or more direct nodes
pointing to blocks, and optionally including one or more
intermediate tree levels having one or more indirect nodes pointing
to indirect nodes and/or direct nodes of the respective metadata
tree structure; maintaining the root node and/or nodes of at least
one tree level of each of at least one metadata structure in a
cache memory; and managing I/O access to data based on the one or
more metadata structures, including obtaining the root node and/or
nodes of the at least one tree level of the metadata structure
maintained in the cache memory from the cache memory and obtaining
at least one node of another tree level of the metadata structure
from the one or more storage devices.
Inventors: |
Hayasaka; Mitsuo (Tokyo,
JP), Aston; Christopher James (Tokyo, JP),
Smith; Jonathan Mark (Tokyo, JP), Matsui; Yuko
(Tokyo, JP), Benham; Simon Latimer (Tokyo,
JP), Willis; Trevor Edward (Tokyo, JP) |
Applicant: |
Name |
City |
State |
Country |
Type |
Hitachi, Ltd. |
Tokyo |
N/A |
JP |
|
|
Assignee: |
Hitachi, Ltd. (Tokyo,
JP)
|
Family
ID: |
1000005215528 |
Appl.
No.: |
16/001,077 |
Filed: |
June 6, 2018 |
Prior Publication Data
|
|
|
|
Document
Identifier |
Publication Date |
|
US 20180285002 A1 |
Oct 4, 2018 |
|
Related U.S. Patent Documents
|
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
Issue Date |
|
|
15373686 |
Dec 9, 2016 |
9996286 |
|
|
|
PCT/US2016/031811 |
May 11, 2016 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F
3/065 (20130101); G06F 3/0685 (20130101); G06F
3/0683 (20130101); G06F 3/0619 (20130101); G06F
3/064 (20130101); G06F 12/0811 (20130101); G06F
16/128 (20190101); G06F 3/061 (20130101); G06F
16/1873 (20190101); G06F 16/183 (20190101); G06F
12/0246 (20130101); G06F 2212/283 (20130101) |
Current International
Class: |
G06F
3/06 (20060101); G06F 12/0811 (20160101); G06F
16/18 (20190101); G06F 16/182 (20190101); G06F
16/11 (20190101); G06F 12/02 (20060101) |
References Cited
[Referenced By]
U.S. Patent Documents
Primary Examiner: Baughman; William E.
Attorney, Agent or Firm: Mattingly & Malur, PC
Parent Case Text
CROSS-REFERENCE TO RELATED PATENT APPLICATIONS
The present application is a continuation application of U.S.
application Ser. No. 15/373,686, filed Dec. 9, 2016, which is a
continuation application of PCT/US2016/031811, filed May 11, 2016,
the contents of which are hereby incorporated by reference into
this application.
Claims
The invention claimed is:
1. A data storage system connectable to one or more client
computers, comprising: a processing unit including a processor or a
programmable logic device; a cache memory; and one or more storage
devices and an interface to communicably connect with one or more
storage devices; the processing unit being adapted to execute:
managing metadata of operation management information for storing
data to the one or more storage devices in units of blocks by
configuring a tree of metadata for managing data, determining a
capacity of the cache memory, based on the determined capacity of
the cache memory, determining a tree level to be stored in the
cache memory according to the determined capacity of the cache
memory and an amount of metadata corresponding to the determined
tree level, and storing the metadata of the tree up to the
determined tree level, managing I/O access to data based on the
metadata maintained in the cache memory and the metadata maintained
in the one or more storage devices, accumulating updates of the
metadata of the operation management information in the cache
memory, updating the metadata of the operation management
information in the storage devices based on the metadata of the
operation management information in the cache memory.
2. The storage system according to claim 1, wherein the operation
management information is divided into plurality of regions, the
processing unit being adapted to execute: updating the metadata of
the operation management information in the storage devices based
on the metadata of the operation management information in the
cache memory on a region-by-region basis.
3. The storage system according to claim 1, wherein the operation
management information is allocation management information.
4. The storage system according to claim 3, wherein the allocation
management information indicates free space objects.
5. The storage system according to claim 2, wherein the processing
unit updates the metadata of the operation management information
in the storage devices when a criteria of a particular region of
the operation management information is satisfied.
6. A method in a data storage system connectable to one or more
client computers, the data storage system comprising: a processing
unit including a processor or a programmable logic device; a cache
memory; and one or more storage devices and an interface to
communicably connect with one or more storage devices; the method,
executed by the processing unit, and comprising the steps of:
managing metadata of operation management information for storing
data to the one or more storage devices in units of blocks by
configuring a tree of metadata for managing data, determining a
capacity of the cache memory, based on the determined capacity of
the cache memory, determining a tree level to be stored in the
cache memory according to the determined capacity of the cache
memory and an amount of metadata corresponding to the determined
tree level, and storing the metadata of the tree up to the
determined tree level, managing I/O access to data based on the
metadata maintained in the cache memory and the metadata maintained
in the one or more storage devices, accumulating updates of the
metadata of the operation management information in the cache
memory, and updating the metadata of the operation management
information in the storage devices based on the metadata of the
operation management information in the cache memory.
7. The method according to claim 6, wherein the operation
management information is divided into plurality of regions, the
method further comprising the step of: updating the metadata of the
operation management information in the storage devices based on
the metadata of the operation management information in the cache
memory on a region-by-region basis.
8. The method according to claim 6, wherein the operation
management information is allocation management information.
9. The method according to claim 8, wherein the allocation
management information indicates free space objects.
10. The method according to claim 7, further comprising the step
of: updating the metadata of the operation management information
in the storage devices when a criteria of a particular region of
the operation management information is satisfied.
Description
DESCRIPTION
The present disclosure relates to a data storage system and/or a
data storage apparatus connectable to one or more host computers,
and in particular a data storage system and/or a data storage
apparatus processing I/O requests.
Further, the present disclosure relates to methods of control of
such data storage system and/or a data storage apparatus. Other
aspects may relate to computer programs, computer program products
and computer systems to operate software components including
executing processing I/O requests at such data storage system
and/or a data storage apparatus.
BACKGROUND
When managing I/O requests from clients to data stored in units of
blocks on storage devices based on a metadata tree structure
including a root node directly or indirectly pointing to blocks
e.g. via indirect nodes pointing to direct nodes and via direct
nodes pointing to blocks of data, in particular in connection with
a log write method which writes modified data to newly allocated
blocks, it has been recognized that by referring to the metadata
nodes by processing the metadata tree structure may lead to
significant read and write amplifications due to random reads
and/or random writes in connection with metadata nodes.
In view of the above problem, it is an object of the present
invention to provide aspects in a data storage system, which
provides and updates a metadata tree structure of plural metadata
nodes for managing I/O requests, allowing to reduce or avoid read
and write amplifications, preferably while achieving high
efficiency in handling I/O requests from a high number of clients
and in connection with multiple types of I/O access protocols,
economical use of storage resources and memories, efficient
scalability for clustered systems of multiple node apparatuses,
highly reliable and efficient data consistency and data protection,
and efficient and reliable recovery functions in case of
failures.
SUMMARY
According to the invention, there is proposed a computer program, a
method and a data storage system according to independent claims.
Dependent claims related to preferred embodiments.
According to exemplary aspects, there may be provided a computer
program including instructions to cause a computer to execute a
method for managing a data storage system.
The method may be comprising: managing one or more metadata tree
structures for storing data to one or more storage devices of the
data storage system in units of blocks, each metadata tree
structure including a root node pointing directly and/or indirectly
to blocks, and a leaf tree level having one or more direct nodes
pointing to blocks, and optionally including one or more
intermediate tree levels having one or more indirect nodes pointing
to indirect nodes and/or direct nodes of the respective metadata
tree structure; maintaining the root node and/or metadata nodes of
at least one tree level of each of at least one metadata structure
in a cache memory; and managing I/O access to data based on the one
or more metadata structures, including obtaining the root node
and/or nodes of the at least one tree level of the metadata
structure maintained in the cache memory from the cache memory and
obtaining at least one node of another tree level of the metadata
structure from the one or more storage devices.
According to exemplary aspects, the root node and/or metadata nodes
of at least one tree level of each of at least one metadata
structure are preferably systematically maintained in the cache
memory preferably for managing the I/O access to data based on the
one or more metadata structures.
For example, "systematically maintaining" a certain data unit in
cache memory may mean that the data unit is kept in cache memory
until reset or re-boot of the system, and is updated whenever
modified in cache memory. Specifically, data systematically
maintained in cache memory may be kept in cache memory for a long
time (e.g. until manual reset or system shutdown or re-boot), in
particular independent of whether the data is frequently accessed,
less frequently accessed or accessed at all. Other data may be
commonly kept in cache memory temporarily (e.g. according to FIFO
management), and such data is only kept longer in cache memory when
used or accessed regularly. At system start, data systematically
maintained in cache memory may be automatically loaded into the
cache memory independent of access to the data, while other data is
only loaded to cache memory when actually needed.
According to exemplary aspects, metadata nodes of at least one
other tree level of each of at least one metadata structure are
preferably temporarily loaded to the cache memory, preferably when
required for managing the I/O access to data based on the one or
more metadata structures.
According to exemplary aspects, metadata nodes of a first group
associated with one or more lowest tree levels of each of at least
one metadata structure, in particular including at least a tree
level of direct nodes, are preferably temporarily loaded to the
cache memory when required for managing the I/O access to data
based on the one or more metadata structures.
According to exemplary aspects, metadata nodes of a second group
associated with one, more or all higher tree levels above the one
or more lowest tree level in each of at least one metadata
structure are systematically maintained in the cache memory.
According to exemplary aspects, writing modified metadata nodes of
the first group to the one or more storage devices is preferably
controlled on the basis of taking a first-type of checkpoint.
According to exemplary aspects, writing modified metadata nodes of
the second group to the one or more storage devices is preferably
controlled on the basis of taking a second-type of checkpoint.
According to exemplary aspects, taking a new first-type checkpoint
(and preferably writing metadata nodes of the first group which
have been modified in a previous first-type checkpoint to the one
or more storage devices upon taking the new first-type checkpoint),
is preferably performed more frequent than taking a new second-type
checkpoint (and preferably writing metadata nodes of the second
group which have been modified in a previous second-type checkpoint
to the one or more storage devices upon taking the new second-type
checkpoint).
According to exemplary aspects, modifying one or more metadata
nodes of the first group preferably includes writing the one or
more modified metadata nodes to a non-volatile memory.
According to exemplary aspects, modifying one or more metadata
nodes of the second group preferably includes writing respective
delta data for each of the one or more modified metadata nodes to
the non-volatile memory, each respective delta data preferably
being indicative of a difference between the respective modified
metadata node of the second group as stored in the cache memory and
the respective non-modified metadata node as stored on the one or
more storage devices.
According to exemplary aspects, the size of a delta data unit is
preferably smaller than a size of an associated metadata node.
According to exemplary aspects, taking a new first-type checkpoint
is preferably performed when an amount of data of metadata nodes of
the first group in the non-volatile memory exceeds a first
threshold.
According to exemplary aspects, taking a new second-type checkpoint
is preferably performed when an amount of delta data associated
with metadata nodes of the second group in the non-volatile memory
exceeds a second threshold.
According to exemplary aspects, the second threshold is preferably
larger than the first threshold.
According to exemplary aspects, when performing a recovery
operation, the method may include recovering a previously modified
metadata node of the first group preferably includes reading the
modified metadata node of the first group from the non-volatile
memory.
According to exemplary aspects, when performing a recovery
operation, the method may include recovering a previously modified
metadata node of the second group preferably includes reading the
corresponding non-modified metadata node from the one or more
storage devices, reading corresponding delta data from the
non-volatile memory, and modifying the non-modified metadata node
based on the corresponding delta data.
According to exemplary aspects, the method may include changing a
highest node tree level of the metadata nodes of the first group to
become a new lowest node tree level of the metadata nodes of the
second group preferably based on monitoring a cache capacity, in
particular preferably if a data amount of metadata nodes of the
second group falls below a third threshold.
According to exemplary aspects, the method may include changing a
lowest node tree level of the metadata nodes of the second group to
become a new highest node tree level of the metadata nodes of the
first group preferably based on monitoring a cache capacity, in
particular preferably if a data amount of metadata nodes of the
second group exceeds a fourth threshold.
According to exemplary aspects, when modifying a metadata node of
the second group associated with a new second-type checkpoint
before a respective corresponding modified metadata node of the
second group associated with a previous second-type checkpoint is
written to the one or more storage devices, the respective modified
metadata node of the second group associated with the new
second-type checkpoint and corresponding reverse delta data is
preferably stored in the cache memory, the corresponding reverse
delta data being preferably indicative of a difference between the
respective modified metadata node of the second group as stored in
the cache memory and the respective corresponding modified metadata
node of the second group associated the previous second-type
checkpoint.
According to exemplary aspects, writing the respective
corresponding modified metadata node of the second group associated
with the previous second-type checkpoint to the one or more storage
devices preferably includes modifying the respective modified
metadata node of the second group as stored in the cache memory
based on the corresponding reverse delta data as stored in the
cache memory.
According to further aspects there may be provided a method for
managing a data storage system, comprising: managing one or more
metadata tree structures for storing data to one or more storage
devices of the data storage system in units of blocks, each
metadata tree structure including a root node pointing directly
and/or indirectly to blocks, and a leaf tree level having one or
more direct nodes pointing to blocks, and optionally including one
or more intermediate tree levels having one or more indirect nodes
pointing to indirect nodes and/or direct nodes of the respective
metadata tree structure; maintaining the root node and/or metadata
nodes of at least one tree level of each of at least one metadata
structure in a cache memory; and/or managing I/O access to data
based on the one or more metadata structures, including obtaining
the root node and/or nodes of the at least one tree level of the
metadata structure maintained in the cache memory from the cache
memory and obtaining at least one node of another tree level of the
metadata structure from the one or more storage devices.
In the following, further aspects are described, which may be
provided independently of the above aspects or in combination with
one or more of the above aspects.
According to exemplary aspects, there may be provided a computer
program including instructions to cause a computer to execute a
method for managing a data storage system and/or a method for
managing a data storage system.
The method may further comprise managing a data structure (such as
e.g. allocation management information, a free space object, and/or
a free space bit map). Such data structure may preferably be
indicative of an allocation status of each of a plurality of blocks
of storage, the allocation status of a block preferably being free
or used. For example, such data structure may include a plurality
of indicators (such as bits, groups of bits, bytes or groups of
bytes), wherein each indicator is associated with a respective
storage block and each indicator is indicative of an allocation
status of its associated storage block.
Preferably, if an allocation status of a block is indicated as
"free", the corresponding storage block is preferably available for
allocation, e.g. for writing a data block of user data or a
metadata node to the storage block upon allocation.
Preferably, if an allocation status of a block is indicated as
"used", the corresponding storage block is preferably storing
previously written data of a data block of user data or a metadata
node to the storage block, thereby not being available for
re-allocation, so that the "used" block is preferably not allocated
for writing another data block until being freed (e.g. when the
previously written data is not needed anymore and the block can be
made available again for allocation for writing new data).
Further, in some exemplary embodiments, if the allocation status of
a block is indicated as "used", the allocation management
information may be further indicative of a reference count of the
block. Such reference count may preferably be indicative of a
number of how many pointers of other objects, metadata structures
and/or metadata nodes of one or more metadata structures point to
the respective block. For example, a block can be allocated again
once a reference count of a block is decremented to zero, and no
other objects, metadata structures and/or metadata nodes point to
the respective block.
The method may preferably comprise updating a data structure, such
as e.g. allocation management information, a free space object,
and/or a free space bit map, indicative of an allocation status of
each of a plurality of blocks of storage.
The method may comprise performing, during managing I/O access,
allocation operations which may include changing a status of one or
more blocks from "free" to "used" and/or incrementing a reference
count of one or more blocks from zero to one (or more).
The method may comprise performing, during managing I/O access,
non-allocation operations which may include changing a status of
one or more blocks from "used" to "free" and/or incrementing and/or
decrementing a reference count of one or more blocks.
Preferably, when changing a status of one or more blocks (or,
preferably, after changing a status of one or more blocks, for
non-allocation operations), the method may include performing an
update operation of modifying the data structure (such as e.g.
allocation management information, a free space object, and/or a
free space bit map) to be indicative of the changed status of the
block.
Furthermore, the data structure (such as e.g. allocation management
information, a free space object, and/or a free space bit map) may
be logically divided into a plurality of regions, each region being
preferably associated with a respective group of storage
blocks.
The method may further comprise managing, for each of the plurality
of regions, respective update operation management information
being indicative of one or more non-allocation update operations to
be applied to update the data structure (such as e.g. allocation
management information, a free space object, and/or a free space
bit map).
The method may further include accumulating, for each region, data
entries of respective update operation management information
associated with the respective region, each data entry being
indicative of a non-allocation update operation to be applied to
update the respective region of the data structure (such as e.g.
allocation management information, a free space object, and/or a
free space bit map) before updating the data structure according to
the accumulated non-allocation update operations to be applied.
Accordingly, a region of the data structure can be updated by
applying plural or all of accumulated non-allocation update
operations based on the respective update operation management
information associated with the respective region.
Preferably, updating the data structure (such as e.g. allocation
management information, a free space object, and/or a free space
bit map) by applying non-allocation update operations is performed
on a region-by-region basis.
Preferably, updating a region of the data structure is performed
when an applying criteria is met.
For example, the number of entries and/or the number of accumulated
update operation entries in update operation management information
per region may be monitored, and when the number of entries and/or
the number of accumulated update operation entries in update
operation management information exceed a threshold, the
accumulated update operations of the respective region can be
applied. Then, the applying criteria may be fulfilled when the
number of entries and/or the number of accumulated update operation
entries in update operation management information exceed a
threshold for at least one region.
Also, in addition or alternatively, the applying criteria may
involve a periodic update such that the applying criteria is
fulfilled whenever a periodic time to update expires, and at that
time, the one or more regions being associated with the highest
number of entries and/or the highest number of accumulated update
operation entries in update operation management information are
selected to be updated.
Also, in addition or alternatively, the applying criteria may
involve a check of an amount of available free blocks that can be
used for allocation according to the allocation management
information of the data structure (such as e.g. allocation
management information, a free space object, and/or a free space
bit map), and when the amount of available free blocks falls below
a threshold, one or more regions of the allocation management
information are updated, e.g. until the amount of free blocks that
can be used for allocation according to the allocation management
information of the data structure is sufficiently increased, e.g.
until the amount of free blocks exceeds a second threshold. Again,
at that time, the one or more regions being associated with the
highest number of entries and/or the highest number of accumulated
update operation entries in update operation management information
can be selected to be updated.
Preferably, applying a non-allocation update operation in a region
of the data structure changes an indication of an allocation status
of an associated block, e.g. by changing the status of the block
from used to free (thereby indicating the block to be available for
re-allocation), by changing the status of the block to increment a
reference count thereof, or decrement a reference count thereof
(e.g. decrementing the reference count to a non-zero value, or
decrementing the reference count to zero, thereby indicating the
block to be available for re-allocation).
Preferably, the respective update operation management information
for one or more or all of the regions of the data structure may be
stored in a cache memory. Furthermore, the respective update
operation management information for one or more or all of the
regions of the data structure may be stored in a cache memory
and/or on storage devices.
Such data structure as above (such as e.g. allocation management
information, a free space object, and/or a free space bit map) may
be managed as data stored to storage blocks, and the data structure
may be managed based on a metadata structure similar to metadata
structures of data objects in the sense of the present disclosure,
e.g. on the basis of a metadata tree structure preferably including
a root node pointing directly and/or indirectly to blocks, and a
leaf tree level having one or more direct nodes pointing to blocks,
and optionally including one or more intermediate tree levels
having one or more indirect nodes pointing to indirect nodes and/or
direct nodes of the respective metadata tree structure.
When managing I/O access to data based on the one or more metadata
structures, the method may comprise allocating one or more blocks
for writing user data in units of data blocks and/or metadata nodes
in units of data blocks and/or at a size equal or smaller than a
block size. Accordingly, such allocation of blocks may occur in
connection with writing user data (e.g. in units of blocks to
storage blocks), and/or when modifying a metadata structure
associated with user data in connection with writing one or more
metadata nodes (e.g. in units of blocks to storage blocks).
In the above, the method may preferably be comprising managing I/O
access to data based on the one or more metadata structures,
including managing one or more metadata tree structures for storing
data to one or more storage devices of the data storage system in
units of blocks, each metadata tree structure may be preferably
including a root node pointing directly and/or indirectly to
blocks, and a leaf tree level having one or more direct nodes
pointing to blocks, and optionally including one or more
intermediate tree levels having one or more indirect nodes pointing
to indirect nodes and/or direct nodes of the respective metadata
tree structure.
The method may preferably comprise managing I/O access to data
based on the one or more metadata structures, including obtaining
the root node and/or metadata nodes of one or more tree levels of
the metadata structure.
According to further aspects there may be provided data storage
system connectable to one or more client computers, comprising a
processing unit including a processor and/or a programmable logic
device; a cache memory; and one or more storage devices and/or an
interface to communicably connect with one or more storage devices;
the processing unit being preferably adapted to execute one or more
methods according to one or more of the above aspects and/or one or
more methods of the present disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1A exemplarily shows a schematic diagram of a data storage
apparatus according to exemplary embodiments;
FIG. 1B exemplarily shows a schematic diagram of a data storage
system comprising plural data storage apparatuses according to
exemplary embodiments;
FIG. 1C exemplarily shows a schematic diagram of another data
storage apparatus according to exemplary embodiments;
FIG. 2A exemplarily shows a schematic diagram of a data storage
system layer architecture according to exemplary embodiments;
FIG. 2B exemplarily shows a schematic diagram of another data
storage system layer architecture according to exemplary
embodiments;
FIG. 2C exemplarily shows a schematic diagram of another data
storage system layer architecture according to exemplary
embodiments;
FIG. 3A exemplarily shows a schematic diagram of an exemplary
metadata tree structure, and FIGS. 3B and 3C exemplarily illustrate
occurrences of read amplifications in data read operations and read
and write amplifications in data write operations based on such
exemplary metadata tree structure;
FIG. 4A exemplarily shows a schematic diagram of an exemplary
metadata tree structure in connection with an example of a metadata
subtree caching, and FIGS. 4B and 4C exemplarily illustrate
reduction of occurrences of read amplifications in data read
operations and read and write amplifications in data write
operations based on such exemplary metadata tree structure
according to some exemplary embodiments;
FIG. 5A exemplarily shows a schematic diagram of an exemplary
metadata tree structure in connection with an example of a metadata
subtree caching, and FIGS. 5B and 5C exemplarily illustrate
reduction of occurrences of read amplifications in data read
operations and read and write amplifications in data write
operations based on such exemplary metadata tree structure
according to some exemplary embodiments;
FIGS. 6A to 6C exemplarily show an exemplary metadata tree
structure in connection with further examples of a metadata subtree
caching according to further exemplary embodiments;
FIG. 7A exemplarily shows a schematic diagram of another exemplary
metadata tree structure, and FIG. 7B to 7E exemplarily show an
exemplary metadata tree structure in connection with further
examples of a metadata subtree caching according to further
exemplary embodiments;
FIG. 8A exemplarily shows a schematic diagram of another exemplary
metadata tree structure, and FIG. 8B exemplarily illustrates the
metadata tree structure of FIG. 8A being grouped in a cached upper
metadata tree portion and a lower metadata portion in connection
with checkpoint processing based on such exemplary metadata tree
structure according to some exemplary embodiments;
FIGS. 8C and 8D exemplarily illustrate the metadata tree structure
of FIG. 8A being grouped in a cached upper metadata tree portion
and a lower metadata portion in connection with checkpoint
processing based on such exemplary metadata tree structure
according to some further exemplary embodiments;
FIG. 9A exemplarily illustrates a flow chart of processing a read
request in connection with checkpoint processing according to some
exemplary embodiments, and FIG. 9B exemplarily illustrates a flow
chart of processing walking down a tree branch of a metadata tree
structure according to some exemplary embodiments;
FIG. 10 exemplarily illustrates a flow chart of processing a write
request in connection with checkpoint processing according to some
exemplary embodiments;
FIG. 11A exemplarily illustrates a flow chart of processing of
taking a first-type checkpoint (minor checkpoint) according to some
exemplary embodiments, and FIG. 11B exemplarily illustrates a flow
chart of processing of taking a second-type checkpoint (major
checkpoint) according to some exemplary embodiments;
FIG. 12A exemplarily illustrates a flow chart of processing a
recovery operation according to some exemplary embodiments, and
FIG. 12B exemplarily illustrates a flow chart of processing a
recovery operation according to further exemplary embodiments;
FIG. 13 exemplarily illustrates a flow chart exemplarily
illustrates a flow chart of processing a write request in
connection with checkpoint processing according to some further
exemplary embodiments;
FIG. 14 exemplarily illustrates a flow chart exemplarily
illustrates a flow chart of processing a second-type checkpoint
(major checkpoint) according to some further exemplary
embodiments;
FIG. 15A exemplarily illustrates a flow chart of processing a read
request, including metadata subtree caching according to some
exemplary embodiments;
FIG. 15B exemplarily illustrates a flow chart of processing a write
request, including metadata subtree caching according to some
exemplary embodiments;
FIG. 16A exemplarily illustrates a flow chart of dynamic metadata
subtree caching according to some exemplary embodiments, and FIG.
16B exemplarily illustrates a flow chart of dynamic metadata
subtree caching in connection with checkpoint processing according
to some further exemplary embodiments;
FIGS. 17A to 17C exemplarily shows schematic drawings of allocation
management information of the free space object FSO being divided
into plural regions accumulating updates to be applied over time,
according to some exemplary embodiments;
FIGS. 18A to 18C exemplarily illustrate examples of update
management information according to exemplary embodiments;
FIG. 19 exemplarily illustrates a flow chart of efficient
allocation information management according to exemplary
embodiments; and
FIG. 20 exemplarily shows a flow chart of a process applying update
operations to a region according to some exemplary embodiments.
DETAILED DESCRIPTION OF THE ACCOMPANYING DRAWINGS AND EXEMPLARY
EMBODIMENTS
In the following, preferred aspects and exemplary embodiments will
be described in more detail with reference to the accompanying
figures. Same or similar features in different drawings and
embodiments are sometimes referred to by similar reference
numerals. It is to be understood that the detailed description
below relating to various preferred aspects and preferred
embodiments are not to be meant as limiting the scope of the
present invention.
Terminology
As used in this description and the accompanying claims, the
following terms shall have the meanings indicated, unless the
context otherwise requires:
A "storage device" is a device or system that is used to store
data. A storage device may include one or more magnetic or
magneto-optical or optical disk drives, solid state storage
devices, or magnetic tapes. For convenience, a storage device is
sometimes referred to as a "disk" or a "hard disk". A data storage
system may include the same or different types of storage devices
having the same or different storage capacities.
A "RAID controller" is a device or system that combines the storage
capacity of several storage devices into a virtual piece of storage
space that may be referred to alternatively as a "system drive"
("SD"), a "logical unit" ("LU" or "LUN"), or a "volume". Typically,
an SD is larger than a single storage device, drawing space from
several storage devices, and includes redundant information so that
it can withstand the failure of a certain number of disks without
data loss. In exemplary embodiments, each SD is associated with a
unique identifier that is referred to hereinafter as a "logical
unit identifier" or "LUID", and each SD will be no larger than a
predetermined maximum size, e.g., 2 TB-64 TB or more.
When commands are sent to an SD, the RAID controller typically
forwards the commands to all storage devices of the SD at the same
time. The RAID controller helps to overcome three of the main
limitations of typical storage devices, namely that the storage
devices are typically the slowest components of the storage system,
they are typically the most likely to suffer catastrophic failure,
and they typically have relatively small storage capacity.
A "RAID system" is a device or system that includes one or more
RAID controllers and a number of storage devices. Typically, a RAID
system will contain two RAID controllers (so that one can keep
working if the other fails, and also to share the load while both
are healthy) and a few dozen storage devices. In exemplary
embodiments, the RAID system is typically configured with between
two and thirty-two SDs. When a file server needs to store or
retrieve data, it sends commands to the RAID controllers of the
RAID system, which in turn are responsible for routing commands
onwards to individual storage devices and storing or retrieving the
data as necessary.
With some RAID systems, mirror relationships can be established
between SDs such that data written to one SD (referred to as the
"primary SD") is automatically written by the RAID system to
another SD (referred to herein as the "secondary SD" or "mirror
SD") for redundancy purposes. The secondary SD may be managed by
the same RAID system as the primary SD or by a different local or
remote RAID system. Mirroring SDs effectively provides RAID 1+0
functionality across SDs in order to provide recovery from the loss
or corruption of an SD or possibly even multiple SDs in some
situations.
A "file system" is a structure of files and directories (folders)
stored in a file storage system. Within a file storage system, file
systems are typically managed using a number of virtual storage
constructs, and in exemplary embodiments, file systems are managed
using a hierarchy of virtual storage constructs referred to as
ranges, stripesets, and spans. File system functionality of a file
server may include object management, free space management (e.g.
allocation) and/or directory management.
A "block" is generally a unit of storage of predetermined size. A
"storage block" may be a unit of storage in the file system that
corresponds to portion of physical storage in which user data
and/or system data is stored. A file system object (discussed
below) generally includes one or more blocks. A "data block" may
refer to a unit of data (e.g. user data or metadata) to be written
to one storage block. Typically the terms "block", "data block" or
"data storage block" may be used interchangeably in the framework
of the present disclosure since usually the allocation of a storage
block is followed by writing the data to the storage block, hence
"data block" may also refer to the unit of storage in the file
system that corresponds to portion of physical storage in which
user data and/or system data is stored.
Exemplary embodiments of the present invention are described with
reference to an exemplary file system of the type used in various
file servers e.g. as sold by Hitachi Data Systems and known
generally as BLUEARC TITAN.TM. and MERCURY.TM. file servers,
although it should be noted that various concepts may be applied to
other types of data storage systems.
An exemplary file server is described in U.S. Pat. No. 7,457,822,
entitled "Apparatus and Method for Hardware-based File System",
which is incorporated herein by reference, and PCT application
publication number WO 01/28179 A2, published Apr. 19, 2001,
entitled "Apparatus and Method for Hardware Implementation or
Acceleration of Operating System Functions", which is incorporated
herein by reference. Another implementation of an exemplary file
server and hardware-implemented file system management is set forth
in U.S. application Ser. No. 09/879,798, filed Jun. 12, 2001,
entitled "Apparatus and Method for Hardware Implementation or
Acceleration of Operating System Functions", which is incorporated
herein by reference. An exemplary file storage system is described
in WO 2012/071335 and U.S. application Ser. No. 13/301,241 entitled
"File Cloning and De-Cloning in a Data Storage System", which was
filed on Nov. 21, 2011, which are incorporated herein by
reference.
An exemplary file server including various hardware-implemented
and/or hardware-accelerated subsystems, for example, is described
in U.S. patent application Ser. Nos. 09/879,798 and 10/889,158,
which are incorporated by reference herein, and such file server
may include a hardware-based file system including a plurality of
linked sub-modules, for example, as described in U.S. patent
application Ser. Nos. 10/286,015 and 11/841,353, which are
incorporated by reference herein.
I. Exemplary Architectures of Data Storage Systems of Exemplary
Embodiments
FIG. 1A exemplarily shows a schematic diagram of a data storage
apparatus 1000 in a data storage system according to exemplary
embodiments. One or more such data storage apparatuses 1000 may be
used to realize a functional layer structure of any of FIGS. 2A to
2C below.
The data storage apparatus 1000 exemplarily includes an I/O
interface 1010 (e.g. front-end interface) exemplarily having
physical ports 1011, 1012 and 1013 and being connectable to one or
more input/output devices 200 (such as e.g. the clients 200, and/or
a management computer 300). Such I/O interface 1010 functions
and/or functional handling thereof may be included in an
interface/protocol layer 110 of any of FIGS. 2A to 2C below.
The data storage apparatus 1000 exemplarily further includes an
external storage interface 1020 (e.g. back-end interface)
exemplarily having physical ports 1021, 1022 and 1023 and being
connectable to one or more externally connected storage devices 600
(e.g. one or more storage disks and/or storage flash modules) for
storing metadata (e.g. system metadata) and data (e.g. user data)
and/or to an external storage system 400 (which may include one or
more externally connected storage devices such as storage disks
and/or storage flash modules) for storing metadata (e.g. system
metadata) and data (e.g. user data). Such external storage
interface 1020 functions and/or functional handling thereof may be
included in a storage device layer 140 of any of FIGS. 2A to 2C
below.
The connections to the above interfaces 1010 and 1020 may be
direct, via wired connections or wireless connections, and/or via
communication networks, such as e.g. networks 500 in FIG. 1A.
Furthermore, exemplarily, the data storage apparatus 1000 further
includes one or more internal storage devices 1031, 1032, 1033 and
1034 (e.g. one or more storage disks and/or storage flash modules),
summarized as internal storage devices 1030, for storing metadata
(e.g. system metadata) and data (e.g. user data).
In further exemplary embodiments, the data storage apparatus(es)
may only include internal storage devices (not being connected to
external storage devices/systems) and in further exemplary
embodiments, the data storage apparatus(es) may only be connected
to external storage devices/systems (not having internal storage
devices).
The data storage apparatus 1000 exemplarily further includes a
processing unit 1060A and optionally another processing unit 1060B.
The processing units 1060A and 1060B exemplarily communicate with
the interfaces 1010 and 1020, as well as with the internal storage
devices 1030, via internal bus systems 1040 and 1050.
Each of the processing units 1060A and 1060B exemplarily includes a
processor 1061 (e.g. central processing unit, or CPU), a memory
controller 1065, a disk controller 1066 and memories such as e.g.
the cache memory 1062, the system memory 1063 and the non-volatile
memory 1064 (e.g. NVRAM). The memory controller 1065 may control
one or more of the memories such as e.g. the cache memory 1062, the
system memory 1063 and the non-volatile memory 1064 (e.g.
NVRAM).
The I/O requests/responses to/from the internal storage devices
1030 and/or to/from the external storage devices/systems 400 and
600 (via the interface 1020) is exemplarily controlled by the disk
controller 1066 of the data storage apparatus 1000. Accordingly,
the disk controller 1066 and/or its functions and/or functional
handling thereof may be included in a storage device layer 140 of
any of FIGS. 2A to 2C below.
Exemplarily, e.g. for mirroring purposes, the NVRAMs 1064 of the
processing units 1060A and 1060B of the data storage apparatus 1000
are exemplarily connected to each other to transfer data between
the NVRAMs 1064. For example, each NVRAM 1064 may be divided into
two portions of similar size, and one portion of each NVRAM 1064 is
provided to store data and or metadata handled by its respective
processing unit 1060 and the other portion of each NVRAM 1064 is
provided to store mirrored data from the other NVRAM via the
connection, respectively. For example, the connection between the
non-volatile memories 1064 may be exemplarily realized as a
non-transparent bridge connection, e.g. by PCIe connection.
Further exemplarily, each of the processing units 1060A and 1060B
exemplarily includes a system memory 1063 (e.g. for storing
processing related data or program data for execution by the
respective processing units) and a cache memory 1063 for
temporarily storing data such as e.g. cache data related with
metadata and/or data for handling I/O access messages.
For controlling the system memory 1063, the cache memory 1064
and/or the non-volatile memory 1064 (NVRAM), each of the processing
units 1060A and 1060B exemplarily includes a memory controller
1065.
For processing, handling, converting, and/or encoding headers of
messages, requests and/or responses, the data storage apparatus
1000 exemplarily further includes the processor 1061 (or other type
of processing unit which may include one or more processors, one or
more programmable logic devices such as integrated circuits, Field
Programmable Gate Arrays (FPGAs), or the like, and/or one or more
processors such as e.g. CPUs and/or microprocessors).
For temporarily storing data (including metadata and/or user data),
the data storage apparatus 1000 includes the non-volatile memory
1064 (e.g. one or more NVRAMs). The non-volatile memory and/or
NVRAM(s) may also be referred to as "cache memory" in exemplary
embodiments, e.g. if the cache memory 1062 is formed as a portion
of the non-volatile memory.
For example, in some embodiments, the difference between cache
memory and the non-volatile memory may be that the data stored in
the non-volatile memory may be mirrored to another non-volatile
memory (e.g. one or more NVRAMs of the other processing unit or
another connected data storage apparatus).
The processing unit(s) 1060A and/or 1060B and/or its functions
and/or functional handling thereof may be included in a metadata
layer 120 and/or a data protection layer 130 of any of FIGS. 2A to
2C below.
FIG. 1B exemplarily shows a schematic diagram of a data storage
system comprising plural data storage apparatuses 1000A and 1000B
in a data storage system according to further exemplary
embodiments.
The data storage apparatuses 1000A and 1000B may be realized as
node apparatuses in a storage system cluster of plural node
apparatuses, which may be communicably connected with each other
via the network interfaces 1010 (or via other front-end or back-end
interfaces).
A difference to the data storage apparatus 1000 of FIG. 1A is that
the non-volatile memory 1064 (e.g. NVRAM) of the respective
processing units 1060 of both data storage apparatuses 1000A and
1000B are connected via a connection between the respective
interfaces 1090 of the data storage apparatuses 1000A and 1000B, in
particular for mirroring data of the non-volatile memory 1064 (e.g.
NVRAM) of the data storage apparatus 1000A in the non-volatile
memory 1064 (e.g. NVRAM) of the data storage apparatus 1000B, and
vice versa.
Exemplarily, the interfaces 1020 of the data storage apparatuses
1000A and 1000B are not shown in FIG. 1B, but additional interfaces
1020 for connection to external storage devices and/or storage
systems may be provided.
Exemplarily, e.g. for mirroring purposes, the NVRAMs 1064 of the
processing units 1060 of both data storage apparatuses 1000A and
1000B are exemplarily connected to each other to transfer data
between the NVRAMs 1064. For example, each NVRAM 1064 may be
divided into two portions of similar size, and one portion of each
NVRAM 1064 is provided to store data and or metadata handled by its
respective processing unit 1060 and the other portion of each NVRAM
1064 is provided to store mirrored data from the other NVRAM via
the connection, respectively.
FIG. 1C exemplarily shows a schematic diagram of another data
storage apparatus 1000 according to exemplary embodiments.
Exemplarily, in FIG. 1C, in addition to the processing units 1060A
and 1060B which may be provided similar as in FIG. 1A, the data
storage apparatus 1000 includes, for hardware acceleration
purposes, further processing units 1070A and 1070B which may be
provided with respective programmable logic devices 1071 (e.g.
instead or in addition to processors) for processing data movement,
data handling or request/response handling in addition to or in
support of the processors 1061 of the processing units 1060A and
1060B.
The programmable logic devices 1071 may be realized by one or more
integrated circuits such as e.g. including one or more Field
Programmable Gate Arrays (FPGAs). The processing units 1070A and
1070B may include own memories 1073 and non-volatile memories 1074
(e.g. NVRAMs), as well as e.g. their own memory controllers 1072.
However, the programmable logic devices 1071 may also be
responsible for the control of the memories 1073 and 1074.
Exemplarily, e.g. for mirroring purposes, the NVRAMs 1074 of the
processing units 1070A and 1070B of the data storage apparatus 1000
are exemplarily connected to each other to transfer data between
the NVRAMs 1074. For example, each NVRAM 1074 may be divided into
two portions of similar size, and one portion of each NVRAM 1074 is
provided to store data and or metadata handled by its respective
processing unit 1070 and the other portion of each NVRAM 1074 is
provided to store mirrored data from the other NVRAM via the
connection, respectively.
For example, the connection between the non-volatile memories 1074
may be exemplarily realized as a non-transparent bridge connection,
e.g. by PCIe connection.
In all of the above configurations, the processing unit/units of
the data storage apparatus(es) may be configured, by one or more
software programs and/or based on hardware implemented processing
(e.g. by support of programmable logic devices), to execute, by
themselves or in combination with one or more further processing
unit(s), the processing and methods of examples of control and
management processes described herein.
II. Exemplary Layer Structures of Data Storage Systems of Exemplary
Embodiments
FIG. 2A exemplarily shows a schematic diagram of a data storage
system layer architecture 100 according to exemplary
embodiments.
Such functional data storage system layer architecture 100 (which
may be provided by software, hardware or any combination thereof)
can be realized on any one of the data storage apparatuses 1000
(1000A, 1000B) of FIGS. 1A to 1C.
Some or all respective layers may use shared resources (such as
sharing processing units, processors, programmable logic devices,
memories such as system memories, cache memories and/or
non-volatile memories or NVRAMs, controllers and/or storage
devices), or some or all layers may be provided on their own
respective resources (e.g. having their own dedicated processing
units, processors, programmable logic devices, memories such as
system memories, cache memories and/or non-volatile memories or
NVRAMs, controllers and/or storage devices). Also, the layers may
share some resources with other layers for some functions while
they own other resources for other functions by themselves.
The data storage system layer architecture 100 exemplarily includes
an interface/protocol layer 110, a metadata layer 120, a data
protection layer 130 and a storage device layer 140. The data
storage system layer architecture 100 may be realized on one or
more servers, file servers, computers, storage devices, storage
array devices, cluster node apparatuses etc., in particular
exemplarily according to configurations of any of FIGS. 1A to
1C.
The interface/protocol layer 110 can exemplarily be communicably
connected to client computers 200 and/or an exemplary optional
management computer 300, e.g. via physical ports and/or
communication networks (e.g. via front-end interfaces 1010 above,
such as network interfaces or the like).
The interface/protocol layer 110 may include one or more physical
interfaces including one or more physical ports, physical switches,
physical connectors, physical interface boards, wireless interfaces
etc. for physical connection, network connection and/or wireless
connection to one or more networks, computers (clients, hosts,
management computers, etc.), servers, or the like.
Also, the interface/protocol layer 110 may include functions,
executed on one or more processing units (e.g. processing units of
any of FIGS. 1A to 1C), for example, to receive, process, convert,
handle, and/or forward messages, requests, instructions, and/or
responses in multiple protocols and I/O access types.
Specifically, the interface/protocol layer 110 is preferably
configured to receive, process, convert, and handle one or more
(and preferably all) of: file-access I/O messages (including
file-access I/O requests directed to files and/or directories of
one or more file systems) according to one or file access protocols
(such as e.g. one or more of AFP, NFS, e.g. NFSv3, NFSv4 or higher,
or SMB/CIFS or SMB2 or higher); block-access I/O messages
(including block-access I/O requests directed to blocks of virtual,
logical or physical block-managed storage areas) according to one
or block access protocols (such as e.g. one or more of iSCSI, Fibre
Channel and FCoE which means "Fibre Channel over Ethernet"); and
object-access I/O messages (including object-access I/O requests
directed to objects of an object-based storage) according to one or
object-based access protocols (such as e.g. IIOP, SOAP, or other
object-based protocols operating over transport protocols such as
e.g. HTTP, SMTP, TCP, UDP, or JMS).
The above connection types and communication functions may include
different interfaces and/or protocols, including e.g. one or more
of Ethernet interfaces, internet protocol interfaces such as e.g.
TCPIP, network protocol interfaces such as e.g. Fibre Channel
interfaces, device connection bus interfaces such as e.g. PCI
Express interfaces, file system protocol interfaces such as NFS
and/or SMB, request/response protocol interfaces such as e.g. HTTP
and/or HTTP REST interfaces, system interface protocols such as
e.g. iSCSI and related interfaces such as e.g. SCSI interfaces, and
NVM Express interfaces.
The interface/protocol layer 110 is exemplarily configured to
connect to and communicate with client computers 200 and/or the
management computer 300 to receive messages, responses, requests,
instructions and/or data, and/or to send messages, requests,
responses, instructions and/or data from/to the client computers
200 and/or the management computer 300, preferably according to
plural different protocols for file access I/Os, block access I/Os
and/or object access I/Os.
Accordingly, in some exemplary embodiments, such requests and
responses exchanged between the data storage system layer
architecture 100 and the client computers 200 may relate to I/O
requests to one or more file systems (e.g. based on file access
protocol I/O messages) and/or to I/O requests to blocks of
physical, logical or virtual storage constructs of one or more
storage devices (e.g. based on block access protocol I/O messages)
of the data storage system 100. Also, such requests and responses
exchanged between the data storage system layer architecture 100
and the client computers 200 may relate to I/O requests to objects
of object-based storage (e.g. based on object access protocol I/O
messages) provided by the data storage system 100.
The I/O requests on the basis of file access protocols may be
including e.g. read requests to read stored data in a file system
(including reading file data, reading file system metadata, reading
file and/or directory attributes) or write data into a file system
(including creating files and/or directories, modifying files,
modifying attributes of files and/or directories, etc.).
The I/O requests on the basis of block access protocols may be
including e.g. read requests to read stored data in one or more
blocks of a block-based storage area (including reading data or
metadata from blocks of a virtual, logical or physical storage area
divided in blocks based on block addresses such as e.g. logical
block addresses LBAs, and/or block number, e.g. reading data blocks
of logical units (LUs)) and write data to blocks of a block-based
storage area (including writing data blocks to newly allocated
blocks of a virtual, logical or physical storage area divided in
blocks based on block addresses such as e.g. logical block
addresses LBAs, and/or block number, e.g. writing data blocks of
logical units (LUs); or modifying data of previously written data
blocks in blocks of the block-based storage area).
In the context of block-based storage on virtual, logical and/or
physical storage devices organized in one or more storage areas
being provided in units of blocks, it is emphasized that the terms
"storage block" and "data block" may refer to related aspects, but
are generally intended to differentiate between the "storage block"
as a construct for storing data as such, e.g. having a certain
block size and being configured to store data of an amount
according to the block size, and the "data block" shall refer to
the unit of data of an amount according to the block size, i.e. to
the block sized unit of data that is written to (or can be read
from) one "storage block". When using the term "block" as such,
this typically may refer to the "storage block" in the sense
above.
As mentioned above, the I/O requests/responses exchanged between
clients 200 and the interface/protocol layer 110 may include
object-related I/O requests/responses relating to data objects of
object-based storage (which may also include an object-based
managed file system), file-system-related I/O requests/responses
relating to files and/or directories of one or more file systems,
and/or block-related I/O requests/responses relating to data stored
in storage blocks of block-managed storage areas (provided
virtually, logically or physically) on storage devices.
The interface/protocol layer 110 communicates with the metadata
layer 120, e.g. for sending requests to the metadata layer 120 and
receiving responses from the metadata layer 120.
In exemplary embodiments, the communication between
interface/protocol layer 110 and metadata layer 120 may occur in an
internal protocol which may be file-based, block-based or
object-based. However, standard protocols may be used. The
interface/protocol layer 110 may receive messages (such as I/O
requests) from the clients in many different protocols, and the
interface/protocol layer 110 is configured to convert messages of
such protocols, or at least headers thereof, to the messages to be
sent to the metadata layer 120 according to the protocol used by
the metadata layer 120. In some exemplary embodiments, the metadata
layer 120 may be configured to handle object-related I/O
requests.
The metadata layer 120 may then preferably be configured to convert
object-related I/O requests relating to data objects (which may
relate to block-based storage areas managed as data objects, to
file-based files and/or directories of one or more file systems
managed as file system objects, and/or to data objects or groups of
data objects managed as data objects) into corresponding
block-related I/O requests (according to a block access protocol)
relating to data stored in storage blocks of virtually, logically
or physically provided storage areas of storage devices, and vice
versa.
In some exemplary embodiments, the metadata layer 120 may be
configured to hold and manage metadata on a data object structure
and on data objects of the data object structure in a metadata
structure and/or metadata tree structure according to later
described examples and exemplary embodiments.
The metadata layer 120 preferably communicates with the data
protection layer 130, e.g. for sending requests to the data
protection layer 130 and receiving responses from the data
protection layer 130, preferably as block-related I/O requests
(according to a block access protocol).
The data protection layer 130 communicates with the storage device
layer 140, e.g. for sending requests to the storage device layer
140 and receiving responses from the storage device layer 140,
preferably as block-related I/O requests (according to a block
access protocol).
The data protection layer 130 may include processing involved in
connection with data protection, e.g. management of data
replication and/or data redundancy for data protection. For
example, the data protection layer 130 may include data redundancy
controllers managing redundant data writes, e.g. on the basis of
RAID configurations including mirroring, and redundant striping
with parity. The data protection layer 130 could then be configured
to calculate parities.
The storage device layer 140 may execute reading data from storage
devices and writing data to storage devices based on messages,
requests or instructions received from the data protection layer
130, and may forward responses based on and/or including read data
to the data protection layer 130.
In general, I/O processing may be realized by the layer
architecture such that the interface/protocol layer 110 receives an
I/O request (file-access, block-access or object-access) and
converts the I/O request (or at least the header thereof) to a
corresponding I/O request in the protocol used by the metadata
layer 120 (e.g. object-based, object access).
The metadata layer 120 uses address information of the received I/O
request and converts the address information to the address
information used by the data protection layer 130. Specifically,
the metadata layer 120 uses address information of the received I/O
request and converts the address information to related block
addresses used by the data protection layer 130. Accordingly, the
metadata layer 120 converts received I/O requests to block access
I/O in a block-based protocol used by the data protection layer
130.
The data protection layer 130 receives the block access I/O from
the metadata layer 120, and converts the logical block address
information to physical block address information of related data
(e.g. taking into account RAID configurations, and parity
calculations, or other error-code calculations) and issues
corresponding block access I/O requests in a block-based protocol
to the storage device layer 140 which applies the block access I/O
to the storage device (e.g. by reading or writing data from/to the
storage blocks of the storage devices).
For response messages, e.g. based on read requests to read user
data, the corresponding response (e.g. with the user data to be
read) can be passed the other way around, for example, in that the
storage device layer 140 returns the read user data in a
block-based protocol to the data protection layer 130, the data
protection layer 130 returns the read user data in a block-based
protocol to the metadata layer 120, the metadata layer 120 returns
the read user data preferably in an object-based protocol to the
interface/protocol layer 110, and the interface/protocol layer 110
returns the final read response to the requesting client.
However, for the above processing, the metadata layer 120 may make
use of large amounts of metadata (which is managed in metadata tree
structures according to the preferred embodiments herein), which is
also stored to storage devices (i.e. in addition to the actual user
data of the object-based storage, file system based storage or
block-based storage shown to the client).
Accordingly, when handling I/O request such as write requests
and/or read requests, the metadata layer may need to obtain
metadata, which may lead to read and write amplifications in the
communications between the metadata layer 120 and the data
protection layer 130 (or directly with the storage device layer, in
exemplary embodiments which store metadata directly on storage
devices without additional data protection schemes). Such read and
write amplifications shall preferably be avoided or at least be
reduced according to an object of the present disclosure.
FIG. 2B exemplarily shows a schematic diagram of another data
storage system layer architecture 100 according to further
exemplary embodiments.
Exemplarily, the data storage system layer architecture 100 of FIG.
2B is proposed for scale-out purposes, in which multiple node
apparatuses (which may also operate as single data storage
apparatus, preferably) may be connected to form a cluster system
which may be extended (scale-out) by adding further node
apparatuses, when needed.
In this connection, it is indicated that the term "node apparatus"
in the present context refers to a device entity which forms a part
of a cluster system of inter-connectable "node apparatuses". This
needs to be distinguished from "metadata nodes", (e.g. "root
nodes", "direct nodes" or "indirect nodes") as described later, as
such "metadata nodes" from data constructs (data elements) which
are units of metadata managed in metadata tree structures as
described below. Sometimes, "metadata nodes" are also referred to
as onodes or inodes.
Exemplarily, FIG. 2B shows two node apparatuses N1 and N2 included
in a cluster of two or more node apparatuses (i.e. including at
least N1 and N2), each node apparatus having an interface/protocol
layer 110, a metadata layer 120B (similar to the metadata layer 120
above), a data protection layer 130 and a storage device layer 140,
similar to the exemplary embodiment of FIG. 2A.
However, in order to scale out the request/response handling to the
cluster node apparatuses, preferably between the interface/protocol
layer 110 of the data storage system layer architecture 100 and the
metadata layers 120B of the node apparatuses N1 and N2, the data
storage system layer architecture 100 of FIG. 2B further includes a
scale-out metadata layer 120A preferably provided between the
interface/protocol layer 110 and the metadata layer 120B, to
communicate I/O access messages (e.g. I/O requests or responses)
between the scale-out metadata layers 120A of the node apparatuses
of the cluster.
By such structure, the clients can send I/O requests to each of the
node apparatuses (i.e. to which one or more node apparatuses they
are connected themselves) independent of which node apparatus
actually stores the target data of the I/O access or actually
manages the storage device(s) storing the target data, and the
scale-out metadata layers 120A respectively handle metadata
managing mapping information locating the target data on the
cluster.
Accordingly, the client may issue the I/O access request to either
one of the cluster node apparatuses, and the scale-out metadata
layer 120A of the receiving node apparatus identifies the node
apparatus storing the target data based on scale-out metadata
(which may also be stored in storage devices), and issues a
corresponding I/O access request to the scale-out metadata layer
120A of the identified node apparatus.
The identified node apparatus handles the I/O request and responds
to communicate an I/O response to the scale-out metadata layer 120A
of the initial receiving node apparatus to return a corresponding
response via the interface/protocol layer 110 of the initial
receiving node apparatus to the requesting client.
Other layers in FIG. 2B may have functions similar to the
corresponding layers of the layer architecture of FIG. 2A.
FIG. 2C exemplarily shows a schematic diagram of another data
storage system layer architecture 100 according to further
exemplary embodiments.
Again, the data storage system layer architecture 100 of FIG. 2C is
proposed for scale-out purposes, in which multiple node apparatuses
(which may also operate as single data storage apparatus,
preferably) may be connected to form a cluster system which may be
extended (scale-out) by adding further node apparatuses, when
needed.
However, in addition to the layers of FIG. 2B, the layer
architecture of FIG. 2C exemplarily further includes another
scale-out data protection layer 130A between the scale-out metadata
layer 120A and the metadata layer 120B (which communicates with the
data protection layer 130B), wherein the scale-out data protection
layers 130A to communicate I/O access messages (e.g. I/O requests
or responses) between the scale-out data protection layers 130A of
the node apparatuses of the cluster. This may include another data
protection scheme in which data may be redundantly stored on
multiple node apparatuses as managed by the data protection layers
130A of the node apparatuses of the cluster, according to data
protection schemes.
In the above exemplary configurations, the metadata layer 120
(and/or 120B) may make use of large amounts of metadata (which is
managed in metadata tree structures according to the preferred
embodiments herein), which is also stored to storage devices (i.e.
in addition to the actual user data of the object-based storage,
file system based storage or block-based storage shown to the
client).
Accordingly, when handling I/O request such as write requests
and/or read requests, the metadata layer may need to obtain
metadata, which may lead to read and write amplifications in the
communications between the metadata layer 120 and the data
protection layer 130 (or directly with the storage device layer, in
exemplary embodiments which store metadata directly on storage
devices without additional data protection schemes). Such read and
write amplifications shall preferably be avoided or at least be
reduced according to an object of the present disclosure.
III. Exemplary Metadata Tree Structure Management (e.g. at a
Metadata Layer)
III.1 Exemplary Metadata Tree Structure
FIG. 3A exemplarily shows a schematic diagram of an exemplary
metadata tree structure as may, for example, be handled by a data
storage apparatus 1000, a file server managing metadata of one or
more file systems, and/or by a metadata layer of one of the above
exemplary embodiments.
For example, in connection with file-based I/O access from clients,
in a file system including one or more file-system objects such as
files and directories, each file system object (such as file
objects related with files of the file system and/or system objects
related to metadata and/or management data of the file system) may
be managed by a corresponding metadata tree structure associated
with the file system object. Accordingly, a file system object
(such as a file or a directory) may be associated with a data
object being managed on the basis of such metadata tree
structure(s).
Furthermore, in connection with object-based I/O access from
clients, data objects or groups of data objects accessed by the
clients may be associated with a data object being managed on the
basis of such metadata tree structure(s).
Furthermore, in connection with block-based I/O access from
clients, virtual, logical or physical storage areas, being divided
into plural blocks, accessed by the clients may be associated with
a data object being managed on the basis of such metadata tree
structure(s). For example, a data object may be associated with a
block-managed logical unit (LU).
For example, for all of the above, if the metadata layer receives
an object-related I/O request (from the interface/protocol layer
based on a client's file access I/O, block access I/O or object
access I/O) relating to a data object, the metadata layer may refer
to the metadata tree structure associated with the respective data
object to find one or more block addresses of data storage
corresponding to the data addressed in the object-related I/O
request on storage devices (as handled by the data protection layer
and/or the storage device layer, for example).
Accordingly, for each data object, the corresponding metadata tree
structure provides information on a relationship between the data
object and its data and block addresses of blocks storing data
blocks of the data of the data object.
Exemplarily, for each data object, there may be provided a root
node RN (which may include a header) and pointers of the root node
RN may point to indirect nodes of the corresponding metadata tree
structure, such as e.g. the indirect nodes IN 0 and IN 1 in FIG.
3A.
Pointers of indirect nodes may, for example, point to other
indirect nodes of a lower generation (tree level) or to direct
nodes (also referred to as "leaf nodes" of a leaf tree level).
Direct nodes are metadata nodes that include pointers pointing to
data blocks including the actual data of the corresponding data
object.
Typically, such metadata tree structure may include multiple tree
levels starting with a root node tree level downwards to a direct
node tree level, optionally having one or more intermediate
indirect node tree levels in between.
Exemplarily, in FIG. 3A, Pointers of the indirect nodes IN 0 and IN
1 exemplarily include pointers pointing to the lower generation
(tree level) of indirect nodes IN 10, IN 11, IN 12, and IN 13. The
pointers of the indirect nodes IN 10, IN 11, IN 12, and IN 13
respectively point to a corresponding pair of the direct nodes DN 0
to DN 7. The pointers of the direct nodes DN 0 to DN 7 respectively
point to a corresponding pair of blocks storing data blocks of data
referred to as DATA 0 to DATA 15, exemplarily.
Of course, the example having only two pointers in each of the root
node, indirect nodes and direct nodes according to FIG. 3A is
purely for exemplary purposes, and each node may include two or
more pointers.
Also, root nodes, indirect nodes and direct nodes may include
different numbers of pointers.
III.2 Read Amplifications in Handling Object-Related Read
Requests
FIG. 3B exemplarily illustrates occurrences of read amplifications
in data read operations based on such exemplary metadata tree
structure of FIG. 3A.
Exemplarily, it is assumed that the metadata layer receives an
object read request directed to the data object being associated
with the metadata tree structure of FIG. 3A, e.g. to read the data
of data block DATA 12 thereof.
In such situation, the metadata layer may be configured to refer to
the root node RN of the metadata tree structure, based on the
object read request being directed to data of the associated data
object.
Based on address information (e.g. based on an indication relating
to an offset of the position of data to be read), the metadata
layer may refer to a pointer in the root node RN being related to
data of data blocks DATA 8 to DATA 15, including the target data of
block DATA 12. By such reference to the corresponding pointer in
the root node RN, the metadata layer may refer to the indirect node
IN 1 referenced by such corresponding pointer.
Based on address information (e.g. based on an indication relating
to an offset of the position of data to be read), the metadata
layer will refer to a pointer in the indirect node IN 1 being
related to data of data blocks DATA 12 to DATA 15, including the
target data of block DATA 12. By such reference to the
corresponding pointer in the indirect node IN 1, the metadata layer
may refer to the indirect node IN 13 referenced by such
corresponding pointer.
Based on address information (e.g. based on an indication relating
to an offset of the position of data to be read), the metadata
layer will refer to a pointer in the indirect node IN 13 being
related to data of data blocks DATA 12 to DATA 13, including the
target data of block DATA 12. By such reference to the
corresponding pointer in the indirect node IN 13, the metadata
layer may refer to the direct node DN 6 referenced by such
corresponding pointer.
Based on address information (e.g. based on an indication relating
to an offset of the position of data to be read), the metadata
layer will refer to a pointer in the direct node DN 6 being related
to the target data of block DATA 12. By such reference to the
corresponding pointer in the direct node DN 6, the metadata layer
may refer to block DATA 12 referenced by such corresponding
pointer, to then issue to the data protection layer (or to the
storage device layer in other embodiments) a block-related read
request to read the data stored at block address of block DATA 12
in the storage device.
However, from the above, it becomes clear that the read operation
of reading the data of block DATA 12, since this requires to read
pointers in any of the nodes RN, IN 1, IN 13 and DN 6, in total the
read operation to read the data of block DATA 12 of the associated
data object on the basis of a single object-related read request
received at the metadata layer, leads to an operations exemplarily
including five (random) read operations to read data from the
storage device(s) in the present example, namely to read the data
of the root nodes RN, IN 1, IN 13 and DN 6 from the storage
device(s), e.g. in connection with five block-related read requests
to read the corresponding data in the storage device(s).
Such increase of a number of read operations is referred to as read
amplifications in the present disclosure, and exemplary embodiments
are provided to achieve reducing corresponding read
amplifications.
III.3 Read and Write Amplifications in Handling Object-Related
Write Requests
FIG. 3C exemplarily illustrate occurrences of read and write
amplifications in data write operations based on such exemplary
metadata tree structure.
Exemplarily, it is assumed that the metadata layer receives an
object write request directed to the data object being associated
with the metadata tree structure of FIG. 3A, e.g. to write new data
to data stored in block of DATA 15 (i.e. to modify the data block
DATA 15).
According to a log write method, instead of modifying the already
written data block such modified data block is written to a newly
allocated storage block, i.e. the new data DATA 15* is written to a
newly allocated block, and the metadata tree structure is updated
to reflect the new situation, in that related nodes of the metadata
tree structure are updated.
However, this first involves identification of the block storing
the old data DATA 15, by referring to the metadata nodes as
follows.
In such situation, the metadata layer may be configured to refer to
the root node RN of the metadata tree structure, based on the
object write request being directed to data of the associated data
object.
Based on address information (e.g. based on an indication relating
to an offset of the position of data to be written), the metadata
layer will refer to a pointer in the root node RN being related to
data of data blocks DATA 8 to DATA 15, including the target data of
block DATA 15. By such reference to the corresponding pointer in
the root node RN, the metadata layer may refer to the indirect node
IN 1 referenced by such corresponding pointer.
Based on address information, the metadata layer will refer to a
pointer in the indirect node IN 1 being related to data of data
blocks DATA 12 to DATA 15, including the target data of block DATA
15. By such reference to the corresponding pointer in the indirect
node IN 1, the metadata layer may refer to the indirect node IN 13
referenced by such corresponding pointer.
Based on address information, the metadata layer will refer to a
pointer in the indirect node IN 13 being related to data of data
blocks DATA 14 to DATA 15, including the target data of block DATA
15. By such reference to the corresponding pointer in the indirect
node IN 13, the metadata layer may refer to the direct node DN 7
referenced by such corresponding pointer.
So, similar to read amplifications occurring in connection with a
read request as discussed above, the processing of an
object-related write request leads to read amplifications. For
example, in the present example, writing new data DATA 15* involves
four (random) read operations namely to read the data of the root
nodes RN, IN 1, IN 13 and DN 7 from the storage device(s), e.g. in
connection with four block-related read requests to read the
corresponding data in the storage device.
However, in addition to writing the new data DATA 15* to a newly
allocated block of storage areas of the storage device(s), for
updating the metadata tree accordingly, such write operation
further includes writing of new metadata nodes to newly allocated
blocks of storage areas of the storage device(s) to write the
modified nodes including writing the root node RN* pointing to
indirect node IN 0 and newly written indirect node IN 1*, writing
the indirect node IN 1* pointing to indirect node IN 12 and newly
written indirect node IN 13*, writing the indirect node IN 13*
pointing to direct node DN 6 and newly written direct node DN 7*,
and writing direct node DN 7* pointing to data blocks of DATA 14
and newly written DATA 15*.
However, from the above, it becomes clear that the write operation
of writing the data of block DATA 15*, since this requires to also
write the modified metadata nodes of the corresponding branch of
the metadata tree structure, in total the write operation to write
the data of block DATA 15* of the associated data object on the
basis of a single object-related write request received at the
metadata layer, leads to operations exemplarily including
exemplarily five (random) write operations to write data to storage
device(s) in the present example, namely to write the data of the
root nodes RN*, IN 1*, IN 13* and DN 7* in addition to DATA 15* to
the storage device(s), e.g. in connection with five block-related
write requests to write the corresponding data in the storage
device.
So, in addition to generating read amplifications occurring in
connection with a write request, the processing of an
object-related write request further leads to write
amplifications.
Summarizing the above, processing object-related read request leads
to read amplifications of block-related read requests and
processing object-related write request leads to read
amplifications of block-related read requests and to write
amplifications of block-related write requests, exemplarily
exchanged between a metadata layer and a data protection layer
(and/or storage device layer).
Such amplifications may further lead to amplifications of I/O
requests and parity calculations in the data protection layer
and/or the storage device layer.
IV. Metadata Subtree Caching
IV.1 Upper Tree Levels Subtree Caching
According to some exemplary embodiments, the metadata layer may
manage metadata of one or more data objects in metadata tree
structures, each exemplarily including a root node pointing to one
or more storage blocks storing data blocks, to one or more indirect
nodes, and/or to one or more direct nodes, and optionally including
one or more indirect nodes pointing to one or more indirect nodes
and/or to one or more direct nodes, and/or optionally including one
or more direct nodes pointing to one or more storage blocks.
According to some exemplary embodiments, while some portions of a
metadata and/or of metadata tree structures may be stored on
storage devices, at least a part (portion) of metadata and/or of
metadata tree structures is preferably stored in a cache memory
such as e.g. in a volatile cache memory (and/or a non-volatile
memory such as e.g. one or more NVRAMs) of the configurations of
any of FIGS. 1A to 1C above, specifically providing the benefits
that read and/or write amplifications as discussed above may be
avoided or at least be significantly reduced, thereby making
handling of object-related I/O requests significantly more
efficient in systems handling many clients and many data objects
and high amounts of metadata to handle a very high number of I/O
requests.
At the same time, since not all of the metadata needs to be kept in
cache memory, it is possible to limit the required cache memory
capacity, which allows provision of a very scalable system with
reasonable cache capacity per node apparatus.
FIG. 4A exemplarily shows a schematic diagram of an exemplary
metadata tree structure in connection with an example of metadata
subtree caching, and FIGS. 4B and 4C exemplarily illustrate
reduction of occurrences of read amplifications in data read
operations and read and write amplifications in data write
operations based on such exemplary metadata tree structure
according to exemplary embodiments.
Exemplarily, for one or more or all data objects, the metadata
layer may hold (maintain) all metadata nodes of a certain metadata
tree structure node tree level and all metadata nodes of metadata
tree structure node tree levels above the certain tree level in
cache (such as e.g. in volatile cache and/or the non-volatile
memory, such as e.g. one or more NVRAMs).
Exemplarily, in FIG. 4A, all metadata nodes above the direct node
tree level of a metadata tree structure similar to FIG. 3A may be
held in cache memory (and/or in NVRAM) for efficient access. That
is, exemplarily all the metadata nodes of the upper three tree
level, including the root node RN, the indirect nodes IN 0 and IN 1
of the upmost indirect node tree level and the indirect nodes IN 10
to IN 13 of the next lower indirect node tree level are
held/maintained in the cache memory (and/or in NVRAM).
FIG. 4B exemplarily illustrates reduced occurrences of read
amplifications in data read operations based on such exemplary
metadata tree structure of FIG. 4A, which will be seen to be
significantly reduced compared to read amplifications occurring by
processing as described in connection with FIG. 3B.
Exemplarily, it is assumed that the metadata layer receives an
object read request directed to the data object being associated
with the metadata tree structure of FIG. 3A, to read DATA 12
thereof.
Then, instead of reading the root node RN and the indirect nodes IN
1 and IN 13 of the tree branch leading to the block of the target
data block DATA 12 by random reads from storage device(s), the root
node RN and the indirect nodes IN 1 and IN 13 of the tree branch
leading to the block of the target data block DATA 12 can
efficiently be read from cache memory (without requiring any random
read request to the data protection layer or storage
device(s)).
Such processing significantly reduces the read amplification, in
that only the data of the direct node DN6 of the tree branch of the
target block and the data of the block of DATA 12 needs to be read
from the storage device(s) by random read operations. In addition,
only efficient cache reads to read the data of the root node RN and
the indirect nodes IN 1 and IN 13 is required. Accordingly, by
subtree caching of the upper node generation tree levels (e.g.
caching a root node and the nodes of one or more intermediate tree
levels below the root node level), a significant reduction of read
amplifications in handling of object-related read requests can be
achieved.
FIG. 4C exemplarily illustrates reduced occurrences of read and
write amplifications in data read operations based on such
exemplary metadata tree structure of FIG. 4A, which will be seen to
be significantly reduced compared to read and write amplifications
occurring by processing as described in connection with FIG.
3C.
Exemplarily, it is assumed that the metadata layer receives an
object write request directed to the data object being associated
with the metadata tree structure of FIG. 4A, to write new data to
data stored in block of DATA 15.
According to a log write method, such data is written to a newly
allocated block, i.e. the new data DATA 15* is written to a newly
allocated block, and the metadata tree structure is updated to
reflect the new situation, in that related nodes of the metadata
tree structure are updated.
However, this first involves identification of the block storing
the old data DATA 15, by referring to the nodes as follows.
In such situation, the metadata layer may be configured to refer to
the root node RN of the metadata tree structure, based on the
object write request being directed to data of the associated data
object, and the root node RN can be read efficiently from cache
memory.
Then, similar to FIG. 3C, the process continues to walk down the
target branch leading to the target block of data block DATA 15, by
following the pointer information and reading the next lower node
of the tree branch, to successively read indirect nodes IN 1 and IN
13. However, in the example of FIG. 4C, since the upper two tree
levels of the indirect nodes below the root node level are
exemplarily held/maintained in cache memory, the indirect nodes IN
1 and IN 13 can be efficiently read from cache memory instead of
requiring random reads, and only data of the direct node DN 7 needs
to read from the storage device by random read.
So, similar to the reduction of read amplifications occurring in
connection with a read request as discussed above, the processing
of an object-related write request leads to a significant reduction
of the number of read amplifications.
To modify the data block, similar as in FIG. 3C, also the data of
block DATA 15 may be read by random read to be modified as
requested, and the modified data block DATA 15* shall be written to
a new place. That is, after allocating a new storage block, the
modified data block DATA 15* is written to a newly allocated block
and the metadata nodes of the target branch are modified to have
updated pointer information according to the new target branch to
the block having the newly written data block DATA 15*.
Accordingly, similar to FIG. 3C above, the modified direct node DN
7* is written to storage device (preferably to another newly
allocated storage block) by random write.
However, instead of also writing the other modified nodes directly
via random write, the indirect node IN 13, the indirect node IN 1
and the root node RN are overwritten in the cache memory (e.g. with
the updated pointers), thereby avoiding random writes to storage
device(s).
Summarizing the above, read and write amplifications occurring in
processing object-related read request exemplarily exchanged
between a metadata layer and a data protection layer may be
significantly reduced by subtree caching of metadata nodes of one
or more upper node tree generations/levels, here exemplarily the
upper most three node tree levels (exemplarily all metadata nodes
of tree levels above the lowest tree level, being the direct node
tree level).
Accordingly, by subtree caching of one or more upper tree levels of
the metadata structure(s) (e.g. caching the root nodes and the
indirect nodes), a significant reduction of read and write
amplifications in handling of object-related read and write
requests can be achieved.
IV.2 Direct Node Subtree Caching
According to some exemplary embodiments, the metadata layer may
manage metadata of one or more data objects in metadata tree
structures, each exemplarily including a root node pointing to one
or more storage blocks storing data blocks, to one or more indirect
nodes, and/or to one or more direct nodes, and optionally including
one or more indirect nodes pointing to one or more indirect nodes
and/or to one or more direct nodes, and/or optionally including one
or more direct nodes pointing to one or more storage blocks.
According to some exemplary embodiments, while some portions of a
metadata and/or of metadata tree structures may be stored on
storage devices, at least a part (portion) of metadata and/or of
metadata tree structures is preferably stored in a cache memory
such as e.g. in a volatile cache memory (and/or a non-volatile
memory such as e.g. one or more NVRAMs) of the configurations of
any of FIGS. 1A to 1C above, specifically providing the benefits
that read and/or write amplifications as discussed above may be
avoided or at least be significantly reduced, thereby making
handling of object-related I/O requests significantly more
efficient in systems handling many clients and many data objects
and high amounts of metadata to handle a very high number of I/O
requests.
At the same time, since not all of the metadata needs to be kept in
cache memory, it is possible to limit the required cache memory
capacity, which allows provision of a very scalable system with
reasonable cache capacity per node apparatus.
FIG. 5A exemplarily shows a schematic diagram of an exemplary
metadata tree structure in connection with an example of metadata
subtree caching, and FIGS. 5B and 5C exemplarily illustrate
reduction of occurrences of read amplifications in data read
operations and read and write amplifications in data write
operations based on such exemplary metadata tree structure
according to exemplary embodiments.
Exemplarily, for one or more data objects, the metadata layer may
hold all nodes of a certain metadata tree structure node tree level
in cache (such as e.g. in volatile cache and/or the non-volatile
memory, such as e.g. one or more NVRAMs).
Exemplarily, in FIG. 5A, all direct nodes DN 0 to DN 7 of a
metadata tree structure similar to FIG. 3A may be held in cache
memory for efficient access.
FIG. 5B exemplarily illustrates reduced occurrences of read
amplifications in data read operations based on such exemplary
metadata tree structure of FIG. 5A, which will be seen to be
significantly reduced compared to read amplifications occurring by
processing as described in connection with FIG. 3B.
Exemplarily, it is assumed that the metadata layer receives an
object read request directed to the data object being associated
with the metadata tree structure of FIG. 5A, to read DATA 12
thereof.
In such situation, the metadata layer may be configured to refer to
the root node RN of the metadata tree structure, based on the
object read request being directed to data of the associated data
object. This may require a block-related read request to read the
data of the corresponding root node RN in the storage device. In
other exemplary embodiments, the root node RN may also be
preliminarily stored in the cache memory, which would avoid such
read operation from storage device to read the corresponding root
node RN.
In this example, since all direct nodes are exemplarily
held/maintained in cache memory, based on address information (e.g.
based on an indication relating to an offset of the position of
data to be read), the metadata layer may directly refer to a direct
node in cache memory which corresponds to the address information
for the associated data object on the basis of pointer information
of the root node RN.
Specifically, based on the address information, the metadata layer
may directly refer to direct node DN 6 stored in the cache memory,
and the metadata layer will refer to a pointer in the direct node
DN 6 being related to the target data of block DATA 12.
By such reference to the corresponding pointer in the direct node
DN 6 stored in cache memory, the metadata layer may refer to block
DATA 12 referenced by such corresponding pointer, to then issue to
the data protection layer (or to the storage device layer in other
embodiments) a block-related read request to read the data stored
at block address of block DATA 12 in the storage device.
Such processing significantly reduces the read amplification, in
that only the data of the root node RN and the data of the block of
DATA 12 needs to be read from the storage device(s) by random read
operations.
In addition, only one efficient cache read to read the data of
direct node DN 6 is required. If the data of the root node RN is
additionally stored in cache memory, even only one random read
operation to read the data of the block of DATA 12 is required in
connection with two efficient cache reads to read the data of the
root node RN and the direct node DN 6 is required.
Accordingly, by subtree caching of a lowest node generation (i.e.
caching the direct nodes), potentially combined with also holding
the root nodes in cache, a significant reduction of read
amplifications in handling of object-related read requests can be
achieved.
FIG. 5C exemplarily illustrates reduced occurrences of read and
write amplifications in data read operations based on such
exemplary metadata tree structure of FIG. 5A, which will be seen to
be significantly reduced compared to read and write amplifications
occurring by processing as described in connection with FIG.
3C.
Exemplarily, it is assumed that the metadata layer receives an
object write request directed to the data object being associated
with the metadata tree structure of FIG. 5A, to modify data stored
in storage block of DATA 15.
According to a log write method, such modified data block is
written to a newly allocated storage block, i.e. the new data DATA
15* is written to a newly allocated block, and the metadata tree
structure is updated to reflect the new situation, in that related
nodes of the metadata tree structure are updated.
However, this first involves identification of the block storing
the old data DATA 15, by referring to the nodes as follows.
In such situation, the metadata layer may be configured to refer to
the root node RN of the metadata tree structure, based on the
object write request being directed to data of the associated data
object. This may require a block-related read request to read the
data of the corresponding root node RN in the storage device. In
other exemplary embodiments, the root node RN may also be
preliminarily stored in the cache memory, which would avoid such
read operation from storage device to read the corresponding root
node RN.
Based on address information (e.g. based on an indication relating
to an offset of the position of data to be read), the metadata
layer may directly refer to a direct node in cache memory which
corresponds to the address information for the associated data
object on the basis of pointer information of the root node RN.
Specifically, based on the address information, the metadata layer
may directly refer to direct node DN 7 stored in the cache memory,
and the metadata layer will refer to a pointer in the direct node
DN 7 being related to the target data of block DATA 15.
So, similar to the reduction of read amplifications occurring in
connection with a read request as discussed above, the processing
of an object-related write request leads to a significant reduction
of the number of read amplifications.
In addition to writing the new data DATA 15* to a newly allocated
block of storage areas of the storage device(s), for updating the
metadata tree accordingly, such write operation further includes
writing of new metadata nodes. However, in the present example this
exemplarily only requires a cache overwrite of the direct node DN 7
stored in cache so that it points to the formerly referenced data
block of DATA 14 and to the newly written data block of DATA
15*.
However, the upper node generation tree levels of the root node RN
and the indirect nodes IN 0, IN 1 and IN 10 to IN 13, do not need
to be updated since their pointers are still valid. Specifically,
the pointer in indirect node IN 13 pointing to direct node DN 7 in
cache memory does not need to be modified since the pointer is
still valid due to the DN node 7 being overwritten in cache
memory.
However, according to checkpoint processing discussed in further
exemplary embodiments below, also the upper level metadata nodes
may be updated on storage devices at certain times, e.g. when a
checkpoint is taken. Such processing may be combined with writing
"deltas" to non-volatile memory such as e.g. the NVRAM.
In the above example, instead of four random read operations and
five random write operations as in FIG. 3C, only one random read
(to read root node RN from storage device(s)), one random write (to
write the new data of DATA 15* to storage device(s)), and one cache
read and one cache overwrite in connection with direct node DN 7
are required, thereby significantly reducing the read and write
amplifications occurring in the processing according to FIG.
3C.
If the data of the root node is additionally held in cache memory,
this would even only require one random write (to write the new
data of DATA 15* to storage device(s)) and two cache reads (to read
data of root node RN and of direct node DN 7 from cache) and one
cache overwrite (to overwrite direct node DN 7 in cache).
Summarizing the above, read and write amplifications occurring in
processing object-related read request exemplarily exchanged
between a metadata layer and a data protection layer may be
significantly reduced by subtree caching of metadata nodes of a
node generation/tree level.
Accordingly, by subtree caching of a lowest node generation (i.e.
caching the direct nodes), potentially combined with also holding
the root nodes in cache, a significant reduction of read and write
amplifications in handling of object-related read and write
requests can be achieved.
IV.3 Root Node and Direct Node Subtree Caching
As discussed above, further reductions of read and write
amplifications can be achieved by additionally holding the root
nodes of metadata tree structure of data objects in cache
memory.
FIG. 6A exemplarily shows an exemplary metadata tree structure in
connection with another example of a metadata subtree caching
according to further exemplary embodiments.
Exemplarily, for one or more data objects, the metadata layer may
hold all nodes of a certain metadata tree structure node tree level
in cache (such as e.g. in cache memory and/or the non-volatile
memory, such as e.g. one or more NVRAMs).
Exemplarily, in FIG. 6A, all direct nodes DN 0 to DN 7 of a
metadata tree structure similar to FIG. 3A may be held in cache
memory for efficient access, and in addition, the root node RN of
such metadata tree structure may be held in cache memory for
efficient access.
Accordingly, random read operations to read the root node RN from
storage device(s) in FIGS. 5B and 5C above can additionally be
avoided, and the root node RN can instead be efficiently read from
the cache memory in handling read or write operations from/to the
associated data object.
IV.4 Root Node and Indirect Node Subtree Caching
FIGS. 6B and 6C exemplarily show an exemplary metadata tree
structure in connection with further examples of a metadata subtree
caching according to further exemplary embodiments.
In FIG. 6B, exemplarily the indirect nodes IN 10 to IN 13 of the
lower node tree level of indirect nodes is stored in cache memory
in addition to the root node RN.
This means that in read operations to read data of the associated
data object, at least read amplifications due to random reads of
the root node RN and one or more of the indirect nodes IN 0 and IN
1 of the upper (higher) level of indirect nodes as well as of the
indirect nodes IN 10 to IN 13 of the lower node level of indirect
nodes can be avoided, so as to significantly reduce the occurrence
of read amplifications in handling read requests and write
requests.
In addition, for write requests the respective modified indirect
node among the indirect nodes IN 10 to IN 13 of the lower node
level of indirect nodes can be achieved by cache overwrite, and
only the respective corresponding direct node pointing to the newly
written data block needs to be written newly by random write, so
that also write amplifications in handling write requests can be
significantly reduced.
In FIG. 6C, exemplarily the indirect nodes IN 0 and IN 1 of the
upper node level of indirect nodes is stored in cache memory in
addition to the root node RN.
This means that in read operations to read data of the associated
data object, at least read amplifications due to random reads of
the root node RN and one or more of the indirect nodes IN 0 and IN
1 of the upper (higher) level of indirect nodes can be avoided, so
as to significantly reduce the occurrence of read amplifications in
handling read requests and write requests.
In addition, for write requests the respective modified indirect
node among the indirect nodes IN 0 and IN 1 of the upper node tree
level of indirect nodes can be achieved by cache overwrite, and
only the respective corresponding indirect node of the lower level
and the respective corresponding direct node pointing to the newly
written data block needs to be written newly by random writes, so
that also write amplifications in handling write requests can be
significantly reduced.
IV.5 Further Examples of Subtree Caching
FIG. 7A exemplarily shows a schematic diagram of another exemplary
metadata tree structure, and FIG. 7B to 7D exemplarily show an
exemplary metadata tree structure in connection with further
examples of a metadata subtree caching according to further
exemplary embodiments.
The metadata tree structure of FIG. 7A differs from the above in
that the root node RN may directly point to data blocks, direct
nodes and indirect nodes, while optional indirect nodes still point
to direct nodes and/or indirect nodes and direct nodes still point
to data blocks.
In FIG. 7B, exemplarily, the root node RN and the direct nodes DN 0
to DN 8 of the metadata structure associated with the data object
are held in cache memory, to significantly reduce read and write
amplifications at least in connection with avoiding random reads
and random writes in connection with the root node, the indirect
nodes and the direct nodes.
In FIG. 7C, exemplarily, the root node RN and the indirect nodes IN
10 to IN 13 of the lower node level of indirect nodes of the
metadata structure associated with the data object are held in
cache memory, to significantly reduce read and write amplifications
at least in connection with avoiding random reads and random writes
in connection with the root node and the indirect nodes.
In FIG. 7D, exemplarily, the root node RN and the indirect node IN
0 (and further indirect nodes) of the upper node level of indirect
nodes of the metadata structure associated with the data object are
held in cache memory, to significantly reduce read and write
amplifications at least in connection with avoiding random reads
and random writes in connection with the root node and the indirect
nodes of the upper node level of indirect nodes.
V. Checkpoint Processing Including Subtree CachingV.1 Major and
Minor Node Management for Checkpoint Processing
FIG. 8A exemplarily shows a schematic diagram of another exemplary
metadata tree structure, and FIG. 8B exemplarily illustrates the
metadata tree structure of FIG. 8A being grouped in a cached upper
metadata tree portion and a lower metadata portion in connection
with checkpoint processing based on such exemplary metadata tree
structure according to some exemplary embodiments.
Exemplarily, while the checkpoint processing of below examples and
exemplary embodiments may be performed in connection with examples
of subtree caching as discussed above, the metadata tree structure
of FIG. 8A exemplarily has at least two object layers, in that a
first object (exemplarily referred to as "index object") is
exemplarily provided with a metadata tree structure having a root
node RN pointing to optionally plural indirect node tree levels
(exemplarily with three indirect node tree levels) which point to a
tree level of direct nodes (exemplarily in the 4.sup.th metadata
tree level).
However, instead of pointing to blocks storing data of the data
object in data blocks, the direct nodes of the "index object" point
to root nodes RN of plural data objects in a second object layer.
This allows for more efficient management of a high amount of data
objects in a single metadata tree structure including the metadata
tree structure of the index object and the respective metadata
structures of the data objects.
Each data object may again include a root node RN pointing to
optionally plural indirect node tree levels (exemplarily with two
indirect node tree levels) which point to a tree level of direct
nodes (exemplarily in the 8.sup.th metadata tree level). Similar to
the above examples, the direct nodes DN of the data objects point
to blocks of data of the respective data objects at the data block
level (e.g. including user data).
Exemplarily, in FIG. 8B, the two lowest tree levels of the metadata
structure (i.e. exemplarily the direct nodes of the data objects
and the next higher tree level of metadata nodes (lower or minor
tree levels), here exemplarily indirect nodes of the 7.sup.th
metadata tree level) are exemplarily referred to as minor nodes
(minor metadata nodes), which may exemplarily be stored on storage
devices and which may exemplarily be not generally be maintained in
cache memory.
On the other hand, the upper tree portion metadata nodes, e.g. the
root node RN of the index object and the metadata nodes of the
1.sup.st to 6.sup.th metadata tree level (upper or major tree
levels) are exemplarily held in cache memory, and such metadata
nodes are exemplarily referred to as major nodes (major metadata
nodes).
Such arrangement is similar to at least the configuration of FIGS.
4A and 6C, having an upper cached metadata tree structure portion
of tree levels (major nodes) and a lower metadata tree structure
portion of tree levels (minor nodes).
FIGS. 8C and 8D exemplarily illustrate the metadata tree structure
of FIG. 8A being grouped in a cached upper metadata tree portion
and a lower metadata portion in connection with checkpoint
processing based on such exemplary metadata tree structure
according to some further exemplary embodiments.
Exemplarily, in FIG. 8C, only the lowest tree level of the metadata
structure (i.e. exemplarily the direct nodes of the data objects)
are exemplarily referred to as minor nodes (minor metadata nodes),
which may exemplarily be stored on storage devices and which may
exemplarily be not generally be maintained in cache memory.
On the other hand, the upper tree portion metadata nodes, e.g. the
root node RN of the index object and the metadata nodes of the
1.sup.st to 7.sup.th metadata tree level (upper or major tree
levels) are exemplarily held in cache memory, and such metadata
nodes are exemplarily referred to as major nodes (major metadata
nodes).
Exemplarily, in FIG. 8D, the three lowest tree levels of the
metadata structure (i.e. exemplarily the direct nodes of the data
objects and the two next higher tree level of metadata nodes (lower
or minor tree levels), here exemplarily indirect nodes of the
6.sup.th and 7.sup.th metadata tree levels) are exemplarily
referred to as minor nodes (minor metadata nodes), which may
exemplarily be stored on storage devices and which may exemplarily
be not generally be maintained in cache memory.
On the other hand, the upper tree portion metadata nodes, e.g. the
root node RN of the index object and the metadata nodes of the
1.sup.st to 5.sup.th metadata tree level (upper or major tree
levels) are exemplarily held in cache memory, and such metadata
nodes are exemplarily referred to as major nodes (major metadata
nodes).
In general, such subtree caching may preferably include one or more
lowest tree levels (including at least the tree level of the direct
nodes) may represent minor metadata nodes, which may exemplarily be
stored on storage devices and which may exemplarily be not
generally be maintained in cache memory (only if such nodes are
read in read/write operations, such minor nodes may temporarily be
loaded to cache memory, but such minor nodes are preferably not
held systematically in cache memory).
Furthermore, such subtree caching may preferably include one or
more higher tree levels above the tree levels of the minor metadata
nodes (including at least one tree level of indirect nodes) may
represent minor metadata nodes which are exemplarily
held/maintained systematically in cache memory.
V.2 Read and Write Request Processing in Connection with Checkpoint
Processing
FIG. 9A exemplarily illustrates a flow chart of processing a read
request in connection with checkpoint processing according to some
exemplary embodiments.
In step S901, an object-related I/O read request to a target data
object is received at the metadata layer, and, based on an object
identifier (e.g. an object number) indicated in the object-related
I/O read request, the process may walk down a target branch of the
metadata structure of the index object leading to the root node of
the respective target data object.
For this purpose, the step S902 includes reading the root node RN
of the index object. By referring to the pointer of the root node
RN associated with the target data object, the process successively
reads the next lower node of the target branch and refers to its
pointer associated with the target data object leading to the next
lower node of the target branch until the root node of the target
object is identified and can be read.
By this processing, the process performs step S903 of walking down
the target object's branch of the index object metadata tree
structure by successively reading the metadata nodes of the target
object's branch of the index object metadata tree structure.
After reading the respective direct node of the index object
metadata tree structure of the target object's branch and referring
to its pointer to the root node of the target data object, the
process continues with step S904 to read the target data object's
root node RN.
Based on further address information (e.g. a block identifier such
as e.g. an offset or a logical block number of the target block)
indicated in the object-related I/O read request, the process may
walk down a target branch of the metadata structure of the target
data object leading to the target block.
For this purpose, the step S904 includes reading the root node RN
of the target data object. By referring to the pointer of the root
node RN of the target data object associated with the target block,
the process successively reads the next lower node of the target
branch and refers to its pointer associated with the target data
block leading to the next lower node of the target branch until the
direct node pointing to the target block is identified.
By this processing, the process performs step S905 of walking down
the target block's branch of the target data object metadata tree
structure by successively reading the metadata nodes of the target
branch of the target data object metadata tree structure.
After reading the respective direct node of the target object
metadata tree structure of the target branch and referring to its
pointer to the target data block, the process continues with step
S906 to read the target data block (e.g. by random read).
Upon reading the target data block, step S907 includes returning
the requested user data including the read data of the data block
in an object-related I/O read response.
In the above, the process reads plural major metadata nodes which
can efficiently be read from cache memory, since all major metadata
nodes are systematically maintained in the cache memory, and only
minor metadata nodes of the whole large metadata structure may need
to be read from storage device(s), thereby significantly reducing
read amplifications.
FIG. 9B exemplarily illustrates a flow chart of processing walking
down a tree branch of a metadata tree structure according to some
exemplary embodiments. This may be applied in steps S903 and/or
S905 of the above processing of FIG. 9A.
The process includes (potentially in a loop while walking down the
target branch) the step S950 of identifying the next lower
(indirect or direct) metadata node based on a pointer of a
previously read metadata node associated with the tree branch to
the target (e.g. the root node of the target data object or the
target data block of the target data object).
The process further includes the step S951 of reading the
identified metadata node from cache memory, if available in cache
memory (e.g. when the metadata node is a major metadata node
systematically held/maintained in cache memory, or in some
exemplary embodiments when the metadata node is a minor metadata
node that is coincidentally available in the cache memory), or
otherwise reading the identified metadata node by random read from
the storage device if the metadata node is a minor metadata node
that is not available in cache memory. Of course the latter may
only occur for the lowest tree levels of minor metadata nodes.
Upon reading the identified metadata node, step S952 reads the
node's pointer associated with the target tree branch (leading to
the target data object and/or leading to the target block), and if
the metadata node identified in step S950 is a direct node (step
S953 gives YES), then the process includes step S954 of continuing
with reading the target (which is either the target data object's
root node or the target data block).
Otherwise, if the metadata node identified in step S950 is not a
direct node (step S953 gives NO), then the process repeats step
S950 for the next lower metadata node, until step S953 gives
YES.
FIG. 10 exemplarily illustrates a flow chart of processing a write
request in connection with checkpoint processing according to some
exemplary embodiments.
In step S1001, an object-related I/O write request to modify a data
block of a target data object is received at the metadata layer
(may also be referred to as a modifying request), and, based on an
object identifier (e.g. an object number) indicated in the
object-related I/O write request, the process may walk down a
target branch of the metadata structure of the index object leading
to the root node of the respective target data object.
For this purpose, the step S1002 includes reading the root node RN
of the index object. By referring to the pointer of the root node
RN associated with the target data object, the process successively
reads the next lower node of the target branch and refers to its
pointer associated with the target data object leading to the next
lower node of the target branch until the root node of the target
object is identified and can be read.
By this processing, the process performs step S1003 of walking down
the target object's branch of the index object metadata tree
structure by successively reading the metadata nodes of the target
object's branch of the index object metadata tree structure, e.g.
exemplarily similar to the process of FIG. 9B.
After reading the respective direct node of the index object
metadata tree structure of the target object's branch and referring
to its pointer to the root node of the target data object, the
process continues with step S1004 to read the target data object's
root node RN.
Based on further address information (e.g. a block identifier such
as e.g. an offset or a logical block number of the target block)
indicated in the object-related I/O read request, the process may
walk down a target branch of the metadata structure of the target
data object leading to the target block.
For this purpose, the step S1004 includes reading the root node RN
of the target data object. By referring to the pointer of the root
node RN of the target data object associated with the target block,
the process successively reads the next lower node of the target
branch and refers to its pointer associated with the target data
block leading to the next lower node of the target branch until the
direct node pointing to the target block is identified.
By this processing, the process performs step S1005 of walking down
the target block's branch of the target data object metadata tree
structure by successively reading the metadata nodes of the target
branch of the target data object metadata tree structure, e.g.
exemplarily similar to the process of FIG. 9B.
After reading the respective direct node of the target object
metadata tree structure of the target branch and referring to its
pointer to the target data block, the process continues with step
S1006 to read the target data block (e.g. by random read) to cache
memory, and step S1007 of modifying the target block in cache
memory based on the received object-related I/O write request to
modify the data block.
By steps S1003 and S1005, preferably all minor nodes have been
temporarily loaded to cache memory so that temporarily all metadata
nodes of the target branch of the metadata structure (including the
systematically maintained major nodes, and the only temporarily
loaded minor nodes) are available in cache memory, and the process
includes a step S1008 of updating the pointers in all metadata
nodes of the target data block's branch of the metadata structure
(preferably including nodes of the index object metadata tree and
the target object metadata tree), e.g. upon allocating a new block
for the modified data block and the allocation of new blocks for
the updated metadata nodes.
However, in other exemplary embodiments, the allocation of new
blocks for the metadata nodes of the target branch may be performed
upon taking the respective checkpoints (see below examples).
However, if the blocks for the updated metadata nodes are allocated
at the time of step S1008, the blocks for major metadata nodes are
preferably allocated in different storage regions than the blocks
allocated for minor metadata nodes, to allow for efficient
sequential writes for minor metadata nodes and for major metadata
nodes, when minor metadata nodes and major metadata nodes are
written to the allocated blocks on the storage device(s) at
different times based on different checkpoint types (see e.g. major
and minor checkpoints in exemplary embodiments below).
In step S1009, the modified target data block is written to the
non-volatile memory (e.g. NVRAM, which is preferably mirrored) and
the modified (updated) minor metadata nodes are written to the
non-volatile memory in step S1010. If the minor nodes are already
stored in the non-volatile memory, these are preferably overwritten
with the updated modified minor metadata nodes (preferably without
allocating new blocks for such minor metadata nodes).
In step S1011, the process continues to write metadata deltas for
each updated modified major metadata node to the non-volatile
memory. This has the benefit that not the full data of the modified
major metadata node (e.g. unit of a block size) needs to be written
to the non-volatile memory but only the smaller-sized "delta" needs
to be written. Here, since the respective unmodified major metadata
node is still stored in the storage device, the respective metadata
delta is a smaller sized data unit only describing the currently
updated difference between the unmodified major metadata node still
stored in the storage device and the respective updated modified
major metadata node. In exemplary embodiments, the deltas stored in
the non-volatile memory may only be required for recovery
purposes.
In step S1012, the process continues to return a write
acknowledgement once the updated data (updated data block, updated
minor metadata nodes and respective deltas for the updated major
metadata nodes) is stored in the non-volatile memory (preferably
mirrored in a second non-volatile memory).
However, since the actual write operations to storage devices are
not yet performed and at least all of the major metadata nodes were
efficiently read from cache memory, since all major metadata nodes
are systematically maintained in the cache memory and only minor
metadata nodes of the whole large metadata structure may need to be
read from storage device(s), such processing allows to
significantly reduce read and write amplifications between the
metadata layer and the data protection layer (or storage device
layer).
V.3 Minor Checkpoint Processing
FIG. 11A exemplarily illustrates a flow chart of processing of
taking a first-type checkpoint (minor checkpoint) according to some
exemplary embodiments.
As mentioned above, by step S1010, upon write processing (in
connection with writing data blocks to new blocks) the respective
updated modified minor metadata nodes are stored in the
non-volatile memory.
In step S1101 it is checked whether the data amount of updated
minor nodes stored in a minor node metadata portion of the
non-volatile memory exceed a threshold (e.g. once the capacity of
the predetermined size of the minor node metadata portion of the
non-volatile memory is used to a predetermined threshold ratio or
is fully used, or once the amount exceeds a previously set
threshold, which may be configurable).
When the data amount of updated minor nodes stored in the minor
node metadata portion of the non-volatile memory exceeds the
threshold, a new minor checkpoint is issued in step S1102. This may
be including e.g. writing a new incremented minor checkpoint number
to minor metadata nodes which will be updated after issuing the new
minor checkpoint. The updated minor nodes already stored in the
minor node metadata portion of the non-volatile memory may be
associated with the previous minor checkpoint number.
Step S1103 then exemplarily allocates blocks for all modified data
blocks and updated minor metadata nodes stored in the non-volatile
memory (e.g. being associated with the previous minor checkpoint
number) in regions of the storage device(s) which are preferably
sequentially arranged (or at least allow for one or more sequential
writes of updated minor metadata nodes). In alternative exemplary
embodiments, the allocation of blocks for the modified data blocks
and/or updated minor metadata nodes may also already be performed
at the respective times of updating the respective minor metadata
nodes and storing them to the non-volatile memory, e.g. in
connection with step S1010 above.
In step S1104, all modified data blocks and updated minor metadata
nodes of the previous minor checkpoint are written from the minor
node metadata portion of the non-volatile memory (or alternatively
from cache memory) to the allocated blocks on storage device(s),
preferably by sequential writes.
In step S1105, upon writing the data blocks and updated minor
metadata nodes of the previous minor checkpoint to storage
device(s), the minor node metadata portion of the non-volatile
memory may be emptied for new updated minor metadata nodes and
modified data blocks of the new minor checkpoint, and the process
may repeat monitoring whether the data amount of updated minor
nodes stored in the minor node metadata portion of the non-volatile
memory exceed the threshold to issue the next new minor checkpoint
according to the above steps.
V.4 Major Checkpoint Processing
FIG. 11B exemplarily illustrates a flow chart of processing of
taking a second-type checkpoint (major checkpoint) according to
some exemplary embodiments.
As mentioned above, by step S1011, upon write processing (in
connection with writing data blocks to new blocks) the respective
deltas of the updated modified major metadata nodes are stored in
the non-volatile memory.
In step S1151 it is checked whether the data amount of deltas of
updated major nodes stored in a major node metadata portion of the
non-volatile memory exceed a threshold (e.g. once the capacity of
the predetermined size of the major node metadata portion of the
non-volatile memory is used to a predetermined threshold ratio or
is fully used, or once the amount exceeds a previously set
threshold, which may be configurable).
Here, the benefit may be achieved that the major metadata nodes of
the metadata structure are written to storage devices upon a
second-type (major) checkpoint less frequent than writing the
updated minor metadata nodes and modified data blocks by taking the
first-type (minor) checkpoint, and the read and write amplification
may be reduced even further by less frequent update of major nodes
to storage devices.
According to exemplary embodiments, this may be advantageously
achieved by writing the deltas of smaller size for major nodes to
the non-volatile memory so that more updates for major nodes can be
written to the non-volatile memory before a new major checkpoint is
taken.
In other alternative embodiments, or in most preferred exemplary
embodiments additionally to the aspect of writing deltas, the
threshold of step S1151 may be set larger than the threshold
applied in step S1101 for minor checkpoints above, e.g. by
providing the major node metadata portion of the non-volatile
memory at a larger size than the minor node metadata portion of the
non-volatile memory.
By doing so, the benefit of having the minor checkpoints being
issued more frequent than the major checkpoints may be achieved by
writing the smaller sized deltas for the updated major metadata
nodes to the non-volatile memory (e.g. instead of writing the
complete data of the updated major metadata nodes) and/or by
assigning a larger capacity for the major node metadata portion of
the non-volatile memory compared to the smaller capacity of the
minor node metadata portion of the non-volatile memory. In the
latter case, used as an exemplary alternative, the benefit of
having the minor checkpoints being issued more frequent than the
major checkpoints may be achieved even if complete updated major
metadata nodes were written to non-volatile memory in step
S1011.
When the data amount of deltas of updated major nodes stored in the
major node metadata portion of the non-volatile memory exceeds the
threshold, a new major checkpoint is issued in step S1152. This may
be including e.g. writing a new incremented major checkpoint number
to major metadata nodes which will be updated after issuing the new
major checkpoint. The deltas of the updated major nodes already
stored in the major node metadata portion of the non-volatile
memory may be associated with the previous major checkpoint
number.
Step S1153 then exemplarily allocates blocks for all updated major
metadata nodes for which deltas are stored in the non-volatile
memory (e.g. being associated with the previous major checkpoint
number) in regions of the storage device(s) which are preferably
sequentially arranged (or at least allow for one or more sequential
writes of updated major metadata nodes). In alternative exemplary
embodiments, the allocation of blocks for the updated major
metadata nodes may also already be performed at the respective
times of updating the respective major metadata nodes and storing
their respective deltas to the non-volatile memory, e.g. in
connection with step S1011 above.
In step S1154, all updated major metadata nodes of the previous
major checkpoint are written from the cache memory to the allocated
blocks on storage device(s), preferably by sequential writes. This
has the advantage that the deltas do not need to be applied, and
these may only be needed for recovery purposes, as explained in
further exemplary embodiments below.
In step S1155, upon writing the updated major metadata nodes of the
previous major checkpoint to storage device(s), the major node
metadata portion of the non-volatile memory may be emptied for
deltas of new updated major metadata nodes of the new major
checkpoint, and the process may repeat monitoring whether the data
amount of deltas of updated major nodes stored in the major node
metadata portion of the non-volatile memory exceed the threshold to
issue the next new major checkpoint according to the above
steps.
V.5 Recovery Processing Based on Major and Minor Checkpoints
FIG. 12A exemplarily illustrates a flow chart of processing a
recovery operation according to some exemplary embodiments.
In the process of FIG. 12A, it is assumed that the operation of the
data storage apparatus has been interrupted by way of a failure,
and the data of the cache memory previously stored is lost or may
need to be assumed to be corrupted. Then, based on the recovery
processing, normal operation can resume exemplarily after recovery
processing.
In step S1201 (e.g. upon restoring the mirrored data from another
mirror non-volatile memory, or based on the situation that the
non-volatile memory still stores the data of a time prior to the
failure as non-volatile type memory), all data blocks and minor
metadata nodes stored in the non-volatile memory (e.g. in the minor
node metadata portion of the non-volatile memory) are
identified.
Upon allocating storage blocks for all identified data blocks and
minor metadata nodes stored in the non-volatile memory in step
S1202, the data blocks and minor metadata nodes stored in the
non-volatile memory are written from the non-volatile memory to the
respective allocated blocks on the storage device(s) in step S1203
and the minor node metadata portion of the non-volatile memory is
emptied in step S1204.
In step S1205, all major metadata nodes associated with deltas
stored in the non-volatile memory (e.g. in the major node metadata
portion of the non-volatile memory) are identified and all
identified metadata nodes associated with deltas stored in the
non-volatile memory are read from storage device (i.e. in the
non-updated version) and loaded to cache memory in step S1206.
Alternatively, all major nodes of the metadata tree structure can
be loaded into cache, and only the ones for which delta(s) exist
are updated based on the respective delta(s) in cache memory. This
has the advantage that all major metadata nodes are again
maintained systematically in the cache memory for normal operation,
e.g. according to exemplary embodiments of subtree caching
above.
Upon allocating storage blocks for all identified major metadata
nodes for which deltas are stored in the non-volatile memory in
step S1207, the (non-updated) major metadata nodes stored in the
non-volatile memory are respectively updated by using the
respective delta(s) associated with the respective major metadata
node to be applied to (non-updated) major metadata nodes to updated
the respective major metadata nodes based on the respective
delta(s) in step S1208.
Then, the updated major metadata nodes can be written from the
cache memory to the respective allocated blocks on the storage
device(s) in step S1209 and the major node metadata portion of the
non-volatile memory can be emptied in step S1210, and normal
operation can resume (step S1211).
FIG. 12B exemplarily illustrates a flow chart of processing a
recovery operation according to further exemplary embodiments. As a
difference to FIG. 12A, FIG. 12B is a recovery process that
advantageously allows to immediately resume normal operation in
step S1251 prior to completing the recovery process, to update
metadata nodes from non-volatile memory only when involved in I/O
request, i.e. only when needed (preferably accompanied by an
additional background process according to some steps of FIG. 12A
to update all metadata nodes prior to taking a new major or minor
checkpoint, respectively).
In step S1251, normal operation is resumed after a failure.
Then, when a read is issued to a data block or minor metadata node
in connection with processing a current I/O request (step S1252
gives YES), it is checked whether a corresponding data block or
minor metadata node is stored in the non-volatile memory in step
S1253.
If step S1253 gives NO (i.e. the corresponding data block or minor
metadata node does not need to be updated), the corresponding data
block or minor metadata node is read from storage device in step
S1255 (e.g. temporarily to cache memory and for further use in the
I/O process as discussed for read/write processing in examples
above) and the process proceeds with normal operation; step
S1251.
However, if step S1253 gives YES (i.e. the corresponding data block
or minor metadata node exists in non-volatile memory as an updated
version from before the failure), the corresponding data block or
minor metadata node is instead loaded from the non-volatile memory
as the updated version in step S1254 (e.g. temporarily to cache
memory and for further use in the I/O process as discussed for
read/write processing in examples above) and the process proceeds
with normal operation; step S1251.
However, when a read is issued to a major metadata node in
connection with processing a current I/O request (step S1256 gives
YES), it is checked whether a corresponding delta associated with
the respective major metadata node is stored in the non-volatile
memory in step S1257.
If step S1257 gives NO (i.e. the corresponding major metadata node
does not need to be updated), the corresponding major metadata node
is read from storage device in step S1258 to be loaded and
maintained in cache memory and for further use in the I/O process
as discussed for read/write processing in examples above) and the
process proceeds with normal operation; step S1251.
However, if step S1257 gives YES (i.e. the corresponding major
metadata node has one or more deltas existing in non-volatile
memory from before the failure and needs to be updated based on the
delta(s)), the corresponding (non-updated) major metadata node is
read from storage device in step S1259 and then updated by applying
the associated delta(s) loaded from non-volatile memory in step
S1260 to obtain the corresponding updated major metadata node based
on the associated delta(s), and the updated major metadata node is
then loaded and maintained in the cache memory in step S1261 for
further use in the I/O process as discussed for read/write
processing in examples above, and the process proceeds with normal
operation; step S1251.
V.6 Dirty List Information Processing for Major and/or Minor
Checkpoints
FIG. 13 exemplarily illustrates another flow chart of processing a
write request in connection with checkpoint processing according to
some further exemplary embodiments, e.g. alternative to FIG. 10
above. Specifically, the steps S1301 to S1311 may be performed
similar to steps S1001 to S1011 of FIG. 10 above.
The additional exemplary processing of FIG. 13 allows efficiently
continuing handling of I/O requests for a new major checkpoint even
when processing writing of major metadata nodes of a previous major
checkpoint to the storage device(s).
In an additional step S1312, the process includes updating a major
node dirty list of a current major checkpoint by adding an entry
for each currently modified major metadata node of step S1308. That
is, the process maintains management information indicating dirty
major nodes (being major metadata nodes that have been modified in
the cache memory but have not yet been written to storage
device(s)).
When processing the issuance of a new major checkpoint and writing
major nodes dirtied in the last major checkpoint to the storage
device(s), the maintained management information such as e.g. a
major node dirty list of a then previous major checkpoint may be
processed entry by entry.
Accordingly, preferably such management information such as e.g. a
major node dirty list maintained for a previous major checkpoint
and another major node dirty list maintained for a current major
checkpoint. Similar management information can be also maintained
for minor metadata nodes such as e.g. a minor node dirty list
maintained for a previous minor checkpoint and another minor node
dirty list maintained for a current minor checkpoint
However, it may occur that a major node is dirtied (updated) again
in the new major checkpoint before the major node has been written
to storage device(s) for the previous major checkpoint, i.e. when
the major node dirty list maintained for a previous major
checkpoint still includes an entry for the respective major node
and their delta(s) of the previous major checkpoint are still
stored in non-volatile memory.
Such situation may be solved by, for example: (1) writing the major
node to storage device(s) for the previous major checkpoint and
removing the previous associated delta(s), before storing the new
delta from the new modification of the current checkpoint to the
non-volatile memory (e.g. by writing only the respective major node
or by writing also neighboring allocated major nodes in a more
efficient sequential write to storage device(s)); or (2) or copying
the cached major node to another cache page and modify only one
copy in cache for the current major checkpoint as a live version,
while the non-modified cache page thereof may be used when writing
the respective major node to storage device(s) for the previous
checkpoint, which requires more cache capacity.
However, in a most preferable exemplary embodiment, such situation
may be solved by storing reverse deltas in cache memory (and/or
non-volatile memory) according to the below process.
Hence, the process exemplarily includes a step S1313 of storing,
for each updated major metadata node of step S1312 being also still
identified in the major node dirty list of the previous major
checkpoint, a reverse delta corresponding to the delta written to
non-volatile memory in step S1311 in the cache memory (or in
non-volatile memory). Such reverse delta may have the same size and
format as the previously mentioned delta, only for reversing a
corresponding change according to a delta. Accordingly, while
applying the corresponding delta(s) to a metadata node results in
the updated metadata node, applying the corresponding reverse
delta(s) to the updated metadata node would result again in the
non-modified metadata node.
In step S11314, the process continues to return a write
acknowledgement once the updated data (updated data block, updated
minor metadata nodes and respective deltas for the updated major
metadata nodes) is stored in the non-volatile memory (preferably
mirrored in a second non-volatile memory).
FIG. 14 exemplarily illustrates a flow chart exemplarily
illustrates a flow chart of processing a second-type checkpoint
(major checkpoint) according to some further exemplary embodiments,
exemplarily using management information such as major node dirty
lists for the current and previous checkpoint, respectively.
Similar processing can be provided also for minor nodes and minor
checkpoints.
In step S1401 to S1403 the process may include steps similar to
steps S1151 to S1153, wherein step S1403 may be performed based on
the entries of the major node dirty list of the previous
checkpoint, while a new major node dirty list of the current new
checkpoint is maintained upon step S1402.
Then, the following processing may be (successively or in parallel)
performed for each major metadata node having an entry in the major
node dirty list of the previous checkpoint.
When a major node having an entry in the major node dirty list of
the previous checkpoint shall be written to storage device(s), it
is checked in step S1404 whether the corresponding major node has
an entry in the major node dirty list of the current checkpoint
(i.e. whether it has been modified/dirtied again since issuing the
new major checkpoint).
If step S1404 gives NO, the corresponding major node can be
processed similar to FIG. 11B by loading the corresponding major
node from the cache memory in step S1408 and writing the
corresponding major node from the cache memory to the respective
allocated block on the storage device in step S1409.
Then, the corresponding entry of the corresponding major node can
be removed from the major node dirty list of the previous
checkpoint in step S1410 and its associated delta(s) in the
non-volatile memory may be removed in step S1411.
On the other hand, if step S1404 gives YES, the corresponding major
node is loaded from the cache memory in step S1405 (i.e. in the
updated version of the current checkpoint) and the corresponding
reverse delta(s) are loaded from the cache memory in step S1406,
and, in step S1407, the corresponding major node as loaded from the
cache memory can be "updated" backwards to the version of the
previous major checkpoint by applying the corresponding reverse
delta(s) and the "updated" major node according to the version of
the previous checkpoint is written to storage device(s) in step
S1409.
Then, the corresponding entry of the corresponding major node can
be removed from the major node dirty list of the previous
checkpoint in step S1410 and its associated delta(s) in the
non-volatile memory relating to the previous major checkpoint may
be removed in step S1411 (without however removing the delta(s)
associated with the new current major checkpoint).
VI. Further Examples Relating to Subtree Caching
VI.1 Read Request Processing Including Subtree Saching
FIG. 15A exemplarily illustrates a flow chart of processing a read
request, including subtree caching according to further exemplary
embodiments.
In step S1501, an object-related I/O read request directed to a
data object is received, e.g. at the metadata layer.
In step S1502, address information is obtained from the
object-related I/O read request which indicates an address of a
data block to be read (e.g. on the basis of a logical block
address).
In step S1503, e.g. based on information (such as an object ID or
the like) identifying the data object, a root node of a metadata
tree structure associated with the data object is identified and
the identified root node is read (e.g. from cache, if available in
cache, or by random read from storage device(s)).
In step S1504, among the (direct or indirect) nodes stored in the
cache memory being related to at least one node level of the
metadata tree structure associated with the data object, the
(direct or indirect) node of a tree branch related to the data
block to be read is identified based on the address information
obtained in step S1502.
In step S1505, the identified (direct or indirect) node is read
from the cache memory via cache read.
If the identified node is a direct node (i.e. when the direct nodes
of the metadata tree structure are stored in cache according to
direct node subtree caching) and step S1506 gives YES, based on the
pointer to the data block to be read among the pointers of the
direct node read from cache memory, the data block to be read is
read from storage device(s) via a random read operation from
storage device(s) in step S1508, e.g. by issuing a block-related
read request to the storage device(s) or the data protection
layer.
On the other hand, if the identified node is an indirect node (i.e.
when the indirect nodes of the metadata tree structure of one level
of indirect nodes are stored in cache according to indirect node
subtree caching) and step S1506 gives NO, based on the pointer to
the next lower direct or indirect node of the tree branch, which
relates to the data block to be read, among the pointers of the
indirect node read from cache memory, the process continues with a
step S1507 to read the one or more (direct and/or indirect) lower
nodes of the tree branch, which relates to the data block to be
read, via a random read operation from storage device(s), e.g. by
issuing corresponding block-related read requests to the storage
device(s) or the data protection layer.
After reading the respective direct node of said tree branch, based
on the pointer to the data block to be read among the pointers of
the direct node, the data block to be read is read from storage
device(s) via a random read operation from storage device(s), e.g.
by issuing a block-related read request to the storage device(s) or
the data protection layer, in step S1508.
Upon returning the data block to be read via block-related read
response and receiving the block-related read response at the
metadata layer in step S1509, an object-related I/O read response
with the requested data is issued in step S1510 at the metadata
layer, e.g. to be returned to the interface/protocol layer for
creating a respective response to be returned to a requesting
host.
VI.2 Write Request Processing Including Subtree Caching
FIG. 15B exemplarily illustrates a flow chart of processing a write
request, including subtree caching according to further exemplary
embodiments.
In step S1551, an object-related I/O write request directed to a
data object is received, e.g. at the metadata layer.
Then, steps similar to steps S1502 to S1506 above are performed,
steps S1552 to S1556.
In a step S1552, address information is obtained from the
object-related I/O write request which indicates an address of a
data block to be newly written (e.g. on the basis of a logical
block address).
In a step S1553, e.g. based on information (such as an object ID or
the like) identifying the data object, a root node of a metadata
tree structure associated with the data object is identified and
the identified root node is read (e.g. from cache, if available in
cache, or by random read from storage device(s)).
In a step S1554, among the (direct or indirect) nodes stored in the
cache memory being related to at least one node tree level of the
metadata tree structure associated with the data object, the lowest
(direct or indirect) node of a target tree branch related to the
data block to be newly written is identified based on the obtained
address information.
In a step S1555, the identified (direct or indirect) node is read
from the cache memory via cache read.
If the identified node is a direct node (i.e. when the direct nodes
of the metadata tree structure are stored in cache according to
direct node subtree caching) and step S1556 gives YES, the data
block is written to a new block address on storage device(s) into a
newly allocated and previously free block in step S1557.
Similarly, when step S1556 gives NO, the data block is written to a
new block address on storage device(s) into a newly allocated and
previously free block in step S1559, however after walking down the
target data block's tree branch for node tree levels below the
lowest cached node tree level of indirect nodes until reading the
direct node from storage device(s) in step S1558.
However, if the identified node of step S1554 is an indirect node
(i.e. when the indirect nodes of the metadata tree structure of one
or more levels of indirect nodes are stored in cache according to
subtree caching) and step S1556 gives NO, the process additionally
allocates blocks and newly writes the modified one or more (direct
and/or indirect) lower nodes of the tree branch, which relates to
the data block to be newly written, via a random write operation(s)
to storage device(s) in step S1560, e.g. by issuing corresponding
block-related write requests to the storage device(s) or the data
protection layer.
Accordingly, for all nodes of the tree branch lower than the
identified node in the cache memory, a new node is written to have
the updated pointer information leading the tree branch to the
newly written data block.
Then, the (direct or indirect) node of the tree branch identified
in step S1554 is updated in step S1561 by a respective
corresponding cache overwrite to update its pointers to have the
updated pointer information leading the tree branch to the newly
written data block.
In step S1562, the corresponding object-related I/O write
acknowledgement is issued at the metadata layer, e.g. to be
returned to the interface/protocol layer.
VI.3 Dynamic Subtree Caching
FIG. 16A exemplarily illustrates a flow chart of dynamic metadata
subtree caching according to exemplary embodiments.
In step S1601, a capacity of the cache memory is determined.
Specifically, it is determined which portion of cache can be made
available for caching of portions of metadata tree structure. This
may be calculated based on settings of an administrator adjusting
cache resources or cache policies, or the available cache capacity
may be set manually by corresponding instruction via a management
computer 300.
In step S1602, the metadata amount is determined in one or more or
each node tree level of metadata tree structure of one or more data
objects. This may be done, for example, by calculating the
respective number of (direct or indirect) nodes per each node tree
level, if the node size is fixed.
In step S1603, exemplarily based on a threshold (which may be set
or be determined based on the cache capacity determined in step
S1601), the lowest node tree level of the metadata tree structure
of one or more data objects is identified based on the amount(s)
determined in step S1602 which does not exceed the threshold.
This may be performed by determining the amounts for each node tree
level, or by starting with a lowest node tree level of direct
nodes, and calculating step by step the amounts for each next
higher node tree level if the previous node tree level was
associated with an amount exceeding the threshold.
It should be noted that generally it may be assumed that the amount
of data of each node tree level is lower than the amount of data of
the next lower node tree level and higher than the amount of data
of the next higher node tree level. That is, while it is preferable
to store the direct nodes of the lowest node tree level to achieve
the most optimal reduction of read and write amplifications, such
lowest node tree level also would lead to the large cache memory
consumption since the node tree level of the direct nodes generally
is likely to have the largest number of nodes.
Once the lowest node tree level of the metadata tree structure of
one or more data objects which does not exceed the threshold is
identified in step S1603 based on the amount(s) determined in step
S1602, the metadata of all (direct or indirect) nodes of this
particular identified lowest node tree level of the metadata tree
structure, for which the amount does not exceed the threshold, is
loaded into the cache memory in step S1604, to be systematically
maintained in cache.
In further embodiments, also metadata nodes of some or all node
tree levels above the identified node tree level may be loaded into
the cache memory in step S1604, to be systematically maintained in
cache.
Here, the lowest node level of the metadata tree structure to be
loaded into cache memory may be identified globally for all
metadata of one or more data objects, or independently for metadata
associated with one data object, independently for metadata
associated with a group of data objects, or independently for
metadata associated with each single data object, or for a complete
metadata structure including a metadata structure of an index
object and a metadata structure of one or more data objects being
pointed to by the metadata structure of the index object.
In step S1605, the remaining cache capacity is monitored to monitor
whether the data amount of the cached node level may increase.
If the monitored remaining cache capacity falls below a threshold
(which may be set or be determined based on the cache capacity
determined in step S1601), the lowest cached node tree level is
changed such that a next higher node tree level represents the new
lowest cached node tree level of metadata nodes systematically
maintained in the cache memory.
For example, nodes of a next higher node level of the metadata tree
structure may be loaded into the cache memory (instead of the
previously stored lowest node level of the metadata tree structure)
to use a lower capacity of the cache memory, and the process may
continue again with step S1605.
Alternatively, if the further nodes of upper node tree levels above
the previous lowest node tree level of the metadata tree structure
have been previously maintained systematically in cache memory, the
nodes of the lowest cached node tree level are removed from cache
memory (or set as temporarily stored cache data that can be
overwritten by other data in cache) so that the nodes of a next
higher node tree level represent the new lowest node tree level of
the metadata tree structure maintained systematically in cache
memory.
FIG. 16B exemplarily illustrates a flow chart of dynamic metadata
subtree caching in connection with checkpoint processing according
to some further exemplary embodiments.
In step S1651, a capacity of the cache memory is determined.
Specifically, it is determined which portion of cache can be made
available for systematic caching of portions of metadata tree
structure. This may be calculated based on settings of an
administrator adjusting cache resources or cache policies, or the
available cache capacity may be set manually by corresponding
instruction via a management computer 300.
In step S1652, the metadata amount of major metadata nodes stored
in cache memory is determined or monitored.
In step S1653, it is checked whether the metadata amount of major
metadata nodes stored in cache memory exceeds a threshold (which
may be set based on step S1651 or be pre-set or configurable).
If the step S1653 gives YES, the lowest cached major node tree
level of cached major metadata nodes is changed to be the new
highest cached minor node tree level of cached minor metadata
nodes, to reduce the cache usage of systematically caching the
major metadata nodes, e.g. when the available cache capacity for
other processes becomes too low. This may be accompanied by taking
a minor checkpoint, for writing the new minor metadata nodes of the
new highest cached minor node tree level to storage device(s).
In step S1656, it is checked whether the metadata amount of major
metadata nodes stored in cache memory falls below a (preferably
lower) threshold (which may be set based on step S1651 or be
pre-set or configurable).
If step S1656 gives YES, the minor metadata nodes of the highest
minor node tree level are read to be loaded into the cache memory
(e.g. from storage device(s) by random reads, or from the
non-volatile memory if available) in step S1657, and in step S1658
the highest minor node tree level is changed to the new lowest
cached major node tree level of cached major metadata nodes, to
increase the cache usage of systematically caching the major
metadata nodes to improve reduction of write and read
amplifications, in view of more efficient usage of free cache
capacities.
VII. Lazy Update
VII.1 Free Space Object
As mentioned in the above, when a data block is to be written (e.g.
when writing user data into a data block and also when writing a
new root, direct or indirect node to the storage device), a
previously free storage block needs to be allocated.
For such metadata to be used for allocation, a data object may be
managed which indicates which blocks are used or free. Free blocks
are blocks that are currently unused and available for allocation
in connection with new data writes. A block is used if it includes
previously written data (user data or metadata such as e.g. data
relating to a root node, an indirect node or a direct node) and the
block is referenced by at least one object or node's pointer. The
number of pointers pointing to the same block may be referred to as
reference count of the respective block. If the reference count is
zero, the block can be considered to be free and available for new
allocation.
In general, a block being unused and available for (re-)allocation
may be referred to as a free block.
The data of such data object, which indicates which blocks are used
or free for allocation, may be exemplarily referred to as free
space object (FSO) and may be provided, for example, as a bitmap in
some exemplary embodiments, and, in general, the data of the free
space object may include, for each storage block of (e.g. connected
or available) storage device(s), a respective indicator which
indicates whether the associated block is in use or free (available
for allocation).
In a simple example, a bitmap may be provided in which each bit is
associated with one storage block (two bit states per block, i.e.
used or free), but in some other exemplary embodiments there may
also be provided more than one bit per storage block to provide
more detailed information on a status of the block, e.g. further
indicating a reference count of the respective block, when in use,
and/or indicating whether the block is referenced by a current
metadata tree structure and/or previous checkpoint versions of a
metadata tree structure.
When the free space object is managed as a data object, a metadata
data tree structure according to e.g. FIG. 3A or 7A may also be
used for managing the metadata associated with the free space
object.
However, when the allocation management information of the free
space object is read or written to in connection with data writes
and allocation of free blocks for the data writes and freeing now
unused blocks to be available for new allocation, this implies that
the metadata tree structure associated with the free space object
is also read and written so that significant read and write
amplifications may occur similar to the read and write
amplifications discussed in connection with FIGS. 3B and 3C.
VII.2 Allocation Management Information Update Operations
According to exemplary embodiments, there may be two operations to
update the allocation management information of the free space
object, e.g. an operation to allocate a new block, e.g. indicating
that a previously free block is updated to the status "used", e.g.
by changing/updating the respective indicator associated with the
respective block from "free" to "used", and an operation to free a
block which is not used anymore (e.g. because the data of the block
has been newly written to a newly allocated block or the block has
been de-duplicated).
According to exemplary embodiments, there may be three or more
operations to update the allocation management information of the
free space object, e.g. an operation to allocate a new block, e.g.
indicating that a previously free block is updated to the status
"used", e.g. by changing/updating the respective indicator
associated with the respective block from "free" to "used", an
operation to increase (increment) a reference count of the block
(e.g. when a newly written node has a pointer pointing to the
respective block or when another duplicate block has been
de-duplicated) and an operation to decrease (decrement) a reference
count of the block (e.g. when a node having a pointer pointing to
the respective block is deleted, e.g. when data of the block of
reference count larger than one has been newly written to a newly
allocated block, or the block has been de-duplicated).
In addition, there may be provided an operation to free a block
which is not used anymore (e.g. because the data of the block of
reference count one has been newly written to a newly allocated
block or the block has been de-duplicated), or, alternatively, the
decrement operation may be regarded as an operation to free a
block, if the decrement operation is performed in connection with a
block having reference count one, and will have a reference count
zero after the reference count decrement operation.
In general the above operations to update the allocation management
information of the free space object which do not allocate a new
block may be referred to as "non-allocation update operations"
(including e.g. freeing a certain block, decrementing a reference
count and/or incrementing a reference count of a certain block),
and the operation to update the allocation management information
of the free space object that a previously free block is
used/allocated may be referred to as "allocation update
operation".
Preferably, "allocation update operations" to update the allocation
management information of the free space object, when a previously
free block is (re-)allocated, are applied to the allocation
management information of the free space object at the time of
allocation of the respective block to avoid that a block may be
allocated twice (or more often).
However, the inventors have considered that non-allocation update
operations do not need to be applied to the allocation management
information of the free space object at the time of the occurrence
but can be delayed to achieve further benefits.
VII.3 Region-Based Accumulation of Update Operations
It is proposed to accumulate non-allocation update operation
management information indicating non-allocation update operations
to be performed, and to apply the accumulated non-allocation update
operations at least for portions of the allocation management
information of the free space object by an accumulated update to
avoid or at least reduce read and write amplifications in
connection with updates of the allocation management information of
the free space object.
For such processing, according to some preferred exemplary
embodiments, it is proposed that the allocation management
information of the free space object is divided into a number of
regions, and respective non-allocation update operation management
information indicating non-allocation update operations to be
performed for blocks of the region of the allocation management
information of the free space object may be accumulated for each of
the regions.
The accumulation of non-allocation update operation management
information may be regarded as being randomly distributed among the
complete allocation management information of the free space
object.
FIG. 17A exemplarily shows a schematic drawing of allocation
management information of the free space object FSO being divided
into plural regions R1 to RM.
Whenever a block status is changed (e.g. freeing the block,
allocating the block, decrementing a reference count, incrementing
a reference count, etc.), an indicator in the allocation management
information of the free space object FSO being associated with the
respective block shall be updated so that the respective indicator
reflects the change of the block status.
FIG. 17B exemplarily shows a schematic drawing of allocation
management information of the free space object FSO of FIG. 17A
after a short period of time, and FIG. 17C exemplarily shows a
schematic drawing of allocation management information of the free
space object FSO of FIG. 17A after a longer period of time.
In the regions R1 to RM, each of the accumulating blocks shall
exemplarily represent an indicator to be updated in the allocation
management information based on a change of an associated
block.
As can be seen, since such updates in connection with status
changes of blocks in the allocation management information of the
free space object FSO relate to plural random reads and random
writes to storage devices, the different regions R1 to RM will
likely accumulate updates to be applied randomly in a random
distributed manner across the regions R1 to RM of the allocation
management information of the free space object FSO.
Accumulating updates to be applied in the allocation management
information of the free space object FSO may be performed by
managing, for each region, a respective non-allocation update
operation management information such as an update operation list
per region.
VII.4 Update Operation Management Information
In some exemplary embodiments, non-allocation update operation
management information can be realized as update operation lists
provided per region.
In a simple example, when the status of blocks is changed only
between "free" and "used" and back, the non-allocation update
operation management information may be exemplarily provided as an
update operation list 220_i per region R_i, wherein each update
operation list 220_i may indicate logical block addresses of blocks
of the particular region which need to be freed, as exemplarily
shown in FIG. 18A. The logical block address may also be give as
block number in some exemplary embodiments.
Then, for each region, the respective non-allocation update
operation management information for the respective region,
exemplarily indicates accumulated logical block addresses of blocks
to be freed, i.e. of blocks for which the status can be changed to
"free" in the respective region of the allocation management
information of the free space object FSO.
When such update operation list 220_i indicates plural blocks in
the region R_i for which the update operation shall be applied to
the allocation management information of the free space object FSO,
instead of individually and randomly applying such update
operations, then the respective region R_i of the allocation
management information of the free space object FSO can be updated
for the plural blocks indicated in the update operation list 220_i,
thereby significantly reducing write and read amplifications.
In another example, when the status of blocks is changed by
incrementing and/or decrementing reference counts (e.g. a block
being freed when the reference count is decremented to zero), the
non-allocation update operation management information may be
exemplarily provided as an update operation list 420_i per region
R_i, wherein each update operation list 420_i may indicate logical
block addresses of blocks of the particular region which need to be
updated in the allocation management information of the free space
object FSO by incrementing or decrementing their respective
reference count.
The update operation list 420_i may further indicate the respective
update operation to be performed, e.g. either to decrement or
increment the reference count associated with the respective block,
and, additionally, the update operation list 420_i may further
indicate a respective checkpoint number indicating a checkpoint
(e.g. a checkpoint of a managed associated file system or other
data structure stored on the storage devices), as exemplarily shown
in FIG. 18B. Such checkpoint numbers may also be indicated
additionally in the update operation list 220_i of FIG. 12A in the
simple example in which blocks are only considered "free" or
"used".
Preferably, when freeing blocks (update operation of decrementing
the reference count to zero), the block should preferably not
reused (e.g. by re-allocation) during the same checkpoint, and so
the update operation to update a status of a block to "free" may be
applied only for blocks for which the checkpoint number indicated
in the update operation list 420_i is smaller than a current
checkpoint number. This may include minor and/or major checkpoint
numbers.
For each region, the respective non-allocation update operation
management information for the respective region, exemplarily
indicates accumulated logical block addresses of blocks for which
update operations need to be performed, i.e. of blocks for which
the respective reference count needs to be decremented or
incremented in the respective region of the allocation management
information of the free space object FSO.
When such update operation list 420_i indicates plural blocks in
the region R_i for which the update operation shall be applied to
the allocation management information of the free space object FSO,
instead of individually and randomly applying such update
operations, then the respective region R_i of the allocation
management information of the free space object FSO can be updated
for the plural blocks indicated in the update operation list 420_i
according to the respective indicated update operation, thereby
significantly reducing write and read amplifications.
In the above, the entries in the respective update operation lists
may exemplarily be not indexed per block, and so the same block
address may be indicated multiple times in the same update
operation list, e.g. in connection with incrementing and/or
decrementing the reference count more than once.
In further exemplary embodiments, the update operation list 620_i
per region R_i may also be indexed per block, as exemplarily shown
in FIG. 18C. In the update operation list 620_i, the logical block
address (or block number) of the blocks which need status change
are identified exemplarily together with a checkpoint number of the
last status change, similar to FIG. 18B.
However, instead of indicating the update operation "increment" or
"decrement", the update operation list 620_i indicates a delta
number of accumulated reference count changes, to indicate, whether
the reference count is increased or decreased, and by which amount
the reference count is to be changed, when applying the update. For
example, if the block indicator of the allocation management
information of the free space object FSO is to be updated after
accumulating n "decrement" operations and m "increment" operations,
the reference count of the block is to be updated by changing the
reference count by the accumulated delta number m-n.
VII.5 Update List Management
FIG. 19 exemplarily illustrates a flow chart of efficient
allocation information management according to exemplary
embodiments.
In step S1901, a new block is allocated (e.g. when a new block of
user data is written or when a new node such as a root node,
indirect node or direct node is written).
Preferably the allocation management information is immediately
updated in step S1902 to change the status of the respective block
from "free" to "used" (including e.g. to increment the reference
count from zero to one), in order to avoid that the block is
allocated again for another write.
This may be done by applying the allocation update to the
allocation management information of the free space object FSO on
disk or on storage device(s). Alternatively, a current region from
which the blocks are currently allocated (e.g. a current region in
which the allocation cursor is currently positioned) may be loaded
into cache during allocation of blocks in the current region, and
updates indicating allocation of blocks are applied to the region
in cache (e.g. by cache overwrite), thereby avoiding read and write
amplifications in connection with allocation updates to the
allocation management information.
When the allocation cursor moves to another region, or when the
blocks are allocated from another region, the other region can be
loaded to cache and the previous region can be sequentially written
based on the updated region from cache (having all allocation
updates being applied thereto).
However, as mentioned above, non-allocation updates (such as
freeing other blocks, incrementing and/or decrementing reference
counts of other blocks) may be required subsequently by such
allocation of a new block, and in step S1903, such associated
non-allocation update operations are determined but not applied
directly.
For example, when data of a block is to be modified, the data block
is written to a new block location (i.e. a new block is allocated)
but the reference count of the previous block may be decremented by
one, or the block may be freed. Also, storage blocks storing nodes
of the metadata tree may need to be freed or the reference count
thereof may need to be decreased. Step S1903 identifies such
related non-allocation updates (such as freeing other blocks,
incrementing and/or decrementing reference counts of other
blocks).
In step S1904, for each of the identified/determined related
non-allocation update operation, it is determined in which region
of the allocation management information of the free space object
FSO the respective non-allocation update operation is to be
applied, and the respective regions are identified, e.g. based on
logical block addresses and/or block numbers.
In step S1905, for each non-allocation update operation determined
in step S1903, the respective update operation is indicated in an
entry (e.g. by adding an entry, e.g. according to FIG. 18A or 18B,
or by modifying a pre-existing entry, e.g. according to FIG. 18C)
of the non-allocation update operation management information
associated with the respective region.
By doing so, update operation information of updates to be applied
to each of the regions is accumulated in the respective
non-allocation update operation management information associated
with the respective regions.
In the above, the non-allocation update operation management
information associated with the respective region may be held in
cache memory.
In some further embodiments, the non-allocation update operation
management information associated with the respective region may in
addition or alternatively be written to an storage device, e.g. in
an optional step S1906, e.g. to save cache capacity and further
delay applying the non-allocation update operations in the free
space object as stored on storage device(s).
In some embodiments, in order to avoid or further reduce read and
write amplifications, new entries of non-allocation update
operation management information associated with the respective
region may be held in cache until a certain predetermined number of
new entries is accumulated and/or until a certain predetermined
amount of data of new entries is accumulated, and then the
accumulated new entries may be written to the non-allocation update
operation management information associated with the respective
region as held on the internal storage device such as an internal
disk or an internal flash module.
For example, when a certain predetermined number of new entries is
accumulated and/or until a certain predetermined amount of data of
new entries is accumulated for a certain region such that the data
corresponds to an integer multiple of a block size of the internal
storage device, the accumulated new entries may be written to the
non-allocation update operation management information associated
with the respective region as held on the internal storage device
efficiently.
VII.6 Applying Update Operations to a Region
FIG. 20 exemplarily shows a flow chart of a process applying update
operations to a region.
In step S2001, update operation entries are accumulated in update
operation management information associated with the respective
regions, e.g. according to FIG. 19.
In step S2002, it is checked whether an applying criteria is
fulfilled to apply the accumulated update operation entries for one
or more regions. Such applying criteria can be provided in multiple
ways.
For example, the number of entries and/or the number of accumulated
update operation entries in update operation management information
per region may be monitored, and when the number of entries and/or
the number of accumulated update operation entries in update
operation management information exceed a threshold, the
accumulated update operations of the respective region can be
applied. Then, the applying criteria is fulfilled when the number
of entries and/or the number of accumulated update operation
entries in update operation management information exceed a
threshold for at least one region.
Also, in addition or alternatively, the applying criteria may
involve a periodic update such that the applying criteria is
fulfilled whenever a periodic time to update expires, and at that
time, the one or more regions being associated with the highest
number of entries and/or the highest number of accumulated update
operation entries in update operation management information are
selected to be updated.
Also, in addition or alternatively, the applying criteria may
involve a check of an amount of available free blocks that can be
used for allocation according to the allocation management
information of the free space object FSO, and when the amount of
available free blocks falls below a threshold, one or more regions
of the allocation management information are updated, e.g. until
the amount of free blocks that can be used for allocation according
to the allocation management information of the free space object
FSO is sufficiently increased, e.g. until the amount of free blocks
exceeds a second threshold. Again, at that time, the one or more
regions being associated with the highest number of entries and/or
the highest number of accumulated update operation entries in
update operation management information can be selected to be
updated.
In step S2003, when the applying criteria is met (step S2002 gives
YES), the one or more regions to be updated can be selected or
identified. For example, the one or more regions being associated
with the highest number of entries and/or the highest number of
accumulated update operation entries in update operation management
information can be selected to be updated.
That is, in step S2003 the one or more regions of the free space
object to be updated is identified.
In step S2004, the data of the selected region(s) of the allocation
management information of the free space object FSO is read by
sequential read from storage device(s). By doing so, read and write
amplifications can be advantageously avoided or at least be
significantly reduced in connection with updates of the allocation
management information of the free space object FSO. For example,
the data of the selected region(s) can be loaded into cache.
Optionally, in exemplary embodiments which perform the optional
step S1906 above, the update operation management information for
the respective selected region(s) can be read from storage
device(s) in the optional step S2005. Otherwise, the update
operation management information for the respective selected
region(s) will be available in cache memory.
In step S2006, based on the accumulated update operation entries in
update operation management information for the respective selected
region(s), the data of the selected region(s) is updated (e.g. the
update operations are applied) according to the entries of
non-allocation update operations associated with the region
obtained from the respective update operation management
information. By doing so, read and write amplifications can be
advantageously avoided or at least be significantly reduced in
connection with updates of the allocation management information of
the free space object FSO.
Upon updating the data of the selected region(s), the updated data
of the selected region(s) is written as sequential write to storage
device(s) in step S2007. By doing so, read and write amplifications
can be advantageously avoided or at least be significantly reduced
in connection with updates of the allocation management information
of the free space object FSO.
In step S2008, the respective update operation management
information for the selected region(s) is reset (e.g. to remove all
entries of executed update operations to avoid that such update
operations are redundantly repeated when the same region is updated
next time).
As will be appreciated by one of skill in the art, the present
invention and aspects and exemplary embodiments, as described
hereinabove and in connection with the accompanying figures, may be
embodied as a method (e.g., a computer-implemented process, a
business process, or any other process), apparatus (including a
device, machine, system, computer program product, and/or any other
apparatus), or a combination of the foregoing.
Accordingly, exemplary embodiments of the present invention may
take the form of an entirely hardware embodiment, an entirely
software embodiment (including firmware, resident software,
micro-code, etc.), or an embodiment combining software and hardware
aspects that may generally be referred to herein as a "system"
Furthermore, embodiments of the present invention may take the form
of a computer program product on a computer-readable medium having
computer-executable program code embodied in the medium.
It should be noted that arrows may be used in drawings to represent
communication, transfer, or other activity involving two or more
entities. Double-ended arrows generally indicate that activity may
occur in both directions (e.g., a command/request in one direction
with a corresponding reply back in the other direction, or
peer-to-peer communications initiated by either entity), although
in some situations, activity may not necessarily occur in both
directions.
Single-ended arrows generally indicate activity exclusively or
predominantly in one direction, although it should be noted that,
in certain situations, such directional activity actually may
involve activities in both directions (e.g., a message from a
sender to a receiver and an acknowledgement back from the receiver
to the sender, or establishment of a connection prior to a transfer
and termination of the connection following the transfer). Thus,
the type of arrow used in a particular drawing to represent a
particular activity is exemplary and should not be seen as
limiting.
Embodiments of the present invention are described hereinabove with
reference to flowchart illustrations and/or block diagrams of
methods and apparatuses, and with reference to a number of sample
views of a graphical user interface generated by the methods and/or
apparatuses. It will be understood that each block of the flowchart
illustrations and/or block diagrams, and/or combinations of blocks
in the flowchart illustrations and/or block diagrams, as well as
the graphical user interface, can be implemented by
computer-executable program code.
The computer-executable program code may be provided to a processor
of a general purpose computer, special purpose computer, or other
programmable data processing apparatus to produce a particular
machine, such that the program code, which executes via the
processor of the computer or other programmable data processing
apparatus, generate means for implementing the
functions/acts/outputs specified in the flowchart, block diagram
block or blocks, figures, and/or written description.
These computer-executable program code may also be stored in a
computer-readable memory that can direct a computer or other
programmable data processing apparatus to function in a particular
manner, such that the program code stored in the computer readable
memory produce an article of manufacture including instruction
means which implement the function/act/output specified in the
flowchart, block diagram block(s), figures, and/or written
description.
The computer-executable program code may also be loaded onto a
computer or other programmable data processing apparatus to cause a
series of operational steps to be performed on the computer or
other programmable apparatus to produce a computer-implemented
process such that the program code which executes on the computer
or other programmable apparatus provides steps for implementing the
functions/acts/outputs specified in the flowchart, block diagram
block(s), figures, and/or written description. Alternatively,
computer program implemented steps or acts may be combined with
operator or human implemented steps or acts in order to carry out
an embodiment of the invention.
It should be noted that terms such as "server" and "processor" may
be used herein to describe devices that may be used in certain
embodiments of the present invention and should not be construed to
limit the present invention to any particular device type unless
the context otherwise requires. Thus, a device may include, without
limitation, a bridge, router, bridge-router (brouter), switch,
node, server, computer, appliance, or other type of device. Such
devices typically include one or more network interfaces for
communicating over a communication network and a processor (e.g., a
microprocessor with memory and other peripherals and/or
application-specific hardware) configured accordingly to perform
device functions.
Communication networks generally may include public and/or private
networks; may include local-area, wide-area, metropolitan-area,
storage, and/or other types of networks; and may employ
communication technologies including, but in no way limited to,
analog technologies, digital technologies, optical technologies,
wireless technologies (e.g., Bluetooth), networking technologies,
and internetworking technologies.
It should also be noted that devices may use communication
protocols and messages (e.g., messages generated, transmitted,
received, stored, and/or processed by the device), and such
messages may be conveyed by a communication network or medium.
Unless the context otherwise requires, the present invention should
not be construed as being limited to any particular communication
message type, communication message format, or communication
protocol. Thus, a communication message generally may include,
without limitation, a frame, packet, datagram, user datagram, cell,
or other type of communication message.
Unless the context requires otherwise, references to specific
communication protocols are exemplary, and it should be understood
that alternative embodiments may, as appropriate, employ variations
of such communication protocols (e.g., modifications or extensions
of the protocol that may be made from time-to-time) or other
protocols either known or developed in the future.
It should also be noted that logic flows may be described herein to
demonstrate various aspects of the invention, and should not be
construed to limit the present invention to any particular logic
flow or logic implementation. The described logic may be
partitioned into different logic blocks (e.g., programs, modules,
functions, or subroutines) without changing the overall results or
otherwise departing from the true scope of the invention.
Often times, logic elements may be added, modified, omitted,
performed in a different order, or implemented using different
logic constructs (e.g., logic gates, looping primitives,
conditional logic, and other logic constructs) without changing the
overall results or otherwise departing from the true scope of the
invention.
The present invention may be embodied in many different forms,
including, but in no way limited to, computer program logic for use
with a processor (e.g., a microprocessor, microcontroller, digital
signal processor, or general purpose computer), programmable logic
for use with a programmable logic device (e.g., a Field
Programmable Gate Array (FPGA) or other PLD), discrete components,
integrated circuitry (e.g., an Application Specific Integrated
Circuit (ASIC)), or any other means including any combination
thereof Computer program logic implementing some or all of the
described functionality is typically implemented as a set of
computer program instructions that is converted into a computer
executable form, stored as such in a computer readable medium, and
executed by a microprocessor under the control of an operating
system. Hardware-based logic implementing some or all of the
described functionality may be implemented using one or more
appropriately configured FPGAs.
Computer program logic implementing all or part of the
functionality previously described herein may be embodied in
various forms, including, but in no way limited to, a source code
form, a computer executable form, and various intermediate forms
(e.g., forms generated by an assembler, compiler, linker, or
locator).
Source code may include a series of computer program instructions
implemented in any of various programming languages (e.g., an
object code, an assembly language, or a high-level language such as
Fortran, C, C++, JAVA, or HTML) for use with various operating
systems or operating environments. The source code may define and
use various data structures and communication messages. The source
code may be in a computer executable form (e.g., via an
interpreter), or the source code maybe converted (e.g., via a
translator, assembler, or compiler) into a computer executable
form.
Computer-executable program code for carrying out operations of
embodiments of the present invention may be written in an object
oriented, scripted or unscripted programming language such as Java,
Perl, Smalltalk, C++, or the like. However, the computer program
code for carrying out operations of embodiments of the present
invention may also be written in conventional procedural
programming languages, such as the "C" programming language or
similar programming languages.
Computer program logic implementing all or part of the
functionality previously described herein may be executed at
different times on a single processor (e.g., concurrently) or may
be executed at the same or different times on multiple processors
and may run under a single operating system process/thread or under
different operating system processes/threads.
Thus, the term "computer process" refers generally to the execution
of a set of computer program instructions regardless of whether
different computer processes are executed on the same or different
processors and regardless of whether different computer processes
run under the same operating system process/thread or different
operating system processes/threads.
The computer program may be fixed in any form (e.g., source code
form, computer executable form, or an intermediate form) either
permanently or transitorily in a tangible storage medium, such as a
semiconductor memory device (e.g., a RAM, ROM, PROM, EEPROM, or
Flash-Programmable RAM), a magnetic memory device (e.g., a diskette
or fixed disk), an optical memory device (e.g., a CD-ROM), a PC
card (e.g., PCMCIA card), or other memory device.
The computer program may be fixed in any form in a signal that is
transmittable to a computer using any of various communication
technologies, including, but in no way limited to, analog
technologies, digital technologies, optical technologies, wireless
technologies (e.g., Bluetooth), networking technologies, and
internetworking technologies.
The computer program may be distributed in any form as a removable
storage medium with accompanying printed or electronic
documentation (e.g., shrink wrapped software), preloaded with a
computer system (e.g., on system ROM or fixed disk), or distributed
from a server or electronic bulletin board over the communication
system (e.g., the Internet or World Wide Web).
Hardware logic (including programmable logic for use with a
programmable logic device) implementing all or part of the
functionality previously described herein may be designed using
traditional manual methods, or may be designed, captured,
simulated, or documented electronically using various tools, such
as Computer Aided Design (CAD), a hardware description language
(e.g., VHDL or AHDL), or a PLD programming language (e.g., PALASM,
ABEL, or CUPL).
Any suitable computer readable medium may be utilized. The computer
readable medium may be, for example but not limited to, an
electronic, magnetic, optical, electromagnetic, infrared, or
semiconductor system, apparatus, device, or medium.
More specific examples of the computer readable medium include, but
are not limited to, an electrical connection having one or more
wires or other tangible storage medium such as a portable computer
diskette, a hard disk, a random access memory (RAM), a read-only
memory (ROM), an erasable programmable read-only memory (EPROM or
Flash memory), a compact disc read-only memory (CD-ROM), or other
optical or magnetic storage device.
Programmable logic may be fixed either permanently or transitorily
in a tangible storage medium, such as a semiconductor memory device
(e.g., a RAM, ROM, PROM, EEPROM, or Flash-Programmable RAM), a
magnetic memory device (e.g., a diskette or fixed disk), an optical
memory device (e.g., a CD-ROM), or other memory device.
The programmable logic may be fixed in a signal that is
transmittable to a computer using any of various communication
technologies, including, but in no way limited to, analog
technologies, digital technologies, optical technologies, wireless
technologies (e.g., Bluetooth), networking technologies, and
internetworking technologies.
The programmable logic may be distributed as a removable storage
medium with accompanying printed or electronic documentation (e.g.,
shrink wrapped software), preloaded with a computer system (e.g.,
on system ROM or fixed disk), or distributed from a server or
electronic bulletin board over the communication system (e.g., the
Internet or World Wide Web). Of course, some embodiments of the
invention may be implemented as a combination of both software
(e.g., a computer program product) and hardware. Still other
embodiments of the invention are implemented as entirely hardware,
or entirely software.
While certain exemplary embodiments have been described and shown
in the accompanying drawings, it is to be understood that such
embodiments are merely illustrative of and are not restrictive on
the broad invention, and that the embodiments of invention are not
limited to the specific constructions and arrangements shown and
described, since various other changes, combinations, omissions,
modifications and substitutions, in addition to those set forth in
the above paragraphs, are possible.
Those skilled in the art will appreciate that various adaptations,
modifications, and/or combination of the just described embodiments
can be configured without departing from the scope and spirit of
the invention. Therefore, it is to be understood that, within the
scope of the appended claims, the invention may be practiced other
than as specifically described herein. For example, unless
expressly stated otherwise, the steps of processes described herein
may be performed in orders different from those described herein
and one or more steps may be combined, split, or performed
simultaneously.
* * * * *