U.S. patent application number 15/153308 was filed with the patent office on 2016-09-08 for method and apparatus for tiered storage.
This patent application is currently assigned to Avere Systems, Inc.. The applicant listed for this patent is John R. Boyles, Jeffrey Butler, Daniel Clash, Joseph Hosteny, IV, Michael L. Kazar, Daniel S. Nydick. Invention is credited to John R. Boyles, Jeffrey Butler, Daniel Clash, Joseph Hosteny, IV, Michael L. Kazar, Daniel S. Nydick.
Application Number | 20160261694 15/153308 |
Document ID | / |
Family ID | 44710864 |
Filed Date | 2016-09-08 |
United States Patent
Application |
20160261694 |
Kind Code |
A1 |
Clash; Daniel ; et
al. |
September 8, 2016 |
Method and Apparatus for Tiered Storage
Abstract
A system for storing file data and directory data received over
a network includes a network interface in communication with the
network which receives NAS requests containing data to be written
to files from the network. The system includes a first type of
storage. The system includes a second type of storage different
from the first type of storage. The system includes a policy
specification n which specifies a first portion of one or more
files' data which is less than all of the files' data is stored in
the first type of storage and a second portion of the data which is
less than all of the data of the files is stored in the second type
of store. The system comprises a processing unit which executes the
policy and causes the first portion to be stored in the first type
of storage and a second portion to be stored in the second type of
storage. A method for storing file data and directory data received
over a network.
Inventors: |
Clash; Daniel; (Pittsburgh,
PA) ; Kazar; Michael L.; (Pittsburgh, PA) ;
Boyles; John R.; (Cranberry Township, PA) ; Butler;
Jeffrey; (Sewickley, PA) ; Hosteny, IV; Joseph;
(Pittsburgh, PA) ; Nydick; Daniel S.; (Wexford,
PA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Clash; Daniel
Kazar; Michael L.
Boyles; John R.
Butler; Jeffrey
Hosteny, IV; Joseph
Nydick; Daniel S. |
Pittsburgh
Pittsburgh
Cranberry Township
Sewickley
Pittsburgh
Wexford |
PA
PA
PA
PA
PA
PA |
US
US
US
US
US
US |
|
|
Assignee: |
Avere Systems, Inc.
Pittsburgh
PA
|
Family ID: |
44710864 |
Appl. No.: |
15/153308 |
Filed: |
May 12, 2016 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
12798285 |
Apr 1, 2010 |
9342528 |
|
|
15153308 |
|
|
|
|
15135164 |
Apr 21, 2016 |
|
|
|
12798285 |
|
|
|
|
12283961 |
Sep 18, 2008 |
9323681 |
|
|
15135164 |
|
|
|
|
14175801 |
Feb 7, 2014 |
|
|
|
12283961 |
|
|
|
|
13493701 |
Jun 11, 2012 |
8655931 |
|
|
14175801 |
|
|
|
|
12218085 |
Jul 11, 2008 |
8214404 |
|
|
13493701 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 16/182 20190101;
G06F 2212/60 20130101; H04L 67/1097 20130101; G06F 3/0643 20130101;
G06F 3/065 20130101; G06F 16/1827 20190101; H04L 67/06 20130101;
G06F 2212/163 20130101; G06F 3/067 20130101; G06F 12/0813 20130101;
G06F 16/22 20190101; G06F 3/0685 20130101; G06F 3/0604 20130101;
G06F 2212/62 20130101; G06F 3/0659 20130101; G06F 2212/154
20130101 |
International
Class: |
H04L 29/08 20060101
H04L029/08; G06F 12/08 20060101 G06F012/08; G06F 17/30 20060101
G06F017/30; G06F 3/06 20060101 G06F003/06 |
Claims
1. A system for storing file data and directory data received over
a network comprising: a network interface in communication with the
network which receives NAS requests containing data to be written
to files from the network; a first type of storage; a second type
of storage different from the first type of storage; a policy
specification which specifies a first portion of one or more files'
data which is less than all of the files' data is stored in the
first type of storage and a second portion of the data which is
less than all of the data of the files is stored in the second type
of store; and a processing unit which executes the policy and
causes the first portion to be stored in the first type of storage
and a second portion to be stored in the second type of
storage.
2. A system as described in claim 1 where the policy specification
is stored in a file within the first type of storage or the second
type of storage.
3. A system as described in claim 1 including a management database
outside of the first type of storage or the second type of storage
where the policy specification is stored in the management
database.
4. The system as described in claim 1 wherein the policy
specification specifies the first portion and the second portion of
one or more files' data, where either portion may include file
system meta-data.
5. The system as described in claim 4 wherein the policy
specification for the first portion or the second portion of one or
more files' data includes meta-data containing block addressing
information.
6. The system as described in claim 1 including a buffer module
having buffers and which reads and writes data into the
buffers.
7. The system as described in claim 6 including an inode attribute
manager which updates attributes in an inode.
8. The system as described in claim 7 including a directory manager
which treats a directory object as a set of mappings between file
names and inode identifiers.
9. The system as described in claim 7 including a data manager
which copies data to and from the buffers.
10. The system as described in claim 9 including an inode object
allocator which allocates inodes and tags them with policy
specifications.
11. A system as described in claim 10 including a NAS server
operations module which receives incoming NAS requests and invokes
local storage operations interfacing with the inode attribute
manager, the directory manager, the data manager and the inode
object allocator for reading and writing file and directory
attributes, reading and writing file data, and performing directory
operations.
12. The system as described in claim 10 including a NAS cache
operations module which acts as a cache of data storage in one or
external NFS servers and which creates and maintains cached
versions of actively accessed directories and files stored at the
external NFS server.
13. A system as described in claim 1 wherein at least two files are
stored within a directory, where a Block Allocator allocates some
blocks to a first file in the directory from a first type of
storage, and allocates some blocks to a second file in the
directory from a second type of storage different from the first
type of storage.
14. A system as described in claim 13 including a Block Allocator
that allocates blocks for a file from a first type of storage and
additional blocks for the same file from a second type of storage
different from the first type of storage.
15. A system as described in claim 14 where the Block Allocator
determines the blocks to be allocated to a file from a first type
of storage, and the blocks to be allocated from a second type of
storage different from the first type of storage, based upon the
policy associated with the file.
16. A method for storing file data and directory data received over
a network comprising the steps of: receiving NAS requests
containing data to be written to files from the network at a
network interface; executing with a processing unit the policy
specification which specifies a first portion of one or more files'
data which is less than all of the files' data is stored in a first
type of storage and a second portion of the data which is less than
all of the data of the files stored in the second type of data
which is different from the first type of storage; and causing with
the processing unit the first portion to be stored in the first
type of storage and the second portion to be stored in the second
type of storage.
17. The method as described in claim 16 including the step of
writing data to a file located in a directory which is a policy
root directory and has the policy specification.
18. The method as described in claim 17 including the steps of
looking up a subdirectory D, which is a policy root directory
having an associated policy specification, and associating the
policy specification with subdirectory D.
19. A system for storing file data and directory data received over
a network comprising: a network interface in communication with the
network which receives NAS requests from the network, including NAS
requests containing data to be written to files; a first type of
storage; a second type of storage different from the first type of
storage; a policy specification which specifies a first portion of
one or more directories' data which is less than all of the
directories' data is stored in the first type of storage and a
second portion of the data which is less than all of the data of
the directories is stored in the second type of store; and a
processing unit which executes the policy and causes the first
portion to be stored in the first type of storage and a second
portion to be stored in the second type of storage.
20. A system for storing file data and directory data received over
a network comprising: a network interface in communication with the
network which receives NAS requests containing data to be written
to files from the network; a first type of storage; a policy
specification which specifies a first portion of one or more files'
data which is less than all of the files' data is stored in the
first type of storage; a processing unit which executes the policy
and causes the first portion to be stored in the first type of
storage; an inode attribute manager which updates attributes in an
inode; and an inode object allocator which allocates inodes and
tags them with policy specifications.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This is a continuation of U.S. patent application Ser. No.
12/798,285 filed Apr. 1, 2010, and is a continuation-in-part and
claims priority from U.S. patent application Ser. No. 15/135,164
filed Apr. 21, 2016, which is a divisional of U.S. patent
application Ser. No. 12/283,961 filed Sep. 18, 2008, now U.S. Pat.
No. 9,323,681; this application also is a continuation-in-part and
claims priority from U.S. patent application Ser. No. 14/175,801
filed Feb. 7, 2014, which is a continuation of U.S. patent
application Ser. No. 13/493,701 filed Jun. 11, 2012, now U.S. Pat.
No. 8,655,931, which is a continuation of U.S. patent application
Ser. No. 12/218,085 filed Jul. 11, 2008, now U.S. Pat. No.
8,214,404, all of which are incorporated by reference herein.
FIELD OF THE INVENTION
[0002] This invention is in the field of tiered computer storage
servers--NAS or SAN servers with more than one type of persistent
storage present in the system. The servers in question may be
either NAS file servers or appliances that cache data from NAS file
servers. (As used herein, references to the "present invention" or
"invention" relate to exemplary embodiments and not necessarily to
every embodiment encompassed by the appended claims.)
BACKGROUND OF THE INVENTION
[0003] This section is intended to introduce the reader to various
aspects of the art that may be related to various aspects of the
present invention. The following discussion is intended to provide
information to facilitate a better understanding of the present
invention. Accordingly, it should be understood that statements in
the following discussion are to be read in this light, and not as
admissions of prior art.
[0004] Today, there are many types of persistent storage used by
network attached storage servers, including magnetic disk storage,
solid state storage, and battery-backed RAM. This type of storage
can be used by a typical NAS or SAN server, storing all of the data
in a virtual disk or file system, or it can be used in a tier or
cache server that stores only the most recently accessed data form
a disk or file system.
[0005] In either type of storage system, storing data in the best
type of persistent storage for its reference pattern can result in
a much better ratio of storage system cost per storage system
operation. For example, NVRAM provides the fastest random read or
write rates of the tree example storage media mentioned above, but
it also currently the most expensive, perhaps five times as
expensive as the next most expensive media (flash or solid state
storage). Flash storage provides comparable random read performance
to NVRAM at a small fraction of the cost, but getting good random
write performance from flash storage is a challenge, and also
negatively affects the overall lifetime of the flash device.
Standard magnetic disk storage handles sequential read and write
requests nearly as fast as any other persistent storage media, at
the lowest cost of all, but loses the vast majority of its
performance if the read or write requests are not for sequentially
stored data.
[0006] Thus, if a storage system can place the various types of
data in the appropriate type of storage, a storage system can
deliver a much better price/performance ratio than one that simply
uses a single type of persistent storage.
[0007] Existing systems make use of a mix of types of persistent
storage in a number of ways. Many file servers, going back to Sun
Microsystems' PrestoServe board for its SunOS-based file servers,
have used NVRAM to reduce write latencies by providing temporary
persistent storage for new incoming data. In the SSD arena,
NetApp's PAM2 card is a victim cache made from SSD holding data
that doesn't fit it memory, speeding up random reads to the data
stored in the card. For a number of reasons, even thought this
cache is made from persistent storage, the NetApp PAM2 cache does
not hold modified data that is not persistently held elsewhere,
either in an NVRAM card or on rotating disks. And of course, an
obvious use of SSD drives 48 is as a replacement for existing
drives, providing faster read access, especially random read
access, at the cost of some penalty in both cost and write
performance. Systems like ONTAP/GX can also make more intelligent
use of flash or SSD drives 48 by migrating entire volumes to
storage aggregates comprised entirely of SSD; in the ONTAP/GX case,
this would allow portions of a namespace to be moved to SSD,
although only in its entirety, only all at one time, and only at
pre-defined volume boundaries.
[0008] It is in this context that this invention operates. This
invention allows the mixing of SSD and normal rotating magnetic
drives (hereafter referred to as hard disk drives, or HDDs) in the
same file system, instead of as a cache or as a separate type of
aggregate, and provides a number of policy mechanisms for
controlling the exact placement of files within the collection of
storage pools.
[0009] This provides a number of advances over the state of the
art. First, because individual files can be split, at the block
level, between SSD and HDD storage, or between other types of
persistent storage, this server can place storage optimally at a
very fine level of granularity. For example, consider the case of a
media server where files are selected randomly for playing, but
where, once selected, the entire file is typically read
sequentially. This system could store the first megabyte or so in
SSD, with the remainder of what may well be a 200 MB or larger file
stored on much less expensive HDD. The latency for retrieving the
first 1 MB of data would be very low because the data is stored in
SSD, and by the time that this initial segment of data has been
delivered to the NAS client, the HDD could be transferring data at
its full rate, after having performed its high latency seek
operation concurrent with the initial segment's transfer from SSD.
To fully benefit from this flexibility in allocation, the storage
system needs to apply allocation policies to determine where and
when to allocate file space to differing types of storage. This
invention allows policies to be provided globally, or for
individual exports of either NAS or SAN data, or for arbitrarily
specified subtrees in a NAS name space.
[0010] As compared with prior art that allows whole volumes to be
relocated from HDD to SSD, or vice versa, this invention provides
many benefits. First, the invention requires neither entire volumes
nor even entire files to move between storage types, and second,
data can be placed in its optimal location initially, based on the
specified policies. As compared with using flash in a victim cache,
use of flash memory as file system storage allows improvement in
write operation performance, since data written never needs to be
stored on HDD at all. In addition, this invention allows policies
to change at any directory level, not just at volume
boundaries.
BRIEF SUMMARY OF THE INVENTION
[0011] The present invention pertains to a system for storing file
data and directory data received over a network. The system
comprises a network interface in communication with the network
which receives NAS requests containing data to be written to files
from the network. The system comprises a first type of storage. The
system comprises a second type of storage different from the first
type of storage. The system comprises a policy specification which
specifies a first portion of one or more files' data which is less
than all of the files' data is stored in the first type of storage
and a second portion of the data which is less than all of the data
of the files is stored in the second type of store. The system
comprises a processing unit which executes the policy and causes
the first portion to be stored in the first type of storage and a
second portion to be stored in the second type of storage.
[0012] The present invention pertains to a method for storing file
data and directory data received over a network. The method
comprises the steps of receiving NAS requests containing data to be
written to files from the network at a network interface. There is
the step of executing with a processing unit the policy
specification which specifies a first portion of one or more files'
data which is less than all of the files' data is stored in a first
type of storage and a second portion of the data which is less than
all of the data of the files stored in the second type of data
which is different from the first type of storage. There is the
step of causing with the processing unit the first portion to be
stored in the first type of storage and the second portion to be
stored in the second type of storage.
[0013] The present invention pertains to a system for storing file
data and directory data received over a network. The system
comprises a network interface in communication with the network
which receives the NAS requests from the network, including the NAS
requests containing data to be written to files. The system
comprises a first type of storage. The system comprises a second
type of storage different from the first type of storage. The
system comprises a policy specification which specifies a first
portion of one or more directories' data which is less than all of
the directories' data is stored in the first type of storage and a
second portion of the data which is less than all of the data of
the directories is stored in the second type of store. The system
comprises a processing unit which executes the policy and causes
the first portion to be stored in the first type of storage and a
second portion to be stored in the second type of storage.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING
[0014] In the accompanying drawings, the preferred embodiment of
the invention and preferred methods of practicing the invention are
illustrated in which:
[0015] FIG. 1 is a block diagram of the system and of the claimed
invention.
[0016] FIG. 2 is another block diagram of the system of the claimed
invention.
[0017] FIG. 3 shows an inode table file.
[0018] FIG. 4 shows the contents of an inode file slotted page.
[0019] FIG. 5 shows a tree of directories and files, each of which
has a reference to a policy object file.
[0020] FIG. 6 shows a new policy object Y added to a directory
C.
[0021] FIG. 7 shows a state of a directory and files with a fixed
policy link.
DETAILED DESCRIPTION OF THE INVENTION
[0022] Referring now to the drawings wherein like reference
numerals refer to similar or identical parts throughout the several
views, and more specifically to FIGS. 1 and 2 thereof, there is
shown a system 10 for storing file data and directory data received
over a network. The system 10 comprises a network interface 12 in
communication with the network which receives NAS requests
containing data to be written to files from the network. The system
10 comprises a first type of storage 14. The system 10 comprises a
second type of storage 16 different from the first type of storage
14. The system 10 comprises a policy specification which specifies
a first portion of one or more files' data which is less than all
of the files' data is stored in the first type of storage 14 and a
second portion of the data which is less than all of the data of
the files is stored in the second type of store. The system 10
comprises a processing unit 18 which executes the policy and causes
the first portion to be stored in the first type of storage 14 and
a second portion to be stored in the second type of storage 16.
[0023] At least two files may be stored within a directory, where a
Block Allocator 42 allocates some blocks to a first file in the
directory from a first type of storage 14, and allocates some
blocks to a second file in the directory from a second type of
storage 16 different from the first type of storage 14. The system
10 may include a Block Allocator 42 that allocates blocks for a
file from a first type of storage 14 and additional blocks for the
same file from a second type of storage 16 different from the first
type of storage 14. The Block Allocator 42 may determine the blocks
to be allocated to a file from a first type of storage 14, and the
blocks to be allocated from a second type of storage 16 different
from the first type of storage 14, based upon the policy associated
with the file.
[0024] Some examples of different types of storage that includes
drives made from different materials, or drives that use different
formats or structure, or drives that operate at least at a 10%
difference in speed or drives that have at least a 10% difference
in bandwidth and/or average latency performing random read
operations. Or, there could be "solid state disks" vs. "fibre
channel disks" (that is, disk drives made from flash chips vs. disk
drives made from magnetic media disks); the flash disks have much
lower latency when doing random IO operations. In general,
different types of storage might have different access latencies
(the time between sending a request and receiving the desired data)
and different transfer rates. Disk drives can rotate at rates
between 5400 RPM and 15000 RPM, with access times inversely
proportional to the rotation speed (one basically has to wait 1/2
of a spin to obtain the desired data, on average).
[0025] The policy specification may be stored in a file within the
first type of storage 14 or the second type of storage 16. The
system 10 may include a management database 20 outside of the first
type of storage 14 or the second type of storage 16 where the
policy specification is stored in the management database 20.
[0026] The policy specification may specify the first portion and
the second portion of one or more files' data, where either portion
may include file system 10 meta-data. The policy specification for
the first portion or the second portion of one or more files' data
may include meta-data containing block addressing information.
[0027] The system 10 may include a buffer module 22 having buffers
and which reads and writes data into the buffers. The system 10 may
include an inode attribute manager 24 which updates attributes in
an inode. The system 10 can include a directory manager 26 which
treats a directory object as a set of mappings between file names
and inode identifiers. Alternatively, the system 10 may include a
data manager 28 which copies data to and from the buffers. The
system 10 may include an inode object allocator 30 which allocates
inodes and tags them with policy specifications.
[0028] The system 10 may include a NAS server operations module 32
which receives incoming NAS requests and invokes local storage
operations interfacing with the inode attribute manager 24, the
directory manager 26, the data manager 28 and the inode object
allocator 30 for reading and writing file and directory attributes,
reading and writing file data, and performing directory operations.
Alternatively, the system 10 may include a NAS cache operations
module 34 which acts as a cache of data storage in one or external
NFS servers and which creates and maintains cached versions of
actively accessed directories and files stored at the external NFS
server.
[0029] The present invention pertains to a method for storing file
data and directory data received over a network. The method
comprises the steps of receiving NAS requests containing data to be
written to files from the network at a network interface 12. There
is the step of executing with a processing unit 18 the policy
specification which specifies a first portion of one or more files'
data which is less than all of the files' data is stored in a first
type of storage 14 and a second portion of the data which is less
than all of the data of the files stored in the second type of data
which is different from the first type of storage 14. There is the
step of causing with the processing unit 18 the first portion to be
stored in the first type of storage 14 and the second portion to be
stored in the second type of storage 16.
[0030] There may be the step of writing data to a file located in a
directory which is a policy root directory and has the policy
specification. There may be the steps of looking up a subdirectory
D, which is a policy root directory having an associated policy
specification, and associating the policy specification with
subdirectory D. There may be the step of looking up a subdirectory
D of a parent directory having an associated policy specification,
and associating the parent directory's policy specification with
subdirectory D. There may be the step of looking up a file within a
directory having an associated policy specification, and
associating the directory's policy specification with file F. There
may be the step of writing data to a file F located in a directory
having an associated policy specification, and the step of the
Inode space allocator 36 allocating space to hold the written data
according to the associated policy specification.
[0031] The looking up directory D step may include the steps of
consulting a cache of a management database 20, determining that
the directory D is a policy root, and augmenting directory D's file
handle with a policy tag indicating a policy to be applied to files
under directory D in the file system's 10 namespace. There may be
the step of looking up a subdirectory D of a parent directory
having an associated policy specification, and associating the
parent directory's policy specification with subdirectory D. There
may be the step of looking up a file within a directory having an
associated policy specification, and associating the directory's
policy specification with file F. There may be the steps of
receiving the lookup for file F within directory D, and propagating
the policy tag from directory D's file handle to file F's file
handle. There may be the step of storing the data in buffers of a
memory until a buf clean module collects modified data to write the
modified data to persistent storage; determining with the buf clean
module which portions of the file F have new data; and calling from
the buf clean module into an inode space allocator 36 to ensure
that blocks have been allocated in persistent storage for an
appropriate block range in the file F. There may be the step of
identifying with the inode space allocator 36 a policy associated
with the file F from the policy tag of file F's file handle;
calling a block allocator 42 module with the inode space allocator
36; obtaining with the block allocator 42 module storage blocks of
an appropriate type in persistent storage; and inserting with the
inode space allocator 36 block addresses associated with the
appropriate storage into an indirect block tree associated with a
file F. There may be the step of starting IO operations with the
buf clean module from buffers in memory having respective data
written to the appropriate storage blocks in persistent
storage.
[0032] The present invention pertains to a system 10 for storing
file data and directory data received over a network. The system 10
comprises a network interface 12 in communication with the network
which receives the NAS requests from the network, including the NAS
requests containing data to be written to files. The system 10
comprises a first type of storage 14. The system 10 comprises a
second type of storage 16 different from the first type of storage
14. The system 10 comprises a policy specification which specifies
a first portion of one or more directories' data which is less than
all of the directories' data is stored in the first type of storage
14 and a second portion of the data which is less than all of the
data of the directories is stored in the second type of store. The
system 10 comprises a processing unit 18 which executes the policy
and causes the first portion to be stored in the first type of
storage 14 and a second portion to be stored in the second type of
storage 16.
[0033] In the operation of the invention, a technique is provided
to declare policies used for placing data in one of several types
of persistent storage holding a file system 10. The system 10
provides one or both of a NAS server, or a NAS cache server. A NAS
server exports NFS, CIFS or other network file system 10 protocol
access to one or more file systems. A NAS cache server provides the
same protocol access to data stored on an external NAS system,
while maintaining a cache of frequently accessed files and
directories within the cache server. This invention is a
continuation of <<vseg patent>>, which describes a file
system 10 that can store any block of data in one of a plurality of
types of storage, and of <<cache patent>>, which
describes a cache server appliance that can cache data from one or
more external NAS file servers.
[0034] In this invention, data is exported via one of two
mechanisms, either a NAS Server, or a NAS Cache. Each NAS file
system 10 exported from this invention is exported either from the
NAS server operations module 32, or the NAS cache operations module
34, but not both for a given NAS export.
[0035] Both a NAS Server and a NAS Cache provide a set of exports,
each of which is a root of a file system 10 tree. In this
invention, the administrator can associate a set of data policy
rules with a server export, such that the policy rules are applied
to any object allocated within the export's subtree. Additionally,
for data exported via the NAS Server Operations, the policy rules
can be changed at any directory, by creating a file called
".avere_control" in the directory, and placing within the file a
description of the policy to be applied for that directory, and to
all of its descendents. For data exported via the NAS Cache
Operations, .avere_control files containing policy rules can be
placed at locations in the file system 10 called sub-export roots;
however, any directory can be made into a sub-export root
directory.
[0036] These policies can specify to which type of storage data for
an affected file should be allocated. It should be clear that other
types of policies beyond allocation policies can be specified with
this mechanism. For example, priority information could be
associated with the policies, as well.
[0037] The base system 10 provides a number of pre-defined
policies, including a policy that places all data in HDD storage,
another policy that places all data in SSD storage, and a policy
that places the first N bytes of a file in one type of storage, and
the remainder in another type of storage (typically, these would be
SSD and HDD, respectively).
[0038] Structurally, the system 10 is described as follows:
[0039] In FIG. 1, all of the software modules are executed by one
or more general purpose processors in a computer system 10, with
attached Network Interface Cards (NIC cards), attached main memory
accessible to all processors. The computer system 10 includes an
NVRAM card attached to an IO bus (typically PCI Express today), and
as described below, some modules cause data to be transferred over
the PCI Express bus from the computer system 10's main memory to
the NVRAM card to protect said data from loss in the event of a
system crash or power loss. Storage devices are attached to the
computer system 10 via one or more storage controllers on a PCI
Express bus. These controllers attach to magnetic or solid-state
disk drives over either a Serial-attached SCSI (SAS) or
Serial-attached ATA (SATA) bus.
[0040] FIG. 2 shows the hardware components of the system 10
mentioned below. One or more CPU complexes are connected via
internal buses to both main memory and an IO controller, which
provides a number of PCI Express (PCIe) links. One or more of these
PCIe links connects to Network Interface Cards, which send and
receive network traffic. One or more of these PCIe links connects
to a SATA disk controller, which in turn connects to one or more
SATA disk drives over SATA links. One or more of these PCIe links
connects to a SAS controller which connects to one or more disk
drives over SAS links. One or more of these PCIe links connects to
an NVRAM controller card, which holds a copy of some data before it
is written to persistent disk storage.
[0041] Unless otherwise noted below, all of the modules in FIG. 1
execute on one or more of the processor complex, and may access any
of the other components shown in FIG. 2.
[0042] In this system 10, NFS and other NAS protocols are processed
by the NAS Server Operations or the NAS Cache Operations boxes in
FIG. 1, processing requests received by one or more Network
Interface Cards (NICs) in a computer system 10. The NAS server
operations module 32 invokes local storage operations providing an
interface for reading and writing file and directory attributes,
reading and writing file data, and performing directory operations.
The NAS Cache Operations provide similar functionality, but acting
as a cache of data stored at one or more external NFS servers,
where the cache operations module creates and maintains cached
versions of the actively accessed directories and files stored at
the external NFS server. Both of these modules execute on general
purpose processors in the computer system 10.
[0043] In this document, an inode represents either a file system
10 object that represents a file, directory or symbolic link. In
the case of data accessed by the NAS Server Operations (NSO), there
is one such object for each file, directory or symbolic link in the
file system 10. In the case of data accessed by the NAS Cache
Operations (NCO), there is one such inode for each file from the
back-end file server that is currently cached in the server.
[0044] The Buffer module 22 is used by both the NSO and NCO modules
to read and write file system 10 data. The module tags fixed sized
buffers, stored in the processor's main memory, with an inode
object pointer, qualified by a byte offset within that inode. One
distinguished inode represents the data in the collection of
physical disks holding a file system 10, while others represent
individual files, directories or symbolic links. The Buffer module
22 also keeps track of which ranges of bytes within each buffer
need to be copied to a separate NVRAM card on the computer system's
PCI Express bus. This copy is initiated from main memory to the
NVRAM card's memory over the PCI Express bus by the Buffer package
when a buffer's last write reference is released.
[0045] All types of inodes have filling methods and cleaning
methods in the module named "Buffer Fill/Clean"; these methods are
responsible for filling a buffer with data stored in persistent
storage when the buffer is first accessed, and for writing updates
to persistent storage sometime after the buffer has been modified,
respectively. These methods cause data to be transferred over a PCI
or PCI Express bus from disk or SSD drives 48 attached to a SAS or
SATA bus, in the preferred embodiment. However, it should be clear
to someone skilled in the art that other processor IO buses than
PCI or PCI Express can be used, and other storage attachment buses
than SAS or SATA can be used, as well.
[0046] The Inode attribute manager 24 is responsible for updating
attributes in a given inode. These attributes include file access,
change and modification times, file length and ownership and
security information. This module is typically implemented by the
main processor, acting on memory managed by the aforementioned
Buffer module 22.
[0047] The Directory manager 26 treats a directory object as a set
of mappings between file names and inode identifiers. It can create
entries, remove entries, and enumerate entries in a directory. It
executes on the main system 10 processors.
[0048] The Data manager 28 is responsible for copying data to and
from buffers provided by the Buffer module 22, as part of handling
read and write NAS requests.
[0049] The Inode object allocator 30 is responsible for allocating
and tagging inodes in the local file system 10. These inodes may be
tagged by any arbitrary string of bytes, so that the same mechanism
can be used by either the NSO or NCO modules. This module is also
responsible for deleting files, and pruning files from the cache
when space gets low and an NCO is operating.
[0050] The Inode space allocator 36 module is responsible for
adding disk blocks to existing inodes.
[0051] The Inode space truncator 38 is responsible for freeing disk
blocks referenced by existing inodes. The resulting blocks will
then read as zeroes after the truncator has completed.
[0052] The Block allocator 42 is responsible for allocating blocks
of persistent storage for the Inode space allocator 36, and for
freeing blocks of persistent storage for the Inode space truncator
38. The blocks whose allocation status is tracked by this module
are stored in the disk drives shown in FIG. 2.
[0053] The sub-export policy manager 44 is responsible for mapping
incoming file handles tagged with a sub-export tag to the specific
policy associated with that sub-export. Sub-exports are one
mechanism used to associate a specific policy with an arbitrary
directory, and are the only mechanism that can be used to associate
a policy with an arbitrary directory exported via the NAS cache
operations module 34.
[0054] The disk driver module 46 actually performs IO operations to
HDD and SSD drives 48, and any other types of persistent storage
present in the system 10.
[0055] There are two mechanisms for determining the policy to apply
to a particular file, one typically used in NSO subtrees, and one
which is typically used within NCO subtrees. The NSO subtree policy
mechanism has the benefit of requiring less information in the file
handles returned to the NAS client, but does not work in NCO
subtrees. The NCO subtree policy mechanism works actually works in
both types of subtrees, and is the only one that works reliably in
NCO subtrees. In the preferred implementation of the invention, the
NSO mechanism is used in any NSO subtrees, and the NCO mechanism is
used in any NCO subtrees, but other implementations might use the
NCO mechanism for both types of subtrees.
[0056] Within NSO subtrees, each inode has a policy object, and
this object is inherited from its parent directory at the time the
object is created or found from a directory search. That is, an NSO
directory lookup operations copy a pointer to the parent directory
inode's policy object into the newly located child's inode, and a
vdisk create operation does the same thing, initializing the newly
created object's inode's policy object to that of the parent
directory's inode.
[0057] Within NCO subtrees, each inode has a sub-export ID, which
is inherited from its parent directory at the time the object is
created or found from a directory search. That is, an NCO directory
lookup operation copies the sub-export ID from the parent directory
inode's sub-export ID, and an NCO create operation does the same
thing, initializing the newly created object's inode's sub-export
ID to the value associated with the parent directory's inode. If
the looked-up object is the root of a sub-export tree, on the other
hand, the new sub-export's ID is used instead of the inherited ID.
Given a sub-export ID, the sub-export manager can locate the
current policy object for that sub-export ID, through a simple
table lookup. Thus, with NCO operations as well, the invention can
obtain a policy object from any inode, typically inherited from the
policy object in use by the object's parent directory.
[0058] A global policy object can be provided to define the
behavior of the system 10 in the absence of any overriding policy
control file, as well.
[0059] When a policy object is updated, the new policies will be
directly utilized by cleaning operations for each inode sharing the
policy automatically.
[0060] More complex is the situation in an NSO tree where a new
policy object is placed upon a directory that does not already
contain a policy control file, but which inherited a policy from an
ancestor's control file, and which also has a number of descendents
already using that ancestor's control file. In this case, simply
changing the policy object on the inode whose directory was just
assigned a new policy control file will not automatically update
the policy references in any of that directory's descendents. To
handle this situation, a "path verifier" is associated with version
number with each policy object, and labels each inode with the
policy object's last path verifier. When a new policy is inserted
into a directory tree by creating or deleting a policy control
file, the policy of the parent directory (in the case of the
create) or the policy being deleted (in the case of the deletion of
a policy control file) has its version incremented. When a policy
object is encountered whose path verifier does not match the path
verifier cached in the referencing inode, the system 10 walks up
the directory path, following each directory's ".." entry in turn,
until a directory with a path verifier matching its policy object
is encountered. This policy object is then stamped into the
referencing inode, along with its current path verifier. This
algorithm requires that each non-directory file system 10 object,
that is, regular files and symbolic links, also store a reference
to the object's parent directory. The details of how this is done
are provided below, but the fundamental idea is to store up to two
back pointers from each inode to parent directory entries, and to
store an additional exception table for those files that have more
than two parent links, a fairly rare occurrence. Directory ".."
entries are not counted in the hard link table of the parent
directory.
[0061] Recall that the continued invention provides an aggregate
abstraction containing data segments consisting of moderate sized
arrays of storage blocks comprised of the same type of persistent
storage (within each segment). Given a representation of the policy
to be applied to allocating any given inode's data, this
information is used by the cleaner to control what type of segment
will supply blocks for newly written data. Thus, for example, if a
policy specifies that the first megabyte of a file should be
allocated from solid state disks (SSDs), and a cleaner is cleaning
some blocks in the range of 262144-278528, the cleaner will
allocate the blocks for this data from a segment allocated from
SSD.
[0062] Note that while the preferred realization of this invention
includes both a NAS cache operations module 34 and a NAS server
operations module 32, other combinations are possible. For example,
a system 10 providing NAS Cache Operations alone would be a useful
caching appliance. A system 10 providing a NAS server operations
module 32 alone would be a useful file server.
[0063] In describing the present invention, reference is first made
to a file system 10 as described in U.S. patent application Ser.
No. 12/283,961, incorporated by reference herein.
[0064] That invention provided a file system 10 in which data is
stored in multiple segments, each segment being comprised of a
different type of persistent storage, for example, flash storage,
expensive FC disk drives, inexpensive SATA disk drives, and/or
battery backed RAM. In that system 10, individual blocks, including
both data and meta-data blocks associated with a file can be
allocated from separate segments.
[0065] This invention extends U.S. patent application Ser. No.
12/283,961, incorporated by reference herein, by providing a rich
variety of policies for determining from which segment a particular
block allocation should be performed, and allowing different
policies to apply to different portions of the file system 10 name
space. In addition, this invention extends the first invention to
include file system 10 cache appliances.
[0066] Below, a policy language is described that allows the
specification of how data and meta data blocks should be allocated
for affected files. Then, the detailed implementation of the
storage system 10 is described, including a description of the
mechanisms for providing a global default policy, a default policy
for each exported file system 10, and a mechanism for overriding
these two default policies on a per-directory or per-file
basis.
[0067] Policy Specifications
[0068] A policy specification based on a Lisp-like syntax is
provided. In particular, the definition of a function named
"allocType" is allowed that is called with two parameters, the
attributes of the inode being cleaned, and the offset at which a
particular block is located within the file. The function returns
the atom "ssd" if the data should be located in SSD storage, and
returns the atom "hdd" if the data should be located in HDD
storage. The attributes provided include, but are not limited
to:
[0069] length--the file's length in bytes
[0070] extension--the last component of the file name, for example
"mp3" for a file named "music.mp3"
[0071] type--one of "directory", "file" or "symlink".
[0072] mode--the Unix protection bits
[0073] (defun allocType (attrs offset) (return "flash"))
[0074] or
[0075] (defun alloc-type (attrs offset) (if (or (less-than offset
0x400000) (eq attrs.type "dir"))) (return "flash") (return
"hdd")))
[0076] or
[0077] (defun alloc-type (attrs offset) (if (equal attrs.type
"directory") (return "flash") (return "hdd"))
[0078] or
[0079] (defun alloc-type (attrs offset) (if (greater-eq attrs.size
100000) (return "hdd") (return "flash)))
[0080] The usual comparison and Boolean operators of "eq" "not-eq"
"less-than" "greater-than" "less-eq" "greater-eq" "not" "and" and
"or" are provided. Arithmetic operators "add" "subtract" "times"
"divide" are also provided on integers. Strings can be compared for
equality only, while integers can be compared with any of the
comparison operators.
[0081] Storage Format
[0082] The storage in this invention is divided into different
classes, based on the type of the underlying persistent media. The
storage in this invention is organized as an aggregate, providing
an array of persistent storage blocks of different types; this
organization is shown in FIG. 3. The blocks of each type of storage
are grouped into contiguous sets of fixed sized segments. For each
segment in the aggregate's block address space, the file system 10
has a 16 bit segment descriptor giving the type of persistent
storage backing that portion of the address space. For example, an
aggregate might have a 16 KB block size, and a segment size of 1
GB, holding a contiguous collection of 65536 blocks. The segment
descriptor corresponding to a particular segment gives two pieces
of information, a bit indicating whether there are any free blocks
within this segment, and 15 bits describing the type of storage
underlying this segment.
[0083] The aggregate also stores a bitmap array, providing one bit
per block in the aggregate, where a value of 1 means the
corresponding block is allocated, while a value of 0 means that the
block is free and available for use.
[0084] When a particular block of a particular type of storage
needs to be allocated, the system 10 effectively searches for a
segment descriptor indicating both that the segment is not full,
and that the segment consists of the desired type of persistent
storage. A block is then allocated from that segment by consulting
the allocation bitmap for the segment in question, returning the
desired number of blocks and setting the corresponding bitmap bits.
If the segment is full, the segment descriptor is also modified to
indicate that the segment is full.
[0085] When freeing a block, the corresponding bit is simply
cleared in the allocation bitmap, and overwrites the segment
descriptor for the segment containing the freed block to indicate
that the segment is not full.
[0086] Inodes provide the key file abstraction used in this
invention: they provide a way to organize a subset of the storage
into a contiguous range of file blocks that can be read or written
as a contiguous set of data blocks. An inode is a small (256-512
byte) structure describing the location of the data blocks making
up a file. This abstraction is used to store the bitmap array, the
segment descriptor array, the inode hash table and the inode table
itself. Note that these key inodes are stored in the aggregate
header itself, rather than in the inode file.
[0087] With the exception of the meta data files described in the
paragraph above, the inodes of all of the files in the system 10
are stored inode table file. Each inode is tagged by a file handle,
a variable length string uniquely identifying the file. The inode
can also be named by its address within the inode file itself in
certain contexts, usually that of internal pointers in the file
system 10 itself.
[0088] The inode table file consists of an array of slotted pages,
with a header at the start of the page specifying how many inodes
are in the file, and specifying how many bytes of file handle are
stored at the end of the slotted page. After the header, there is
an array of inodes, then some free space, and finally the set of
file handles associated with the inodes in this slotted page. The
set of file handles is stored at the end of the slotted page, and
is allocated from back to front. FIG. 4 shows the contents of an
inode file slotted page.
[0089] The system 10 typically locates an inode either by its
offset within the inode table file, or by its file handle. Finding
an inode by offset within the inode table file is very simple:
since inodes do not span block boundaries, the system 10 simply
requests the buffer containing the desired inode offset, and
locates the inode at the appropriate offset within the block. In
order to find an inode by file handle, the file handle is hashed
into a 32 bit value, which is then computed modulo the inode hash
table size. The inode hash table file is then consulted; this file
is treated as an array of 8 byte offsets into the inode table file,
giving the offset of the first inode in that hash conflict chain.
The inode file is then searched, loading one slotted page at a
time, until an inode with the correct file handle is located, if
any.
[0090] If an inode needs to be created with a particular file
handle, then after the search is complete, if no inode has been
located with the desired file handle, a new inode is allocated.
[0091] The header in the inode table file has a list of all slotted
blocks that have room for at least one more inode. This list is
consulted, and the block at the head of the list with room for
another inode is read into the buffer package, and the new inode is
created in that block. If no existing blocks have room for another
inode, a collection of new blocks are allocated to the inode file
by the inode space allocator 36, and the new inode is allocated
from one of these blocks.
[0092] Operation Overviews
[0093] The following paragraphs give a description of the
invention, describing how each of the core modules described in
FIG. 1 operate.
[0094] NAS Server Operations
[0095] The NAS Server Operations (NSO) module receives incoming NAS
calls, in this case, NFS operations, and executes them by making
calls on the Inode attribute manager 24, the Directory manager 26,
the Data manager 28 and the Inode object allocator 30.
[0096] The following describes how each operation is performed, in
detail.
[0097] NSO File Create
[0098] When a file, symbolic link, or directory is created, an
inode is allocated using the inode allocator, which uses a simple
linked list implementation to find a free block of space to hold a
new inode. One of the fields in this inode is a pointer back to the
directory in which the file name is being created, and this field
is initialized at this time with the inode address of the parent
directory's inode. This address is represented as the byte offset
of the inode within the Inode Table file.
[0099] At this time, the policy pointer is also set in the newly
created file. The policy object is inherited from the parent
directory, as follows. First, the policy object named by the parent
directory is referenced, and the create operation compares the
referenced policy object's version number with the policy version
number in the parent directory inode itself. If they match, this
policy object reference (represented as the inode address of the
file storing the policy description), and the corresponding version
number are stored in the newly created file. If the two versions do
not match, the policy path procedure is performed to locate the
current policy object for the parent directory, and then store the
newly determined policy object address and policy version number in
the newly created file's inode.
[0100] If the operation is a symbolic link or directory, the first
data block of the newly created file is initialized as well, by
simply updating the buffer cache block tagged with the newly
created inode and offset 0 within the inode, filling it with the
initial contents of the symbolic link, or the header of an empty
directory, respectively.
[0101] NSO File Delete
[0102] When a file, directory, or symbolic link is deleted, first
the operation is validated (the type of the operation matches the
type of the object, and for directories, that the directory is
empty). Once validated, a delete operation then checks whether the
object has a link count of 1, or greater than 1. If the object has
a high link count, then delete will only remove the reference to
the object; the object itself is left untouched except for
decrementing its link count. To maintain the parent pointers from
the inode, the reference to the parent directory is removed from
the set of back pointers from the deleted inode. The parent
directory's contents are also modified to remove the entry pointing
to the deleted file; this results in one or more modified blocks in
the buffer cache that will eventually be written out by the Buffer
Cleaner to the directory's inode.
[0103] If the last link to an object is removed with delete, then
in addition from removing the parent link from the object, the file
system 10 operation invokes the truncate operation by calling the
inode space truncator 38 to free the SSD or HDD blocks allocated to
the file back to the pool of free disk blocks for the aggregate.
Note that each block is allocated from a particular type of
storage, and is freed back to the same type of storage.
[0104] If the object deleted is policy file, the delete operation
increments the policy inode's version number, forcing all objects
that were using the policy object to perform the policy path
procedure to determine the new correct policy to utilize.
[0105] If the link count on the deleted object is still greater
than zero, the file still exists on another path. In this case, the
policy path procedure is performed on the file to verify that there
is a valid policy associated with the unlinked file.
[0106] NSO File Rename
[0107] File rename is essentially a combination operation that is
equivalent of a delete of the target of the rename, a hard link
from the source object to the target name, and a delete of the
source object (which simply removes the name from the directory and
updates the back pointer from the source object's inode, since the
source object would have a link count>1 in a rename
operation).
[0108] If a file rename occurs within a single directory, no policy
path updates are required. If a normal file is renamed between two
directories, the path policy procedure is performed on this file
alone to regenerate its policy pointer. If a directory is renamed
between two other directories, the policy version number of the
policy associated with the renamed directory is incremented, which
will force the policy path procedure to be performed for all of the
children of the renamed directory; this increment need not occur if
this policy is stored in the directory being renamed itself.
[0109] NSO File Truncate
[0110] A file truncate simply invokes the inode space truncator 38
module to remove all of the disk blocks allocated to the file's
inode, and then updates the disk buffer that holds stores the inode
itself, setting the new file's length in the inode. Note that the
allocation policy has no effect on the truncate operation, since
the system 10 is only freeing already allocated blocks, and of
course those blocks have to be freed to their original storage.
[0111] NSO File Link
[0112] A file link operation simply adds a new reference from a new
directory entry to an existing file or symbolic link's inode. The
operation adds a new entry to the target directory, increments the
link count on the target inode, and adds a back pointer to the
target inode's back pointer list. The back pointer information is
the storage address of the directory's inode. This back pointer is
used by the policy path procedure to find the appropriate policy to
use for allocating space for the file.
[0113] Adding a file link does not change the default policy in use
by the linked file.
[0114] NSO File Read
[0115] A file read operation operates very simply: the buffers
tagged with the target file's inode and the required offsets to
cover the range being read are loaded into memory, and then
transferred over the network to the client making the request.
[0116] NSO File Write
[0117] A file write operation operates simply: the buffers tagged
with the target file's inode and the required offsets to cover the
range being written are loaded into memory, and then filled with
data received over the network from the client sending the request.
Any buffers that are overwritten in their entirety do not have to
be filled from HDD/SSD, but can be created empty.
[0118] The updated buffers are marked as dirty, and will be cleaned
eventually by a buffer clean operation, as described below.
[0119] Note that dirty buffers modified by this Write operation are
transferred to the NVRAM card, shown in FIG. 2, where they remain
until the buffer clean operation transfers the data to disk
storage, at which point the NVRAM buffers can be freed.
[0120] NAS Cache Operations
[0121] The NAS Cache Operations function as a continuation of
<<Avere cache patent>>. These operations are described
in detail in that invention disclosure, and typically modify the
file system 10 state by reading and writing attributes, directory
contents, and file data, in response to incoming NFS operations,
and responses from calls made to the NFS back-end server whose
contents are being cached.
[0122] The NAS cache operations module 34 provides the same
external interface as the NSO module, although the implementation
is significantly different (and described in detail in the
aforementioned cache patent). However, because this module
implements a cache, files may be discarded from the cache at
various times, changing the algorithms that must be used to track
the policy in use by any given file. To a lesser degree, the
algorithms for determining the storage pools from which to allocate
storage also change. These changes from <<the cache
invention>> are described in detail below.
[0123] In this module, instead of storing policy pointers directly
in inode, the invention stores sub-, export IDs in each inode and
uses the Sub-Export manager to map these IDs into policies as the
inodes are actually used by NAS Cache Operations. This is done
because the system 10 has no way to in general determine the parent
object of a file if the file is no longer in the cache, with the
result that there may be no way to determine the current policy
object in use.
[0124] Sub-export IDs, on the other hand, are returned to the NAS
client as part of the file handle, with the result that once a file
handle is returned to a NAS client, the parent object never needs
to be located again, since the sub-export ID is present in the file
handle. Thus, the use of sub-export IDs as intermediaries between
the inode and the corresponding policy allows a NAS Cache
Operations to reliably determine any object's current policy, even
when an arbitrary number of objects are discarded from the
cache.
[0125] Inodes that are sub-export roots are located at node boot
time by evaluating the path names configured for the sub-export
roots, and marking the evaluated inodes as sub-export roots. Then,
when an NCO Lookup, Create, Mkdir, Symlink or ReaddirPlus, returns
a new file handle, it creates the new inode with a sub-export ID
inherited from the new object's parent directory. Should some other
operation need to create an inode for an object not present in the
cache, the sub-export ID is determined by simply setting the ID to
the sub-export ID present in the incoming file handle.
[0126] All of the NAS Cache Operations are described below in
detail:
[0127] NCO File Create
[0128] The NCO File Create operation is performed with parameters
that provide a parent directory file handle, a new file name and
updated file attributes.
[0129] In order to determine policy to associate with the newly
created file, the sub-export ID associated with the parent, if any,
is propagated to the new or old child. The sub-export manager is
then consulted to determine the policy object for the new
object.
[0130] NCO File Delete
[0131] The NCO File Delete operation does not need to manage
sub-export IDs.
[0132] NCO File Rename
[0133] As in NSO File Rename, the NCO File Rename operation is the
equivalent of the removal of the target of the rename, if any,
followed by the creation of a hard link from the source file to the
target file, followed by the removal of a hard link from the
source.
[0134] NCO File Truncate
[0135] There are no policy related changes required to the cache
NCO module for file truncate operation.
[0136] NCO File Link
[0137] The NCO File Link operation adds a new name to an existing
file. Since no new objects are created or looked-up by this
operation, it requires no specific changes to manage sub-export
IDs.
[0138] NCO File Read
[0139] The NSO File Read operation has no policy management related
changes required, since the incoming sub-export ID determines the
applicable policy with one lookup done by the Sub-export manager.
However, the NCO File Read operation may read data from the
back-end server and write that data to the local cache inode. Thus,
in the case where the data being read is not present in the cache
inode, the NCO File Read operation reads the desired data from the
back-end server and writes it to the cache inode. Before doing
this, the NCO File Read operation sets the inode's policy to the
policy returned by the sub-export manager. Once the required policy
has been set, eventually a cleaner will write that data to the
appropriate type of persistent storage as dictated by the
policy.
[0140] NCO File Write
[0141] The NCO File Write operation has no policy management
related changes, but does need to determine the policy specified by
the sub-export ID, in the same way that NCO File Read does.
[0142] Specifically, the sub-export ID in the file handle is used
to initialize the field in the cache inode, if one is not already
present in the NCO module already. Once the sub-export ID is
available, the corresponding policy can be determined with a single
call to the sub-export manager, and the resulting policy is placed
in the inode. Eventually a cleaner will collect modified buffers
from the memory cache and write these modified buffers to the
appropriate type of persistent storage, as dictated by this
policy.
[0143] Note that dirty buffers modified by this Write operation are
transferred to the NVRAM card, shown in FIG. 2, where they remain
until the buffer clean operation transfers the data to disk
storage, at which point the NVRAM buffers can be freed.
[0144] Inode Attribute Manager 24
[0145] The inode attribute manager 24 is responsible for reading
and writing attributes in files stored locally, either as part of a
standard NAS file system 10 for use by the NSO, or as part of a
cached file accessed by the NCO.
[0146] The cache table inode is a file that holds slotted pages
full of inodes, as shown in FIG. 4. The Inode attribute manager 24
(IAM) provides several operations:
[0147] GetInodeByFH--This call locates an inode by file handle, and
returns a referenced inode structure to its caller. It works by
computing the hash value of the file handle, and then using the
buffer package to load the block of the hash file that holds the
head of that hash bucket. The operation then proceeds to iterate
through the linked list of inodes, reading each inode in the list
from the slotted page holding that inode's offset, until either a
null next pointer is encountered, or until the inode with the
desired file handle is located.
[0148] GetInodeByOffset--This call locates an inode by its offset
in the inode file. The operation simply creates an inode object
whose offset in the inode table is set to the input parameter
specifying the inode offset.
[0149] MapInode--This call takes an inode returned by GetInodeByFH
or GetInodeByOffset, and returns a referenced buffer to the buffer
holding the block of the inode table inode that holds the inode in
question.
[0150] Once an inode has been mapped, the caller can simply modify
the inode returned by MapInode and then release the buffer back to
the buffer system, marking it as modified.
[0151] If the caller needs to just read the contents of the inode,
it can do so after calling MapInode as well. In this case, the
buffer holding the inode table block is released without marking
the underlying buffer as modified.
[0152] Directory Manager 26
[0153] The directory manager 26 implements a simple abstraction of
tagged mappings between file names and file handles. Each mapping
entry is tagged with a 64 bit identifier (sometimes called a
cookie) used by NFS readdir operations as they iterate over the
entries in a directory. The manager provides the following
operations on directory inodes:
[0154] int32_t append(char*namep, uint8_t*fileHandle, uint32_t
fileHandleLength, uint64_t cookie).
[0155] This call adds the name and file handle to the directory,
replacing an existing entry if one exists. The cookie value is also
stored. An error code is returned, or 0 for success.
[0156] int32 remove(char*namep)
[0157] This call removes the entry from the directory. It returns 0
for success, and otherwise a non-zero error code indicating the
reason for failure, such as the entry's not existing.
[0158] int32_t readdir(uint64_t startCookie, char*space, uint32_t
spaceSize, uint32_t*bytesReturnedp)
[0159] This call copies out an integral number of directory entries
from the directory, starting at the entry starting at the tag value
specified in startCookie. Each entry contains a 32 bit file name
length, followed by that number of bytes of file name, followed by
a 32 bit file handle length, followed by that number of file handle
bytes, followed by a 64 bit integer giving the cookie tag of the
next file name entry.
[0160] int32_t clear( )
[0161] This call empties out the contents of a directory. The
resulting directory contains no entries.
[0162] Data Manager 28
[0163] The data manager 28 implements a normal user data file
abstraction for use by the NSO and NCO modules, in terms of
operations on the buffer module 22. It implements a read operation
that takes an inode, an offset and a length, and returns a set of
data buffers for inclusion in a network buffer. It also implements
a write operation that takes an inode, an offset, a length, and a
set of pointer, length pairs pointing at some data bytes, and
copies the data into the buffers from the data bytes. In the read
case, the inode's length is consulted to limit the read, and on a
write operation, the inode's length is updated if the file is
extended.
[0164] The operations provided in details are:
[0165] int32_t read(CfsInode*ip, uint64_t offset, uint32_t length,
uint32_t bufCount, BufHandle*outBufspp)
[0166] This call reads data from the inode specified by ip, at the
offset specified by the offset parameter, for the number of bytes
specified by the length parameter. A number of read referenced
buffers, one for every block's worth of data, aligned to an
integral block boundary, are returned in*outBufspp.
[0167] The CfsInode structure simply stores the offset of the
desired inode within the inode file. The underlying inode is
located by calling the buffer package to read the appropriate
containing block of the inode file.
[0168] int32_t write(CfsInode*ip, uint64_t offset, uint32_t length,
uint32_t iovCount, IovEntry*iovsp)
[0169] This call writes data from the data portion of one or more
network buffers, whose data portion is described by the
concatenation of a set of IovEntry structures. Each IovEntry
consists of a pointer and a length, and describes a simple array of
bytes to be written to the file.
[0170] Furthermore, if the value of (length+offset) is greater than
the length attribute in the underlying inode, the inode's length is
set to the new value of (length+offset).
[0171] int32_t setLength(CfsInode*ip, uint64_4 newLength).
[0172] This call changes the length of the file specified by the
parameter ip. The buffer holding the inode is updated, and if the
length is being reduced, any data blocks allocated to the inode are
freed by calling the truncate function in the inode space truncator
38, described below.
[0173] Inode Object Allocator 30
[0174] The inode object allocator 30 is responsible for allocating
and freeing inodes as files and directories are created and
deleted. Inodes are managed by maintaining a free inode list in a
header in the inode file (stored where there inode at offset 0
would be). If the inode free list is empty, a new block of file
space is allocated to the inode file, and the newly available
inodes are added to the free list.
[0175] Freeing an inode is simple: the inode is simply added to the
inode free list.
[0176] <<simple picture of a file containing an array of
inodes, including a header where inode at offset 0 would
be>>
[0177] Inode Space Allocator 36
[0178] The inode space allocator 36 is responsible for adding
blocks to an already created inode. An inode stores pointers to a
set of disk blocks, in an indirect block tree similar to that used
by FreeBSD and other BSD-based Unix systems.
[0179] In order to handle multiple types of storage, however, the
inode space allocator 36 needs to be extended to choose blocks from
the desired type of storage.
[0180] As described above, each inode is associated with a policy
having an allocType method that maps offsets within an inode into a
particular type of storage. When the buffer cleaner module needs to
write out the dirty buffers associated with the inode, it queries
the inode's policy object to determine which type of storage is
required for each block. The inode space allocator 36 then calls
the block allocator 42 to allocate the required number of blocks of
each type of storage, and then schedules the writes from the memory
buffers to the newly allocated blocks. Finally, once the writes
complete, any overwritten blocks are freed back to their respective
typed storage pools via the block allocator 42 module.
[0181] Inode Space Truncator 38
[0182] The inode space truncator 38 is responsible for freeing
blocks when a file's size is reduced, freeing the disk blocks
allocated past the new end-of-file (EOF). This module is
essentially unmodified in this invention, since it simply frees the
blocks pointed to by the indirect block trees in the inode.
[0183] The block allocator 42 module knows which type of storage
pool is associated with each block, and ensures that each block is
freed back to the appropriate storage pool.
[0184] Block Allocator 42
[0185] The block allocator 42 is responsible for allocating a range
of blocks of a particular type. A typical aggregate has two or
three different types of persistent storage from which blocks of
storage can be allocated. Each type of persistent storage is
addressed by blocks numbers, and blocks are divided into segments
whose default size is 2 16 blocks (although power of 2 multiple of
the block size is an acceptable segment size for an aggregate).
Blocks from different persistent stores are mapped in groups of an
entire segment into the aggregate address space, so that it is
guaranteed that within a particular segment, all of the blocks
share the same type of persistent storage, but two different
adjacent segments may differ in their underlying type of persistent
storage.
[0186] Buffer Fill/Clean Module 40
[0187] Buffer Fill
[0188] A buffer fill operation is invoked when a buffer is accessed
that is not marked as valid. The buffer manager reads the data from
SSD or HDD, as determined by the block address, and then the buffer
is marked as valid, and those threads waiting for the data to
become valid are woken.
[0189] Buffer Clean
[0190] A buffer clean operation is invoked when the number of dirty
pages in the buffer cache exceeds a minimum threshold, designed to
ensure that there are sufficient numbers of dirty pages to clean
efficiently.
[0191] Once this minimum threshold has been crossed, cleaner
threads are created in the buffer fill/clean module 40 that collect
a number of dirty buffers that belong to one or more inodes, ensure
that the respective blocks are allocated for the dirty buffers by
calling the inode space allocator 36 (which itself calls the block
allocator 42 to allocate new blocks and to free old, overwritten
blocks). The cleaner threads then calls the SSD and HDD drivers to
actually perform a small number of large writes to actually clean
the dirty buffers.
[0192] Driver Modules
[0193] A buffer system above the drivers provides a memory cache
tagged by inode and offset within the inode. Most inodes represent
files in the file system 10, but a few inodes represent physical
disk or SSD drives 48. Both are cached using the buffer cache
system 10.
[0194] <<describe buffer cache operation, including reference
counts, filling flag, dirty flag, cleaning flag>>
[0195] <<describe remaining objects>>
[0196] At the lowest level, there are drivers that provide access
to an arbitrary set of rotating disk drives and solid state disk
drives, although other types of persistent storage providing an
abstraction of an array of disk blocks may also be provided to the
buffer system 10.
[0197] NSO Policy Associations
[0198] In the basic invention, each separate file system 10 has its
own file system 10 identifier, based upon which the default policy
specification can be easily chosen, and for those file systems for
which no default policy has been specified, a global policy
specification can be used.
[0199] The most interesting part of the problem of associating
policies with sets of files arises when putting a policy on a
specific directory and its descendents. In this invention, a file
named .avere_control is placed can be placed in any directory, with
this file holding the storage policy to be applied to all files in
the .avere_control's containing directory, and all of that
directory's descendents. In order to reliably associate a policy
object with an arbitrary inode, the invention must first be able to
detect when an .avere_control policy file is present in a
directory.
[0200] This is straightforwardly done by adding several fields to
each inode. There are added: [0201] a policy file inode number,
giving the inode number of the .avere_control policy file in effect
for this object; [0202] a "policy root" bit indicating that the
policy file specified by the policy file inode number is stored in
this directory; [0203] a policy version number, incremented as
described below to detect the insertion and deletion of other
.avere_control policy files during operation of the system 10;
[0204] a parent inode number, specifying the inode of the directory
containing a file; if a file contains more than one hard link to
the file, the file contains a pointer to a linked list of blocks
storing the back pointers to the parent directories in
question.
[0205] When a file, directory or symlink is created, the file
system 10 sets its original parent inode number to the directory in
which the object is created. This field is also updated on a
cross-directory rename operation. A hard link operation adds an
additional back pointer to those stored for a particular file, and
may involve adding an additional block of back pointers if the
current representation for the back pointers is at its capacity
limit. Similarly, an unlink operation will remove a back pointer
from the set of back pointers, possibly releasing empty blocks of
back pointer information. Directory back pointers do not have to
use this mechanism, as ".." entries exist in all directories,
pointing back to the single parent directory that the child
directory has. This parent information will be used for locating
policy information in certain cases.
[0206] When an inode is loaded into memory, a pointer is set up to
the corresponding parsed policy, which is found by searching an
in-memory hash table; if the policy object is not present in the
hash table, it is read from the appropriate policy file inode and
entered into the hash table at this time. Once the policy object
has been located, the policy version in the policy object is
compared with the policy version number in the referencing inode.
If the versions do not match, the policy path walk process,
described below, is performed, to locate the correct policy object
for the referencing inode, and a reference to that policy object is
stored in the in-memory inode object.
[0207] Special care must be taken when adding or removing a policy
file. When a new policy file is updated or created, then before
inserting the new policy object reference into the inode, the
system 10 first walks up the directory tree, starting at the
directory into which the policy file is being inserted, and
continuing by following each directory's ".." parent pointer in
turn, until either the root of the file system 10 is encountered,
or until an existing policy file is encountered. The existing
policy object's version number is incremented at this time, which
will force the performance of the policy path walk process
described below to determine the new correct policy object for all
descendents of that policy object.
[0208] Similarly, when a policy file is deleted from a directory,
the policy inode's policy version number is also incremented, so
that any files connected to that policy object will re-compute, via
the policy path walk process, the new correct policy object to
use.
[0209] The policy path walk process is used to find the correct
policy to use in the name space hierarchy, and is used when the
structure of the policy name space changes, either because of a
directory rename to a new location, or because of the insertion or
deletion of a relevant policy object (.avere_control file). The
process begins when a data cleaning, or other operation, needs to
determine the policy object for use with a given file. If the inode
being accessed has a policy version number matching the policy
version of the policy object itself, then the already parsed object
can be used. If the policy object associated with the file or
directory has the policy root bit set, this means that the policy
file is actually present in this directory, and the policy object
based on the policy file, and associated the newly parsed policy
object with the directory in question will simply need to be
rebuilt. Otherwise, the policy path walk procedure recursively
applies itself to the parent object of the file or directory in
question, effectively walking up the file system 10 tree until it
encounters a policy object whose version matches the policy version
in the inode itself, at which point a reference to this policy
object is inserted in the file's inode.
[0210] In FIG. 5, a tree of directories (A, B, C and D) and files
(E, F and G) is shown, each of which has a reference to a policy
object file, shown as the rectangular box. Each directory or
regular file has the same path version (1), which matches the
policy object's version (also 1). Thus, this is a representation of
the state of the file system 10 where directory A has a policy
object X, and all of its descendent files and directories B-G share
the same policy object.
[0211] If a new policy object Y is added to the directory C, the
picture looks like as shown in FIG. 6.
[0212] In FIG. 6, a new policy object, Y, is created and associate
this object with directory C. Before linking Y into directory C,
the policy creation operation walks up the tree to the next highest
director having a policy object (directory A), and then increments
the associated policy object's version from 1 to 2. The goal is to
ensure that any references to an old inode, such as D, will detect
that the policy inherited from its ancestor need to be revalidated
on the next reference to the object. There are a number of files
and directories that are now pointing to the wrong policy, but in
all of these cases ( ), the policy version number in the inode no
longer matches the version number in the policy object, so the link
will not be used. In this case, it can be seen that a new policy Y
has been created for directory C, and spliced into the tree as C's
policy object, and the path versions for both X and Y have been set
to new values. The next access to any of the file system 10 objects
will then walk up the tree to locate the current policy to use.
[0213] Assume there are accesses to directory B, file E and file G.
Each access would notice that the policy link is incorrect, and
would perform the policy path walk procedure, fixing up the policy
link. The resulting state would look like the description of FIG.
7.
[0214] Note that the walks up from B and G locate the same policy
object (X), but update the path policy version to 2 for those
inodes. The walk up from inode E encounters the policy object Y at
node C, and links to that policy, setting inode E's path policy
version to match Y's version (3) at the same time.
[0215] NCO Policy Associations
[0216] The above describes how to associate policies with arbitrary
subsets of a file system 10 name space, but assumes that the system
10 can store and maintain a set of back pointers from each file or
directory in the system 10. In some distributed implementations of
a NAS cache, this requires that learned parent/child relationships
between file objects be passed relatively expensively between
nodes, since many NAS protocols, including NFS version 3, do not
provide operations for determining the parent directories from a
normal file's file handle.
[0217] Thus, this invention provides a second mechanism for
associating policies with files and directories within a name
space, this one primarily to be used in systems utilizing a NAS
cache operations module 34, although the same technique works for
NAS Server Operations as well, at the cost of slightly larger file
handles. The goal will be to ensure that every incoming file handle
will include a 16 bit field called a sub-export tag, along with the
usual information required to determine the back-end file server's
file handle. Each sub-export tag will be associated with a policy
object described above on policy specifications, with the policy
specification stored in an .avere_control file.
[0218] To enable this mechanism, a predefined collection of
directories are created by the Inode object allocator 30 as
directed by an external management process; these directories are
called sub-export roots. Each sub-export root is assigned a unique
16 bit sub-export tag. The root directory for the file system's
true NFS export is always a sub-export root. Whenever a file handle
is returned from an NCO operation, including NFS create, NFS
lookup, NFS mkdir, NFS symlink, or NFS readdirplus, the parent's
sub-export tag is included in the returned child's file handle,
unless the child object is a sub-export root directory, in which
case the sub-export root's sub-export tag is included instead.
Similarly, when the mount protocol mounts the root directory for
the file system 10, it is augmented to include the export's root
directory's sub-export tag to the returned file handle, as
well.
[0219] A sub-export policy manager 44 monitors the contents of the
.avere_control files, if any, present in the corresponding
sub-export root directories. Thus, whenever an .avere_control file
stored in a sub-export root directory is modified, such as by an
NFS write, NFS create or NFS delete operation, the sub-export
policy manager 44 re-reads the .avere_control file and updates the
policy associated with the sub-export's root's sub-export tag.
[0220] Thus, every incoming file handle is tagged by a sub-export
tag that can be mapped to a policy specification by calling the
sub-export manager with the incoming file handle's sub-export tag,
and obtaining the sub-export's currently active policy.
[0221] When the inode space allocator 36 needs to allocate space to
a particular inode, the policy associated with the inode's
sub-export tag is used to determine the type of storage from which
space should be allocated to the inode, as described earlier.
Example
[0222] The following describes a write operation where an implicit
or explicit policy directs the data to be located on one or more
types of storage.
[0223] An example is provided where the file "/a/b" is opened in
write mode, and then 10 MB of data is written to it. In this
example, the directory B is marked as a policy root, and has a
management-assigned sub-export tag, with an associated policy that
specifies that the first 1 MB of storage should be allocated from
SSD storage, and the remaining bytes in the file should be
allocated from normal HDD-resident storage.
[0224] The file open operation turns into two NSO lookup
operations, one looking up the directory "a" in the root directory
to determine the file handle of the directory "a", and the next one
looking up the name "b" in "a". The lookup of "a" in the directory
"/" retrieves the file handle for "a", consults a cache of the
management database 20, determines that "a" is a policy root, and
augments the file handle with a tag indicating the policy to be
applied to files under directory "a" in the name space. When the
NSO lookup for "b" in directory "a" is received, the policy tag on
"a"s file handle is propagated to "b"s file handle as well.
[0225] The client then sends 10 MB of NSO write operations (if, as
is common, each NSO write operation contained 64 KB of data, this
would be 160 separate write calls). This data is commonly written
directly to memory, where it remains until the Buf Clean mechanism
collects the modified data to write to persistent storage. The Buf
Clean module determines which sections of the file have new data,
and calls the Inode space allocator 36 to ensure that blocks have
been allocated for the appropriate block range in the file. In this
example, the Buf Clean module will request that the first 640
blocks of the file "b" are allocated. The Inode space allocator 36
examines the policy associated with file "b" (the policy inherited
from directory "a"s file handle) and determines that the first 64
blocks should be allocated from SSD-backed segments, and the
remaining 576 blocks should be allocated from HDD-backed segments.
The Inode space allocator 36 then calls the Block allocator 42
module to actually obtain the storage of the appropriate type, and
then inserts these block addresses into the indirect block tree
associated with file "b".
[0226] Once the appropriate blocks have been allocated from the
desired types of persistent storage, and these blocks have been
added to file "b"s indirect block tree, the system 10 schedules the
real IO operations. Specifically, the Buf Clean module starts IO
operations from the buffers in memory being cleaned to the newly
allocated persistent storage blocks just allocated. Once these IO
operations complete, the buffers are marked as cleaned, and the Buf
Clean operation terminates.
[0227] Although the invention has been described in detail in the
foregoing embodiments for the purpose of illustration, it is to be
understood that such detail is solely for that purpose and that
variations can be made therein by those skilled in the art without
departing from the spirit and scope of the invention except as it
may be described by the following claims.
* * * * *