U.S. patent application number 12/143134 was filed with the patent office on 2008-12-25 for network distributed file system.
This patent application is currently assigned to TENOWARE R&D LIMITED. Invention is credited to Tomasz NOWAK, Antoni SAWICKI.
Application Number | 20080320097 12/143134 |
Document ID | / |
Family ID | 40137642 |
Filed Date | 2008-12-25 |
United States Patent
Application |
20080320097 |
Kind Code |
A1 |
SAWICKI; Antoni ; et
al. |
December 25, 2008 |
NETWORK DISTRIBUTED FILE SYSTEM
Abstract
A storage pool component is operable on a computing device
including a storage medium having an otherwise free storage
capacity for forming a portion of a storage capacity of a storage
pool and being operably connected across a network to at least one
other such component. The component comprises configuration data
identifying at least one other computing device to which the
computing device may connect across the network; and a directory
for identifying file information for files of the storage pool
stored on the storage medium, the file information being stored
with a degree of redundancy across the computing devices of the
storage pool. On instantiation, the component for communicates with
at least one other component operating on one of the other
computing devices to verify the contents of the directory. The
component reconciles file information stored on the storage medium
with file information from the remainder of the storage pool. The
component then acts as a driver, responsive to an access request
for a file stored in the storage pool received across the network
from another component of the storage pool, for determining a
location of the file on the storage medium from the directory and
for accessing the file accordingly.
Inventors: |
SAWICKI; Antoni; (Dublin,
IE) ; NOWAK; Tomasz; (Warsaw, PL) |
Correspondence
Address: |
CHRISTOPHER & WEISBERG, P.A.
200 EAST LAS OLAS BOULEVARD, SUITE 2040
FORT LAUDERDALE
FL
33301
US
|
Assignee: |
TENOWARE R&D LIMITED
Dublin
IE
|
Family ID: |
40137642 |
Appl. No.: |
12/143134 |
Filed: |
June 20, 2008 |
Current U.S.
Class: |
709/216 ;
707/999.01; 707/E17.01; 711/112; 711/E12.001 |
Current CPC
Class: |
G06F 2201/815 20130101;
G06F 11/16 20130101; G06F 3/0607 20130101; G06F 3/067 20130101;
G06F 3/0659 20130101; G06F 11/1443 20130101; G06F 3/0644 20130101;
G06F 3/065 20130101; G06F 11/1425 20130101; G06F 11/2094 20130101;
G06F 3/0619 20130101; G06F 16/176 20190101; H04L 1/1809 20130101;
G06F 15/17331 20130101; G06F 11/1076 20130101; G06F 2211/104
20130101; G06F 11/1662 20130101; H04L 67/1097 20130101 |
Class at
Publication: |
709/216 ;
711/112; 707/10; 707/E17.01; 711/E12.001 |
International
Class: |
G06F 15/167 20060101
G06F015/167; G06F 12/00 20060101 G06F012/00; G06F 17/30 20060101
G06F017/30 |
Foreign Application Data
Date |
Code |
Application Number |
Jun 22, 2007 |
IE |
S2007/0453 |
Claims
1. A storage pool component operable on a computing device
including a storage medium having an otherwise free storage
capacity for forming a portion of a storage capacity of a storage
pool and being operably connected across a network to at least one
other storage pool component, each storage pool component operating
on a computing device providing a respective portion of said
storage pool capacity, said storage pool component comprising:
configuration data identifying said at least one other computing
device to which said computing device may connect across said
network; a directory for identifying file information for files of
said storage pool stored on said storage medium, said file
information being stored with a degree of redundancy across said
computing devices of said storage pool; means responsive to
instantiation of said component for communicating with at least one
other component operating on one of said at least one other
computing devices for verifying the contents of said directory;
means for reconciling file information stored on said storage
medium with file information from the remainder of said storage
pool; and a driver, responsive to an access request for a file
stored in said storage pool received across said network from
another component of said storage pool, for: determining a location
of said file on said storage medium from said directory; accessing
said file accordingly.
2. A component as claimed in claim 1, wherein the component further
comprises a user interface component arranged to enable said
configuration data to be determined.
3. A component as claimed in claim 1, wherein said access request
comprises a read access and wherein said driver is arranged to
return said file information to said requesting component.
4. A component as claimed in claim 1, wherein said access request
comprises a write access including file information and wherein
said driver is arranged to write said file information to said
storage medium and to update said directory accordingly.
5. A component as claimed in claim 1, wherein said configuration
data includes an identifier for said storage pool, storage size
information for said storage pool, an indicator of said redundancy
provision within said storage pool, and network identifiers for
other components of said storage pool.
6. A component as claimed in claim 1, wherein said component is
arranged to operate as a disk device driver on said computing
device, said driver being arranged to receive file access requests
from any applications running on said computing device and in
accordance with said directory to transmit file access requests to
other components of said storage pool, to process responses to said
requests and to communicate the processing of said responses to
said applications.
7. A component as claimed in claim 6, wherein said access request
comprises a request for file information from another component of
said storage pool, said file information being distributed across
N+M computing devices, where N>=1 and determines the amount of
storage available in said storage pool and wherein M>0 and
determines said redundancy provision within said storage pool.
8. A component as claimed in claim 7, wherein said component is
responsive to a file write request to split said file information
into N+M clusters and to transmit file write requests to other
components of said storage pool, each request including at least a
respective write access request to a component.
9. A component as claimed in claim 7, wherein said component is
responsive to a file write request to split said file information
into clusters of a given size, to transmit file write requests to
other components of said storage pool, each request including at
least a respective write access request to a component and to
transmit a write request including residual file information from
said splitting to at least M components of said storage pool.
10. A component as claimed in claim 6, wherein said component is
arranged to determine how many other components of said storage
pool are accessible across said network.
11. A component as claimed in claim 10, wherein said component is
responsive to less than N of N+M components being accessible to
halt access to said storage pool.
12. A component as claimed in claim 10, wherein N<=M and said
component is responsive to less than 50% of said components being
accessible to permit only read access requests to said storage
pool.
13. A component as claimed in claim 1, wherein said component is
arranged to provide storage capacity for respective portions of a
plurality of storage pools, said configuration data including data
for each storage pool.
14. A component as claimed in claim 1, wherein said component is
arranged to make said storage pool available as a disk drive.
15. A component as claimed in claim 1 wherein said file information
is stored in blocks in a directory of said storage medium.
16. A component as claimed in claim 1 wherein said file information
is stored as objects in a transactional database.
17. A component as claimed in claim 16, wherein said verifying
means is arranged to compare transaction log entries stored on said
component with transaction log entries stored on another component
of said storage pool to determine if file information stored on
said component is valid.
18. A component as claimed in claim 1 wherein said verifying means
is arranged to compare at least one of: file name, file size, last
modification date, and file attributes contained in said directory
with corresponding attributes for a file stored on another
component of said storage pool to determine if file information
stored on said component is valid.
19. A component according to claim 1, wherein said component is
arranged to periodically check the accessibility of other
components forming the storage pool to said client.
20. A component according to claim 19, wherein said verifying means
is arranged to communicate with the most accessible of said other
components.
21. A component as claimed in claim 6, wherein said file access
requests are transmitted to all accessible components of said
storage pool.
22. A component according to claim 1, wherein said component is
arranged to communicate with other components of said storage pool
in an encrypted manner determined by
23. A system, comprising: a plurality of computing devices having:
a storage medium; at least one of said computing devices comprising
a storage pool component, said storage pool component being
operable on the computing device, the storage medium having an
otherwise free storage capacity for forming a portion of a storage
capacity of a storage pool and being operably connected across a
network to at least one other storage pool component, each storage
pool component operating on a computing device providing a
respective portion of said storage pool capacity, said storage pool
component comprising: configuration data identifying said at least
one other computing device to which said computing device may
connect across said network; a directory for identifying file
information for files of said storage pool stored on said storage
medium, said file information being stored with a degree of
redundancy across said computing devices of said storage pool;
means responsive to instantiation of said component for
communicating with at least one other component operating on one of
said at least one other computing devices for verifying the
contents of said directory; means for reconciling file information
stored on said storage medium with file information from the
remainder of said storage pool; and a driver, responsive to an
access request for a file stored in said storage pool received
across said network from another component of said storage pool,
for: determining a location of said file on said storage medium
from said directory; and accessing said file accordingly, said
storage pool component being arranged to make said storage pool
available as a disk drive; and said system including one or more
legacy clients accessing said storage pool through a legacy disk
device driver.
Description
CROSS-REFERENCE TO RELATED APPLICATION
[0001] n/a
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
[0002] n/a
FIELD OF THE INVENTION
[0003] The present invention relates to a method and apparatus for
distributing files within a network.
BACKGROUND OF THE INVENTION
[0004] Traditionally computer data is usually stored in form of
individual files in a computer long-term storage (non-volatile).
This most commonly is a "hard disk". Hard disks suffer from
following issues:
[0005] Limited capacity
[0006] Prone to damage and failure because of mechanical (moving)
parts--short lifetime
[0007] Not shared, generally only one machine can access it at a
time
[0008] To overcome such problems, disks can be combined in to
larger pools of storage with data protection, for example, in a
RAID array. In terms of their interface to a host computer, disks
or pools of disks can be either be:
[0009] Internal--e.g. IDE, SATA, SCSI disks.
[0010] External--DAS Directly Attached Storage e.g. USB, SCSI,
Fiber Channel. However, DAS is only capable being connected to very
limited (<10) number of servers at a time.
[0011] External--NAS Network Attached Storage e.g. Ethernet,
TCP/IP, IPX/SPX. NAS is just a more advanced, specifically designed
in hardware file server.
[0012] External--SAN Storage Area Network e.g. Fiber Channel (FC)
network infrastructure. It is acknowledged that SAN is capable of
being connected to multiple machines however the infrastructure
costs for doing so are prohibitive for desktops and, in spite of
improvements such as iSCSI, SAN is typically used only for
servers.
[0013] A pool of storage usually needs to be accessible by more
than just a single machine. Traditionally the most common way of
sharing storage is to use a "file server" which is a dedicated
computer on the network, providing its storage pool (connected
through any of the above 4 ways, internal or external)
transparently to other computers by a "File Sharing Protocol" over
a computer network (LAN/WAN/etc) with the possibility of adding
extra security (access control) and availability (backups) from a
central location.
[0014] Some commonly used file sharing protocols are: [0015]
CIFS/SMB/Windows File Sharing introduced by Microsoft with Windows
3.x [0016] NFS introduced by Sun Microsystems and adopted by almost
all Unix operating systems [0017] Netware introduced by Novel
[0018] Apple Share used by Apple computers
[0019] However file servers suffer from some serious issues:
[0020] Single point of failure--when server fails all clients are
unable to access data [0021] Central bottleneck--when multiple
clients are accessing same server the network, congestion can occur
[0022] Limited capacity and scalability--file servers can run out
of space when more clients are connected [0023] Expensive dedicated
hardware and per gigabyte cost Maintenance costs, upgrades,
service, repairs, etc.
SUMMARY OF THE INVENTION
[0024] The present invention provides a virtual storage pool from a
combination of unused disk resources from participating nodes
presented as a single virtual disk device (of combined size)
available to and shared with all nodes of a network. Under a host
operating system, the virtual storage pool is visible as normal
disk drive (disk letter on Windows and mount point on Unix),
however all disk I/O is distributed to participating nodes over the
network. In the present specification, this is referred to as a
Network Distributed File System (NDFS).
[0025] If any of the peer workstations becomes unavailable (even
for a short period of time) the virtual storage pool could become
unavailable or inconsistent. In preferred embodiments of the
invention, to achieve availability comparable to a server or NAS
storage, data is distributed in such a way that if any number (of
predefined) participating nodes become unavailable, the virtual
storage pool is still accessible to the remaining computers.
[0026] The invention is based on a Peer-to-Peer (P2P) network
protocol which allows data to be stored and retrieved in a
distributed manner. The data redundancy mechanism is used at the
protocol level in such a way that if any of participating nodes is
inaccessible or slow in response to requests, they are
automatically omitted in the processing. The protocol therefore is
network congestion or break resistant.
[0027] For example, given a network of 25 workstations, each of
which having 120 GB disk of which 100 GB is unused, a virtual
storage pool of size 2.5 TB could be formed and made available to
all nodes on the network.
[0028] The storage pool of the preferred embodiments, in contrast
to the traditional file server approach has following
characteristics: [0029] No single point of failure--up to
predefined number of participating nodes (peers) can become
inaccessible and the data will still be available. [0030] Reduced
network bottlenecks--using a P2P network protocol provides the
benefits of parallel I/O. The load is generally evenly and
coherently distributed across the network as opposed to
point-to-point transmission with a client-server model. Also the
network protocol automatically adapts to network congestion by not
requiring all data to be retrieved and ignoring nodes which are
slow to respond. [0031] Unlimited capacity and scalability--the
size of the storage automatically grows with every node added to
the network. [0032] No extra hardware costs, more than that,
resources that would otherwise be unused can be utilized.
[0033] In accordance with one aspect, the present invention
provides a storage pool component operable on a computing device
including a storage medium having an otherwise free storage
capacity for forming a portion of a storage capacity of a storage
pool and being operably connected across a network to at least one
other storage pool component. Each storage pool component operates
on a computing device providing a respective portion of the storage
pool capacity. The storage pool component has configuration data
identifying the at least one other computing device to which the
computing device may connect across the network, a directory for
identifying file information for files of the storage pool stored
on the storage medium, the file information being stored with a
degree of redundancy across the computing devices of the storage
pool, means responsive to instantiation of the component for
communicating with at least one other component operating on one of
the at least one other computing devices for verifying the contents
of the directory, means for reconciling file information stored on
the storage medium with file information from the remainder of the
storage pool, and a driver. The driver is responsive to an access
request for a file stored in the storage pool received across the
network from another component of the storage pool, and determines
a location of the file on the storage medium from the directory and
for accessing the accordingly.
[0034] In accordance with another aspect, the present invention
provides a system having a plurality of computing devices. The
plurality of computing devices each has a storage medium. At least
one of the computing devices includes a storage pool component. The
storage pool component is operable on the computing device and the
storage medium has an otherwise free storage capacity for forming a
portion of a storage capacity of a storage pool and is operably
connected across a network to at least one other storage pool
component. Each storage pool component operates on a computing
device providing a respective portion of the storage pool capacity.
The system also includes one or more legacy clients accessing the
storage pool through a legacy disk device driver.
BRIEF DESCRIPTION OF THE DRAWINGS
[0035] A more complete understanding of the present invention, and
the attendant advantages and features thereof, will be more readily
understood by reference to the following detailed description when
considered in conjunction with the accompanying drawings
wherein:
[0036] FIG. 1 shows schematically a pair of virtual storage pools
(VSP) distributed across a set of nodes according to an embodiment
of the present invention;
[0037] FIG. 2 shows a client application accessing a virtual
storage pool according to an embodiment of the invention;
[0038] FIG. 3 shows the main components within a Microsoft Windows
client implementation of the invention;
[0039] FIG. 4 shows the main components within an alternative
Microsoft Windows client implementation of the invention;
[0040] FIG. 5 a write operation being performed according to an
embodiment of the invention; and
[0041] FIG. 6 shows a cluster of VSPs in a high availability
group.
DETAILED DESCRIPTION OF THE INVENTION
[0042] In traditional data storage, the term "Storage Pool" refers
to a pool of physical disks or logical disks served by a SAN or
LUNs (Logical Units) in DAS. Such a storage pool can be used either
as a whole or partially, to create a higher level "logical
volume(s)" by means of RAID-0, 1, 5, etc. before being finally
presented through the operating system.
[0043] According to the present invention, a storage pool is
created through a network of clients communicating through a P2P
network protocol. For the purposes of the present description, the
network includes the following node types: [0044] Server--node
contributing local storage resources (free disk space) to the
storage pool through the P2P protocol. A server receives requests
from clients to either provide or modify data stored at that node.
As such it will be seen that server nodes can be regular
workstations that need not run applications which access the VSP,
dedicated network servers (such as Windows Server, Unix Server,
etc), Network Attached Storage (NAS) or Storage Area Network (SAN)
devices. [0045] Active Client--node accessing the storage pool
through the P2P protocol. An Active Client, when accessing the
data, communicates with all available servers simultaneously. When
one or more of the servers is delayed in responding or is
completely unavailable, the missing piece of data can be rebuilt
using redundant chunks of data obtained from the other servers.
[0046] Peer Node--a node, which is both Server and Active Client at
same time. This is the most common node type. [0047] Legacy
Client--a node accessing the server pool through a legacy protocol
like CIFS or NFS, through a gateway on an Active Client.
[0048] From the above it will be seen that a virtual storage pool
can be created from Peer Nodes or Server Nodes and accessible for
Active Clients or Peer Nodes. VSP therefore can function for
example on: [0049] workstations sharing their own LSEs with their
shared VSP--Active Nodes; [0050] workstations sharing their own
LSEs with external Active (or Legacy) Clients--Server/Peer Nodes to
Active Nodes; [0051] network servers, sharing their own LSEs with
their common shared VSP--Active Modes; [0052] network servers (for
example NAS or SAN) sharing their own LSE with external network
Active (or Legacy) Clients--Server/Peer Nodes to Active Nodes;
[0053] mixed network of active clients and servers sharing their
LSE with their common shared VSP--Active Nodes; or [0054] mixed
network of active clients and servers sharing their LSE with
external Active (or Legacy) Clients--Server/Peer Nodes to Active
Nodes.
[0055] Referring now to FIG. 1, a VSP (Virtual Storage Pool), VSP A
or VSP B, according to the preferred embodiment is formed up from
Local Storage Entities (LSE) served by either Server or Peer Nodes
1 . . . 5. In a simple implementation, an LSE can be just a hidden
subdirectory on a disk of the node. However, alternative
implementations referred to later could implement an LSE as an
embedded transactional database. In general, LSE size is determined
by the available free storage space on the various nodes
contributing to the VSP. Preferably, LSE size is the same on every
node, and so global LSE size within a VSP will be dependent on
smallest LSE in the VSP.
[0056] The size of VSP is calculated on VSP Geometry: [0057] If no
data redundancy is used (Geometry=N), the size of the VSP is
determined by the number N of Server or Peer Nodes multiplied by
size of the LSE. [0058] When mirroring (M replicas) is being used
(Geometry=1+M), the size of the VSP is equal to the size of the
LSE.
[0059] When RAID3/5 is being used (Geometry=N+1), the size of the
VSP equals N+1 multiplied by size of LSE.
[0060] When RAID-6 is being used (Geometry=N+2), the size of VSP
equals N+2 multiplied by size of LSE.
[0061] If N+M redundancy is used (Geometry=N+M), the size of VSP
equals N multiplied by the size of LSE.
[0062] Because the LSE is the same on every node, a situation may
occur when one or few nodes having a major storage size difference
could be under utilized in contributing to virtual network storage.
For example in a workgroup of 6 nodes, two nodes having 60 GB disks
and four having 120 GB disks, the LSE on two nodes may be only 60
GB, and so single VSP size could only be 6*60 GB=360 GB as opposed
to 120+120+120+120+60+60=600 GB. In such a situation, multiple VSPs
can be defined. So in the above example, two VSPs could be created,
one 6*60 GB and a second 4*60 GB, and these will be visible as two
separate network disks. In fact, multiple VSPs enable different
redundancy levels and security characteristics to be applied to
different VSPs, so enabling greater flexibility for
administrators.
[0063] Using the invention, a VSP is visible to an Active Client,
Peer Node or indeed Legacy Client as a normal disk formed from the
combination of LSEs with one of the geometries outlined above. When
a client stores or retrieves data from a VSP it attempts to connect
to every Server or Peer Node of the VSP and to perform an LSE I/O
operation with an offset based on VSP Geometry.
[0064] Before describing an implementation of the invention in
detail, we define the following terms: [0065] LSE Block Size (LBS)
is a minimal size of data that can be accessed on an LSE. Currently
it is hard coded at 1024 bytes. [0066] Network Block Size (NBS) is
a maximum size of data payload to be transferred in a single
packet. Preferably, NBS is smaller than the network MTU (Maximum
Transmission Unit)/MSS (Maximum Segment Size) and in the present
implementations NBS is equal to LBS, i.e. 1024 bytes, to avoid
network fragmentation. (Standard MTU size on an Ethernet type
network is 1500 bytes). [0067] VSP Block Size (VBS) is the size of
data block at which data is distributed within the P2P network:
VBS=LBS*number of non-redundant nodes (N).
[0068] VSP Cluster Size (VCS)--data (contents of the files before
redundancy is calculated) is divided into so called clusters,
similar in to data clusters of traditional disk based file systems
(FAT, NTFS). Cluster size is determined by VSP Geometry and NBS
(Network Block Size) in following way:
VCS=Number of Data Nodes*NBS [0069] VCS is a constant data size
that a redundancy algorithm can be applied to. If a data unit is
smaller than VCS, mirroring is used. If data unit is larger than
VCS it will be wrapped to a new cluster. For example, with
reference to FIG. 5, if a VSP has 5 data nodes and the NBS is 1400
bytes, the VCS would be 5*1400=7000 bytes. If a client application
performs a write I/O operation of 25 kilobytes of data, the NDFS
will split it to three clusters (of 7000 bytes) and remaining 4000
bytes will be mirrored among nodes. Another implementation would
pad the remaining 4000 bytes with 3000 zeros up to full cluster
size and distribute among nodes as a fourth cluster. [0070] Host
Block Size (HBS) is the block size used on a host operating
system.
[0071] Referring now to the implementation of FIG. 3 where only
Peer Nodes and a single VSP per network are considered. In this
implementation, a simple user mode application (u_ndfs.exe) is used
for startup, maintenance, recovery, cleanup, VSP forming, LSE
operations and the P2P protocol, however, it will be seen that
separate functionality could equally be implemented in separate
applications.
[0072] Upon startup, u_ndfs.exe reads config.xml, a configuration
file, which defines LSE location and VSP properties i.e. geometry,
disk name and IP addresses of peer nodes. (The configuration file
is defined through user interaction with a configuration GUI
portion (CONFIG GUI) of U_ndfs.) U_ndfs then spawns a networking
P2P protocol thread, NDFS Service. The network protocol used by the
thread binds to a local interface on a UDP port and starts network
communications with other nodes contributing to the VSP.
[0073] If less than a quorum N of N+M nodes are detected by the
node on start-up, the VSP is suspended for that node until a quorum
is reached.
[0074] Where there is N+M redundancy and where N<=M, it is
possible for two separate quorums to exist on two detached
networks. In such a case, if N<=50% of N+M, but a quorum is
reached at a node, the VSP is set to read-only mode at that
node.
[0075] Once a quorum is present, local LSE to VSP directory
comparison is performed by recovering directory metadata from
another node.
[0076] If the VSP contains any newer files/directories than the
local LSE (for instance if the node has been off the network and
files/directories have been changed), a recovery procedure is
performed by retrieving redundant network parts from one or more
other nodes and rebuilding LSE data for the given file/directory.
In a simple implementation, for recovery, the node closest to the
requesting node based on network latency is used as the source for
metadata recovery.
[0077] So for example, in an N+M redundancy VSP implementation, a
file is split into N+M clusters, each cluster containing a data
component and a redundant component. Where one or more the N+M
nodes of the VSP was unavailable when the file was written or
updated, during recovery, the previously unavailable node must
obtain at least N of the clusters in order to rebuild the cluster
which should be stored for the file on the recovering node to
maintain the overall level of redundancy for all files of the
VSP.
[0078] It will also be seen that, after start-up and recovery, the
networking protocol should remain aware of network failure and
needs to perform an LSE rescan and recovery every time the node is
reconnected to the network. The user should be alerted to expect
access to the VSP when this happens.
[0079] A transaction log can be employed to speed up the recovery
process instead of using a directory scan, and if the number of
changes to the VSP exceeds the log size, a full recovery could be
performed.
[0080] It can also be useful during recovery to perform full disk
scan in a manner of fsck ("file system check" or "file system
consistency check" in UNIX) or chkdsk (Windows) to ensure files
have not been corrupted.
[0081] When LSE data is consistent with the VSP, the networking
thread begins server operations and u_ndfs.exe loads a VSP disk
device kernel driver (ndfs.sys). The disk device driver (NFDS
Driver) then listens to requests from the local operating system
and applications, while u_ndfs.exe listens to requests from other
nodes through the networking thread.
[0082] Referring to FIG. 2, in operation, an application (for
instance Microsoft Word) running on the host operating system,
calls the I/O subsystem in the OS kernel and requests a portion of
data with an offset (0 to file length) and size. (If the size is
bigger than HBS, the kernel will fragment the request to smaller
subsequent requests). The I/O subsystem then sends an IRP (I/O
request packet) message to the responsible device driver module,
NFDS driver. In case of a request to the VSP, the kernel device
driver receives the request and passes it on to the P2P network
protocol thread, NDFS Service, for further processing based on the
VSP geometry.
[0083] At the same time, when the server side of the networking
thread receives a request from a client node through the network,
an LSE I/O operation is performed on the local storage.
[0084] Both client and server I/Os can be thought of as normal I/O
operations with an exception that they are intercepted and passed
through the NDFS driver and NDFS service like a proxy. N+M
redundancy can thus be implemented with the P2P network protocol
transparent to both clients and servers.
[0085] Referring now to FIG. 4, in further refined implementation
of the invention, a separate kernel driver, NDFS Net Driver, is
implemented for high-speed network communications instead of using
Winsock. This driver implements its own layer-3 protocol and only
reverts to IP/UDP in case of communication problems.
[0086] Also, instead of using the Windows file system for the LSE,
a database, NDFS DB, can be used. Such a database implemented LSE
can also prevent users from manipulating the raw data stored in a
hidden directory as in the implementation of FIG. 3.
[0087] For the implementation of FIG. 3, a P2P network protocol is
used to provide communications between VSP peer nodes on the
network. Preferably, every protocol packet comprises:
[0088] Protocol ID
[0089] Protocol Version
[0090] Geometry
[0091] Function ID
[0092] Function Data
For the implementations of FIGS. 3 and 4, the following functions
are defined:
TABLE-US-00001 NDFS_FN_READ_FILE_REQUEST 0x0101
NDFS_FN_READ_FILE_REPLY 0x0201 NDFS_FN_WRITE_FILE 0x0202
NDFS_FN_CREATE_FILE 0x0102 NDFS_FN_DELETE_FILE 0x0103
NDFS_FN_RENAME_FILE 0x0104 NDFS_FN_SET_FILE_SIZE 0x0105
NDFS_FN_SET_FILE_ATTR 0x0106 NDFS_FN_QUERY_DIR_REQUEST 0x0207
NDFS_FN_QUERY_DIR_REPLY 0x0203 NDFS_FN_PING_REQUEST 0x0108
NDFS_FN_PING_REPLY 0x0204 NDFS_FN_WRITE_MIRRORED 0x0109
NDFS_FN_READ_MIRRORED_REQUEST 0x0205 NDFS_FN_READ_MIRRORED_REPLY
0x0206
[0093] As can be seen above, every function has a unique id, and
the highest order byte defines whether the given function is
BROADCAST (1) or UNICAST (2) based.
[0094] The functions can be categorized as carrying data or
metadata (directory operations). Also defined are control functions
such as PING, which do not directly influence the file system or
data.
Functions, which carry data are as follows:
[0095] READ_REQUEST
[0096] READ_REPLY
[0097] WRITE
[0098] WRITE_MIRRORED
[0099] READ_MIRRORED_REQUEST
[0100] READ_MIRRORED_REPLY
whereas functions, which carry metadata are as follows: [0101]
CREATE--creates a file or directory with a given name and
attributes [0102] DELETE--deletes a file or directory with it's
contents [0103] RENAME--renames a file or directory or it's
localization in directory structure (MOVE) [0104] SET_ATTR--changes
file attributes [0105] SET_SIZE--sets file size. Note that the file
size doesn't imply how much space the file physically occupies on
the disk and is only an attribute. [0106] QUERY_DIR_REQUEST [0107]
QUERY_DIR_REPLY
[0108] In the present implementations, all metadata (directory
information) is available on every participating node. All
functions manipulating metadata are therefore BROADCAST based and
do not require two way communications--the node modifying data is
sent as a broadcast message to all other nodes to update the
metadata. Verification of such operations is performed only on the
requesting node.
[0109] The rest of the metadata functions are used to read
directory contents and are used in the recovery process. These
functions are unicast based, because the implementations assume
metadata to be consistent on all available nodes.
[0110] After fragmentation of a file into clusters, the last
fragment usually has a random size smaller than the full cluster
size (unless the file size is rounded up to the full cluster size).
Such a fragment cannot easily be distributed using N+M redundancy
and is stored using 1+M redundancy (replication) using the function
WRITE_MIRRORED. This is also valid for files that are smaller than
cluster size. (Alternative implementations may have different
functionality such as padding or reducing block size to 1
byte.)
[0111] WRITE_MIRRORED is a BROADCAST function because an identical
data portion is replicated to all nodes. It should be noted that
for READ_MIRRORED operations, all data is available locally
(because it is identical on every node) and no network I/O is
required for such small portions of data (except for recovery
purposes).
[0112] Note that the mirrored block size has to be smaller than
cluster size, however it can be larger than NBS size. In such cases
more than one WRITE_MIRRORED packet has to be sent with a different
offset for the data being written.
[0113] In implementing N+M redundancy, clusters are divided into
individual packets. To read data from a file, the broadcast
function READ_REQUEST is used. The function is sent to all nodes
with the cluster offset to be retrieved. Every node replies with
unicast function READ_REPLY with its own data for the cluster at
NBS size.
[0114] The node performing READ_REQUEST waits for first number of
data nodes READ_REPLY packets sufficient to recover the data. If
enough packets are received, any following reply packets are
discarded. The data then is processed by an N+M redundancy function
to recover the original file data.
[0115] Functions like REQUEST/REPLY have a 64-bit unique
identification number generated from the computer's system clock
inserted while sending REQUEST. The packet ID is stored to a queue.
When the required amount of REPLY packets with same ID is received,
the REQUEST ID is removed from the queue. Packets with IDs not
matching those in the queue are discarded.
[0116] The packet ID is also used in functions other than
REQUEST/REPLY to prevent execution of functions on the same node as
the sending node. When a node receives a REQUEST packet with an ID
matching a REQUEST ID in the REQUEST queue, the REQUEST is removed
from the queue. Otherwise the REQUEST function in the packet will
be executed.
[0117] The broadcast function PING_REQUEST is sent when the
networking thread is started on a given node. In response, the node
receives a number of unicast responses PING_REPLY from the other
nodes, and if these are less than required, the VSP is suspended
until quorum is reached.
[0118] Every other node starting up sends following PING_REQUEST
packets and this can be used to indicate to a node that the
required number of nodes are now available, so that VSP operations
can be resumed for read-only or read/write.
[0119] The PING functions are used to establish the closest (lowest
latency) machine to the requesting node and this is used when
recovery is performed. As explained above, re-sync and recovery are
initiated when a node starts up and connects to the network that
has already reached quorum. This is done to synchronize any changes
made to files when the node was off the network. When the recovery
process is started, every file in every directory is marked with a
special attribute. The attribute is removed after recovery is
performed. During the recovery operation the disk is not visible to
the local user. However, remote nodes can perform I/O operations on
the locally stored files not marked with the recovery attribute.
This ensures that data cannot be corrupted by
desynchronization.
[0120] The recovering node reads the directory from the lowest
latency node using QUERY_DIR_REQUEST/RESPONSE functions. The
directory is compared to locally stored metadata for the VSP. When
comparing individual files, the following properties are taken into
consideration: [0121] Name--if the file is present on the source
machine and not present on the local node, the file will be created
using the received metadata and the file recovery process will be
performed. If the file exists on the local node and does not exist
on the remote node it will be removed locally. Exactly same
protocol applies to directories (which are accessed recursively).
[0122] Size of file--if the locally stored file size is different
to the source node the file, it is removed and recovered. [0123]
Last modification time--if the modification time is different the
file is deleted and recovered. [0124] File attributes (e.g.
read-only, hidden, archive)--unlike the previous parameters, in
case of a difference in file attributes, the file is not deleted
and recovered, instead only the attributes are applied. In more
extensive implementations, attributes such as Access Control List
(ACL) and security information can be applied. Also, some
implementation may also include several additional attributes such
as file versioning or snapshots.
[0125] Note that last modification time recovery wouldn't make
sense if local time would be used on every machine. Instead every
WRITE and WRITE_MIRRORED request carry a requesting node generated
timestamp in the packet payload and this timestamp is assigned to
the metadata for the file/directory on every node.
[0126] Per-file data recovery process is performed by first
retrieving the file size from the metadata (which prior to data
recovery has to be "metadata recovered"). Then the file size is
divided into cluster sizes and standard READ_REQUESTS performed to
retrieve the data. An exception is the last cluster which is
retrieved from the metadata source node (lowest latency) using
READ_MIRRORED_REQUEST. The last part of recovery process comprises
setting proper metadata parameters (size, attributes, last
modification time) on the file.
[0127] File and attribute comparison is performed recursively for
all files and folders on the disk storage. When recovery is
finished all data is in sync and normal operations are resumed.
[0128] Alternative implementations of the invention can have
dynamic recovery as opposed to recovery on startup only. For
example, the networking thread can detect that the node lost
communication with the other nodes and perform recovery each time
communication is restored.
[0129] As mentioned above, a live transaction log file (journaling)
can assist such recovery and the node could periodically check the
journal or its serial number to detect if any changes have been
made that the node was unaware of. Also the journal checking and
metadata and last cluster recovery should be performed in more
distributed manner than just trusting the node with lowest
latency.
[0130] While the above implementations have been described as
implemented in Windows platforms, it will be seen that the
invention can equally be implemented with other operating systems,
as despite operating system differences a similar architecture to
that shown in FIGS. 3 and 4 can be used.
[0131] In more extensive implementations of the invention,
different security models can be applied to a VSP: [0132] Open
Access--no additional security mechanisms, anyone with a compatible
client can access the VSP. Only collision detection will have to be
performed to avoid data corruption. Standard Windows ACLs and
Active Directory authentication will apply. [0133] Symmetric Key
Access--a node trying to access VSP will have to provide a shared
pass-phrase. The data on LSE and/or protocol messages will be
encrypted and the pass-phrase will be used to decrypt data on fly
when doing N+M redundancy calculations. [0134] Certificate
Security--in this security model, when forming a VSP, every node
will have to exchange it's public keys with every other node on the
network. When a new node tries to access the VSP it will have to be
authorized on every existing participating node (very high
security).
[0135] While the implementations above have been described in terms
of active clients, servers and peer nodes, it will be seen that the
invention can easily be made available to legacy clients, for
example, using Windows Share. It may be particularly desirable to
allow access only to clients which are more likely not be highly
available, for example, a laptop, as becoming a peer in a VSP could
place an undue recovery burden, not only the laptop but on other
nodes participating in the VSP, as the laptop connects and
disconnects from the network.
[0136] Further variations of the above described implementations
are also possible. So for example, rather than using an IP or MAC
to identify nodes participating in a VSP, a dedicated NODE_ID could
be used. Administration functions could also be expanded to enable
one node to be replaced with another node in the VSP, individual
nodes to be added or removed from the VSP or the VSP geometry to be
changed.
[0137] Additionally the VSP could be implemented in a way that
represents a continuous random access device formatted with a
native file system such as FAT, NTFS or EXT/UFS on Unix. The VSP
could also be used as virtual magnetic tape device for storing
backups using traditional backup software.
[0138] Native Filesystem usage represents a potential problem where
multiple nodes, while updating the same volume, could corrupt the
VSP file system meta data due to multi node locking. To mitigate
this, either a clustered filesystem would be used, or each node
could access only a separate virtualized partition at a time.
[0139] For example, in a High Availability cluster such as
Microsoft Cluster Server, Sun Cluster or HP Serviceguard, a HA
Resource Group traditionally comprises a LUN or Disk Volume or
partition residing on a shared storage (disk array or SAN) that is
used only by this Resource Group and moves between nodes together
with other resources. Referring now to FIG. 6, such a LUN or
partition could be replaced with NDFS VSP formed out of cluster
nodes and internal disks, so removing HA cluster software
dependency on shared physical storage.
[0140] It will be appreciated by persons skilled in the art that
the present invention is not limited to what has been particularly
shown and described herein above. In addition, unless mention was
made above to the contrary, it should be noted that all of the
accompanying drawings are not to scale. A variety of modifications
and variations are possible in light of the above teachings without
departing from the scope and spirit of the invention, which is
limited only by the following claims.
* * * * *