U.S. patent application number 12/758245 was filed with the patent office on 2010-10-14 for virtual machine data backup.
This patent application is currently assigned to PHD Virtual Technologies. Invention is credited to RONALD T. McKELVEY, ALEXANDER D. MITTELL, JAMES ROSIKIEWICZ.
Application Number | 20100262797 12/758245 |
Document ID | / |
Family ID | 42935159 |
Filed Date | 2010-10-14 |
United States Patent
Application |
20100262797 |
Kind Code |
A1 |
ROSIKIEWICZ; JAMES ; et
al. |
October 14, 2010 |
VIRTUAL MACHINE DATA BACKUP
Abstract
Disclosed is a method and system for efficiently backing up a
virtual machine file. A virtual machine file is logically divided
into a plurality of fixed-size blocks of similar size, for example,
a number of 1 MB data blocks. An MD5 hash value is generated from
the contents of each block. Each block is written to a file having
a filename that includes a filesystem-compliant form (e.g.,
hexadecimal form) of the computed MD5 hash value. A backup device
includes a directory hierarchy having a plurality of first-level
directories corresponding to the first two bytes of the hash value,
and a plurality of second-level directories corresponding to the
next two bytes of the hash value. The blocks are uniquely stored in
the directory corresponding to the byte value pairs of the hash.
The present disclosure provides data integrity checking and reduces
storage requirements for duplicative, redundant, or null data.
Inventors: |
ROSIKIEWICZ; JAMES;
(Stockton, NJ) ; McKELVEY; RONALD T.; (Morris
Plains, NJ) ; MITTELL; ALEXANDER D.; (Cedar Knolls,
NJ) |
Correspondence
Address: |
CARTER, DELUCA, FARRELL & SCHMIDT, LLP
445 BROAD HOLLOW ROAD, SUITE 420
MELVILLE
NY
11747
US
|
Assignee: |
PHD Virtual Technologies
Mount Arlington
NJ
|
Family ID: |
42935159 |
Appl. No.: |
12/758245 |
Filed: |
April 12, 2010 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61168315 |
Apr 10, 2009 |
|
|
|
61168318 |
Apr 10, 2009 |
|
|
|
61172435 |
Apr 24, 2009 |
|
|
|
Current U.S.
Class: |
711/162 ;
707/640; 707/E17.007; 707/E17.01; 711/E12.001; 711/E12.103;
713/165 |
Current CPC
Class: |
G06F 11/1469 20130101;
G06F 16/10 20190101; G06F 11/1484 20130101; G06F 11/1438
20130101 |
Class at
Publication: |
711/162 ;
707/640; 713/165; 711/E12.001; 711/E12.103; 707/E17.007;
707/E17.01 |
International
Class: |
G06F 12/16 20060101
G06F012/16; G06F 12/00 20060101 G06F012/00; G06F 17/00 20060101
G06F017/00 |
Claims
1. A method for backing up computer data, comprising the steps of:
dividing a source data file into a plurality of fixed size blocks,
wherein each block is of equal blocksize; generating a unique block
identifier relating to the contents of a fixed size block; on a
destination storage device, providing a directory hierarchy having
a plurality of first-level directories corresponding to a first
portion of the unique block identifier and a plurality of
second-level directories corresponding to a second portion of the
unique block identifier; and storing a datablock file
representative of the fixed size block in a corresponding second
level directory.
2. The method in accordance with claim 1, further comprising:
providing an index file corresponding to the source data file; and
storing the unique block identifier in the index file.
3. The method in accordance with claim 1, wherein the fixed block
size is in a range of about 256 KB to about 8 MB.
4. The method in accordance with claim 1, further comprising the
step of compressing the datablock file representative of the fixed
size block.
5. The method in accordance with claim 1, further comprising the
step of encrypting the datablock file representative of the fixed
size block.
6. The method in accordance with claim 1, wherein the unique block
identifier is a hash is generated in accordance with an MD5
algorithm
7. The method in accordance with claim 1, further comprising the
step of naming the datablock file representative of a fixed size
block in accordance with the unique block identifier.
8. The method in accordance with claim 1, further comprising the
steps of: computing a unique block identifier of a stored datablock
file; retrieving a stored unique block identifier corresponding to
the stored datablock; determining a property of the stored
datablock by comparing the computed unique block identifier to the
stored unique block identifier.
9. The method in accordance with claim 1, further comprising:
determining whether the fixed size block consists of a simple data
pattern.
10. The method in accordance with claim 9, wherein the simple data
pattern is selected from a group consisting of all zeros, all ones,
and all nulls.
11. A system for performing data backup, comprising: a processor; a
storage device operably coupled to the processor; and a data backup
module including a set of instructions executable on the processor
for performing a method of data backup comprising the steps of:
dividing a source data file into a plurality of fixed size blocks,
wherein each block is of equal blocksize; generating a unique block
identifier relating to the contents of a fixed size block; on the
storage device, providing a directory hierarchy having a plurality
of first-level directories corresponding to a first portion of the
unique block identifier and a plurality of second-level directories
corresponding to a second portion of the unique block identifier;
and storing a datablock file representative of the fixed size block
in a corresponding second level directory.
12. The system in accordance with claim 11, wherein the method of
data backup further comprises the steps of: providing an index file
corresponding to the source data file; and storing the unique block
identifier in the index file.
13. The system in accordance with claim 11, wherein the fixed block
size is in a range of about 256 KB to about 8 MB.
14. The system in accordance with claim 11, wherein the method of
data backup further comprises the step of compressing the datablock
file representative of the fixed size block.
15. The system in accordance with claim 11, wherein the method of
data backup further comprises the step of encrypting the datablock
file representative of the fixed size block.
16. The system in accordance with claim 11, wherein the unique
block identifier is a hash is generated in accordance with an MD5
algorithm
17. The system in accordance with claim 11, wherein the method of
data backup further comprises the step of naming the datablock file
representative of a fixed size block in accordance with the unique
block identifier.
18. The system in accordance with claim 11, wherein the method of
data backup further comprises the steps of: computing a unique
block identifier of a stored datablock file; retrieving a stored
unique block identifier corresponding to the stored datablock;
determining a property of the stored datablock by comparing the
computed unique block identifier to the stored unique block
identifier.
19. The system in accordance with claim 11, wherein the method of
data backup further comprises the step of determining whether the
fixed size block consists of a simple data pattern.
20. The system in accordance with claim 19, wherein the simple data
pattern is selected from a group consisting of all zeros, all ones,
and all nulls.
21. Machine-readable media comprising a set of instructions
configured to perform the method of data backup in accordance with
claims 1 though 10.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] The present application claims the benefit of and priority
to U.S. Provisional Application Ser. No. 61/168,315, filed on Apr.
10, 2009, entitled "VIRTUAL MACHINE DATA BACKUP"; U.S. Provisional
Application Ser. No. 61/168,318, filed on Apr. 10, 2009, entitled
"VIRTUAL MACHINE FILE-LEVEL RESTORATION"; and U.S. Provisional
Application Ser. No. 61/172,435, filed on Apr. 24, 2009, entitled
"VIRTUAL MACHINE DATA REPLICATION"; the entirety of each are hereby
incorporated by reference herein for all purposes.
BACKGROUND
[0002] 1. Technical Field
[0003] The present disclosure relates to computer data backup, and
in particular, to a system and method for performing block-level
backups of virtual machine, wherein backed up data is stored in
de-duplicated form in a hierarchical directory structure.
[0004] 2. Background of Related Art
[0005] Continuing advances in storage technology allow vast amounts
of digital data to be stored cheaply and efficiently. However, in
the event of a failure or catastrophe, equally vast amounts of data
can be lost. Therefore, data backup is a critical component of
computer-based systems. As used herein, the term "backup" may refer
to the act of creating copies of data, and may refer to the actual
backed-up copy of the original data. The original data typically
resides on a hard drive, or on an array of hard drives, but may
also reside on other forms of storage media, such as solid state
memory. Data backups are necessary for several reasons, including
disaster recovery, restoring data lost due to storage media
failure, recovering accidentally deleted data, and repairing
corrupted data resulting from malfunctioning or malicious
software.
[0006] A virtual machine (VM) is a software abstraction of an
underlying physical (i.e., hardware) machine which enables one or
more instances of an operating system, or even one or more
operating systems, to run concurrently on a physical host machine.
Virtual machines have become popular with administrators of data
centers, which can contain dozens, hundreds, or even thousands of
physical machines. The use of virtual servers greatly simplifies
the task of configuring and administering servers in a large scale
environment, because a virtual machine may be quickly placed into
service without incurring the expense of provisioning a hardware
machine at a data center. Virtualization is highly scalable,
enabling servers to be allocated or deallocated in response to
changes in demand. Support and administration requirements may be
reduced because virtual servers are readily monitored and accessed
using remote administration tools and diagnostic software.
[0007] In one aspect, a virtual server consists of three
components. The first component is virtualization software
configured to run on the host machine which performs the hardware
abstraction, often referred to as a hypervisor. The second
component is a data file which represents the filesystem of the
virtual machine, which typically contains the virtual machine's
operating system, applications, data files, etc. A virtual machine
data file may be a hard disk image file, such as, without
limitation, a Virtual Machine Disk Format (VMDK) format file. Thus,
for each virtual machine, a separate virtual machine file is
required. The third component is the physical machine on which the
virtualization software executes. A physical machine may include a
processor, random-access memory, internal or external disk storage,
and input/output interfaces, such as network, storage, and desktop
interfaces (e.g., keyboard, pointing device, and graphic display
interfaces.)
[0008] In installations having many machines, traditional methods
of performing backups may become burdensome and tend to be unduly
resource-intensive, particularly in a virtual environment. In
addition, backing up multiple instances of essentially identical
virtual servers (as typically found in, e.g., "server farms" or in
clustered systems") often results in large amounts of redundant
backup data, which becomes difficult to manage and store. A backup
system which performs virtual server backups with increased
efficiency and effectiveness would be a welcome advance.
SUMMARY
[0009] The disclosed method processes 1 MB fixed-length blocks of
data of a virtual machine file. A unique identifier, such as
without limitation, an MD5 hash, is created for this block data.
The 1 MB of data can be compressed, or left uncompressed. The 1 MB
of data is stored as a single file. The file name is the MD5 hash
value of the 1 MB data block. The hash of this file is saved to a
separate index file for later use to retrieve, validate, and
rebuild the backup data. The data blocks, whether in compressed or
uncompressed form, are stored at a storage destination, in a unique
directory structure consisting of 256 first level directories
designated as 00-FF, each having 256 second level directories
designated as 00-FF within, comprising 65,536 directories in total.
The 1 MB compressed (or uncompressed) data files are stored in the
directory structure based on the first four bytes of the hash,
e.g., [0010] "./00/22/T.002249a8a218ef8a4da87550f388942d.gz".
[0011] The first four bytes of data for the file name are "0022".
The file is stored in directory "./00/22/". The .gz extension
indicates the file is compressed.
[0012] Subsequent backups are performed having as a destination the
same storage location. Data blocks are generated using the above
unique hash. A file query is made to the storage location to see if
there is already a file existing with the same hash. If the file
does not exist, the source data is written into the directory
hierarchy with the hash as the file name and an index file is
updated. If the file exists, then only the index file is updated
for the current backup being run.
[0013] Over time the directory structure will accumulate data
blocks from all backups sent thereto. A separate index file is
created for each backup, and is used to keep track of the blocks of
data for, e.g., re-assembling data block of the original source
during restoration.
[0014] The use of a hash also provides a self-checking mechanism
which enables self-validation of the data within the stored file. A
routine is scheduled to run on an ad-hoc or periodic basis that
reads the data within a stored file, and validates the data in the
file to verify a match to the hash file name. If the data does not
match, the block is considered suspect, and is slated to be
deleted. All associated backups that include this data block are
flagged as "bad". The index file corresponding to backups so
flagged may additionally or alternatively include a "bad" flag.
[0015] In an embodiment, the data blocks (e.g., the 1 MB data
blocks) may be evaluated to determine whether the data contained
therein exhibits a predefined ("special") data pattern. For example
with limitation, a special data pattern may include a particular or
repeating pattern, e.g., a data block consisting entirely of zero
(00H) bytes. In this instance; a special hash is generated that
represents the special data block containing the particular data
pattern. The special hash may be hard-coded, defined in a database,
and/or defined in a configuration file. Since the contents of a
special data block is predefined, it is only necessary to record
the fact that the data block is special. It is unnecessary to store
the actual contents of a special block. Thus, for each data block
identified as special, the index file is updated accordingly and
the backup proceeds. In this manner, resources are conserved since
special blocks, e.g., null blocks, do not consume space on the
storage device, do not use communication bandwidth during backup
and restoration procedures, do not require as much computational
resources, and so forth. This provides an efficient way to skip
special (e.g., null) data in a given backup set.
[0016] In another aspect, disclosed is a method for backing up
computer data that includes the steps of dividing a source data
file into a plurality of fixed size blocks, wherein each block is
of equal blocksize. A unique block identifier relating to the
contents of a fixed size block is generated. On a destination
storage device, a directory hierarchy is provided having a
plurality of first-level directories corresponding to a first
portion of the unique block identifier, and a plurality of
second-level directories corresponding to a second portion of the
unique block identifier. A datablock file representative of the
fixed size block is stored in a corresponding second level
directory.
[0017] In yet another aspect, disclosed is machine-readable media
comprising a set of instructions configured to perform a method for
backing up computer data that includes the steps of dividing a
source data file into a plurality of fixed size blocks, wherein
each block is of equal blocksize. A unique block identifier
relating to the contents of a fixed size block is generated. On a
destination storage device, a directory hierarchy is provided
having a plurality of first-level directories corresponding to a
first portion of the unique block identifier, and a plurality of
second-level directories corresponding to a second portion of the
unique block identifier. A datablock file representative of the
fixed size block is stored in a corresponding second level
directory.
[0018] Also disclosed is a system for performing data backup that
includes a processor, a storage device operably coupled to the
processor, and a data backup module. The data backup module
including a set of instructions executable on the processor for
performing a method of data backup. The method includes the steps
of dividing a source data file into a plurality of fixed size
blocks, wherein each block is of equal blocksize. A unique block
identifier relating to the contents of a fixed size block is
generated. On a destination storage device, a directory hierarchy
is provided having a plurality of first-level directories
corresponding to a first portion of the unique block identifier,
and a plurality of second-level directories corresponding to a
second portion of the unique block identifier. A datablock file
representative of the fixed size block is stored in a corresponding
second level directory.
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] The above and other aspects, features, and advantages of the
present disclosure will become more apparent in light of the
following detailed description when taken in conjunction with the
accompanying drawings in which:
[0020] FIG. 1 shows a block diagram of an embodiment of a virtual
machine backup system in accordance with the present
disclosure;
[0021] FIG. 2 is a flowchart of an embodiment of a virtual machine
backup method in accordance with the present disclosure;
[0022] FIG. 3 is a block diagram illustrating a directory hierarchy
of an embodiment of a virtual machine backup in accordance with the
present disclosure; and
[0023] FIG. 4 is a flow diagram of an embodiment of a virtual
machine backup in accordance with the present disclosure.
DETAILED DESCRIPTION
[0024] Particular embodiments of the present disclosure are
described hereinbelow with reference to the accompanying drawings;
however, it is to be understood that the disclosed embodiments are
merely examples of the disclosure, which may be embodied in various
forms. Well-known functions or constructions are not described in
detail to avoid obscuring the present disclosure in unnecessary
detail. Therefore, specific structural and functional details
disclosed herein are not to be interpreted as limiting, but merely
as a basis for the claims and as a representative basis for
teaching one skilled in the art to variously employ the present
disclosure in virtually any appropriately detailed structure. In
the discussion contained herein, the terms user interface element
and/or button are understood to be non-limiting, and include other
user interface elements such as, without limitation, a hyperlink,
clickable image, and the like.
[0025] Additionally, the present invention may be described herein
in terms of functional block components, code listings, optional
selections, page displays, and various processing steps. It should
be appreciated that such functional blocks may be realized by any
number of hardware and/or software components configured to perform
the specified functions. For example, the present invention may
employ various integrated circuit components, e.g., memory
elements, processing elements, logic elements, look-up tables, and
the like, which may carry out a variety of functions under the
control of one or more microprocessors or other control
devices.
[0026] Similarly, the software elements of the present invention
may be implemented with any programming or scripting language such
as C, C++, C#, Java, COBOL, assembler, PERL, Python, PHP, or the
like, with the various algorithms being implemented with any
combination of data structures, objects, processes, routines or
other programming elements. The object code created may be executed
by any computer having an Internet Web Browser, on a variety of
operating systems including Windows, Macintosh, and/or Linux.
[0027] Further, it should be noted that the present invention may
employ any number of conventional techniques for data transmission,
signaling, data processing, network control, and the like.
[0028] It should be appreciated that the particular implementations
shown and described herein are illustrative of the invention and
its best mode and are not intended to otherwise limit the scope of
the present invention in any way. Examples are presented herein
which may include sample data items (e.g., names, dates, etc.)
which are intended as examples and are not to be construed as
limiting. Indeed, for the sake of brevity, conventional data
networking, application development and other functional aspects of
the systems (and components of the individual operating components
of the systems) may not be described in detail herein. Furthermore,
the connecting lines shown in the various figures contained herein
are intended to represent example functional relationships and/or
physical or virtual couplings between the various elements. It
should be noted that many alternative or additional functional
relationships or physical or virtual connections may be present in
a practical electronic data communications system.
[0029] As will be appreciated by one of ordinary skill in the art,
the present invention may be embodied as a method, a data
processing system, a device for data processing, and/or a computer
program product. Accordingly, the present invention may take the
form of an entirely software embodiment, an entirely hardware
embodiment, or an embodiment combining aspects of both software and
hardware. Furthermore, the present invention may take the form of a
computer program product on a computer-readable storage medium
having computer-readable program code means embodied in the storage
medium. Any suitable computer-readable storage medium may be
utilized, including hard disks, CD-ROM, DVD-ROM, optical storage
devices, magnetic storage devices, semiconductor storage devices
(e.g., USB thumb drives) and/or the like.
[0030] The present invention is described below with reference to
block diagrams and flowchart illustrations of methods, apparatus
(e.g., systems), and computer program products according to various
aspects of the invention. It will be understood that each
functional block of the block diagrams and the flowchart
illustrations, and combinations of functional blocks in the block
diagrams and flowchart illustrations, respectively, can be
implemented by computer program instructions. These computer
program instructions may be loaded onto a general purpose computer,
special purpose computer, or other programmable data processing
apparatus to produce a machine, such that the instructions that
execute on the computer or other programmable data processing
apparatus create means for implementing the functions specified in
the flowchart block or blocks.
[0031] These computer program instructions may also be stored in a
computer-readable memory that can direct a computer or other
programmable data processing apparatus to function in a particular
manner, such that the instructions stored in the computer-readable
memory produce an article of manufacture including instruction
means that implement the function specified in the flowchart block
or blocks. The computer program instructions may also be loaded
onto a computer or other programmable data processing apparatus to
cause a series of operational steps to be performed on the computer
or other programmable apparatus to produce a computer-implemented
process such that the instructions that execute on the computer or
other programmable apparatus provide steps for implementing the
functions specified in the flowchart block or blocks.
[0032] Accordingly, functional blocks of the block diagrams and
flowchart illustrations support combinations of means for
performing the specified functions, combinations of steps for
performing the specified functions, and program instruction means
for performing the specified functions. It will also be understood
that each functional block of the block diagrams and flowchart
illustrations, and combinations of functional blocks in the block
diagrams and flowchart illustrations, can be implemented by either
special purpose hardware-based computer systems that perform the
specified functions or steps, or suitable combinations of special
purpose hardware and computer instructions.
[0033] One skilled in the art will also appreciate that, for
security reasons, any databases, systems, or components of the
present invention may consist of any combination of databases or
components at a single location or at multiple locations, wherein
each database or system includes any of various suitable security
features, such as firewalls, access codes, encryption,
de-encryption, compression, decompression, and/or the like.
[0034] The scope of the invention should be determined by the
appended claims and their legal equivalents, rather than by the
examples given herein. For example, the steps recited in any method
claims may be executed in any order and are not limited to the
order presented in the claims. Moreover, no element is essential to
the practice of the invention unless specifically described herein
as "critical" or "essential."
[0035] FIG. 1 illustrates a representative operating environment
100 for an example embodiment of a virtual machine backup system
105 in accordance with the present disclosure. Representative
operating environment 100 includes virtual machine backup system
105 which can be a personal computer (PC) or a server, which
further includes at least one system bus 150 which couples system
components, including at least one processor 110; a system memory
115 which may include random-access memory (RAM); at least one
storage device 130, such as without limitation one or more hard
disks, CD-ROMs or DVD-ROMs, or other non-volatile storage devices,
such as without limitation flash memory devices; and a data network
interface 140. System bus 150 may include any type of data
communication structure, including without limitation a memory bus
or memory controller, a peripheral bus, a virtual bus, a software
bus, and/or a local bus using any bus architecture such as without
limitation PCI, USB or IEEE 1394 (Firewire). Data network interface
140 may be a wired network interface such as a 100Base-T Fast
Ethernet interface, or a wireless network interface such as without
limitation a wireless network interface compliant with the IEEE
802.11 (i.e., WiFi), GSM, or CDMA standard.
[0036] Virtual machine backup system 105 may be operated in a
networked environment via data network interface 140, wherein
system 105 is connected to one or more virtual machine hosts 160 by
a data network 180, such as a local area network or the Internet,
for the transmission and reception of data, such as without
limitation backing up and restoring virtual machine data files as
will be further described herein. Each of the one or more virtual
machine hosts 160 may include one or more virtual machines 170
operating therein, as will be appreciated by the skilled
artisan.
[0037] Virtual machine backup system 105 includes a virtual machine
backup module 120 that is configured to perform a method of virtual
machine data backup as described herein. In an embodiment, virtual
machine backup module 120 includes a set of programmable
instructions adapted to execute on processor 100 for performing the
disclosed method of virtual machine data backup. In particular, a
method for backing up a virtual disk file or virtual machine file,
e.g., a VMDK file, is presented herein. With reference to FIG. 2, a
virtual machine file 420 slated for backup may be stored on a
storage device, such as without limitation, hard disk 410. While it
is contemplated that hard disk 410 may be included within a virtual
machine host, is it to be understood that a virtual machine file
420 may be stored on a hard disk array, such as a storage-area
network (SAN), a redundant array of independent disks (RAID),
network-attached storage (NAS) and/or on any storage medium now or
in the future known.
[0038] The virtual machine file 420 is logically divided into a
number of fixed-length blocks 430 of like size. In one embodiment,
a blocksize of 1 MB is used, however, it is to be understood that a
blocksize of less than 1 MB, or greater than 1 MB, may be used
within the scope of the disclosed method. In one aspect, the
blocksize is determined at least in part by a correlation between
performance and blocksize. Other parameters affecting blocksize may
include, without limitation, a data bus speed, a data bus width, a
virtual machine file size, a processor speed, a storage device
bandwidth, and a network throughput. If a virtual machine does not
precisely equal a multiple of a chosen fixed blocksize, the
remainder may be padded with e.g., zeros, nulls, or any other fill
pattern, to achieve a set of equal-sized blocks.
[0039] An individual backup data file 445 is created from each
fixed-length block 430 of the virtual machine file 420. In an
embodiment, individual backup data file 445 may be given a
temporary filename, and/or stored in a temporary location, e.g.,
/var/tmp/block000001.dat. A hash is generated according to the
contents of each individual backup data file. In an embodiment, a
4,096 bit MD5 hash is used to create the hash value from the
contents thereof. The resultant hash value is stored in an index
file corresponding to the current backup session which store for
later use during, e.g., data restoration. The index file may
include, without limitation, a list of data blocks comprising the
backup set, hash values corresponding thereto, a date and time of
backup, a source location, and a destination location. A collection
of hash values representative of a backup of virtual machine file,
and data associated therewith, may be stored in an index file 455.
Such a collection, together with the individual backup data files
comprising the backed-up virtual machine file 420 is known as a
"backup set."
[0040] Additionally or alternatively, the data block 430 may be
compressed during a compression step 432 using any suitable manner
of data compression, including without limitation, LZW, zip, gzip,
rar, and/or bzip. Preferably, lossless data compression is used
however in certain embodiments lossy data compression may
advantageously be used.
[0041] The hash value may be regarded as a unique block identifier,
or a unique identifier of a backup data file 455. A non-temporary
("archival") filename of the backup data file may be generated, at
least in part, from the hash value, as illustrated in step 434. For
example, the filename of a backup data file 455 may be created by
appending a hexadecimal representation of the hash value to a file
prefix and/or to an appropriate file extension. Each backup data
file 455 comprising the virtual machine file therefore has a unique
filename based upon the hash value.
[0042] As seen in FIG. 3, a hierarchical directory structure 300 is
provided on a backup storage device, e.g., storage device 130, for
storing the backup data files. The disclosed structure has at a
first level thereof a plurality of directories 320 et seq. (e.g.,
folders). Each first level directory contains therein a plurality
of second level directories 330. In an embodiment, the hierarchy
includes 256 first level directories, wherein each first level
directory includes 256 second level directories, for a total number
of 65,536 directories. The first level and second level directories
may be named in accordance with a sixteen bit hexadecimal value,
e.g., 00-FF. Thus, for example, a plurality of first level
directories may be named in accordance with the series ./00, ./01,
./02 . . . ./FF while a second level of directories may be named
./00/01, ./00/02/ . . . ./00/FF. Other directory mapping schemes
are envisioned within the scope of the present disclosure, such as
without limitation, a directory hierarchy having fewer than two
levels, a directory hierarchy having greater than two levels, a
directory hierarchy having a directory naming convention that
includes fewer than a sixteen bit hexadecimal value, a directory
hierarchy having a directory naming convention that includes
greater than a sixteen bit hexadecimal value, and/or a directory
hierarchy having a directory naming convention that includes an
alternative naming encoding, such as octal, ASCII85, and the
like.
[0043] With reference now to FIG. 2, each backup data file may
advantageously be stored (e.g., copied or moved) in the directory
hierarchy in accordance with the first 4 bytes of the hash value
thereof. By way of example only, assume a backup data file
representing a 1 MB block of a virtual machine file has an MD5 hash
value of: [0044] 010249a8a218ef8a4da87550f388942d
[0045] The backup data file may be compressed with gzip and renamed
in accordance with the present disclosure, e.g.: [0046]
T.010249a8a218ef8a4da87550f388942d.gz
[0047] Taking the first four bytes of the hash value, two at a
time, the destination directory is identified as: [0048]
./01/02
[0049] The backup data file is stored in the identified destination
directory, hence the full pathname of the backup data file may be
expressed as: [0050]
./01/02T/.010249a8a218ef8a4da87550f388942d.gz
[0051] In this manner, each unique data block 430 corresponds to a
backup data file 445 uniquely stored within the directory hierarchy
300. The present disclosure also contemplates a filename/directory
mapping which uses greater than, less than, and/or other than the
first four bytes of the hash value. During execution of a
subsequent backup process, a filename is generated as previously
described. A file query is made to the storage device, e.g., it is
determined whether a backup data file having the same filename
exists and if so, it is presumed the block is unchanged from the
prior backup, and the index file corresponding to the subsequent
backup is updated to include the existing (e.g., unchanged) block.
If, however, it is determined whether a backup data file having the
same filename does not exist, it is presumed the block changed and
the newly-created backup data file is stored within the directory
hierarchy as previously described herein, and a corresponding entry
is written to the index file. In this manner, by ensuring that
duplicate copies of data block are stored only once, increasing
efficiency, e.g., increased execution speed and reduced resource
usage, are provided by a backup performed in accordance with the
present disclosure.
[0052] Advantageously, the disclosed method provides data integrity
validation, which may identify data corruption. During data
integrity validation, a backup data block is read (and, if
required, expanded to an uncompressed form) whereupon a hash value
is generated from the stored contents therein and compared to the
hash value included in the filename. If the computed hash value
corresponds to the filename hash value, it is presumed the archived
data is correct and intact. If, however, a discrepancy is
identified between the expected (filename) hash value and the
actual (computed) hash value, the data block is flagged as bad. Any
backup sets that include a bad backup data file may also be flagged
as bad. Bad backup data files and/or backup sets may be slated for
immediate deletion, or may be scheduled for deletion at a future
time. Integrity validation may be performed on a periodic or
routine basis, or may be performed prior to data restoration from a
backup set.
[0053] In another aspect, a virtual machine data block may be
evaluated to determine whether it contains all zero bytes, all one
bytes, contains null data, or exhibits some other relatively simple
data pattern which obviates the need to physically store such data
block. In this event, a unique "null" hash is generated and
included within the index file, together with any associated data,
without writing a backup data file to the storage device.
[0054] Turning to FIG. 4, an embodiment of the disclosed method
begins in the step 205 and in the step 210, the first datablock 430
of a virtual machine file 420 is read. In the step 215, the
datablock is evaluated to determine whether it exhibits a special
data pattern, e.g., whether the datablock consists entirely of
zeros (00H). If the datablock exhibits a special data pattern, in
the step 225 a corresponding special unique block identifier is
assigned to the datablock. In an embodiment, the special unique
block identifier is a 32-digit hexadecimal number consisting of all
zeros. The process continues with the step 245, as discussed
below.
[0055] If, however in the step 215 it is determined the datablock
does not exhibit a special data pattern, then in the step 220 a
hash function is performed on the contents of the datablock to
generate a unique block identifier corresponding to the datablock.
In an embodiment, the hash function is an MD5 hash function. The
step 230 is performed next wherein the destination directory 330 et
seq. within the directory hierarchy 300 is determined. The
destination directory 330 et seq. is based at least in part upon
the value of specific digits within the unique block identifier. In
an embodiment, the first two bytes of the unique block identifier
(e.g., the two most significant digits of the hash) represent the
first level directory 320 et seq. within the directory hierarchy.
The next two bytes of the unique block identifier (e.g., the next
two most significant digits of the hash) represent the second level
directory 330 et seq. within the directory hierarchy. The pathname
of the datablock file as stored within the directory hierarchy may
be formed by concatenating a pathname root string (e.g.,
"/mnt/bck/"), the first two significant hexadecimal digits of the
unique identifier (e.g., "01"), a directory delimiter character
(e.g., "/"), the next two significant hexadecimal digits of the
unique identifier (e.g., "02"), a directory delimiter character
(e.g., "/"), and the file name of the datablock (e.g.,
010249a8a218ef8a4da87550f388942d.dat.
[0056] In the step 235, the datablock 430 is optionally compressed
to reduce the amount of storage resources that will be required to
store the datablock file. In embodiments, the manner of compression
may be hard-coded, defined in a database (e.g., a registry
database), and/or defined in a configuration file (e.g., via a
"preferences" or "options" setting provided by a user interface or
by hand-editing a configuration file) in accordance with user
requirements. Any suitable manner of data compression may be
employed, including without limitation, LZW, zip, gzip, rar, and/or
bzip. Additionally or alternatively, the datablock may be
cryptographically encoded using any suitable cryptosystem,
including without limitation a symmetric-key cryptosystem (e.g.,
DES, Triple-DES, AES, and the like) or a public-key cryptosystem
(e.g., RSA, Diffie-Hellman, elliptic curve techniques, and the
like.)
[0057] In the step 240, the datablock 430 (which may be in its
original form, compressed, encrypted, and/or combinations thereof)
is written to a corresponding datablock file 445 in the destination
directory 335 et seq. In the step 245, an index file entry 446
corresponding to the datablock 430 in an index file 445 is created.
The index file 445 may contain entries relating solely to the
current backup set, or may contain entries relating to a plurality
of backup sets. In an embodiment, the index file 445 includes a
database. For each corresponding datablock 430 identified within
the index file 445, an index file entry 446 may include, without
limitation, a unique block identifier value, a timestamp of the
backup set, a timestamp relating to the backup time of the
individual datablock, a datablock source location, a datablock
destination location. In embodiments, the datablock source location
may include an identifier relating to the virtual machine from
which the backup set was generated, a virtual machine host
identifier, a machine name, a node name, a network address (e.g.,
an internet protocol address), a software identifier, a hardware
identifier, an encryption key, and the like. In embodiments, the
datablock destination location may include an identifier relating
to the storage device on which the datablock file 445 is stored, a
destination directory in which the datablock file 445 is stored, a
pathname of the datablock file, a filename of the datablock file, a
unique block identifier value, and the like.
[0058] In the step 250, a test is performed whereby it is
determined whether all datablocks 430 of the virtual machine file
420 have been processed. If not, the method 200 iterates to the
step 210 wherein the next datablock 430 of the virtual machine file
420 is read, and processing proceeds as described hereinabove.
[0059] The present disclosure is also directed to a computer-based
apparatus and a computing system configured to perform a method of
data backup as described herein. Also disclosed is
computer-readable media comprising a set of instructions of
performing a method of data backup as described herein.
[0060] While several embodiments of the disclosure have been shown
in the drawings and/or discussed herein, it is not intended that
the disclosure be limited thereto, as it is intended that the
disclosure be as broad in scope as the art will allow and that the
specification be read likewise. Therefore, the above description
should not be construed as limiting, but merely as exemplifications
of particular embodiments. The claims can encompass embodiments in
hardware, software, or a combination thereof. Those skilled in the
art will envision other modifications within the scope and spirit
of the claims appended hereto.
* * * * *