Method for bulk deletion through segmented files Chron; Edward Gustav ; et al. [International Business Machines Corporation]

Method for bulk deletion through segmented files

Chron; Edward Gustav ; et al.

Patent Application Summary

U.S. patent application number 11/156842 was filed with the patent office on 2012-02-23 for method for bulk deletion through segmented files. This patent application is currently assigned to International Business Machines Corporation. Invention is credited to Edward Gustav Chron, Frederick Douglis, Stephen Paul Morgan.

Application Number	20120047188 11/156842
Document ID	/
Family ID	37574633
Filed Date	2012-02-23

United States Patent Application	20120047188
Kind Code	A9
Chron; Edward Gustav ; et al.	February 23, 2012

Method for bulk deletion through segmented files

Abstract

A mechanism is provided that aggregates data in a way that permits data to be deleted efficiently, while minimizing the overhead necessary to support bulk deletion of data. A request is received for automatic deletion of segments in a container and a waterline is determined for the container. A determination is made if at least one segment in the container falls below the waterline. Finally, in response to one segment falling below the waterline, the segment from the container is deleted. Each object has an associated creation time, initial retention value, and retention decay curve (also known as a retention curve). At any point, based on these values and the current time, the object's current retention value may be computed. The container system continually maintains a time-varying waterline: at any point, objects with a retention value below the waterline may be deleted.

Inventors:

Chron; Edward Gustav; (Sunnyvale, CA) ; Douglis; Frederick; (Basking Ridge, NJ) ; Morgan; Stephen Paul; (San Jose, CA)

Assignee:

International Business Machines Corporation
Armonk
NY

Prior Publication:

	Document Identifier	Publication Date
	US 20060288047 A1	December 21, 2006

Family ID:

37574633

Appl. No.:

11/156842

Filed:

June 20, 2005

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
10943397	Sep 17, 2004
11156842	Jun 20, 2005
10944597	Sep 17, 2004	7958093
11156842	Jun 20, 2005

Current U.S. Class:	707/813
Current CPC Class:	G06F 16/162 20190101
Class at Publication:	707/813
International Class:	G06F 17/30 20060101 G06F017/30

Claims

1. A method for bulk deletion through segmented files, the method comprising: receiving a request for automatic deletion of segments in a container; determining a waterline for the container; determining if at least one segment within a plurality of segments in the container falls below the waterline; and in response to the at least one segment falling below the waterline, deleting the at least one segment from the container.

2. The method of claim 1, wherein the waterline is set to a segment retention value, wherein the segment retention value is a function of information within the given segment and is a minimum value to retain the given segment.

3. The method of claim 2, wherein the minimum value is determined by at least one of the creation date of the given segment, the retention decay curve of the given segment, the initial retention value of the given segment, the current time or a date for deletion of the given segment.

4. The method of claim 1, wherein the waterline is a value determined by a function, wherein the function is determined by a retention decay curve of a given segment, and wherein determining if the at least one segment within the plurality of segments in the container falls below the waterline further comprises: identifying the at least one segment within the plurality of segments in the container whose value is below the waterline to form an identified segment; and deleting the identified segment from the container.

5. The method of claim 4, wherein segments that are not identified for deletion are not contiguous.

6. The method of claim 4, wherein segments that are not identified for deletion are contiguous.

7. The method of claim 1, wherein the waterline is a value determined by a function that converts a creation date of a given segment to the value and wherein determining if the at least one segment within the plurality of segments in the container falls below the waterline further comprises: scanning the plurality of segments in the container from a beginning of the container in ascending date order for the least one segment whose value is above the waterline; and deleting the at least one segment from the beginning of the container up to the segment whose value is above the waterline.

8. The method of claim 1, wherein deleting the at least one segment from the container is performed by unmapping individual file blocks associated with the at least one segment.

9. A data processing system comprising: a bus system; a communications system connected to the bus system; a memory connected to the bus system, wherein the memory includes a set of instructions; and a processing unit connected to the bus system, wherein the processing unit executes the set of instructions to receive a request for automatic deletion of segments in a container; determine a waterline for the container; determine if at least one segment within a plurality of segments in the container falls below the waterline; and delete the at least one segment from the container in response to the at least one segment falling below the waterline.

10. The data processing system of claim 9, wherein the waterline is set to a segment retention value, wherein the segment retention value is a function of information within the given segment and is a minimum value to retain the given segment.

11. The data processing system of claim 10, wherein the minimum value is determined by at least one of the creation date of the given segment, the retention decay curve of the given segment, the initial retention value of the given segment, the current time or a date for deletion of the given segment.

12. The data processing system of claim 9, wherein the waterline is a value determined by a function, wherein the function is determined by a retention decay curve of a given segment, and wherein the set of instructions to determine if the at least one segment within the plurality of segments in the container falls below the waterline further comprises: a set of instructions to identify the at least one segment within the plurality of segments in the container whose value is below the waterline to form an identified segment; and delete the identified segment from the container.

13. The data processing system of claim 12, wherein segments that are not identified for deletion are not contiguous.

14. The data processing system of claim 12, wherein segments that are not identified for deletion are contiguous.

15. The data processing system of claim 9, wherein the waterline is a value determined by a function that converts a creation date of a given segment to the value and wherein the set of instructions to determine if the at least one segment within the plurality of segments in the container falls below the waterline further comprises: a set of instructions to scan the plurality of segments in the container from a beginning of the container in ascending date order for the least one segment whose value is above the waterline; and delete the at least one segment from the beginning of the container up to the segment whose value is above the waterline.

16. The data processing system of claim 9, wherein the set of instructions to delete the at least one segment from the container is performed by a set of instructions to unmap individual file blocks associated with the at least one segment.

17. A computer program product comprising: a computer usable medium including computer usable program code for bulk deletion through segmented files, the computer program product including; computer usable program code for receiving a request for automatic deletion of segments in a container; computer usable program code for determining a waterline for the container; computer usable program code for determining if at least one segment within a plurality of segments in the container falls below the waterline; and computer usable program code for deleting the at least one segment from the container in response to the at least one segment falling below the waterline.

18. The computer program product of claim 17, wherein the waterline is set to a segment retention value, wherein the segment retention value is a function of information within the given segment and is a minimum value to retain the given segment.

19. The computer program product of claim 18, wherein the minimum value is determined by at least one of the creation date of the given segment, the retention decay curve of the given segment, the initial retention value of the given segment, the current time or a date for deletion of the given segment.

20. The computer program product of claim 17, wherein the waterline is a value determined by a function, wherein the function is determined by a retention decay curve of a given segment, and wherein the computer usable program code for determining if the at least one segment within the plurality of segments in the container falls below the waterline further comprises: computer usable program code for identifying the at least one segment within the plurality of segments in the container whose value is below the waterline to form an identified segment; and computer usable program code for deleting the identified segment from the container.

21. The computer program product of claim 20, wherein segments that are not identified for deletion are not contiguous.

22. The computer program product of claim 20, wherein segments that are not identified for deletion are contiguous.

23. The computer program product of claim 17, wherein the waterline is a value determined by a function that converts a creation date of a given segment to the value and wherein the computer usable program code for determining if the at least one segment within the plurality of segments in the container falls below the waterline further comprises: computer usable program code for scanning the plurality of segments in the container from a beginning of the container in ascending date order for the least one segment whose value is above the waterline; and computer usable program code for deleting the at least one segment from the beginning of the container up to the segment whose value is above the waterline.

24. The computer program product of claim 17, wherein deleting the at least one segment from the container is performed by unmapping individual file blocks associated with the at least one segment.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The present invention is related to the following applications entitled "System and Method for Optimizing a Storage System to Support Full Utilization of Storage Space," Ser. No. 10/943,397, filed on Sep. 17, 2004; and entitled "System and Method for Optimizing a Storage System to Support Short Data Lifetimes," Ser. No. 10/944,597, filed on Sep. 17, 2004. All of the above related applications are assigned to the same assignee, and incorporated herein by reference.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] The present invention relates generally to an improved data processing system. More particularly, the present invention provides a mechanism for aggregating data in a way that permits data to be deleted efficiently, while minimizing the overhead necessary to support bulk deletion of data.

[0004] 2. Description of the Related Art

[0005] Early file systems were designed with the expectation that data would typically be read from disk many times before being deleted. Therefore, on-disk data structures were optimized for reading of data. However, as main memory sizes increased, more read requests could be satisfied from data cached in memory. This motivated file system designs that optimized write performance rather than read performance. However, the performance of such system tends to suffer from overhead due to the need to garbage collect current, i.e. "live," data while making room for areas where new data can be written.

[0006] New types of systems are evolving in which, in addition to reading and writing of data, creation and deletion of data are important factors in the performance of the system. These systems tend to be systems in which data is quickly created, used and discarded. These systems also tend to be systems in which the available storage system resources are generally fully utilized. In such systems, the creation of data and deletion of this data is an important factor in the overall performance of the system.

[0007] However, known file systems, which are optimized for data reads or, alternatively, data writes, do not provide an adequate performance optimization for this new breed of systems. Previous file systems teach a method whereby a sequence of objects is stored in a set of storage segments. See "Position: Short Object Lifetimes Require a Delete-Optimized Storage System," by Douglis et al., 11th ACM SIGOPS European Workshop, September 2004, which is hereby incorporated by reference. Typically, such segments are fixed in size and pre-allocated. At any given time, a plurality of segments are available for storing newly written objects, with each segment holding objects with similar retention attributes, specifically a retention value and retention decay function. When an object is to be stored, a then-in-use segment is the first target of the store operation. Although the segment may be empty, typically, the segment already holds a plurality of other objects. Therefore, it would be advantageous to have a system and method for a mechanism that aggregates data in a way that permits data to be deleted efficiently, while minimizing the overhead necessary to support bulk deletion of data.

SUMMARY OF THE INVENTION

[0008] The present invention provides for a mechanism that aggregates data in a way that permits data to be deleted efficiently, while minimizing the overhead necessary to support bulk deletion of data. In the present invention, a request for automatic deletion of segments in a container is received and a waterline for the container is determined. A determination is made if at least one segment in the container falls below the waterline. Finally, in response to one segment falling below the waterline, the segment from the container is deleted. Each object has an associated creation time, initial retention value, and retention decay curve (also known as a retention curve). At any point, based on these values and the current time, the object's current retention value may be computed. The container system continually maintains a time-varying waterline: at any point, objects with a retention value below the waterline may be deleted.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:

[0010] FIG. 1 is an exemplary diagram of a distributed data processing system in which aspects of the present invention may be implemented;

[0011] FIG. 2 is an exemplary block diagram of a server computing device in which aspects of the present invention may be implemented;

[0012] FIG. 3 is an exemplary block diagram of a client computing device in which aspects of the present invention may be implemented;

[0013] FIG. 4 depicts an object header layout in accordance with an illustrative embodiment of the present invention;

[0014] FIG. 5 depicts an object trailer layout in accordance with an illustrative embodiment of the present invention;

[0015] FIG. 6 depicts an exemplary single-block object layout in accordance with an illustrative embodiment of the present invention;

[0016] FIG. 7 depicts a block header in accordance with an illustrative embodiment of the present invention;

[0017] FIG. 8 depicts a block trailer in accordance with an illustrative embodiment of the present invention;

[0018] FIG. 9 depicts an exemplary multi-block object layout in accordance with an illustrative embodiment of the present invention;

[0019] FIG. 10 depicts an exemplary data structure in accordance with an illustrative embodiment of the present invention;

[0020] FIG. 11 depicts an object header layout for sparse epochs in accordance with an illustrative embodiment of the present invention;

[0021] FIG. 12 depicts an epoch chain before storage unit deletion in accordance with an illustrative embodiment of the present invention;

[0022] FIG. 13 depicts an epoch chain after storage unit deletion in accordance with an illustrative embodiment of the present invention; and

[0023] FIG. 14 depicts a flow diagram illustrating an exemplary operation of aggregating data in a way that permits data to be deleted efficiently in bulk in accordance with an illustrative embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0024] The present invention provides for a mechanism for aggregating data in a way that permits data to be deleted efficiently, while minimizing the overhead necessary to support bulk deletion of data. FIGS. 1-3 are provided as exemplary diagrams of data processing environments in which embodiments of the present invention may be implemented. It should be appreciated that FIGS. 1-3 are only exemplary and are not intended to assert or imply any limitation with regard to the environments in which aspects or embodiments of the present invention may be implemented. Many modifications to the depicted environments may be made without departing from the spirit and scope of the present invention.

[0025] With reference now to the figures, FIG. 1 depicts a pictorial representation of a network of data processing systems in which aspects of the present invention may be implemented. Network data processing system 100 is a network of computers in which embodiments of the present invention may be implemented. Network data processing system 100 contains a network 102, which is the medium used to provide communications links between various devices and computers connected together within network data processing system 100. Network 102 may include connections, such as wire, wireless communication links, or fiber optic cables.

[0026] In the depicted example, server 104 connects to network 102 along with storage unit 106. In addition, clients 108, 110, and 112 connect to network 102. These clients 108, 110, and 112 may be, for example, personal computers or network computers. In the depicted example, server 104 provides data, such as boot files, operating system images, and applications to clients 108-112. Clients 108, 110, and 112 are clients to server 104. Network data processing system 100 may include additional servers, clients, and other devices not shown.

[0027] In the depicted example, network data processing system 100 is the Internet with network 102 representing a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other computer systems that route data and messages of course, network data processing system 100 also may be implemented as a number of different types of networks, such as for example, an intranet, a local area network (LAN), or a wide area network (WAN). FIG. 1 is intended as an example, and not as an architectural limitation for different embodiments of the present invention.

[0028] Referring to FIG. 2, a block diagram of a data processing system that may be implemented as a server, such as server 104 in FIG. 1, is depicted in accordance with an illustrative embodiment of the present invention. Data processing system 200 may be a symmetric multiprocessor (SMP) system including a plurality of processors 202 and 204 that connect to system bus 206. Alternatively, a single processor system may be employed. Also connected to system bus 206 is memory controller/cache 208, which provides an interface to local memory 209. I/O bus bridge 210 connects to system bus 206 and provides an interface to I/O bus 212. Memory controller/cache 208 and I/O bus bridge 210 may be integrated as depicted.

[0029] Peripheral component interconnect (PCI) bus bridge 214 connects to I/O bus 212 provides an interface to PCI local bus 216. A number of modems may be connected to PCI local bus 216. Typical PCI bus implementations will support four PCI expansion slots or add-in connectors. Communications links to clients 108-112 in FIG. 1 may be provided through modem 218 and network adapter 220 connected to PCI local bus 216 through add-in connectors.

[0030] Additional PCI bus bridges 222 and 224 provide interfaces for additional PCI local buses 226 and 228, from which additional modems or network adapters may be supported. In this manner, data processing system 200 allows connections to multiple network computers. A memory-mapped graphics adapter 230 and hard disk 232 may also be connected to I/O bus 212 as depicted, either directly or indirectly.

[0031] Those of ordinary skill in the art will appreciate that the hardware depicted in FIG. 2 may vary. For example, other peripheral devices, such as optical disk drives and the like, also may be used in addition to or in place of the hardware depicted. The depicted example is not meant to imply architectural limitations with respect to the present invention.

[0032] The data processing system depicted in FIG. 2 may be, for example, an IBM eServer.TM. pSeries.RTM. computer system, running the Advanced Interactive Executive (AIX.RTM.) operating system or LINUX operating system (eServer, pSeries and AIX are trademarks of International Business Machines Corporation in the United States, other countries, or both while Linux is a trademark of Linus Torvalds in the United States, other countries, or both).

[0033] With reference now to FIG. 3, a block diagram of a data processing system is shown in which aspects of the present invention may be implemented. Data processing system 300 is an example of a computer, such as client 108 in FIG. 1, in which code or instructions implementing the processes for embodiments of the present invention may be located. In the depicted example, data processing system 300 employs a hub architecture including a north bridge and memory controller hub (MCH) 308 and a south bridge and input/output (I/O) controller hub (ICH) 310. Processor 302, main memory 304, and graphics processor 318 are connected to MCH 308. Graphics processor 318 may be connected to the MCH through an accelerated graphics port (AGP), for example.

[0034] In the depicted example, local area network (LAN) adapter 312, audio adapter 316, keyboard and mouse adapter 320, modem 322, read only memory (ROM) 324, hard disk drive (HDD) 326, CD-ROM drive 330, universal serial bus (USB) ports and other communications ports 332, and PCI/PCIe devices 334 connect to ICH 310. PCI/PCIe devices may include, for example, Ethernet adapters, add-in cards, PC cards for notebook computers, etc. PCI uses a card bus controller, while PCIe does not. ROM 324 may be, for example, a flash binary input/output system (BIOS). Hard disk drive 326 and CD-ROM drive 330 may use, for example, an integrated drive electronics (IDE) or serial advanced technology attachment (SATA) interface. A super I/O (SIO) device 336 may be connected to ICH 310.

[0035] An operating system runs on processor 302 and coordinates and provides control of various components within data processing system 300 in FIG. 3. The operating system may be a commercially available operating system such as Microsoft.RTM. Windows.RTM. XP (Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both). An object oriented programming system, such as the Java.TM. programming system, may run in conjunction with the operating system and provides calls to the operating system from Java programs or applications executing on data processing system 300 (Java is a trademark of Sun Microsystems, Inc. in the United States, other countries, or both).

[0036] Instructions for the operating system, the object-oriented programming system, and applications or programs are located on storage devices, such as hard disk drive 326, and may be loaded into main memory 304 for execution by processor 302. The processes for embodiments of the present invention are performed by processor 302 using computer implemented instructions, which may be located in a memory such as, for example, main memory 304, memory 324, or in one or more peripheral devices 326 and 330. These processes may be executed by any processing unit, which may contain one or more processors.

[0037] Those of ordinary skill in the art will appreciate that the hardware in FIGS. 1-3 may vary depending on the implementation. Other internal hardware or peripheral devices, such as flash memory, equivalent non-volatile memory, or optical disk drives and the like, may be used in addition to or in place of the hardware depicted in FIGS. 1-3. Also, the processes of the present invention may be applied to a multiprocessor data processing system.

[0038] As some illustrative examples, data processing system 300 may be a personal digital assistant (PDA), which is configured with flash memory to provide non-volatile memory for storing operating system files and/or user-generated data.

[0039] A bus system may be comprised of one or more buses, such as system bus 206, I/O bus 212 and PCI buses 216, 226 and 228 as shown in FIG. 2. Of course the buss system may be implemented using any type of communications fabric or architecture that provides for a transfer of data between different components or devices attached to the fabric or architecture. A communications unit may include one or more devices used to transmit and receive data, such as modem 218 or network adapter 220 of FIG. 2 or modem 322 or LAN 312 of FIG. 3. A memory may be, for example, local memory 209 or cache such as found in memory controller/cache 208 of FIG. 2 or main memory 304 of FIG. 3. A processing unit may include one or more processors or CPUs, such as processor 202 or processor 204 of FIG. 2 or processor 302 of FIG. 3. The depicted examples in FIGS. 1-3 and above-described examples are not meant to imply architectural limitations. For example, data processing system 300 also may be a tablet computer, laptop computer, or telephone device in addition to taking the form of a PDA.

[0040] The present invention may be implemented in a distributed data processing environment or in a stand-alone computing system. For example, the present invention may be implemented in a server, such as server 104, or client computing device, such as clients 108-112. Moreover, aspects of the present invention may be implemented using storage device 106 in accordance with the present invention as described hereafter.

[0041] The configuration of the present invention is based upon a number of observations made of log-structured file systems. Therefore, a brief explanation of a log-structure file system will first be made. In its earliest incarnation, the log-structured file system was envisioned as a single contiguous log in which data was written at one end of a wrap-around log and free space was created at the other end by copying "live" files to the first end. This had the disadvantage that long-lived data would be continually garbage collected, resulting in high overhead. The problem of long-lived data was solved by segmenting the log into many fixed-size units, which were large enough to amortize the overhead of a disk seek relative to writing an entire unit contiguously. These units, called "segments," were cleaned in the background by copying live data from segments with low utilization (i.e., most of the segment already consists of deleted data) to new segments of entirely live data. See "The Design and Implementation of a Log-Structured File System," by Rosenblum and Ousterhout, ACM Transactions on Computer Systems, 1991, which is hereby incorporated by reference.

[0042] In an illustrative embodiment of the present invention, if sufficient space is available in an appropriate segment, an object is copied into the end of the segment; otherwise, the remaining space in the segment is marked as unused, the segment is marked as full, and a new unused segment becomes the target of the store. An object is a unit of data access. If an object exactly fills a segment, the segment is marked as full, and all space in the segment is marked as used. Unused space in a segment is known as fragmented storage. In the embodiment, an object larger than a single segment is stored as a special case of a single file that is created for the purpose of storing the object.

[0043] Each object has an associated creation time, initial retention value, and retention decay curve (also known as a retention curve). At any point, based on these values and the current time, the object's current retention value may be computed. The container system continually maintains a time-varying waterline: at any point, objects with a retention value below the waterline may be deleted.

[0044] In an illustrative embodiment of the present invention, objects with the same initial retention value and retention curve are placed in segments identified to hold such objects exclusively, with the segment being assigned a segment creation time equal to the creation time of the object most recently stored in it. Objects in a segment may thereby be evaluated and deleted en masse. Changing an object's retention curve therefore involves moving the object from one segment to another. Moving an object from a source to a destination segment could involve renaming the object, in turn requiring directories, if any, that identify the object's source to be updated to identify the object's destination, or alternative and/or additional means and/or methods to be applied. Simply removing the object from the source could increase fragmentation, as the space formerly occupied by the object may not be readily reusable until the segment as a whole is re-usable, i.e., until all objects in the segment have been deleted.

[0045] The present invention realizes a container as a single, potentially large file. Modern file systems support files logically reaching sizes of up to 2.sup.64 bytes. Even at the very substantial write rate of 2.sup.30 bytes per second, it would take upwards of 500 years to fill a single container of 2.sup.64 bytes. Presumably, file systems supporting yet larger file sizes, e.g., 2.sup.128 bytes, will be available before file size becomes a limiting factor.

[0046] A container file comprises an ordered list of file blocks, each of a fixed size, starting at offset zero, aligned on block boundaries. Without loss of generality we assume hereinafter that file blocks are 4,096 bytes in length. This is in contrast to a file system storage allocation unit which may be considerably larger, e.g., 1 Mbyte.

[0047] A file block is a logical entity; at any point it may be mapped by being associated with an identically-sized disk block, or it may remain unmapped. It is a function of a file system to transparently maintain the mapping. We assume further, again without loss of generality that a modern file system returns as logical zeroes, data retrieved from an unmapped file block.

[0048] In an illustrative embodiment of the present invention, an object is stored in a container file, starting and ending on a file block boundary. Objects are allocated an integral number of file blocks. If nothing but zero-length objects were stored in a container, one file block would be used per object, and fragmentation would be relatively high. Objects typically are larger, often substantially so. Typically, only a small amount of space allocated to objects is fragmented. The actual amount of file block fragmentation is dependent upon the distribution of object sizes and the file block size and cannot in general be estimated a priori.

[0049] Objects in a container abut each other, i.e., the only gaps between objects are those needed to bring an object to a file-block boundary. Objects may be of practically unlimited size, up to the maximum size of the container.

[0050] Turning to FIG. 4, an object header layout is depicted in accordance with an illustrative embodiment of the present invention. Each object starts with an object header 400. Object header 400 comprises object header magic number 402, object length 404, certain object flags 406, object creation time 408, object retention curve 410, container generation 412, hash vector 414, epochal object offset 416, a sync-point object offset 418. Object header 400 also comprises a reserved area 420.

[0051] Object header magic number 402 appears in a valid object header. Object header magic number 402 is a means for the container system to check for certain types of errors. Object length 404 indicates the actual amount of data associated with the object, not including padding to bring the allocated space up to a multiple of file blocks in length. The number of blocks allocated to the object may be computed directly from this number.

[0052] Object flags 406 indicate various things about the object. The meanings of various flags are described where and as necessary. Container generation 412 will be described further with respect to object tokens. Hash vector 414 is the hash initialization vector used for tokens generated for the container. The use of hash vector 414 will be described further with respect to retrieving an object from a container. Object creation time 408 corresponds to the time that the object was created. Some convention must be followed when assigning time values. Object retention curve 410 is an identifier for a mathematical function. Given one of the current time, the object creation time, the initial retention value, and/or the object retention curve, the retention value of the present object may be computed.

[0053] Epochal object offset 416 refers to the last object in a previous epoch. Epochal object offset 416 will be described further with respect to epochs and their establishment. Sync-point object offset 418 refers to an object recently known by the container system to have been sync-pointed. An object has been sync-pointed if and only if every disk block associated with the object and every previously created object, has been written to disk.

[0054] FIG. 5 depicts an object trailer layout in accordance with an illustrative embodiment of the present invention. Each object ends with an object trailer 500. Object trailer 500 comprises object trailer magic number 502 and relative offset 504. Object trailer magic number 502 appears in a valid object trailer. Object trailer magic number 502 is a means for the container system to check for certain types of errors. Relative offset 504 may be used to determine the start of the object. Relative offset 504 is the offset in bytes from the start of object trailer 500 to the start of object header 400 of FIG. 4 describing the object with which the file block is associated. For a single-block object, the field contains the value -4088LL. The field also may be used to determine whether the file block was completely (i.e., atomically) written to disk.

[0055] Thus, a single-block object may have a layout such as that depicted in FIG. 6 in accordance with an illustrative embodiment of the present invention. In single-block object layout 600, object header 602 is at the beginning of a single-block object and object trailer 604 is at the end. Object header 602 and object trailer 604 are separated by object data 606.

[0056] Objects may be larger than a single block. These are stored in multiple adjacent blocks and may comprise, in addition to object headers and object trailers, block headers and block trailers. Other than the first block, every block includes a block header, residing at the beginning of the block. Other than the last block, every block includes a block trailer, residing at the end of the block. The block header and trailer serve two purposes. First, they indicate whether the block was completely (i.e., atomically) written to disk. Second, they identify the object with which the block is associated, and its relative offset within the object.

[0057] FIG. 7 depicts a block header in accordance with an illustrative embodiment of the present invention. Block header 700 comprises block header magic number 702 and relative offset 704. Block header magic number 702 appears in a valid block header. Block header magic number 702 is a means for the container system to check for certain types of errors. Relative offset 704 may be used to determine the start of the object. Relative offset 704 is the offset in bytes from the start of the block header to the start of the object header describing the object with which the file block is associated. The field also may be used to determine whether the file block was completely (i.e., atomically) written to. For the second block in a multi-block object, the relative offset is -4096LL.

[0058] FIG. 8 depicts a block trailer in accordance with an illustrative embodiment of the present invention. Block trailer 800 comprises block trailer magic number 802 and relative offset 804. Block trailer magic number 802 appears in a valid block trailer. Block trailer magic number 802 is a means for the container system to check for certain types of errors. Relative offset 804 may be used to determine the start of the object. Relative offset 804 is the offset in bytes from the start of the block trailer to the start of the object header describing the object with which the file block is associated. The field also may be used to determine whether the file block was completely (i.e., atomically) written to. For the second block in a multi-block object, the relative offset is -8184LL.

[0059] Thus, a multi-block object may have a layout such as that depicted in FIG. 9 in accordance with an illustrative embodiment of the present invention. In multi-block object layout 900, object A header 902 is at the beginning of a multi-block object and object A trailer 904 is at the end. In between object A header 902 and object A trailer 904 are three data blocks: object A/1 data block 906, object A/2 data block 908, and object A/3 data block 910. Each data block has a header and a trailer; however, in the case of a multi-block object, an in-between header is considered a block header such as block A/2 header 912 and block A/3 header 914. Additionally, an in-between trailer is considered a block trailer such as block A/1 trailer 916 and block A/2 trailer 918.

[0060] Objects are appended to a container in the same order as they are created. As objects are appended, the container's file blocks are modified. The file system may lazily write modified file blocks to one or more disks in an order convenient to the file system. Disk blocks are not necessarily written to disk in the same order as their associated file blocks appear in the file or were modified. That is, the disk block corresponding to the i.sup.th file block may be written after the disk block corresponding to the j.sup.th file block, where i<j.

[0061] Herein, it is assumed that a file block will be completely (i.e., atomically) written to disk or not at all; a file block cannot be partly written. It is further assumed that a file block that has been allocated but has not had its underlying disk block written, when read back, will comprise logical zeroes. Modern file systems generally can provide these features.

[0062] Once an object has been stored in a container, it subsequently may be retrieved via an object token, as depicted FIG. 10 in accordance with an illustrative embodiment of the present invention. Token 1000 comprises container number 1004, object offset 1006, object length 1008, object creation time 1010, container generation 1012, and hash value 1014. Token 1000 also comprises a reserved area 1002.

[0063] Container number 1004 indicates the container with which token 1000 is associated. Object offset 1006 indicates the offset of the object within the container. Object length 1008 indicates the actual amount of data associated with the object, not including padding to bring the allocated space up to a multiple of file blocks in length. Object creation time 1010 indicates the time that the object was created. While this field has high resolution, its accuracy may be limited. Container generation 1012 is the reuse label associated with the container. Hash value 1014 is a secure hash of all of the preceding token fields, primed with a container hash initialization vector. Hash value 1014 guarantees that token 1000 cannot be modified by an application.

[0064] Object offset 1006 may be reused if a container identifier is reused. Container generation 1012 differentiates between reuses of a container identifier. When creating an object, its container generation 1012 is set to the generation of the container. Container generation 1012 may be incremented on reuse, may be a randomly-chosen number, or may be chosen via some other means and method. The algorithm and the value chosen are not depended upon.

[0065] Token hash value 1014 was chosen so as to be large enough for various well-known algorithms, including Secure Hash Algorithm-1 (SHA-1) and Message Digest #5 (MD5).

[0066] A closed container may be clean or dirty. A clean container is one that does not need to be recovered: its contents are internally consistent. It may have been closed before the most recent system failure; alternatively, it may have been open yet not have been modified for some time. A container is clean if its last object refers to the immediately preceding object as a sync-point object; otherwise, it is dirty.

[0067] In normal operation, a producer application puts an object into a container. Upon successful completion, the container system returns a token for the object. As previously described, the token contains various fields including identifiers for the container and object, the object's length, and its creation time. The container system supports objects with no minimum and no (practical) maximum size. Multiple producers may put objects into the same container "simultaneously." The container system adds them to the container according to a serializable schedule. Objects are time-stamped by creation time; however, an object's time stamp may not be entirely accurate. For this reason, objects placed in a container in a certain order may have time stamps in a different order. More precisely, an object with time stamp i may appear in the container after an object with time stamp j, where i<j. However, the container system limits the degree to which objects may appear "out of order," i.e., object i may appear after object j only if i-j<limit.

[0068] A producer may transmit a token via some mechanism beyond the scope of the present discussion, to one or more consumer applications. A consumer may retrieve the object from the container--if the object still is available and valid--by presenting the token to the container system. Objects need not be retrieved from a container in the same order that they were put into it. In fact, an object need not be retrieved at all. Multiple consumers may retrieve objects from the store simultaneously; indeed, the same object may be retrieved by multiple consumers simultaneously. As a token is not made available until an object has been put into a container, a consumer cannot retrieve an object that is not yet (fully) in the container.

[0069] If the present invention is implemented on a cluster of computers supporting a cluster file system, e.g., IBM's General Parallel File System (GPFS.TM.), containers may be shared among producers and consumers running simultaneously on multiple computers in a single cluster.

[0070] In certain cases, it may be desirable for producers not to send tokens to consumers. The invention provides a means for a consumer, given a token for an object in a certain container, to retrieve the next object in the container. Complementing this means is a means to determine a container's first object. With these means, one or more producers may put a sequence of objects into a container, and a set of consumers may retrieve the objects, simply by sharing the identity of the container.

[0071] The invention manages storage in a manner similar to a delete-optimized store, at least at a high level. As previously described, each object is evaluated according to its retention curve, its initial retention value, its creation time, and the current time. See "Position: Short Object Lifetimes Require a Delete-Optimized Storage System," by Douglis et al., 11th ACM SIGOPS European Workshop, September 2004, which is hereby incorporated by reference. Its value is compared to a dynamically-computed waterline and, if below, the object is deleted. However, in other aspects, the method of the present invention differs substantially from that of the original proposal.

[0072] The invention supports immutable objects, i.e., objects that once created, are not changeable. There are several reasons for this choice. In one aspect, objects abut each other within a container. Extending an object in place could require moving one or more objects or storing an object in pieces. It would be problematic to move objects, as the object's token refers to the object's offset in its container. If the object were to move, a method to determine the object's "forwarding address" would need to be implemented.

[0073] One method to implement a forwarding address means and method would be to add the address to the original object, e.g., in an expanded header or within the old data body then apply it during the object retrieval process. However, since the old data body will likely have been deleted, a "tombstone" directing to a new location is not practical. Another approach would be to create a look-aside table that would be checked for a forwarding address for the object before retrieving the object. Checking a look-aside table prior to each object access could add potentially substantial overhead to the cost of an access. Of course, the look-aside table could be checked after failed object retrieval. The main issue then would be maintenance of the look-aside table without depending upon synchronization with the container system. Still another approach is to provide an automated means to convert an object's address into a new location, such as a specific file name. Automating the forwarding address via filename lookup is simple but has the disadvantage of adding overhead to each lookup of an object that has been deleted rather than relocated.

[0074] In another aspect of the present invention, an immutable object would have a fixed size, whereas a mutable object might not. For reasons of applications programmability and performance, the object length is included in its token.

[0075] Given an object's length, the application can allocate a buffer of sufficient size to hold the object prior to retrieving it. Not knowing the object's length beforehand, the application would have to guess, allocating a buffer of the hoped-for size. Alternatively, the application might allocate a buffer sufficiently large to hold a very large buffer. Upon attempting to retrieve an object too large for the buffer, the application would be told the object's actual length, would allocate a buffer sufficient to hold the object, and would try again to retrieve it. However, as the object is mutable, it might have grown in the interim. In the worst case, the application might have to try repeatedly to retrieve the object.

[0076] Along the same lines, knowing from the token the length of the object and its offset within the container file prior to retrieving it, the container system may schedule a disk read for the entire object at once. Were the length of the object not stored in the token, the container system first would have to retrieve it from the object header; thus, two disk reads would have to be scheduled and executed. In the first read, the object's header would be retrieved. The object's length would be extracted from the header then a second disk read, for the object's body, would be scheduled and executed. Although in the end the same disk blocks would be read, doing so as two reads versus one may inhibit performance, e.g., by increasing latency.

[0077] In the case where performance is inhibited, various optimizations may be applied. For example, the first disk read might be expanded to include not just the block containing the object header, but additional disk blocks, e.g., totaling eight or 16, on the assumption that "most" objects would be smaller than that and, therefore, a second read "typically" would prove unnecessary.

[0078] Applying read-ahead as an optimization eliminates the possibility of another very desirable one: reading the object's data blocks directly into the application's buffer. Without the latter optimization, the disk blocks typically would be read into a container system buffer then moved to the application's buffer. This move would add computation and memory bus overhead, as well as complicating the management of container system buffers.

[0079] Applying the read-ahead optimization yet reading the object's data into the application's buffer could introduce security problems. If the object were in fact smaller than the number of disk blocks read, data from a subsequent object could end up in the application's buffer. To eliminate this issue, the container system might subsequently have to overwrite in the application's buffer certain bytes written "inadvertently" (or, more properly, insecurely) therein. To do so might be problematic in certain cases, e.g., if the container system could be interrupted after the disk read but before the bytes had been overwritten.

[0080] In general, it is unclear whether object read-ahead would even be an effective optimization. In many cases there will be a large variance in the length of objects within and among containers. Different default read-ahead lengths might be appropriate for different cases. For a first container, it might be optimal to read ahead by four disk blocks, but by sixteen for a second. The read-ahead parameter could be set manually as an attribute of a container's attributes or it could be computed dynamically by the container system. Of course, read ahead would be unnecessary if an object's token were to include its length.

[0081] Suppose an existing mutable object were to comprise multiple disk blocks. A subsequent write to the object might fail for any of several reasons, including a full or partial system crash. A failed write may result in some blocks being written, but not others: An incomplete write would obtain.

[0082] In the simplest implementation of mutable objects, wherein objects lack on-disk trailers, it would be impossible to determine that any given write was incomplete. Worse, parts of one object might show up in another. Clearly, this would be undesirable as far as applications go. It might also have potential security implications.

[0083] A slightly more sophisticated implementation would incorporate object trailers including matching generations. The generations would be compared upon object retrieval: Non-matching generations would indicate an incomplete write. However, matching generations would not necessarily indicate a complete write. A write might complete to the header and trailer but not to all intermediate blocks. This case cannot be detected by object generations, headers, and trailers.

[0084] There may be a performance impact of using generations as well as object headers and trailers. However, unless the object, including its header and trailer, were read in a single operation, in general multiple I/Os would be required to determine whether the write completed, the header and trailer would be read separately. Depending on the size of the object, separating the two I/Os might prove time-intensive (i.e., slow).

[0085] In a more sophisticated implementation, object signatures could be used to determine whether a write completed. At write time, the (entire) object would be signed and stored in an object's trailer. The signature could be computed by hashing the full object or a portion of every block (on the assumption that block writes are atomic). To implement this technique would require that the object be scanned by the container system, both while being stored and retrieved. Potentially, to do so would have a substantial impact on performance: Object reads could otherwise be implemented without copying using direct I/O. For the container system to scan an object, each block of an object would have to be copied to a container system buffer, a portion of each block would have to flow through the computer's data cache, and a computation would have to be performed on the cached data. Of course, with the signing implementation, the same issue regarding reading the object in a single call vs. multiple calls would of course exist.

[0086] An aim of the present invention is to exploit the file system's features and functions, and to avoid wherever possible implementing similar function. In this section, we presume that the file system presents a modern interface based closely on the POSIX model.

[0087] When putting an object in a container, it is appended to the container file a single, append-mode write( ) operation. As POSIX guarantees that append-mode writes are atomic and serialized with respect to each other, application- and system-level locking are unnecessary with respect to object producers. Internally, of course, the file system must coordinate currency among competing, "simultaneous" appending programs.

[0088] When writing the object, a header and a trailer are constructed in the container system's memory. The header takes the format previously depicted in FIG. 4. The header magic number is set from a container system constant. The object flags are cleared. The object generation is the generation for the container into which the object is to be put. The object length is the number of bytes of data associated with the object. The object creation time is the present time of day. The object retention curve is either passed as a parameter by the producer application or is inferred from the container's attributes.

[0089] During normal operation, the container system maintains for each container an imprecise epochal object, an imprecise sync-point object, and a first object. Except for the first object, the offsets of these are copied into the corresponding object header fields. The object trailer takes the format depicted in FIG. 5. The trailer magic number is set from a container system constant. The object relative offset is computed with respect to the object header.

[0090] An iovec structure next is constructed pointing to these items as well as to the buffer identified by the application as containing the object's data. Then, the iovec structure is passed into an append-mode write( ). The header, data, and trailer are appended in order to the file, atomically and serially, in a single, sequential disk write( ).

[0091] Upon the successful completion of the write( ), the object token has been created. The container number is the identifier for the container into which the object is being stored. The object length, creation time, and generation are copied from the object header.

[0092] As append-mode write( ) was used, the object offset is only known a posteriori and must be determined. This is accomplished via a two-step computation. First, the file position is extracted from the FILE * data structure that the container system used to write( ) to the container file. The file position indicates the logical end of the file; it may differ from the actual file end as multiple producers may be placing objects into the same container (file) simultaneously. The FILE * structure contains a cached version of the file position as of the completion of the producer's most recent write( ). Second, the object length is subtracted from the file position. The result is the object offset.

[0093] Finally, the hash value is computed by applying a secure hash algorithm, primed with a container hash initialization vector, to the other token fields.

[0094] Once the token has been computed, it may be returned to the producer, which may in turn distribute the token freely. Possession of the token for an object is a requirement for the possessor to access the object, though access may be mitigated by additional security mechanisms.

[0095] If changed blocks of the container file were written to disk as soon as a producer put an object in the container, the container system typically would perform poorly. If the blocks might be written asynchronously, the producer might instead perform other work while the blocks were being written. On the other hand, if the blocks may be written asynchronously, the complexity of recovering after a system crash is increased. The system of the present invention incorporates a method of lazy synchronization. Several optimizations, some of which will be described below, may be incorporated to balance performance and recovery time/object loss in case of a system failure.

[0096] Objects are added to a container sequentially; we expect that the objects may not be retrieved from the container for some time and even then, it is possible that only a small percentage of the objects added will be retrieved. Given these expectations, we expect further that a container appears to be a sequentially written file that is later accessed either sequentially in full or randomly in only a small part.

[0097] It is possible that a container will appear as a sequentially written, sequentially read file where the producers and consumers typically operate within a few objects of each other. However, in many environments, especially those involving clustered systems, such an arrangement might tend to perform poorly, as the producers and consumers might tend to compete for the same resources, and conflict for the same file system locks.

[0098] Modern file systems tend to detect and specially handle files being written sequentially. That is, they typically attempt to avoid "polluting" the cache of disk blocks being used for other purposes, with blocks that are being accessed only sequentially. Generally, disk blocks associated with files being written sequentially are scheduled for writing to disk as soon as possible after they have been modified by an application. File systems typically make very little, if any, effort to keep "dirty" (modified but unwritten) blocks of such files in cache. Some file systems are notably more aggressive than others in this regard; nevertheless, it is an important and widely-adopted optimization.

[0099] Many operating systems in addition periodically schedule long-lived, dirty blocks for writing to disk. For example, UNIX.TM. and similar operating systems periodically exercise a sync( ) routine that schedules for writing all dirty disk blocks. Often, such operating systems more and more aggressively handle dirty disk blocks that remain in the disk cache even after multiple sync( ) cycles.

[0100] In many and perhaps the vast majority of cases, the sequential file "trickling" to disk and periodic sync( ) calls will be sufficient for the degree of synchronization required to implement containers efficiently and with reasonable semantics. However, other steps may be taken, to "harden" the semantics. For example, the container system itself might periodically initiate a sync( ) call, to encourage dirty blocks to be written to disk in a timely manner. Another technique is to request asynchronous "call-backs" when disk blocks have been written to disk. For this and other reasons, as will be obvious to one skilled in the art, asynchronous I/O is a generally useful technique to apply to container implementation. Various means and methods for implementing asynchronous I/O and interfaces for the same are well known in the art and, as such, are not described herein. Yet another strategy is to write all changes to the container file synchronously. While the slowest in terms of performance, it may be the most desirable option in some circumstances.

[0101] In the present invention, a container file is written sequentially. The underlying file system must allocate space to sequential files in an intelligent manner. Virtually all modern file systems handle space allocation to sequentially-written files very efficiently. A common technique, when the file initially is small, is to start by allocating a relatively small amount of storage to the file. Then, as the file system detects that the file is being written sequentially, larger and larger amounts of storage are allocated at a time, up to a certain maximum size. When the file is eventually closed, allocated but unused storage is freed.

[0102] Retrieving an object from a container is a much simpler matter than putting one in it. The container system is passed a token and a location of a buffer into which the object's data is to be copied. The token identifies the container, the object (by its offset within the container), and its length. The container system presumes that the calling application allocated a buffer large enough to hold the object's data. If not, the consequences are up to the application and operating system.

[0103] An application retrieving an object from a container does not lock the container. The container system relies on the file system to lock its data structures to the extent necessary. Retrieving the object works as follows.

[0104] The token's container number and generation are extracted. If the container number is in use and generation number extracted from the token matches that of the container, the container's hash initialization vector is located. A secure hash is computed for the token starting with the container's hash initialization vector. If the hash value computed matches that of the token, the token is valid. Next, the container file is opened, and an iovec structure is built, according to which the object's header, data, and trailers will be read. Its header and trailer will be copied into container system buffers and its data will be copied into an application buffer. The amount of data to be read is known by the application and the container system from the token's object length field.

[0105] Reading the header, data, and trailer are carried out via a single read( ) operation; however, a number of iovec structure entries may be required to read the data associated with an object. Object (resp., block) headers and trailers are stored in each file block. These must be skipped over when reading. As the starting offset of an object is known, the location of each object offset can be computed and placed in the iovec prior to the read( ) operation. The object offsets actually encountered during the read( ) may be stored into an array by the read( ) operation, and subsequently checked to ensure that each block intended to be retrieved was in fact valid. Alternatively, if a validity check is not needed, the object offsets may be read into a "dummy" buffer then deleted.

[0106] Multiple objects may be retrieved from the same container by different applications without blocking and, indeed, without concurrency control beyond that provided by the file system.

[0107] In some cases it may be desirable to retrieve the first object in a container. At all times, the offset of a container's first object is computable. As objects are deleted from a container, either programmatically or automatically, the container's first object changes. At points, the computation though just completed may be found in the next step to be invalid. In that case, the computation must be redone. The method of the present invention is forgiving of a stale computation of the first object.

[0108] Given a token, the next object in the container can be identified. There are two cases: either the current object exists or it doesn't. An object that no longer exists would have been deleted due to aging. (There is no means to explicitly delete an object.) It is simple to test whether an object exists or not. An object exists if and only if its offset is the same as or larger than that of the container's first object. If the current object exists, the next object is determined by computing from the current object the offset of the next object. The header of the next object is read. A token may be constructed for the next object, the object data may be returned, or both. If the current object does not exist, the first object is chosen as the next object.

[0109] As an important optimization, the container system may read ahead by one block whenever retrieving the next object from a container. The additional cost to read one additional block typically will be negligible, yet the value will be high: The additional block will contain the header of the next object in the container. From the header, the next object's token may be constructed and returned to the application. The application then will be able to retrieve the corresponding object--having first allocated a buffer to hold it--and the yet-next object's token in, a single disk read operation. Thus, retrieving a sequence of objects from a container can be highly efficient if done by container identifier rather than by a stream of tokens.

[0110] In addition to supporting uninterpreted data in the object body, the present invention also may support extended attributes, i.e., information about the data. The information can be of virtually any form, the specification of which is outside the scope of the present discussion. In general, the amount of extended attribute information data associated with an object tends to be much smaller than the object body. In some preferred embodiments, it may be stored entirely in the first block of the object. There are certain advantages to relegating the extended attribute data to this location. In one aspect, its location is precisely known as is the location of the object body, so the two could be retrieved independently if so desired. In another, it may be the case that extended attribute data must be updatable.

[0111] Though it has previously been indicated that mutable objects are undesirable, limiting changes to object to the first block, which not only is atomically-updatable but also contains the object header, presents certain key advantages. First, the attributes can be changed atomically, i.e., completely or not at all. Second, if the block containing the header somehow becomes corrupted, the object becomes irretrievable and the validity test is an easy one to perform.

[0112] In one preferred embodiment, to implement extended attributes being stored in a known location of an object, a multiple of file blocks would be allocated to hold the extended attributes. One of the reserved fields of the object header would be allocated to hold the extended attribute length, which would be the length in bytes of the extended attributes. Extended attributed length would the actual amount of extended attributes data associated with the object, not including padding to bring the allocated space up to a multiple of file blocks (typically, one) in length. The object length field of the object header would be renamed the object body length field. The number of blocks allocated to the object may be computed directly from the combination of extended attribute length and object body length.

[0113] In many cases, it is desirable for a system that stores data to maintain secure access to the data. The container system of the present invention can be augmented to do so, as follows. First, the means and method disclosed in the present invention assumes that the application, in association with the operating system, provides adequate information to the container system to identify the entities of interest. Second, the invention assumes that the container system may store with each object, sufficient information for a security system, with the application's identification information, to determine the access allowed to the application. Third, the invention assumes that a function is able, when passed the application's identification information and information stored with the object, to determine the access. The container system stores the object-specific security information in extended attributes that are not directly accessible to applications.

[0114] When an application passes a token into the container system, either it or the operating system also passes in identification information regarding the application. The container system retrieves the object's extended attributes and extracts from them the security-specific information. The container system then passes to the security checking function the application's identification information, the object-specific security information, and the type of access desired by the application. (With respect to the container system, the access desired would be to retrieve the object.) The security system would either allow or disallow the access, and the container system would act appropriately.

[0115] Unfortunately, if direct I/O is desired, the method would in general require two disk reads to securely retrieve the object. The first read would retrieve the object's extended attributes, and the second would retrieve its data. As performance might be substantially impacted by breaking the read in two, a method to reduce the impact is desirable.

[0116] In one preferred embodiment, the method of next object header read-ahead may be extended when reading the next object, to not only compute the token for the next object, but to cache the security-related extended attributes for the next object as well. Thus, when the next object is accessed, the next object's security information is available without having to perform two disk reads.

[0117] Unfortunately, the optimization does not in general provide any benefit for objects accessed entirely at random, as the security information for the next object will not typically be cached. In this case, it would seem that sequential scanning of objects in a container, by getting the first and then the next objects in succession, could well substantially outperform accessing individual objects by token, perhaps by a factor of two.

[0118] It is, however, possible that object retrieval patterns will not be entirely random; that is, they may follow a cyclical pattern wherein an object is selected more or less at random, then a series of sequential object retrievals is performed. In that case, the optimization would provide substantial benefit. As the run-time overhead would be negligible--an extra block read and a cache of one block between successive object retrievals, the value of the optimization may in many cases exceed its cost.

[0119] Yet a further valuable optimization may be to cache several of the most recent object headers that have been accessed, to handle the case where objects may be accessed out of order but with some locality.

[0120] The optimization of the previous section, i.e., caching the security information associated with the next object when reading a given object, may be extended to caching extended attributes in general. In that case, scanning through a container for objects with extended attributes matching certain criteria may be effected. If the objects to be scanned are relatively small compared to the cost ratio of sequential disk I/O to random disk I/O, and/or a large percentage of the objects scanned are retrieved, maximum performance may be achieved via this optimization than by reading only the extended attributes.

[0121] Objects may be deleted en masse in one of three ways. Deleting the container in which an object resides causes the object to be deleted. Short of deleting the container, objects may be deleted programmatically, i.e., under application control, by their creation date. Alternatively, objects may be deleted automatically, i.e., as a result of value-based storage management.

[0122] Within a container, objects older than an application-supplied time-stamp may be deleted en masse. Logically, the list of objects comprising the container is scanned from its tail forward for objects older than the time-stamp. If one is found, it is deleted and the next one is examined. The process will end when either an object is found with a time-stamp newer than that supplied by the application, or the end of the container is reached.

[0123] In practice, it would be inefficient to implement programmatic object deletion as logically described. In the first place, the container might contain an enormous number of objects, so scanning through them would be impractical. In the second place, deleting the objects one at a time would make storage management inefficient on most modern file systems. The present invention includes means and methods for implementing the process efficiently.

[0124] If epochs (and epochal objects) are chosen well, the number of epochs will be substantially smaller than the number of objects and/or the number of blocks comprising a container. In that case, scanning through an epoch list may be substantially faster than scanning through the container. In normal operation, if the list is reasonably small, a version of it may be cached in memory for an in-use container. In some cases, it may be possible to cache key information about epochal objects, especially their creation times, for efficient scanning of an epochal object list.

[0125] To identify objects to be programmatically deleted, an epochal object list may be scanned backward, from the head (more recent end) of the container toward the tail (less recent end), until an object is found that is older than the date supplied. Then, the list of objects within that epoch may be scanned forward, until an object with a creation date newer than the supplied date is found. All prior objects then are deleted. Note that variations on this theme may be employed. For example, in some preferred embodiments, the method steps of scanning within the last epoch may be skipped, so that entire epochs, rather than individual objects, are deleted. Note that in all cases, an epoch begins and ends on an object boundary.

[0126] Objects that have been identified for deletion are deleted en masse. Modern file systems typically provide a means for destroying the mapping between file blocks and disk blocks. For example, IBM's journaling file system for the AIX.TM. operating system provides an fclear( ) system call that logically zeroes a byte range of a file. Where possible, the call unmaps file blocks; otherwise, the blocks are zeroed. GPFS implements fclear( ) for clusters based on AIX.TM. on IBM.RTM. pSeries.TM. computers. File systems supporting the X/Open.TM. Data Storage Management facility provide the dm_punch_hole( ) function, which is similar to fclear( ). Where possible, disk blocks underlying a file region are unmapped.

[0127] By deleting objects en masse, optimizing epoch management, and relying on fclear( ), dm_punch_hole( ), or other file-to-disk unmapping facility, a modern file system is able to optimize storage allocation among containers without undue fragmentation.

[0128] In some cases, it may be desirable to delete individual objects in a container. In one preferred embodiment, an object may be deleted merely by setting the OBJECT_DELETED flag; however, there is a potential security issue associated with this embodiment. If the object is retrieved by token, the object body may be copied into the application's buffer before the object is known to have been deleted. In another preferred embodiment, the object to be deleted may be replaced by a null object; however, there is a security issue associated with this embodiment, as well, and the embodiment may be less efficient, even substantially so. In turning the existing object into a null object, the body of the object would be overwritten with ASCII NUL bytes. If the system were to fail while the object were only partially overwritten, or before all of the blocks of the fully-overwritten object had been written to disk, it should not be possible that data from non-overwritten blocks of the object would be copied into the application's buffer before the object is known to have been deleted. A full implementation of either embodiment would have to take these issues into account, for instance using write-ahead logging.

[0129] If the deletion of individual objects were supported, epoch fragmentation could become an issue. Suppose that a substantial fraction of the objects in an epoch have been deleted. In that case, it might be desirable to unmap the file blocks so that the theretofore associated disk blocks might be reused. Of course, storage allocation units would have to be taken into account when determining whether to unmap blocks, or file system fragmentation could occur. We assume without loss of generality that unmapping would occur on storage-unit aligned, storage-unit size sections of a container file. Thus, only if all of the objects in a storage unit had been deleted could the unit be unmapped.

[0130] FIG. 11 depicts an object header layout for sparse epochs in accordance with an illustrative embodiment of the present invention. In one preferred embodiment, a reserved field of the object header 1100, object gap 1102 would indicate the empty region (if any), in bytes, between one object and the next, and the container's epoch chain would be updated to reflect unmapped allocation units. Fields in section 1104 and reserved area 1106 were previously described in FIG. 4. However, reserved area 1106 changes in size when object gap 1102 is added to object header 1100.

[0131] FIGS. 12 and 13 depict an exemplary change in an epoch chain to accommodate the unmapping of a storage allocation unit in accordance with an illustrative embodiment of the present invention. FIG. 12 depicts an epoch chain before storage unit deletion in accordance with an illustrative embodiment of the present invention. FIG. 13 depicts an epoch chain after storage unit deletion in accordance with an illustrative embodiment of the present invention. For simplicity, the example from FIG. 12 to FIG. 13 depicts without loss of generality a case in which a storage allocation unit is exactly one block. Those skilled in the art will understand how to extend the example without undue experimentation to a functioning system. In FIG. 12, object 1202 comprises epoch object #0 1204, objects 1206, 1208, and 1210 comprise epoch object #1 1212, object 1214 comprises epoch object #4 1216, and object 1218 comprises epoch object #5 1220. FIG. 13 depicts the resulting storage allocation after objects 1206 and 1210 of FIG. 12 are deleted and the underlying storage allocation units unmapped.

[0132] Of course, the means and method of sparse epochs also may be used to support the modification of retention curves on a fine granularity basis, e.g., per-object, per set of objects, per-segment. If the retention curves of objects within an epoch differ, the epoch may be split and storage unmapped using the sparse epoch method. For example, an epoch may be split dynamically along storage allocation unit lines such that those objects in a first range of allocation units share a first retention curve, those in a second range share a second curve, and so forth. In one preferred embodiment, as the retention curves of objects are changed, epochs may be split and/or coalesced. In another preferred embodiment, epoch splitting and/or coalescing may be deferred until the container valet as scans through the epoch chain. In another preferred embodiment, changes to the retention value or curve of an object are accomplished by copying the object to a new container or a separate file in the underlying file system.

[0133] Automatic object deletion is somewhat simpler than programmatic deletion. Periodically, a container system waterline is set by some entity beyond the scope of the present discussion, said waterline indicating the minimum "value" of objects that must be maintained within containers. An object has a retention function and a creation time; along with the current time, these allow the object's value to be determined. An object with a value below the waterline may be deleted; otherwise, it must be maintained.

[0134] As with programmatic object deletion, objects within the earliest epoch may be scanned, or more simply, just the epoch chain may be scanned. If only the epoch chain is scanned, the value of an epoch may be considered to be the same as that of the newest object it contains. Note: Here, the discussion has assumed that the objects in a container, or at least the objects in an epoch, have the same retention curve. If this is not the case, storage management becomes substantially more complex.

[0135] An important optimization, that may improve performance in file systems employing unbalanced trees to map file blocks to disk blocks, may be to truncate the container file to zero length in the event that the container becomes empty. This would tend to eliminate indirect, double-indirect, triple-indirect, and so forth blocks from the tree and thereby improve block lookup performance. The optimization presumably would be less valuable in extent-based file systems. One issue with the optimization would be the maintenance of the container generation. In the case where the generation is based on a random number choice or time-stamp, the problem is inconsequential. If it is based on container-based state, a null object may be added to the container immediately after the container has been truncated.

[0136] In the present invention, automatic storage management works as follows. Each container has associated with it exactly one valet. Periodically, the valet opens the container, determines its length via fstat( ), and locates the object at that address. Note that objects may be added after that point by one or more producers. The valet need only locate an object near the tail of the file. As container blocks begin and end on well-known boundaries, the identification of the object is straightforward.

[0137] From the header of the located object, the valet extracts the epochal object offset field, which identifies the next epoch in the epoch chain. The valet then scans backwards through the epoch chain, recording the chain as it proceeds, until the first epoch is reached. The valet knows that the first epoch has been reached when the epochal object offset field of an object header in the epoch chain indicates an offset before the first live object in the file. Then, scanning forward through the recorded chain, the valet computes the value of some object in each epoch. The valet might choose the epochal object for this purpose.

[0138] As every object in an epoch has the same retention curve, the value of one object in an epoch is approximately the same as every other one. Next, comparing the computed value for the epoch with that of the waterline, the valet decides whether to retain or delete the epoch. If the epoch is to be deleted, the file blocks are unmapped and the associated storage is freed.

[0139] In the case where retention curves cannot be changed, i.e., all of the objects in a container have the same curve, which is set at the container's creation the valet may stop the then-current evaluation-deletion cycle once it detects an (y) epoch that should not be deleted. This line reasoning assumes that retention curves decrease monotonically with time. That is, suppose there is an epoch created a time t having the value x. Then every epoch (if any) created at any subsequent time t+e where e is positive, has a value y where y.gtoreq.x.

[0140] In the case where retention curves can be changed, it is possible that a subsequently created epoch may have a value y<x, in which case the valet cannot necessarily stop once it detects an epoch that should not be deleted. There may be subsequent epochs that could be deleted.

[0141] The valet runs periodically whether the container is in active use or not; moreover, in a clustered system, the valet may run on any cluster node. Various additional optimizations may be applied in the scheduling of the valets, to minimize overhead. For example, the system could cache in memory the value of the least valuable epoch in each container, and then process the containers in increasing order of their least valuable data.

[0142] There are two cases to consider when determining whether automatic storage management might interact with other operations. One is whether it might interact with adding an object to a container; the other is whether it might interact with retrieving an object from a container.

[0143] With regard to putting an object into a container, again there are two cases to consider: the truncation case, in which the container is being truncated to zero length and the fclear( ) or dm_punch_hole( ) case, in which a single epoch--but not the last--is being freed from the container.

[0144] In the former case, at issue is whether an object may be added to the container "while" the container is being truncated. If ftruncate( ) were to be used, problems could ensue. If ftruncate( ) were called at the "same" time as append-mode write( ), which is used to put objects into a container, it is possible that the write( ) might be lost, which would be undesirable.

[0145] One solution would be to lock out writes and truncates. This solution would be undesirable as locks would have to be acquired and released frequently--in the worst case every time an object were added to a container.

[0146] A way to avoid lock contention is to limit the valet from using fclear( ) or dm_punch_hole( ) to delete the last epoch in a file. In this case, the operations do not conflict as they address different parts of the file.

[0147] Automatic storage management could interact with retrieving an object from a container. Punching a hole in a file via dm_punch_hole( ) or fclear( ) is not atomic with respect to read( ). Thus, an application may be retrieving an object while the epoch containing it is being unmapped. In that case, the application may receive ASCII NUL (i.e., zero) bytes rather than the expected results. One solution would involve locking the retrieval of objects with respect to automatic storage management. This solution would be undesirable as locks would have to be acquired and released frequently--in the worst case every time an object were retrieved from a container.

[0148] In FIG. 14, a flow diagram 1400 illustrating an exemplary operation of aggregating data in a way that permits data to be deleted efficiently in bulk in accordance with an illustrative embodiment of the present invention. As the operation begins a request is received for automatic deletion of segments in a container (step 1402). Then a determination of a waterline that is to be applied to the container is made (step 1404). The first segment in the container is then located (step 1406) and checked to determine if the segment falls below the waterline (step 1408). If the segment falls below the waterline (step 1410), it is deleted from the container (step 1412). Then a determination is made if there are more segments (step 1414). If there are more segments, then the next segment is located in the container (step 1416, and the operation continues with step 1408. If at step 1414, there are no more segments in the container to check, the operation ends. Returning to step 1410, if the segment that is being checked does not fall below the waterline, the operation proceeds to step 1414 and continues as previously described.

[0149] The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In an illustrative embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.

[0150] Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

[0151] The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk--read only memory (CD-ROM), compact disk--read/write (CD-R/W) and DVD.

[0152] A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.

[0153] Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.

[0154] Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.

[0155] The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

* * * * *