U.S. patent application number 13/352256 was filed with the patent office on 2013-07-18 for optimized b-tree.
This patent application is currently assigned to Apple Inc.. The applicant listed for this patent is Owen Joseph Strain, Wenguang Wang. Invention is credited to Owen Joseph Strain, Wenguang Wang.
Application Number | 20130185271 13/352256 |
Document ID | / |
Family ID | 48780709 |
Filed Date | 2013-07-18 |
United States Patent
Application |
20130185271 |
Kind Code |
A1 |
Strain; Owen Joseph ; et
al. |
July 18, 2013 |
OPTIMIZED B-TREE
Abstract
The present technology includes an optimized b-tree. To improve
concurrent access, a read lock can be applied to traversed nodes of
a b-tree in a lock coupling. A read locked node can be promoted to
a write locked node upon a determination that the node is likely to
be modified, wherein the locked node first restricts access to
further functions and then applies a write lock to the node when
all existing functions accessing the node end. If one of the other
functions attempts to promote the later function can be canceled
and removed from the tree. A node can be promoted if the node is
likely to be modified when considering multiple factors such as
type of function, whether it is a leaf node, the number of keys in
the node, or the number of keys in a child node.
Inventors: |
Strain; Owen Joseph; (San
Francisco, CA) ; Wang; Wenguang; (Santa Clara,
CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Strain; Owen Joseph
Wang; Wenguang |
San Francisco
Santa Clara |
CA
CA |
US
US |
|
|
Assignee: |
Apple Inc.
Cupertino
CA
|
Family ID: |
48780709 |
Appl. No.: |
13/352256 |
Filed: |
January 17, 2012 |
Current U.S.
Class: |
707/704 ;
707/E17.007 |
Current CPC
Class: |
G06F 16/2246 20190101;
G06F 16/2343 20190101; G06F 16/9027 20190101 |
Class at
Publication: |
707/704 ;
707/E17.007 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A processor-implemented method comprising: upon a request to
access a b-tree by at least a first function, initially instituting
a read lock on a traversed node in the b-tree; determining that the
node of the b-tree likely requires modification; and promoting the
read lock to a write lock upon the determination that the node of
the b-tree likely requires modification by the first function, but
not until any other functions beyond the first function have left
the node.
2. The processor-implemented method of claim 1, further comprising:
unlocking the node upon a determination that the node will not be
modified, but not until a child node of the node has been read
locked by the first function.
3. The processor-implemented method of claim 1, wherein the
determination that a node is likely to be modified is based on the
type of function requesting access to the desired node.
4. The processor-implemented method of claim 1, wherein each parent
node stores keys, the determination that a node is likely to be
modified is based on a node containing a predetermined maximum or
predetermined minimum number of keys.
5. The processor-implemented method of claim 1, wherein the
determination that a node is likely to be modified is based on a
determination that the desired node is a leaf node.
6. The processor-implemented method of claim 1, wherein promoting
further comprises: preventing further access to the node so that no
new functions can access the node while other functions currently
accessing the node remain.
7. The processor-implemented method of claim 6, wherein promoting
further comprising: upon a determination that the node is likely to
be modified by a second function that already had access to the
node when it was determined that the node of the b-tree likely
requires modification, returning an error to the second
function.
8. A system comprising: a processor; a first module configured to
control the processor to, upon a request to access a b-tree by at
least a first function, initially instituting a read lock on a
traversed node in the b-tree; a second module configured to control
the processor to determine that the node of the b-tree likely
requires modification; and a third module configured to control the
processor to promote the read lock to a write lock upon the
determination that the node of the b-tree likely requires
modification by the first function, but not until any other
functions beyond the first function have left the node.
9. The system of claim 8, further comprising: a fourth module
configured to control the processor to unlock the node upon a
determination that the node will not be modified, but not until a
child node of node has been read locked by the first function.
10. The system of claim 8, wherein the determination that a node is
likely to be modified is based on the type of function requesting
access to the desired node.
11. The system of claim 8, wherein each parent node stores keys,
the determination that a node is likely to be modified is based on
a node containing a predetermined maximum or predetermined minimum
number of keys.
12. The system of claim 8, wherein the determination that a node is
likely to be modified is based on a determination that the desired
node is a leaf node.
13. The system of claim 8, wherein the third module further
controls the processor to: prevent further access to the node so
that no new functions can access the node while other functions
currently accessing the node remain.
14. The system of claim 13, wherein the third module further
controls the processor to: return an error to a second function
that already had access to the node when it was determined that the
node of the b-tree likely requires modification by the first
function, upon a determination that the node is likely to require
modification by the second function.
15. A non-transitory computer-readable medium having instructions
stored thereon which, when executed by a computing device, cause
the computing device to: obtain, by a server, a read lock on nodes
of a b-tree traversed by a function, wherein a locked parent node
can be unlocked upon a child node of the parent node being locked;
and promote, upon a determination that a traversed node is likely
to be modified, the read lock on the traversed node to a write
lock.
16. The non-transitory computer-readable medium of claim 15,
further comprising: unlocking, upon the function completing, the
locks placed on the traversed nodes.
17. The non-transitory computer-readable medium of claim 15,
further comprising: unlocking a locked parent node upon a child
node of the parent node being locked and a determination that the
parent node is not likely to be modified.
18. The non-transitory computer-readable medium of claim 15,
wherein the determination that a node is likely to be modified is
based on the type of function.
19. The non-transitory computer-readable medium of claim 15,
wherein each parent node stores keys, the determination that a node
is likely to be modified is based on a node containing a
predetermined maximum or predetermined minimum number of keys.
20. The non-transitory computer-readable medium of claim 15,
wherein the determination that a node is likely to be modified is
based on a determination that the node is a leaf node.
21. The non-transitory computer-readable medium of claim 15,
wherein promoting comprises: locking the node so that no new
functions can access the node while other functions currently
accessing the node remain; and upon a determination that no other
functions are currently accessing the node, write locking the
node.
22. The non-transitory computer-readable medium of claim 21,
promoting further comprising: upon a determination that one of the
other functions currently accessing the node is attempting to
promote, returning an error to the other function.
Description
BACKGROUND
[0001] 1. Technical Field
[0002] The present disclosure relates to b-trees and more
specifically to concurrent access of b-trees.
[0003] 2. Introduction
[0004] Computers are relied upon to store and offer access to large
amounts of data. Accordingly, being able to access the data faster
and more efficiently is an ongoing goal of modern developers. To
achieve this goal, computing data structures have been developed to
achieve this end.
[0005] One data structure which has been commonly used to manage
data is a binary tree. Binary trees store data in nodes connected
in a tree structure. Each tree begins with a single root node that
stores a single data element and can have no more than two child
nodes. The child nodes are commonly referred to as the left child
and right child. Each child node can likewise store one data
element and have two child nodes. Data is stored in a binary tree
using the value in each node as a key. For example, if a binary
tree holds integers as values, the tree is organized such that each
integer is stored in a node to the left of a node containing a
larger integer, but to the right of a node containing a smaller
integer. This way the contents of each node of the tree can be used
as a key to quickly traverse the tree and find data.
[0006] Many variations of the binary tree have been developed. One
variation that is commonly used when managing very large amounts of
data is a b-tree. B-trees allow for multiple keys and children per
node. Some b-trees can be configured to have hundreds of keys and
children per node, thus having millions of nodes in a tree with a
fairly short depth. These types of large b-trees are commonly used
by file-systems to represent files and directories.
[0007] One variation of a b-tree has been developed by Ohad Rodeh
and has been disclosed in his paper B-trees, Shadowing, and Clones
(ACM Transactions on Computational Logic, Vol. V, No. N, August
2007), which is incorporated by reference, herein, in its
entirety.
[0008] To increase the speed and efficiency of accessing data
within a b-tree, concurrent access to a tree can be granted to
multiple functions. Granting concurrent access, however, can lead
to errors if a node is modified while being accessed by another
node. To alleviate this problem, multiple locking schemes have been
used to limit errors while allowing as much concurrent access as
possible.
[0009] In the past, different types of locks, providing different
levels of security, have been used. For example, functions have
been differentiated between functions that modify the tree (write
function) versus those that only request data from the tree (read
function) and different types of locks have been configured to
restrict access from certain types of functions attempting to
access a node. For example, when a read function accesses a node, a
read lock can be placed on the node which allows only other read
functions to access the node concurrently and restricts access to
all write functions. When a write function accesses a node, a write
lock can be placed which restricts access to all other functions,
both read and write.
[0010] One common solution to allowing concurrent access to a
b-tree is to lock each node as it is traversed by the function. For
example, in some embodiments, for every node traversed by a read
function, a read lock is applied to the node. Conversely, for every
node traversed by a write function a write lock can be applied to
the node. This solution, although effective, is inefficient and
slow because each function must enter the tree by the root node and
so the root node must always be locked when the tree is accessed by
a function. In the case of a write function, the result is that no
other functions can access the tree until the write function has
completed and the lock is removed.
[0011] To remedy this problem, the b-tree disclosed by Ohad Rodeh
incorporates a lock coupling technique wherein individual nodes are
locked as they are traversed and can then be released after the
appropriate child node is locked if it is determined that the
parent node will not be modified. Similar to the method described
above, a read lock is used when the tree is traversed by a read
function while a write lock is used when the tree is traversed by a
write function.
[0012] The lock coupling method is beneficial because a node is not
locked unless it is being accessed, or it is likely that the node
will be modified. The root node, therefore, is often not locked
when the tree is accessed by a function and the tree can therefore
be accessed by both reader and writer functions concurrently.
[0013] Lock coupling does provide a more efficient system of
allowing concurrent access to a tree; however the restrictive
nature of a write lock still fails to allow sufficient concurrent
access and ultimately impedes performance. Accordingly, a need
exists for a less restrictive locking technique associated with
concurrent access to b-trees that still provides adequate
protection against errors.
SUMMARY
[0014] Additional features and advantages of the disclosure will be
set forth in the description which follows, and in part will be
obvious from the description, or can be learned by practice of the
herein disclosed principles. The features and advantages of the
disclosure can be realized and obtained by means of the instruments
and combinations particularly pointed out in the appended claims.
These and other features of the disclosure will become more fully
apparent from the following description and appended claims, or can
be learned by the practice of the principles set forth herein.
[0015] Disclosed are systems, methods, and non-transitory
computer-readable storage media for an optimized b-tree. To provide
faster and more efficient concurrent access to a b-tree, a read
lock can be applied to traversed nodes of a b-tree in a lock
coupling fashion regardless of whether a function is a read or
write function. A read locked node can be promoted to a write
locked node upon a determination that the node is likely to be
modified.
[0016] A promote function can be configured to lock a node to
restrict further functions from accessing the node while allowing
any functions currently accessing the node to remain. Once all
other functions accessing the node have left, the promote function
can be configured to apply a write lock to the node. If, while
waiting for the other functions accessing the node to leave, one
attempts to promote, the promote can be granted on a first come
first serve basis. The second function attempting to promote can
receive an error and retry its traversal from the root of the
tree.
[0017] A node can be promoted upon a determination that the node is
likely to be modified. A determination that a node is likely to be
modified can be based on numerous factors. For example, the
determination can be made based on the type of function or whether
the node is a leaf node. In some embodiments, the tree can be
configured to proactively merge or split nodes as they are
traversed based on the number of keys in the node. For example, all
nodes at a maximum capacity can be split or rebalanced and all
nodes at a minimum capacity can be merged or rebalanced with its
sibling nodes. The number of keys in the node can then be used to
determine that a node is likely to be split. This can also be
applied to a parent node. A child node that needs to be split or
rebalanced can require the parent node be modified, so the number
of keys in a child node can be used to determine that the parent
node is likely to be modified as well.
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] In order to describe the manner in which the above-recited
and other advantages and features of the disclosure can be
obtained, a more particular description of the principles briefly
described above will be rendered by reference to specific
embodiments thereof which are illustrated in the appended drawings.
Understanding that these drawings depict only exemplary embodiments
of the disclosure and are not therefore to be considered to be
limiting of its scope, the principles herein are described and
explained with additional specificity and detail through the use of
the accompanying drawings in which:
[0019] FIG. 1 illustrates an example system embodiment;
[0020] FIGS. 2a and 2b illustrate an exemplary b-tree;
[0021] FIG. 3 illustrates an exemplary method embodiment of using a
promote function in a b-tree;
[0022] FIG. 4 illustrates an exemplary method embodiment of the
promote function; and
[0023] FIG. 5 illustrates an exemplary system embodiment in which
an optimized b-tree can be implemented.
DETAILED DESCRIPTION
[0024] Various embodiments of the disclosure are discussed in
detail below. While specific implementations are discussed, it
should be understood that this is done for illustration purposes
only. A person skilled in the relevant art will recognize that
other components and configurations may be used without parting
from the spirit and scope of the disclosure.
[0025] FIG. 1 illustrates an exemplary system 100 that includes a
general-purpose computing device 100, including a processing unit
(CPU or processor) 120 and a system bus 110 that couples various
system components including the system memory 130 such as read only
memory (ROM) 140 and random access memory (RAM) 150 to the
processor 120. The system 100 can include a cache 122 of high speed
memory connected directly with, in close proximity to, or
integrated as part of the processor 120. The system 100 copies data
from the memory 130 and/or the storage device 160 to the cache 122
for quick access by the processor 120. In this way, the cache 122
provides a performance boost that avoids processor 120 delays while
waiting for data. These and other modules can control or be
configured to control the processor 120 to perform various actions.
Other system memory 130 may be available for use as well. The
memory 130 can include multiple different types of memory with
different performance characteristics. It can be appreciated that
the disclosure may operate on a computing device 100 with more than
one processor 120 or on a group or cluster of computing devices
networked together to provide greater processing capability. The
processor 120 can include any general purpose processor and a
hardware module or software module, such as module 1 162, module 2
164, and module 3 166 stored in storage device 160, configured to
control the processor 120 as well as a special-purpose processor
where software instructions are incorporated into the actual
processor design. The processor 120 may essentially be a completely
self-contained computing system, containing multiple cores or
processors, a bus, memory controller, cache, etc. A multi-core
processor may be symmetric or asymmetric.
[0026] The system bus 110 may be any of several types of bus
structures including a memory bus or memory controller, a
peripheral bus, and a local bus using any of a variety of bus
architectures. A basic input/output (BIOS) stored in ROM 140 or the
like, may provide the basic routine that helps to transfer
information between elements within the computing device 100, such
as during start-up. The computing device 100 further includes
storage devices 160 such as a hard disk drive, a magnetic disk
drive, an optical disk drive, tape drive, solid state drive or the
like. The storage device 160 can include software modules 162, 164,
166 for controlling the processor 120. Other hardware or software
modules are contemplated. The storage device 160 is connected to
the system bus 110 by a drive interface. The drives and the
associated computer readable storage media provide nonvolatile
storage of computer readable instructions, data structures, program
modules and other data for the computing device 100. In one aspect,
a hardware module that performs a particular function includes the
software component stored in a non-transitory computer-readable
medium in connection with the necessary hardware components, such
as the processor 120, bus 110, display 170, and so forth, to carry
out the function. The basic components are known to those of skill
in the art and appropriate variations are contemplated depending on
the type of device, such as whether the device 100 is a small,
handheld computing device, a desktop computer, or a computer
server.
[0027] Although the exemplary embodiment described herein employs
the hard disk 160, it should be appreciated by those skilled in the
art that other types of computer readable media which can store
data that are accessible by a computer, such as magnetic cassettes,
flash memory cards, digital versatile disks, cartridges, random
access memories (RAMs) 150, read only memory (ROM) 140, a cable or
wireless signal containing a bit stream and the like, may also be
used in the exemplary operating environment. Non-transitory
computer-readable storage media expressly exclude media such as
energy, carrier signals, electromagnetic waves, and signals per
se.
[0028] To enable user interaction with the computing device 100, an
input device 190 represents any number of input mechanisms, such as
a microphone for speech, a touch-sensitive screen for gesture or
graphical input, keyboard, mouse, motion input, speech and so
forth. An output device 170 can also be one or more of a number of
output mechanisms known to those of skill in the art. In some
instances, multimodal systems enable a user to provide multiple
types of input to communicate with the computing device 100. The
communications interface 180 generally governs and manages the user
input and system output. There is no restriction on operating on
any particular hardware arrangement and therefore the basic
features here may easily be substituted for improved hardware or
firmware arrangements as they are developed.
[0029] For clarity of explanation, the illustrative system
embodiment is presented as including individual functional blocks
including functional blocks labeled as a "processor" or processor
120. The functions these blocks represent may be provided through
the use of either shared or dedicated hardware, including, but not
limited to, hardware capable of executing software and hardware,
such as a processor 120, that is purpose-built to operate as an
equivalent to software executing on a general purpose processor.
For example, the functions of one or more processors presented in
FIG. 1 may be provided by a single shared processor or multiple
processors. (Use of the term "processor" should not be construed to
refer exclusively to hardware capable of executing software.)
Illustrative embodiments may include microprocessor and/or digital
signal processor (DSP) hardware, read-only memory (ROM) 140 for
storing software performing the operations discussed below, and
random access memory (RAM) 150 for storing results. Very large
scale integration (VLSI) hardware embodiments, as well as custom
VLSI circuitry in combination with a general purpose DSP circuit,
may also be provided.
[0030] The logical operations of the various embodiments are
implemented as: (1) a sequence of computer implemented steps,
operations, or procedures running on a programmable circuit within
a general use computer, (2) a sequence of computer implemented
steps, operations, or procedures running on a specific-use
programmable circuit; and/or (3) interconnected machine modules or
program engines within the programmable circuits. The system 100
shown in FIG. 1 can practice all or part of the recited methods,
can be a part of the recited systems, and/or can operate according
to instructions in the recited non-transitory computer-readable
storage media. Such logical operations can be implemented as
modules configured to control the processor 120 to perform
particular functions according to the programming of the module.
For example, FIG. 1 illustrates three modules Mod1 162, Mod2 164
and Mod3 166 which are modules configured to control the processor
120. These modules may be stored on the storage device 160 and
loaded into RAM 150 or memory 130 at runtime or may be stored as
would be known in the art in other computer-readable memory
locations.
[0031] As is commonly known in the art, a b-tree is a tree data
structure that keeps data sorted and allows searches, access,
insertions and deletions very quickly. The b-tree is a variation of
a binary search tree in which each node can contain multiple keys
and have more than two children. Although each node has a maximum
number of allowable keys and children, the number of keys and
children per node can be variable as long as they do not exceed the
maximum limit. The number of allowable keys and children per node
can be variable, however they should consistently correspond to
each other. For example, in some embodiments the tree can be
configured so that if each node can contain X keys, the node can
have X+1 children. In some embodiments, the tree can be configured
so that if each node contains X keys, the node can have X
children.
[0032] When accessing a b-tree, all functions must enter from the
root node and traverse the tree accordingly. To quickly search the
tree, the keys within a node are used as a guide. The b-tree is
configured so that the value of all keys stored in a child node is
within the range of the key to its left and right in the parent
node.
[0033] The key values are used as guides to traversing the tree,
but in some embodiments, the actual data stored in the tree is
stored in leaf nodes. The leaf nodes are the nodes at the lowest
level of the tree and do not have children.
[0034] A b-tree is kept balanced by requiring that all leaf nodes
are at the same depth, meaning that they are all an equal distance
from the root node. To maintain this balance, nodes are split and
merged as keys are added and removed from the tree. To perform
split and merge functions efficiently, the number of keys allowed
per internal node can be configured to be within a predetermined
range. For example, each node can be required to have within b and
2b+1 keys, where b>=2. These ranges allow the tree to be easily
split into two and merged into one. The root node does not need to
follow these guidelines and can have, for example, between 0 and
2b+1 keys.
[0035] FIG. 2a illustrates an exemplary b-tree. As illustrated, the
b-tree 200 is configured so that each node can hold up to 4 keys
and have up to 5 children. The illustrated b-tree is configured to
hold integers and, integers are the key used to navigate the tree.
The data is sorted so that all integers stored in the tree are
positioned to the right of all smaller integers and to the left of
all larger integers. To accomplish this, the tree can be configured
so that keys stored in a node are always equal or greater than the
value of the key to its left in the parent node and less than the
value of the key to its right in its parent node.
[0036] For example, as illustrated, the leftmost child node 205 of
the root node 210 is ordered smallest to largest, left to right,
and contains only integers smaller than 7, which is the key to its
right in the root node 210. The middle child node 215 contains only
integers equal to or larger than 7, which is the key to its left in
the root node 210, and smaller than 16, which is the key to its
right in the root node 210. Finally, the rightmost child node 220
only contains integers equal to or larger than 16, which is the key
to its left in the root node 210.
[0037] When adding data to the b-tree 200, the tree can be
configured to rearrange itself to conform to the rules of the tree.
For example, if the integer 3 is added to the illustrated tree, the
leftmost child 205 would exceed the maximum of 4 allowable keys. To
remedy this problem, the root node 210 can be modified to add an
extra key and the leftmost child node 205 can be split into two
nodes.
[0038] FIG. 2b illustrates the b-tree of FIG. 2a with the number 3
added. As illustrated, the root node 210 can be modified to include
the number 3 and the leftmost child 205 can be split into two
separate nodes, one placed to the left 225 of the key 3 in the root
node 210 and one placed between 230 the keys 3 and 7 in the root
node 210. The keys within the two modified child nodes 225 and 230
correspond to the rules of the tree that the keys in the child must
be equal or greater than the value of the key to its left in its
parent node and less than the value of the key to its right in its
parent node.
[0039] To increase the speed at which users can access the data in
the tree, concurrent access to the tree can be granted to multiple
functions. Concurrent access, however, can lead to errors if, for
example, a node of the tree is modified while another function is
attempting to access the data.
[0040] To alleviate this problem, when a node is accessed by a
function, a lock can be placed on the node to restrict other
functions from accessing the node until the lock has been
removed.
[0041] In some embodiments, different types of locks can be used,
the different locks providing different levels of security. For
example, functions can be differentiated between functions that
modify the tree versus those that only request data from the tree.
A function that modifies the tree can be called a write function
because it alters the tree and thus writes to it in some way, for
example by adding or deleting data. A function that merely request
data can be called a read function because data is only read and
the tree is not modified.
[0042] Different types of locks can be configured to restrict
access to a node from certain types of functions. For example,
allowing access to multiple read functions at the same time poses
no threat of error because a read function does not modify the
tree, whereas a write function poses a threat of error since at
least one node will be modified. Accordingly, when a read function
accesses a node, a lock can be placed to allow only other read
functions to access the node concurrently by restricting access to
all write functions. This type of lock can be called a read lock.
When a write function accesses a node, a lock can be placed on the
node which restricts access to all other functions, both read and
write, to protect against error. This type of lock can be called a
write lock.
[0043] One common solution to allowing concurrent access to a
b-tree is to lock each node as it is traversed by the function. For
example, in some embodiments, for every node traversed by a read
function, a read lock is applied to the node. Conversely, for every
node traversed by a write function, a write lock can be applied to
the node. The locks applied can be removed upon the function
completing. Removing a lock applies only to the lock placed by that
function, locks placed on a node by a different function are not
affected.
[0044] This solution, although effective, can be inefficient
because each function must enter the tree by the root node and so
the root node must always be locked when the tree is accessed by a
function. In the case of a write function, the result is that no
other functions can access the tree until the write function has
completed and the lock is removed.
[0045] To remedy this problem, a lock coupling technique can be
used wherein individual nodes are locked as they are traversed and
can then be released after the appropriate child node is locked if
it is determined that it is not likely that the parent node will be
modified. Similar to the method described above, a read lock can be
used when the tree is traversed by a read function while a write
lock can be used when the tree is traversed by a write
function.
[0046] The lock coupling method is beneficial because a node is not
locked unless it is being accessed by a read or write function, or
it is likely that the node will be modified. The root node,
therefore, is often not locked when the tree is accessed by a
function and the tree can therefore be accessed by both reader and
writer functions concurrently.
[0047] Lock coupling does provide a more efficient system of
allowing concurrent access to a tree; however the restrictive
nature of a write lock can still lead to inefficiencies. For
example, a write function ultimately does write to a tree, but the
function is only reading the keys of a tree while it is traversing
and only writes at the end of its search. Accordingly, a write
function is only a read function until it is determined that a node
is likely to be modified and thus can be treated as a read function
until that time. Therefore, there is no need for nodes to be write
locked when traversed by a write function unless there is a
determination that the node is likely to be modified and there is
no need to restrict a write function from a read locked node unless
the node is being modified. By only read locking a node as it is
traversed by a write function and treating a write function as a
read function until it is ready to modify the tree, efficiency can
be greatly increased. Using this system, all functions can have
access to a node unless the node is likely to be modified.
[0048] Many factors can go into determining whether a node is
likely to be modified. For example, the type of function can be a
factor. A read function only reads data from the tree and does not
modify it, so when being accessed by a read function, a node is
never likely to be modified. Another factor can be whether a node
is a leaf node. If a write function has traversed the tree until it
has reached a leaf node, the node is likely to be modified because
data is stored in the leaf nodes. Another factor can be the number
of keys or keys in a node. For example, returning to FIG. 2a, the
leftmost node 205 contains 4 keys, which is the node's maximum
capacity. If a function is going to add a key to the node 205, it
can be determined to be likely that both the leftmost node 205 as
well as the parent node 210 are likely to be modified because the
leftmost node is at full capacity and thus requires a split or
rebalancing.
[0049] This same concept can be applied to b-trees configured to
perform a pro-active split or pro-active merge when inserting or
removing a key. For example, in some embodiments a b-tree can be
configured so that when a write function traverses a tree to insert
a key, each full node is split or rebalanced. Both a node and a
parent node can be determined to be likely to be modified if the
node is full. The same proactive policy can be implemented when a
write function wishes to remove a key from the tree. The function
can be configured to merge or rebalance nodes with a minimal amount
of keys. For example, a tree configured to allow between b and 2b+1
keys per node will merge all nodes with b keys when performing a
remove function and split all nodes with 2b+1 keys when performing
an insert function. In either case, both the node and the parent
node can be determined to be likely to be modified.
[0050] Upon a determination that a node is likely to be modified, a
promote function can be utilized to promote a read locked node to a
write locked node. If multiple functions are accessing a node when
one function wishes to promote the lock on the node from read to
write, the promote function can be configured to first lock the
node from new functions wishing to access the node and then wait
until all existing functions accessing the node have left before
promoting the lock from a read lock to a write lock.
[0051] If one of the other functions attempts to promote to a write
function, priority can be given on a first in order system. For
example, in some embodiments, the first function to attempt to
promote will be given priority and granted the promote, while any
other functions attempting to promote can be canceled and removed
from the tree. The removed functions can then try to re-access the
tree. The likelihood that two functions will attempt to promote at
the same time is highly unlikely. This low probability is used
advantageously to increase concurrent access to the tree by
planning for a possible failure for one function. This represents a
major philosophical difference over previous methods in that the
chance of failure is allowed to increase overall efficiency.
[0052] FIG. 3 illustrates an exemplary method embodiment of using a
promote function in a b-tree. As illustrated when a function
accesses the tree, the method first read locks the root node 305,
which is then assigned as the parent node in this method.
[0053] The method then determines whether the parent node is likely
to change 310. This determination can be made in any number of
ways. For example, it can be determined that the parent node is
likely to be changed if there is no appropriate child node for the
function to continue to, or the number of keys in the node is
outside of a set range. For example, in some embodiments, a b-tree
can be configured to proactively split or merge and so any node
with a number of keys outside of a predetermined range can be
configured to be modified appropriately.
[0054] If it is determined that the parent is likely to change 310,
the promote function (described in further detail in FIG. 4)
promotes 315 the read lock to a write lock. The method then
determines whether there is an appropriate child node 320. If there
is no child node, the method continues to 325 where the function
modifies 325 any node which is write locked by the function and
then releases all locks 330 placed by the function.
[0055] If at 320 an appropriate child node is found, the method can
read lock 340 the appropriate child node. The method can then
determine whether the child node is likely to change 345. If the
child node is not likely to change, the lock on the parent node is
released 360 and the child node is then designated as the parent
node 365. The method then returns to step 320.
[0056] If at 345 it is determined that the child node is likely to
change, the method promotes 350 (FIG. 4) the read lock on the
child, parent or appropriate sibling nodes and modifies 355 the
write locked nodes accordingly. The method then continues to step
360 where the write lock on the nodes are released 360 and the
child node becomes designated as the parent node 365.
[0057] FIG. 4 illustrates an exemplary method embodiment of the
promote function. As illustrated, upon receiving the command to
promote 405, the method first applies a lock to the node 410. This
lock can be similar to a write lock in that it prohibits all read
and write functions from accessing the node. The lock, however, can
be configured to not affect the functions that were already
accessing the node at the time the promote function was
executed.
[0058] The method next determines whether any other functions are
accessing the node 415. If no other functions are accessing the
node, the lock on the node is promoted 420 from a read lock to a
write lock.
[0059] If other functions are accessing the node, the method
determines whether one of those functions is requesting to promote
425. If one of the other functions is requesting to promote, other
function will be denied and the promote will return and error 430.
Upon receiving an error, the denied function can try to re-access
the tree from the root node.
[0060] If at 425 the method determines that a function is not
trying to promote, the method returns to step 415.
[0061] FIG. 5 illustrates an exemplary system embodiment in which
an optimized b-tree can be implemented. As illustrated, servers 505
510, user devices 515 and personal computers 520 can be configured
to communicate with each other directly or through use of a
communications network 525. Although certain types of computing
devices are illustrated, this is only for exemplary purposes; any
type of computing devices can be used. An optimized b-tree can be
implemented on any or all of the user devices.
[0062] An optimized b-tree can be used to store any type of data
for any purpose. For example, an optimized b-tree can be used as a
file system to represent files and directories. This type of
embodiment can be implemented on any of the devices and accessed
from any of the devices. For example, server 505 can implement an
optimized b-tree as a file system. Server 505 can be configured to
be in direct communication with server 510 so that a function
running on server 510 can access the optimized b-tree on server
505. Alternatively, user device 515 can access server 505 via the
communications network 525 and access the optimized b-tree by
accessing files stored on server 505. Personal computer 520 can
likewise access server 505 via the communications network to read
or write files stored in the optimized b-tree.
[0063] Embodiments within the scope of the present disclosure may
also include tangible and/or non-transitory computer-readable
storage media for carrying or having computer-executable
instructions or data structures stored thereon. Such non-transitory
computer-readable storage media can be any available media that can
be accessed by a general purpose or special purpose computer,
including the functional design of any special purpose processor as
discussed above. By way of example, and not limitation, such
non-transitory computer-readable media can include RAM, ROM,
EEPROM, CD-ROM or other optical disk storage, solid state drive,
magnetic disk storage or other magnetic storage devices, or any
other medium which can be used to carry or store desired program
code means in the form of computer-executable instructions, data
structures, or processor chip design. When information is
transferred or provided over a network or another communications
connection (either hardwired, wireless, or combination thereof) to
a computer, the computer properly views the connection as a
computer-readable medium. Thus, any such connection is properly
termed a computer-readable medium. Combinations of the above should
also be included within the scope of the computer-readable
media.
[0064] Computer-executable instructions include, for example,
instructions and data which cause a general purpose computer,
special purpose computer, or special purpose processing device to
perform a certain function or group of functions.
Computer-executable instructions also include program modules that
are executed by computers in stand-alone or network environments.
Generally, program modules include routines, programs, components,
data structures, objects, and the functions inherent in the design
of special-purpose processors, etc. that perform particular tasks
or implement particular abstract data types. Computer-executable
instructions, associated data structures, and program modules
represent examples of the program code means for executing steps of
the methods disclosed herein. The particular sequence of such
executable instructions or associated data structures represents
examples of corresponding acts for implementing the functions
described in such steps.
[0065] Those of skill in the art will appreciate that other
embodiments of the disclosure may be practiced in network computing
environments with many types of computer system configurations,
including personal computers, hand-held devices, multi-processor
systems, microprocessor-based or programmable consumer electronics,
network PCs, minicomputers, mainframe computers, and the like.
Embodiments may also be practiced in distributed computing
environments where tasks are performed by local and remote
processing devices that are linked (either by hardwired links,
wireless links, or by a combination thereof) through a
communications network. In a distributed computing environment,
program modules may be located in both local and remote memory
storage devices.
[0066] The various embodiments described above are provided by way
of illustration only and should not be construed to limit the scope
of the disclosure. Those skilled in the art will readily recognize
various modifications and changes that may be made to the
principles described herein without following the example
embodiments and applications illustrated and described herein, and
without departing from the spirit and scope of the disclosure.
* * * * *