U.S. patent application number 14/836400 was filed with the patent office on 2015-12-17 for system and method for indexing streams containing unstructured text data.
The applicant listed for this patent is Red Lambda, Inc.. Invention is credited to Robert Bird, Adam Leko, Matthew Whitlock.
Application Number | 20150363446 14/836400 |
Document ID | / |
Family ID | 49995921 |
Filed Date | 2015-12-17 |
United States Patent
Application |
20150363446 |
Kind Code |
A1 |
Leko; Adam ; et al. |
December 17, 2015 |
System and Method for Indexing Streams Containing Unstructured Text
Data
Abstract
A system, method and computer readable medium for indexing
streaming data. Data may be received from distributed devices
connected via a network. Data elements may be stored and allocated
to data blocks and events of the block stores. Non-text data may be
converted into a text representation. The data may be split into
terms, and term frequencies of each term within each of the event
may be calculated. Block-level term frequency statics may be
calculated based on the term frequencies. Tree index structures,
such as the Y-tree index, may be generated based on the block-level
term frequency data. The Y-tree index structures may use the terms
as keys and pointers to the corresponding data blocks and
block-level term frequency data. A search query may be performed
over the tree index structures.
Inventors: |
Leko; Adam; (Longwood,
FL) ; Bird; Robert; (Longwood, FL) ; Whitlock;
Matthew; (Longwood, FL) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Red Lambda, Inc. |
Longwood |
FL |
US |
|
|
Family ID: |
49995921 |
Appl. No.: |
14/836400 |
Filed: |
August 26, 2015 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
13832767 |
Mar 15, 2013 |
|
|
|
14836400 |
|
|
|
|
61677171 |
Jul 30, 2012 |
|
|
|
Current U.S.
Class: |
707/743 |
Current CPC
Class: |
G06F 16/24568 20190101;
G06F 16/322 20190101; G06F 16/2246 20190101 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1-29. (canceled)
30. A method for indexing data, comprising the steps of: allocating
stored data-elements to data-blocks of block-stores, wherein the
stored data-elements are stored in the block-stores, wherein the
stored data-elements are allocated via one or more processors to
the data-blocks; further allocating the block-allocated
data-elements to events of the data-blocks, wherein each of the
data-blocks comprise one or more events, wherein each of the events
comprise the block-allocated data-elements of the corresponding
data-block, wherein the block-allocated data-elements are allocated
via the one or more processors to the events; and, splitting the
event-allocated data-elements into terms, wherein the
event-allocated data-elements are split via the one or more
processors into the terms, wherein the terms are adapted to be used
in tree index structures as keys, wherein the tree index structures
are adapted to be generated for the event-allocated data-elements,
wherein the tree index structures are generated via the one or more
processors.
31. The method of claim 30, further comprising the step of:
receiving data-streams, wherein the data-streams comprise streamed
data-elements, wherein the streamed data-elements are stored via
one or more processors in block-stores, wherein the streamed
data-elements are said stored data-elements.
32. The method of claim 30, further comprising the step of:
calculating a term frequencies of each term in each of the events,
wherein the term frequencies are calculated via the one or more
processors; calculating block-level term frequency data for the
event-allocated data-elements stored in the corresponding
data-block based on the term frequencies, wherein the block-level
term frequency data is calculated via the one or more processors;
and, generating the tree index structures for the event-allocated
data-elements based on the block-level term frequency data, wherein
the terms are used in the tree index structures as keys, wherein
the tree index structures are calculated via the one or more
processors.
33. A system for indexing data, comprising: block-stores adapted to
store data-elements; data-blocks of the block-stores, the stored
data-elements being allocated via one or more processors to the
data-blocks; events of the data-blocks, the block-allocated
data-elements being further allocated via the one or more
processors to the events of the data-blocks, each of the
data-blocks comprising one or more events, each of the events
comprising the block-allocated data-elements of a corresponding
data-block; and, terms generated via the one or more processors by
splitting the event-allocated data-elements, wherein the terms are
adapted to be used in tree index structures as keys, wherein the
tree index structures are adapted to be generated for the
event-allocated data-elements, wherein the tree index structures
are generated via the one or more processors.
34. The system of claim 33, wherein the data-elements comprise
streamed data-elements, wherein the data-elements are received via
data-streams.
35. The system of claim 33, further comprising: term frequencies
calculated via the one or more processors based on the frequency of
each term in each of the event; block-level term frequency data
calculated via the one or more processors for the event-allocated
data-elements that are stored in a corresponding data-block, the
block-level term frequency data being based on the term
frequencies; and, the tree index structures generated via the one
or more processors for the event-allocated data-elements based on
the block-level term frequency data, the terms being used in the
tree index structures as keys.
36. A non-transitory computer readable medium having computer
readable instructions stored thereon for execution by a processor,
wherein the instructions on the non-transitory computer readable
medium are adapted to enable a computing device to: allocate stored
data-elements to data-blocks of block-stores; further allocate the
block-allocated data-elements to events of the data-blocks, wherein
each of the data-blocks comprise one or more events, wherein each
of the events comprise the block-allocated data-elements of the
corresponding data-block; and, split the event-allocated
data-elements into terms, wherein the terms are adapted to be used
in tree index structures as keys, wherein the tree index structures
are adapted to be generated for the event-allocated data-elements,
wherein the tree index structures are generated via the one or more
processors.
37. The non-transitory computer readable medium of claim 36,
wherein the instructions on the non-transitory computer readable
medium are further adapted to enable a computing device to: receive
data-streams, wherein the data-streams comprise streamed
data-elements, wherein the streamed data-elements are stored via
one or more processors in block-stores, wherein the streamed
data-elements are said stored data-elements.
38. The non-transitory computer readable medium of claim 36,
wherein the instructions on the non-transitory computer readable
medium are further adapted to enable a computing device to:
calculate a term frequencies of each term in each of the events,
wherein the term frequencies are calculated via the one or more
processors; calculate block-level term frequency data for the
event-allocated data-elements stored in the corresponding
data-block based on the term frequencies, wherein the block-level
term frequency data is calculated via the one or more processors;
and, generate the tree index structures for the event-allocated
data-elements based on the block-level term frequency data, wherein
the terms are used in the tree index structures as keys, wherein
the tree index structures are calculated via the one or more
processors.
Description
[0001] This non-provisional patent application claims priority to,
and incorporates herein by reference, U.S. Provisional Patent
Application No. 61/677,171 which was filed Jul. 30, 2012 and
further incorporates herein by reference, U.S. patent application
Ser. No. 13/600,853 which was filed Aug. 31, 2012.
[0002] This application includes material which is subject to
copyright protection. The copyright owner has no objection to the
facsimile reproduction by anyone of the patent disclosure, as it
appears in the Patent and Trademark Office files or records, but
otherwise reserves all copyright rights whatsoever.
FIELD OF THE INVENTION
[0003] The presently disclosed invention relate in general to the
field of indexing and retrieving data, and in particular to a
system and method for indexing streaming text data in distributed
systems.
BACKGROUND OF THE INVENTION
[0004] Systems for indexing text data are known in the art. Basic
data indexing and information retrieval techniques have been
described in a book entitled "Introduction to Information
Retrieval", ISBN 0521865719. Technology for applications that
require full-text searches is well known in the Apache community
for the Apache Lucene Core open-source software project, which is
supported over the Internet by the Apache Software Foundation. In
addition, the paper entitled "A Novel Index Supporting High Volume
Data Warehouse Insertions" which is authored by C. Jermaine et al.,
while failing to address text indexing, describes certain indexing
techniques and Y-tree index structures for processing fast
insertions of telephone Call Detail Records (CDR). All of these
publications are incorporated herein by reference. Such indexing
systems, however, are problematic for full-text indexing on large
volumes of streaming data. The presently disclosed invention
addresses such limitations, inter alias, by providing an improved
text indexing system and method with acceptable worst-case
insertion performance to enable real-time querying of streams of
data.
SUMMARY OF THE INVENTION
[0005] The presently disclosed invention may be embodied in various
forms, including a system, computer readable medium or a method for
indexing data.
[0006] In an embodiment of a data indexing system, the system may
comprise block-stores adapted to store data-elements of
data-streams. The system may comprise one or more data-blocks of
the block-stores. The stored data-elements may be allocated to the
one or more data-blocks. In addition, the system may comprise
events of the one or more data-blocks. The block-allocated
data-elements may be further allocated to the events of the
data-blocks. Each of the data-blocks may comprise one or more
events. Each of the events may comprise the block-allocated
data-elements of a corresponding data-block.
[0007] The system may further comprise terms generated by splitting
the event-allocated data-elements, and term frequencies calculated
based on the frequency of each term in each of the event. The
system may also comprise block-level term frequency data calculated
for the event-allocated data-elements that are stored in a
corresponding data-block. The block-level term frequency data may
be based on the term frequencies. Further, the system may comprise
tree index structures generated for the event-allocated
data-elements based on the block-level term frequency data. The
tree index structures may comprise Y-tree index structures. The
terms may be used in the Y-tree index structures as keys. In an
embodiment, the block-stores may be stored on a plurality of
distributed devices. In certain embodiments, the data-streams may
be received from a plurality of distributed devices. The plurality
of distributed devices may be connected via a network.
[0008] Further disclosed is an embodiment of a computer readable
medium for the presently disclosed invention comprising computer
readable instructions stored thereon for execution by a processor.
The instructions on the computer-usable medium may be adapted to
enable a computing device to receive data-streams, wherein the
data-streams may comprise data-elements, and store the
data-elements of the received data-streams, wherein the stored
data-elements may be stored in block-stores. In addition, the
instructions may enable a computing device to allocate the stored
data-elements to data-blocks of the block-stores and further
allocate the block-allocated data-elements to events of the
data-blocks. Each of the data-blocks may comprise one or more
events. Each of the events may comprise the block-allocated
data-elements of the corresponding data-block. Further, the
instructions may enable a computing device to split the
event-allocated data-elements into terms, calculate a frequency of
each term in each of the event, and calculate block-level term
frequency data for the event-allocated data-elements stored in the
corresponding data-block based on the calculated term frequencies.
The instructions may also enable a computing device to generate
tree index structures for the event-allocated data-elements based
on the block-level term frequency data. The tree index structures
may comprise Y-tree index structures. The terms may be used in the
Y-tree index structures as keys.
[0009] Similarly, an embodiment of a method for the presently
disclosed invention may include the step of receiving data-streams.
The data-streams may be received from a single distributed device
or from a plurality of distributed devices. The distributed devices
may be connected via a network. Each of the data-streams may
comprise data-elements. The method may include the step of storing
the data-elements of the received data-streams. The stored
data-elements may be stored in block-stores. The block-stores may
be stored on a single distributed device or across a plurality of
distributed devices. Such distributed devices may be connected via
a network.
[0010] Further, the method may include the step of allocating the
stored data-elements to data-blocks of the block-stores. Each of
the block-stores may comprise one or more data-blocks. In an
embodiment, each data-block may comprise the stored data-elements
of only one of the received data-streams. The data-blocks of a
single data-stream may be logically grouped together. Each of the
data-blocks may be read and written as a single unit. In addition,
the method may include allocating the block-allocated data-elements
to events of the data-blocks. Each of the data-blocks may comprise
one or more events. Each of the events may comprise the
block-allocated data-elements of the corresponding data-block.
[0011] In addition, the method may include the step of splitting
the event-allocated data-elements into terms. Further, the method
may include the step of calculating a frequency of each term within
each of the event, and the step of calculating block-level term
frequency statics or data for the event-allocated data-elements
that are stored in the corresponding data-block based on the
calculated term frequencies. The method may also include the step
of generating tree index structures for the event-allocated
data-elements based on the block-level term frequency data. The
tree index structures may comprise Y-tree index structures. The
Y-tree index structures may use the terms as keys.
[0012] In embodiments of the above-disclosed system, computer
readable medium, and method, pointers to the data-blocks may be
generated. Such pointers may comprise values stored in the Y-tree
index structures that identify, or point to, the corresponding
data-blocks and the corresponding block-level term frequency
data.
[0013] In some embodiments of the above-disclosed system, computer
readable medium, and method, event-allocated data-elements may
comprise text data. Event-allocated data-elements may also comprise
a text representation of non-text data, as data-streams may
comprise non-text data that is converted into a text
representation. Event-allocated data-elements may comprise
unstructured data, which may be split in accordance with processes
outlined in an Unicode Standard Annex #29 published the Unicode
Consortium. Multiple writers to the data-blocks may have an
independent tree structure.
[0014] In certain embodiments of the above-disclosed system,
computer readable medium, and method, a search query of the tree
index structures may be performed. The search query for the
data-elements may be performed over all of the tree index
structures. Term statistics may be extracted from query text of the
search query. In some embodiments, a list of candidate data-blocks
may be generated that satisfy the search query based on the Y-tree
index structure. The search query may be evaluated against each of
the data-blocks to generate a list of matching records.
[0015] In some embodiments of the above-disclosed system, computer
readable medium, and method, a term-proximity search query of the
tree index structures may be performed. In an embodiment, the
term-proximity search query may be a wildcard suffix matches search
query, wherein a minimum key in the Y-tree index structures
satisfies a pattern requirement. The keys may be iterated through
until a key is reached that is different from the pattern
requirement. In certain embodiments, the term-proximity search
query may be a fuzzy matches search. In an embodiment, the
term-proximity search query may be based on a Soundex algorithm. In
some embodiments, a list of the terms that are present in a master
Y-tree index structure may be maintained.
[0016] In an embodiment of the above-disclosed system, computer
readable medium, and method, individual pages within the Y-tree
index structure may be compressed. The individual pages within the
Y-tree index may be compressed utilizing compression algorithms.
The individual pages within the Y-tree index may be compressed by
storing data-block in the Y-tree index structures via
gap-compressed encodings, such as .gamma. or .delta. gap-compressed
encodings.
[0017] In an embodiment of the above-disclosed system, computer
readable medium, and method, search queries may be multicasted to a
set of dedicated search nodes that perform searches. Search queries
may be transmitted to a group of destination computing devices
simultaneously in a single transmission from the requesting
computing device, which may have minimal resources. In some
embodiments, query results that are gathered from the search nodes
may be combined. Separate Y-tree index structures may be utilized
per stream writers. In certain embodiments, an index data page
cache for the search nodes may be generated. In an embodiment, a
pre-determined amount of time that an index data page may reside in
the cache before being refreshed with a new page from the backing
storage may be adjusted.
[0018] In an embodiment of the above-disclosed system, computer
readable medium, and method, a block-identifier may be assigned to
each of the data-blocks. Such a block-identifier may be globally
unique. Each of the block-stores may comprise one or more
data-blocks. Each of the data-blocks may be read and written as a
single unit. The data-blocks of a single data-stream may be
logically grouped.
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] The foregoing and other objects, features, and advantages of
the invention will be apparent from the following more particular
description of embodiments as illustrated in the accompanying
drawings, in which reference characters refer to the same parts
throughout the various views. The drawings are not necessarily to
scale, emphasis instead being placed upon illustrating principles
of the invention.
[0020] FIG. 1 is a graph illustrating the results of a scale test
performed with a text-indexing system, in accordance with certain
embodiments of the invention.
[0021] FIG. 2 is a graph illustrating the results of a scale test
performed with a text-indexing system, in accordance with certain
embodiments of the invention.
[0022] FIG. 3 is a block diagram illustrating components of an
embodiment of a data indexing system, in accordance with certain
embodiments of the invention.
[0023] FIG. 4 is a flowchart illustrating steps of an embodiment of
a data indexing method, in accordance with certain embodiments of
the invention.
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0024] Reference will now be made in detail to the embodiments of
the presently disclosed invention, examples of which are
illustrated in the accompanying drawings.
[0025] One of the objects of the present system and method may be
an application in which full-text indexing on large volumes of
streaming data 1 is performed with acceptable worst-case insertion.
The object for certain embodiments may concern such an application
which enables real-time querying of data 2 streaming over a network
3. Such data-elements 2 of streaming data 1 may be received and
stored in block-stores 4. These block-stores 4 may be stored on a
distributed device 5 or across a plurality of distributed devices
5. The block-stores 4 may comprise data-blocks 6, which may be
assigned with block-identifiers 7. The data-blocks 6 may comprise
events 8. The block-allocated data-elements 5 may be further
allocated to the events 8 of the data-blocks 6. Embodiments may not
be limited, however, to any one application, example or object. The
embodiments may be applicable in virtually any application in which
text data is indexed for later searching, retrieval, updating, and
deletion. The embodiments of the present system and method are well
suited for implementation in a distributed environment in which
streams 1 of text data are persisted across multiple data storage
nodes connected in a network 3.
[0026] Existing indexing solutions for text data fail to work with
unbounded streams of incoming data. Prior approaches rely on
periodically rebuilding index structures after the data set reaches
a certain threshold. This may work well for batch-processed index
updates, such as those used by web search engines. However, when
applied to streaming data, such techniques result in long periods
of time where the index is not available or where the index state
reflects stale data.
[0027] Text indexing approaches have generally fallen into two
categories: inverted indices and suffix arrays. Inverted indices
are generally more space efficient, but require preprocessing text
into individual word tokens and restricting queries to matches on
whole word tokens. Suffix arrays, and their variants, allow for
searching arbitrary substrings on large bodies of text, but require
more up-front computation to generate the index data structures and
generally have a much higher space penalty as compared to inverted
indices.
[0028] As practical approaches for efficient incremental updates to
suffix array-based text indices are lacking, recent applications
which utilize such indices may leverage advanced compression
techniques such as those used by the FM-index (developed by
Ferragina and Manzini) and the Wavelet tree (developed by Navarro
et al.). While these variants greatly reduce the space overhead
required for suffix arrays, they fail to support efficient
incremental updates at the rates required to handle incoming
streams of data. In contrast, an embodiment of the present
invention may provide efficient incremental updates that work in a
distributed streaming environment.
[0029] Prior solutions for inverted indices include performing
incremental merges of separate index segments using techniques such
as a log-structured merge or a multi-way merge, which is the
approach that is taken by the Apache Lucene Core project referenced
above. More recent solutions include processes based on
"cache-oblivious" data structures developed by researchers at
Massachusetts Institute of Technology (MIT). While such methods
have good amortized complexity for insertions, those approaches are
inefficient as they require expensive periodic reorganizations of
data within the index.
[0030] An object of an embodiment of the present invention may
include a data structure designed for efficient batch insertions
with the traditional inverted indices working on top of a
compression layer. A benefit of such an embodiment is that it may
work with block-based stream storage mechanisms and techniques,
such as those disclosed in U.S. patent application Ser. No.
13/600,853, entitled "System and Method for Storing Data Streams in
a Distributed Environment," which has been incorporated by
reference above.
[0031] An advantage of an indexing embodiment for the presently
disclosed invention which does not require utilization of
traditional storage mechanisms may be appreciated when comparing
the results illustrated in FIGS. 1 and 2. The results obtained from
an indexing implementation of an embodiment of the presently
disclosed invention are labeled "New Index," while the results
obtained from the Apache Lucene implementation are labeled
"Lucene." The Apache Lucene implementation is a full-text index
implementation that uses traditional techniques known in the art.
FIG. 1 compares the processing time results obtained by the Lucene
implementation versus the New Index implementation. FIG. 2 compares
the space overhead results obtained by the Lucene implementation
versus the New Index implementation. As illustrated in these
figures, the New Index implementation results in significantly
shorter processing time and lower space overhead than the Lucene
implementation as the amount of indexed data increases.
[0032] While FIGS. 1 and 2 do not illustrate background
input/output (I/O) operations performed by the Apache Lucene
implementation, the disadvantages in performance due to such
additional operations can still be realized. The amount of
background I/O operations that the Apache Lucene implementation
uses for merge operations grows as the amount of indexed data
grows. Once the index size is large enough, these periodic rebuild
operations dominate system resources. In contrast, the novel
indexing implementation of an embodiment of the presently disclosed
invention requires no background I/O operations, and thus
performance does not degrade as significantly as the amount of
indexed data increases.
[0033] An embodiment of the invention may be performed in two
phases: 1) front-end processing and 2) back-end storage. The
front-end processing phase may utilize unstructured data-elements 2
as input. If such data 2 is not in text form, the data 2 may be
first converted to a text representation using any appropriate
method known in the art. Once the data 2 is in text form, the text
data-elements 2 may be split into individual terms 9, such as
words. The text data-elements 2 may be split using the process
outlined in Unicode Standard Annex #29 such that language-specific
word boundaries are respected. The annex, entitled "Unicode Text
Segmentation" and published on the Internet by the Unicode
Consortium, describes guidelines for determining default boundaries
between certain significant text elements. After the input text
data-elements 2 have been split into individual terms 9, the
front-end processor calculates the frequency 10 of each term 9 in
each event 8.
[0034] If using a block-based storage mechanism such as the one
described in U.S. patent application Ser. No. 13/600,853, the term
frequencies 10 may be reduced down to a set of frequency statistics
11 for the records or data-elements 2 contained in a single
data-block 6. Once the data-block 6 is full and is flushed to disk,
the block-level term frequency data 11 may be flushed to the tree
index structure 12 described below. Parallelism in the front-end
may be achieved as in the aforementioned patent application. A pool
of data-blocks 6 may be made available during the insertion
process, and threads performing insertions may update the term
frequency data structures 10 for each pooled block after each
insertion. Other storage mechanisms may introduce parallelism in
the front-end phase as appropriate based on the underlying storage
for each event 8. Such an embodiment which utilizes such a
block-based storage mechanism is further described below.
[0035] Once frequency statistics/data 11 have been aggregated for
each group of inserted records/data-elements 2, pointers 16 to each
data-block 6, or to each individual record/data-element 2, may be
inserted into the back-end storage tree structure 12. This tree
structure 12 preferably uses a modified B.sup.+-tree variant known
as a Y-tree 12. The tree structure 12 may use other indexes that
support fast bulk insertions while still providing efficient query
performance, as accomplished by a B+-tree variant. The Y-tree
paper, entitled "A Novel Index Supporting High Volume Data
Warehouse Insertions," which has been incorporated by reference
above, provides sufficient information for a competent developer
familiar with B.sup.+-trees to implement a fully functional Y-tree
12.
[0036] Records/data-elements 2 may be represented within such a
Y-tree structure 12 by using individual terms 9 as keys 13 and
lists 14 of record entries as values 15. Each value entry 15 may
contain a pointer 16 to a data-block 6 along with overall
statistics 11 for the records/data-elements 2 contained within that
data-block 6. Parallelism in the back-end may be achieved by
allowing multiple writers to each have an independent tree
structure 12. In an embodiment, searches must be performed over all
tree instances. Read-write locks or atomic lock-free schemes that
work over tree structures 12 may also be used to allow parallel
updates to a master tree structure 12.
[0037] As with traditional B.sup.+-trees, explicitly caching
frequently used data pages within the Y-tree 12 is an effective way
to reduce the number of I/O operations performed when records are
inserted into the index structure 12.
[0038] To perform searches against the index structure 12, the
search query is processed in a manner similar to insertions. The
front-end is used to extract term statistics 10 from the query
text. For each referenced term, the Y-tree structure 12 is used to
generate a list of candidate data-blocks 6 that satisfy the search
query, and the search query is evaluated against each data-block 6
to generate a list of matching records. Efficient ranked retrieval
may be provided by using the aggregated block-level term statistics
11 contained in the master tree structure 12. Ranking functions
based on cosine similarity may be used with a straightforward
application of the procedure described in "Filtered Document
Retrieval with Frequency-Sorted Indexes" authored by M. Persin et
al. and published in the October, 1996 edition of the Journal of
the American Society for Information Science.
[0039] Extended query operations such as term proximity searches
may also be performed using such an embodiment, as described above.
Wildcard suffix matches may be performed by finding the minimum key
13 in the Y-tree structure 12 that satisfies the pattern
requirement and iterating through the keys 13 until reaching a key
13 that does not satisfy the pattern. Some types of searches--such
as fuzzy matches or searches using the Soundex algorithm--may
require a global term list. Soundex is a phonetic algorithm for
indexing words by sound, as pronounced in English. This may be
accomplished by iterating over the keys 13 in the Y-tree or, more
efficiently, by maintaining a separate smaller list containing only
terms 9 that are present in the Y-tree master index 12.
[0040] As with the aforementioned block-based storage system,
individual pages within the Y-tree 12 may also be compressed using
general-purpose, compression algorithms or by storing block or
record lists in the Y-tree 12 via .gamma. or .delta. gap-compressed
encodings.
[0041] If searches are to be performed by clients with minimal
resources, these clients may multicast their queries to a set of
dedicated nodes that perform searches on behalf of those clients.
If using separate Y-tree indices 12 per stream writer, results
gathered from each search node may be combined with a
straightforward application of techniques used in parallel
map-reduce systems. These search nodes may also benefit from the
addition of an index data page cache to ensure queries containing
popular terms 9 are quickly serviced. Additionally, the amount of
time that pages are allowed to reside in the cache before being
refreshed with newer pages from backing storage may be tuned or
adjusted to balance freshness of results versus the amount of I/O
operations performed in order to keep those results up-to-date.
[0042] An object of an embodiment of the present invention may be
to balance fast insertion speeds with ensuring that data is
available as soon as possible, all while efficiently mapping to
file system operations available in distributed file systems.
Selection of an indexing structure 12 that accomplishes these
objectives was an important part of the design process used to
generate the techniques embodied by the presently disclosed
invention. In addition to selecting such a data structure 12, an
embodiment may require augmenting inverted indices with data
structures that support efficient batch insertions. These
techniques, together with the extended Y-tree index structure 12
disclosed in this present application, solve the technical problems
associated with satisfying the requirements presented by the above
objectives.
[0043] A specific object for an embodiment may be to achieve a
sustained insertion rate of over 50,000 events 8 per second for an
input data set containing 5 billion events 8. Due to the paged
nature of certain approaches, searches against the generated index
12 may require only one block of index data in memory at a time.
Conventional search structures often require holding significant
portions of the index structure 12 in memory in order to perform
searches against the indexed data. Due to the compression mechanism
used by the index structure 12, the storage space used by the index
12 and the data stored by the index 12 may be approximately
one-quarter of the size of the raw data set in its uncompressed
form.
[0044] The term "data element" 2 shall mean a set of binary data
containing a unit of information. Examples of data-elements 2
include, without limitation, a packet of data flowing across a
network 3; a row returned from a database query; a line in a
digital file such as a text file, document file, or log file; an
email message; a message system message; a text message; a binary
large object; a digitally stored file; an object capable of storage
in an object-oriented database; and an image file, music file, or
video file. Data-elements 2 often, but do not always, represent
physical objects such as sections of a DNA molecule, a physical
document, or any other binary representation of a real world
object.
[0045] The term "instructions" shall mean a set of digital data
containing steps to be performed by a computing device. Examples of
"instructions" include, without limitation, a computer program,
macro, or remote procedure call that is executed when an event
occurs (such as detection of an input data-element 2 that has a
high probability of falling within a particular category). For the
purposes of this disclosure, "instructions" can include an
indication that no operation is to take place, which can be useful
when an event that is expected, and has a high likelihood of being
harmless, has been detected, as it indicates that such event can be
ignored. In certain preferred embodiments, "instructions" may
implement state machines.
[0046] The term "machine readable storage" shall mean a medium
containing random access or read-only memory that is adapted to be
read from and/or written to by a computing device having a
processor. Examples of machine readable storage shall include,
without limitation, random access memory in a computer; random
access memory or read only memory in a network device such as a
router switch, gateway, network storage device, network security
device, or other network device; a CD or DVD formatted to be
readable by a hardware device; a thumb drive or memory card
formatted to be readable by a hardware device; a computer hard
drive; a tape adapted to be readable by a computer tape drive; or
other media adapted to store data that can be read by a computer
having appropriate hardware and software.
[0047] The term "network" 3 or "computer network" shall mean an
electronic communications network adapted to enable one or more
computing devices to communicate by wired or wireless signals.
Examples of networks 3 include, but are not limited to, local area
networks (LANs), wide area networks (WANs) such as the Internet,
wired TCP and similar networks, wireless networks (including
without limitation wireless networks conforming to IEEE 802.11 and
the Bluetooth standards), and any other combination of hardware,
software, and communications capabilities adapted to allow digital
communication between computing devices.
[0048] The term "operably connected" shall mean connected either
directly or indirectly by one or more cable, wired network, or
wireless network connections in such a way that the operably
connected components are able to communicate digital data from one
to another.
[0049] The term "output" shall mean to render (or cause to be
rendered) to a human-readable display such as a computer or
handheld device screen, to write to (or cause to be written to) a
digital file or database, to print (or cause to be printed), or to
otherwise generate (or cause to be generated) a copy of information
in a non-transient form. The term "output" shall include creation
and storage of digital, visual and sound-based representations of
information.
[0050] The term "server" shall mean a computing device adapted to
be operably connected to a network 3 such that it can receive
and/or send data to other devices operably connected to the same
network, or service requests from such devices. A server has at
least one processor and at least one machine-readable storage
medium operably connected to that processor, such that the
processor can read data from that machine-readable storage.
[0051] The term "system" shall mean a plurality of components
adapted and arranged as indicated. The meanings and definitions of
other terms used herein shall be apparent to those of ordinary
skill in the art based upon the following disclosure.
[0052] FIG. 3 is a block diagram illustrating components of an
embodiment of a data indexing system, in accordance with certain
embodiments of the invention. As shown, such an embodiment may
comprise block-stores 4 stored on distributed devices 5. The
distributed devices 5 may be adapted to communicate via a network
3. A distributed device 5 may comprise a computing device, such as
a computer or a smart phone, which can share data with other
distributed devices 5 via a network 3. Data-streams 1 may be
transmitted and received from the distributed devices 5. Each of
the data-streams 1 may comprise data-elements 2, which may be
transmitted via the network 3. The block-stores 4 may store the
data-elements 2 of the received data-streams 1. Such stored
data-elements 2 may comprise digital copies of the transmitted
data-elements 2. The data-elements 2 may comprise unstructured
data.
[0053] The stored data-elements 2 may be allocated to data-blocks 6
of a block-store 1, as illustrated in FIG. 3. Such block-allocated
data-elements 2 may be logically grouped. Each of the block-stores
4 may comprise one or more data-blocks 6. In an embodiment, each
data-block 6 may comprise the stored data-elements 2 of only one of
the received data-streams 1. In addition, the data-blocks 6 of a
single data-stream 1 may be logically grouped. In an embodiment, a
block-identifier 7 may be assigned to each of the data-blocks 6.
Such a block-identifier 7 may be globally unique. The data-blocks 6
may comprise events 8. The block-allocated data-elements 2 may be
further allocated to the events 8 of the data-blocks 6. Such
event-allocated data-elements 2 may be logically grouped. Each of
the data-blocks 6 may comprise one or more events 8. Each of the
events 8 may comprise the block-allocated data-elements 2 of the
corresponding data-block 6. In an embodiment, these event-allocated
data-elements 2 may comprise the stored data-elements 2 of only one
of the received data-streams 1.
[0054] FIG. 4 is a flowchart illustrating steps of an embodiment of
a data indexing method, in accordance with certain embodiments of
the invention. As shown, such an embodiment may comprise the step
of receiving 401 data-streams 1. The data-streams 1 may be received
from a single distributed device 5 or from a plurality of
distributed devices 5. The distributed devices 5 may be connected
via a network 3. Each of the data-streams 1 may comprise
data-elements 2. The method may include the step of storing 402 the
data-elements 2 of the received data-streams 1. The stored
data-elements 2 may be stored in block-stores 4. The block-stores 4
may be stored on a single distributed device 5 or across a
plurality of distributed devices 5. Such distributed devices 5 may
be connected via a network 3.
[0055] Further, the method may include the step of allocating 403
the stored data-elements 2 to data-blocks 6 of the block-stores 4.
Each of the block-stores 4 may comprise one or more data-blocks 6.
In an embodiment, each data-block 6 may comprise the stored
data-elements 2 of only one of the received data-streams 1. The
data-blocks 6 of a single data-stream 1 may be logically grouped
together. Each of the data-blocks 6 may be read and written as a
single unit. In addition, the method may include allocating 404 the
block-allocated data-elements 2 to events 8 of the data-blocks 6.
Each of the data-blocks 6 may comprise one or more events 8. Each
of the events 8 may comprise the block-allocated data-elements 2 of
the corresponding data-block 6.
[0056] In addition, the method may include the step of splitting
405 the event-allocated data-elements 2 into terms 9. Further, the
method may include the step of calculating 406 a frequency 10 of
each term 9 within each of the event 8, and the step of calculating
407 block-level term frequency statics or data 11 for the
event-allocated data-elements 2 that are stored in the
corresponding data-block 6 based on the calculated term frequencies
10. The method may also include the step of generating 408 tree
index structures 12 for the event-allocated data-elements 2 based
on the block-level term frequency data 11. The tree index
structures 12 may comprise Y-tree index structures. The Y-tree
index structures 12 may use the terms 9 as keys 13. The method may
further include the step of performing 409 a search query for the
data-elements 2 over all of the tree index structures 12.
[0057] While the invention has been particularly shown and
described with reference to an embodiment thereof, it will be
understood by those skilled in the art that various changes in form
and details may be made therein without departing from the spirit
and scope of the invention.
* * * * *