U.S. patent application number 14/960650 was filed with the patent office on 2016-03-24 for method for in-loop human validation of disambiguated features.
This patent application is currently assigned to QBASE, LLC. The applicant listed for this patent is QBASE, LLC. Invention is credited to Sanjay BODDHU, Rakesh DAVE, Scott LIGHTNER.
Application Number | 20160085760 14/960650 |
Document ID | / |
Family ID | 53265489 |
Filed Date | 2016-03-24 |
United States Patent
Application |
20160085760 |
Kind Code |
A1 |
LIGHTNER; Scott ; et
al. |
March 24, 2016 |
METHOD FOR IN-LOOP HUMAN VALIDATION OF DISAMBIGUATED FEATURES
Abstract
Methods for providing in-loop validation of disambiguated
features are disclosed. The disclosed methods may include
disambiguating features in unstructured text that may use
co-occurring features derived from both the source document and a
large document corpus. The disambiguating systems may include
multiple modules, including a linking on-the-fly module for linking
the derived features from the source document to the co-occurring
features of an existing knowledge base. The system for
disambiguating features may allow identifying unique entities from
a knowledge base that includes entities with a unique set of
co-occurring features, which in turn may allow for increased
precision in knowledge discovery and search results, employing
advanced analytical methods over a massive corpus, employing a
combination of entities, co-occurring entities, topic IDs, and
other derived features. The disclosed method may use validation to
provide input to the system for disambiguating features.
Inventors: |
LIGHTNER; Scott; (Leesburg,
VA) ; DAVE; Rakesh; (Dayton, OH) ; BODDHU;
Sanjay; (Dayton, OH) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
QBASE, LLC |
Reston |
VA |
US |
|
|
Assignee: |
QBASE, LLC
Reston
VA
|
Family ID: |
53265489 |
Appl. No.: |
14/960650 |
Filed: |
December 7, 2015 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
14558237 |
Dec 2, 2014 |
9223833 |
|
|
14960650 |
|
|
|
|
61910802 |
Dec 2, 2013 |
|
|
|
Current U.S.
Class: |
707/723 ;
707/722 |
Current CPC
Class: |
G06F 40/284 20200101;
G06F 16/245 20190101; G06F 16/24578 20190101; G06F 16/21 20190101;
G06F 16/248 20190101; G06F 40/30 20200101; G06F 3/0482 20130101;
G06F 40/14 20200101 |
International
Class: |
G06F 17/30 20060101
G06F017/30; G06F 3/0482 20060101 G06F003/0482 |
Claims
1. A method comprising: receiving, by a first computer, a first
search query result from a search conductor, wherein the first
search query result is based on a search query and comprises a
record matching a field of the search query; sending, by the first
computer, the first search query result to a second computer such
that the second computer is able to disambiguate the first search
query result via a determination of a relatedness among an
individual record feature and a topic identification associated
with each record in the first search query result, wherein the
second computer comprises a main memory storing an in-memory
database, wherein the second computer is configured to link
disambiguation data, in real-time, as the disambiguation data is
requested by the first computer from the second computer;
receiving, by the first computer, a second search query result from
the second computer, wherein the second search query result has
been disambiguated via the second computer; sending, by the first
computer, the second search query result to a third computer such
that the third computer is able to receive an input on the second
search query result; generating, by the first computer, a new
feature occurrence record in a knowledge base database, wherein the
new feature occurrence record includes the input, wherein the
in-memory database comprises the knowledge base database; and
placing, by the first computer, a request that the new feature
occurrence record be stored in the knowledge base database such
that the second computer is able to adjust a parameter of a
disambiguation algorithm based on the input, wherein the
disambiguation algorithm involves linking via the second
computer.
2. The method of claim 1, wherein the second computer is configured
to assign a confidence score to the new feature occurrence record
based on the input.
3. The method of claim 1, wherein the second computer is configured
to adjust a weight associated with the individual record feature in
the first search query result based on the input.
4. The method of claim 1, wherein the third computer is configured
to provide a user interface comprising a first element, wherein the
first element enables a first user selection of a set of
disambiguation algorithms.
5. The method of claim 4, wherein the user interface comprises a
second element, wherein the second element enables a second user
selection of a threshold of acceptance for at least one of the
individual record feature or the topic identification in the first
search query result.
6. The method of claim 1, wherein the search query is constructed
via a markup language.
7. The method of claim 1, wherein the search query is constructed
via a binary format.
8. The method of claim 1, further comprising: validating, by the
first computer, the input before the placing.
9. The method of claim 1, further comprising: processing, by the
first computer, the search query via a process, wherein the process
comprises at least one of an address standardization technique, a
proximity boundary technique, or a nickname interpretation
technique.
10. The method of claim 1, wherein the first computer and the
second computer define a computing system, wherein the third
computer is external to the computing system.
11. A system comprising: a first computer configured to: receive a
first search query result from a search conductor, wherein the
first search query result is based on a search query and comprises
a record matching a field of the search query; send the first
search query result to a second computer such that the second
computer is able to disambiguate the first search query result via
a determination of a relatedness among an individual record feature
and a topic identification associated with each record in the first
search query result, wherein the second computer comprises a main
memory storing an in-memory database, wherein the second computer
is configured to link disambiguation data, in real-time, as the
disambiguation data is requested by the first computer from the
second computer; receive a second search query result from the
second computer, wherein the second search query result has been
disambiguated via the second computer; send the second search query
result to a third computer such that the third computer is able to
receive an input on the second search query result; generate a new
feature occurrence record in a knowledge base database, wherein the
new feature occurrence record includes the input, wherein the
in-memory database comprises the knowledge base database; place a
request that the new feature occurrence record be stored in the
knowledge base database such that the second computer is able to
adjust a parameter of a disambiguation algorithm based on the
input, wherein the disambiguation algorithm involves linking via
the second computer.
12. The system of claim 11, wherein the second computer is
configured to assign a confidence score to the new feature
occurrence record based on the input.
13. The system of claim 11, wherein the second computer is
configured to adjust a weight associated with the individual record
feature in the first search query result based on the input.
14. The system of claim 11, wherein the third computer is
configured to provide a user interface comprising a first element,
wherein the first element enables a first user selection of a set
of disambiguation algorithms.
15. The system of claim 14, wherein the user interface comprises a
second element, wherein the second element enables a second user
selection of a threshold of acceptance for at least one of the
individual record feature or the topic identification in the first
search query result.
16. The system of claim 11, wherein the search query is constructed
via a markup language.
17. The system of claim 11, wherein the search query is constructed
via a binary format.
18. The system of claim 11, wherein the first computer is
configured to validate the input before placing the request.
19. The system of claim 11, wherein the first computer is
configured to process the search query via a process, wherein the
process comprises at least one of an address standardization
technique, a proximity boundary technique, or a nickname
interpretation technique.
20. The system of claim 11, wherein the first computer and the
second computer define a computing entity, wherein the third
computer is external to the computing entity.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation of U.S. patent
application Ser. No. 14/558,237, entitled, "Method For In-Loop
Human Validation Of Disambiguated Features," filed Dec. 2, 2014,
which is a non-provisional application that claims the benefit of
U.S. Provisional Application No. 61/910,802, filed Dec. 2, 2013,
entitled "Method For In-Loop Human Validation Of Disambiguated
Features," all of which are hereby incorporated by reference in
their entirety.
[0002] This application is related to U.S. patent application Ser.
No. 14/557,794, entitled "Method For Disambiguated Features In
Unstructured Text," filed Dec. 2, 2014, U.S. patent application
Ser. No. 14/558,254, entitled "Design And Implementation Of
Clustered In-Memory Database," filed Dec. 2, 2014, and U.S. patent
application Ser. No. 14/557,931 entitled "Method Of Discovering And
Exploring Feature Knowledge", filed Dec. 2, 2014; each of which are
hereby incorporated by reference in their entirety.
FIELD OF THE DISCLOSURE
[0003] The present disclosure relates in general to in-memory
databases and, more specifically, to computer based methods for
validating disambiguated features.
BACKGROUND
[0004] Searching information about entities (i.e., people,
locations, organizations) in a large amount of documents, including
sources such as a network, may often be ambiguous, which may lead
to imprecise text processing functions, imprecise association of
features during a knowledge extraction, and, thus, imprecise data
analysis.
[0005] State of the art systems use linkage based clustering and
ranking in several algorithms like PageRank and the
hyperlink-induced topic search (HITS) algorithm. The basic idea
behind this and related approaches is that pre-existing links
typically exist between related pages or concepts. A limitation of
clustering-based techniques is that sometimes contextual
information needed to disambiguate entities is not present in the
context, leading to incorrectly disambiguated results. Similarly,
documents about different entities in the same or superficially
similar contexts may be incorrectly clustered together.
[0006] Other systems attempt to disambiguate entities by reference
to one or more external dictionaries (or knowledge base) of
entities. In such systems, an entity's context is compared to
possible matching entities in the dictionary and the closest match
is returned. A limitation associated with current dictionary-based
techniques stems from the fact that entities may increase its
number by each moment and, therefore, no dictionary may include a
representation of all of the world's entities. Thus, if a
document's context is matched to an entity in the dictionary, then
the technique has identified only the most similar entity in the
dictionary, and not necessarily the correct entity, which may be
outside the dictionary.
[0007] Traditional search engines allow users to find just pieces
of information that are relevant to an entity, and while millions
or billions of documents may describe that entity the documents are
generally not linked together. In most cases it may not be viable
to try to discover a complete set of documents about a particular
feature. Additionally, methods that pre-link data are limited to a
single method of linking and are fed by many entity extraction
methods that are ambiguous and are not accurate. These systems may
not be able to use live feeds of data; they may not perform these
processes on the fly. As a consequence the latest information is
not used in the linking process.
[0008] One limitation of fully automated linking is that, the
results provided by this type of systems are as good as the data
coming into it. Therefore, if inaccurate data is provided to the
system, inaccurate results may be provided.
[0009] Therefore, there is still a need of accurate entity
disambiguation techniques that allows a precise data analysis.
SUMMARY
[0010] A method for in-loop validation of disambiguated features is
disclosed. The method may include the generation of entity
occurring records from an input, allowing the creation of new
linking models that may give high confidence to the generated
feature occurring records.
[0011] One aspect of the current disclosure is the application of
in-memory analytics to records, where the analytic methods applied
to the records and the level of precision of the methods may be
dynamically selected by a user.
[0012] According to some embodiments, when a user starts a search,
the system may score records against the one or more queries, where
the system may score the match of one or more available fields of
the records and may then determine a score for the overall match of
the records. The system may determine whether the score is above a
predefined acceptance threshold, where the threshold may be defined
in the search query or may be a default value.
[0013] In further embodiments, fuzzy matching algorithms may
compare records temporarily stored in collections with the one or
more queries being generated by the system.
[0014] In some embodiments, numerous analytics modules may be
plugged to the in-memory data base and the user may be able to
modify the relevant analytical parameters of each analytics module
through a user interface.
[0015] Other aspects of the present disclosure include a new
linking on-the-fly module that treats input records with high
confidence. Allowing validation of on-the-fly results of
disambiguated features.
[0016] A method for in-loop validation of disambiguated features
may include a more actively learning system which, by adding input,
results may be improved therefore, filling the gaps of the provided
knowledge.
[0017] In one embodiment, a method is disclosed. The method
comprises receiving, by a search manager computer, a search query
from a user device and submitting the search query to a search
conductor computer module for processing, receiving, by the search
manager computer, search query results from the search conductor
computer, the search query results having one or more records
matching one or more fields of the search query, and sending, by
the search manager computer, the search query results to a
disambiguation analytic computer for disambiguating the search
query results by determining relatedness among individual record
features and topic identifications (topic IDs) associated with each
record in the search query results. The method further includes
receiving, by the search manager computer, disambiguated search
query results from the disambiguation analytic computer and
forwarding the disambiguated search query results to the user
device for providing input on the disambiguated search query
results, and when the user device provides the input on the
disambiguated search query results creating, by the search manager
computer, a new input occurrence record including the input and
storing the new input occurrence record in a knowledge base
database, and adjusting, by the disambiguation analytic computer,
one or more parameters of a disambiguation algorithm based on the
input from the user device.
[0018] In another embodiment, a system is disclosed. The system
comprises one or more server computers having one or more
processors executing computer readable instructions for a plurality
of computer modules including a search manager computer module
configured to receive a search query from a user device. The search
manager computer module being further configured to submit the
search query to a search conductor computer module for processing,
receive search query results from the search conductor computer
module, the search query results having one or more records
matching one or more fields of the search query, send the search
query results to a disambiguation analytic computer module for
disambiguating the search query results, the disambiguation
analytic computer module being configured to determine relatedness
among individual record features and topic identifications (topic
IDs) associated with each record in the search query results,
receive disambiguated search query results from the disambiguation
analytic computer module and forward the disambiguated search query
results to the user device for providing input on the disambiguated
search query results. In the disclosed system, when the user device
provides the input on the disambiguated search query results: the
search manager computer module is further configured to create a
new input occurrence record including the input and store the new
input occurrence record in a knowledge base database, and the
disambiguation analytic computer module is further configured to
adjust one or more parameters of a disambiguation algorithm based
on the input from the user device.
[0019] Numerous other aspects, features and benefits of the present
disclosure may be made apparent from the following detailed
description taken together with the drawing figures.
BRIEF DESCRIPTION OF THE DRAWINGS
[0020] The present disclosure can be better understood by referring
to the following figures. The components in the figures are not
necessarily to scale, emphasis instead being placed upon
illustrating the principles of the disclosure. In the figures,
reference numerals designate corresponding parts throughout the
different views.
[0021] FIG. 1 is a flowchart of a computer based method for
providing in-loop validation of disambiguated features, according
to an embodiment.
[0022] FIG. 2 is a flowchart of a process executed by a link
on-the-fly module, according to an embodiment.
[0023] FIG. 3 is an illustrative diagram of a system employed for
implementing the method for disambiguating features, according to
an exemplary embodiment
DEFINITIONS
[0024] As used herein, the following terms may have the following
definitions:
[0025] "Document" refers to a discrete electronic representation of
information having a start and end.
[0026] "Corpus" refers to a collection of one or more
documents.
[0027] "Feature" refers to any information which is at least
partially derived from a document.
[0028] "Feature attribute" refers to metadata associated with a
feature; for example, location of a feature in a document,
confidence score, among others.
[0029] "Facet" refers to clearly defined, mutually exclusive, and
collectively exhaustive aspects, properties or characteristics of a
class, specific subject, topic or feature.
[0030] "Knowledge base" refers to a computer database containing
disambiguated features or facets.
[0031] "Live corpus" refers to a corpus that is constantly fed as
new documents are uploaded into a network.
[0032] "Memory" refers to any hardware component suitable for
storing information and retrieving said information at a
sufficiently high speed.
[0033] "Module" refers to a computer software component suitable
for carrying out one or more defined tasks.
[0034] "Analytics Parameters" refers to parameters that describe
the operation that an analytic module may have to perform in order
to get specific results.
[0035] "Link on-the-fly module" refers to any linking module that
performs data linkage as data is requested from the system rather
than as data is added to the system.
[0036] "Node" refers to a computer hardware configuration suitable
for running one or more modules.
[0037] "Cluster" refers to a set of one or more nodes.
[0038] "Query" refers to a request to retrieve information from one
or more suitable databases.
[0039] "Record" refers to one or more pieces of information that
may be handled as a unit.
[0040] "Collection" refers to a discrete set of records.
[0041] "Partition" refers to an arbitrarily delimited portion of
records of a collection.
[0042] "Prefix" refers to a string of a given length that comprises
of the longest string of key characters shared by all subtrees of
then node and a data record field for storing a reference to a data
record
[0043] "Database" refers to any system including any combination of
clusters and modules suitable for storing one or more collections
and suitable to process one or more queries.
[0044] "Analytics Agent" or "Analytics Module" refers to a computer
module configured to at least receive one or more records, process
said one or more records, and return the resulting one or more
processed records.
[0045] "Search Manager" or "SM" refers to a computer module
configured to at least receive one or more queries and returns one
or more search results.
[0046] "Search Conductor" or "SC" refers to a computer module
configured to at least run one or more search queries on a
partition and return the search results to one or more search
managers.
[0047] "Sentiment" refers to subjective assessments associated with
a document, part of a document, or feature.
[0048] "Topic" refers to a set of thematic information which is at
least partially derived from a corpus.
DETAILED DESCRIPTION
[0049] The present disclosure is here described in detail with
reference to embodiments illustrated in the drawings, which form a
part here. Other embodiments may be used and/or other changes may
be made without departing from the spirit or scope of the present
disclosure. The illustrative embodiments described in the detailed
description are not meant to be limiting of the subject matter
presented here.
[0050] Embodiments provide a computer-based disambiguation
framework that incorporates a dynamic validation feature. The
introduced framework extends traditional feature disambiguation
systems based on a static knowledge base by providing an input
mechanism to take verification of the disambiguated results and
update the knowledge base on the fly, thereby creating an adaptive
feature disambiguation system that adapts based on an electronic
input describing the input. Additionally, the generated input is
not only used to update new features to the knowledge base, but
also make the linking models adapt their disambiguation process and
logic accordingly. Further, the introduced framework introduces the
concept of validating the verification with evidence from the live
corpus by monitoring the clusters of extracted secondary features,
supporting or not supporting the disambiguation assertions in the
received input.
[0051] The embodiments describe systems and methods for providing
in-loop validation of disambiguated features, which, in an
embodiment, are performed by a central computer server system
having one or more processors executing computer readable
instructions corresponding to a plurality of special purpose
computer modules described in FIGS. 1-2 below.
[0052] FIG. 1 is a flow chart describing a method 100 for providing
in-loop validation of disambiguated features, according to an
embodiment.
[0053] The process may start when a user device generates a search
query, step 102. One or more user devices may be able to generate
one or more search queries through one or more user interfaces. The
user interfaces on the user devices may provide users with the
option of selecting one or more of a set of analytic methods that
may be applied to the results of the search query. The users may
also be capable of selecting thresholds of acceptance of different
levels.
[0054] Then, the query may be received, in step 104, by a search
manager computer module. In this step, the one or more queries
generated by the interaction of one or more users with one or more
user interfaces may be received by one or more search manager
computer modules. In one or more embodiments, the queries may be
represented in a markup language, including XML and HTML. In one or
more other embodiments, the queries may be represented in a
structure, including embodiments where the queries are represented
in JSON. In some embodiments, a query may be represented in compact
or binary format.
[0055] Afterwards, the received queries may be parsed by the search
manager computer module, step 106. This process may allow the
system to determine if field processing is desired, step 108. In
one or more embodiments, the system may be capable of determining
if the process is required using information included in the query.
In one or more other embodiments, the one or more search managers
may automatically determine which one or more fields may undergo a
desired processing.
[0056] If the system determined that field processing for the one
or more fields is desired, the one or more search manager computer
modules may apply one or more suitable processing techniques to the
one or more desired fields, step 110. In one or more embodiments,
suitable processing techniques may include address standardization,
proximity boundaries, and nickname interpretation, amongst others.
In some embodiments, suitable processing techniques may include the
extraction of prefixes from strings and the generation of
non-literal keys that may later be used to apply fuzzy matching
techniques.
[0057] Then, when SM computer module constructs search query, step
112, one or more search managers may construct one or more search
queries associated with the one or more queries. In one or more
embodiments, the search queries may be constructed so as to be
processed as a stack-based search.
[0058] Subsequently, SM may send search query to SC computer
module, step 114. In some embodiments, one or more search manager
computer modules may send the one or more search queries to one or
more search conductor computer modules, where the one or more
search conductors may be associated with collections specified in
the one or more search queries.
[0059] The one or more search conductors may score records against
the one or more queries, where the search conductors may score the
match of one or more fields of the records and may then determine a
score for the overall match of the records. The system may
determine whether the score is above a predefined acceptance
threshold, where the threshold may be defined in the search query
or may be a default value. In one or more embodiments, the default
score thresholds may vary according to the one or more fields being
scored. If the search conductor determines in that the scores are
above the desired threshold, the records may be added to a results
list. The search conductor may continue to score records until it
determines that a record is the last in the partition. If the
search conductor determines that the last record in a partition has
been processed, the search conductor may then sort the resulting
results list. The search conductor may then return the results list
to a search manager.
[0060] When SM receives and collates results from SCs, step 116,
the one or more search conductors return the one or more search
results to the one or more search managers; where, in one or more
embodiments, said one or more search results may be returned
asynchronously. The one or more search managers may then compile
results from the one or more search conductors into one or more
results list.
[0061] The one or more search managers may automatically determine
which one or more fields may undergo one or more desired analytic
processes. Then, the one or more search managers may send the
search results to analytic computer modules, step 118. The one or
more results lists compiled by one or more search managers may be
sent to one or more analytics agents, where each analytics agent
may include one or more analytics modules corresponding to one or
more suitable processing techniques.
[0062] In some embodiments, analytics agents may include
disambiguation modules, linking modules, link on-the-fly modules,
or any other suitable modules and algorithms.
[0063] Within a suitable analytics agent a feature disambiguation
process may be performed by a disambiguation computer module. This
feature disambiguation process may include machine generated topic
IDs, which may be employed to classify features, documents, or
corpora. The relatedness of individual features and specific topic
IDs may be determined using disambiguating algorithms. In some
documents, the same feature may be related to one or more topic
IDs, depending on the context of the different occurrences of the
feature within the document.
[0064] The set of features (e.g., topics, events, entities, facts,
among others) extracted from one record may be compared with sets
of features from other documents, using disambiguating algorithms
to define with a certain level of accuracy if two or more features
across different documents are a single feature or if they are
distinct features. In some examples, co-occurrence of two or more
features across the collection of documents in the database may be
analyzed to improve the accuracy of feature disambiguation process.
In some embodiments, global scoring algorithms may be used to
determine the probability of features being the same. In this
process, different "extracted" secondary features are weighted
appropriately, based on the training procedure performed on a large
representative evaluation corpus.
[0065] In some embodiments, as part of the feature disambiguation
process, a knowledge base may be generated within an in-memory
database (MEMDB).
[0066] After processing, according to some embodiments, the one or
more analytics agents may return one or more processed results
lists, step 120, to the one or more search managers.
[0067] A search manager may return search results in step 122. In
some embodiments, the one or more search managers may decompress
the one or more results list and return them to the user interface
that initiated the query. According to some embodiments, the search
results may be temporarily stored in a knowledge base and returned
to a user interface.
[0068] The knowledge base may be used to temporarily store clusters
of relevant disambiguated features. When new documents may be
loaded into the MEMDB, the new disambiguated set of features may be
compared with the existing knowledge base in order to determine the
relationship between features and determine if there is a match
between the new features and previously extracted features. If the
features compared match, the knowledge base may be updated and the
ID of the matching features may be returned. If the features
compared do not match with any of the already extracted features, a
unique ID is assigned to the disambiguated entity or feature, and
the ID is associated with the cluster of defining features and
stored in within the knowledge base of the MEMDB.
[0069] When a user receives search results through a user
interface, the user may be able to analyze the clusters of
disambiguated feature and their related features. After the
analysis, the user may provide input, in step 124, stating what
kind of error or misinterpretation of data the user may have found.
With this information a new feature occurrence record, step 126,
may be created, which may be labeled as a generated input record.
Then the feature occurrence record may undergo a validation
process, in which analytics modules may be applied, step 128, to
find out if there is overwhelming evidence that disproves the
generated input. If such evidence is not found, the feature
occurrence record may be stored in the knowledge base in step
130.
[0070] Although the exemplary embodiment recites that input can be
received from a user selection on a user interface of a user
device, it is intended that the input can be received automatically
from a user device or other computing device without requiring any
user selection. The input may be provided based upon an established
set of rules or parameters.
[0071] New linking modules may be used when including generated
feature occurrence records in a set of aggregated results to be
processed. The new linking modules may be designed to assign higher
confidence scores to validated relationships, and also perform
on-the-fly re-training of existing linking modules, by adjusting
the weights for each individual secondary features, that are
contributing towards the disambiguation process. Thus, the
generated input is used to update new features to the knowledge
base and also make the linking models adapt their disambiguation
process and logic accordingly.
[0072] Link On-the-Fly (OTF) Processing
[0073] FIG. 2 is a flowchart of a method 200 executed by a link OTF
computer module, which may be employed in search method for
providing in-loop validation of disambiguated features described
above in connection with FIG. 1, according to an embodiment. Link
OTF process 200 may be capable of constantly evaluating, scoring,
linking, and clustering a feed of information. Link OTF process 200
may perform dynamic records linkage using multiple algorithms. In
step 202, search results may be constantly fed into link OTF
computer module. In step 204, the input of data may be followed by
a match scoring algorithm application, where one or more match
scoring algorithms may be applied simultaneously in multiple search
nodes of the MEMDB while performing fuzzy key searches for
evaluating and scoring the relevant results, taking in account
multiple feature attributes, such as string edit distances,
phonetics, and sentiments, among others.
[0074] Afterwards, a linking algorithm application step 206 may be
executed to compare all candidate records, identified during match
scoring algorithm application in step 204, to each other. Linking
algorithm application step 206 may include the use of one or more
analytical linking algorithms capable of filtering and evaluating
the scored results of the fuzzy key searches performed inside the
multiple search nodes of the MEMDB. In some examples, co-occurrence
of two or more features across the collection of identified
candidate records in the MEMDB may be analyzed to improve the
accuracy of the process. Different weighted models and confidence
scores associated with different feature attributes may be taken
into account for linking algorithm application in step 206.
[0075] After linking algorithm application step 206, the linked
results may be arranged in clusters of related features and
returned, as part of return of linked records clusters in step
208.
[0076] FIG. 3 is an illustrative diagram of an embodiment of a
system 300 for disambiguating features in unstructured text. The
system 300 hosts an in-memory database and comprises one or more
nodes.
[0077] According to an embodiment, the system 300 includes one or
more processors executing computer instructions for a plurality of
special-purpose computer modules 301, 302, 311, 312, and 314
(discussed below) to disambiguate features within one or more
documents. As shown in FIG. 3, the document input modules 301, 302
receive documents from internet based sources and/or a live corpus
of documents. A large number of new documents may be uploaded by
the second into the document input module 302 through a network
connection 304. Therefore, the source may be constantly getting new
knowledge, updated by user workstations 306, where such new
knowledge is not pre-linked in a static way. Thus, the number of
documents to be evaluated may be infinitely increasing.
[0078] MEMDB 308 may facilitate a faster disambiguation process,
may facilitate disambiguation process on-the-fly, which may
facilitate reception of the latest information that is going to
contribute to MEMDB 308. Various methods for linking the features
may be employed, which may essentially use a weighted model for
determining which entity types are most important, which have more
weight, and, based on confidence scores, determine how confident
the extraction and disambiguation of the correct features has been
performed, and that the correct feature may go into the resulting
cluster of features. As shown in FIG. 3, as more system nodes are
working in parallel, the process may become more efficient.
[0079] According to various embodiments, when a new document
arrives into the system 300 via the document input module 301, 302
through a network connection 304, feature extraction is performed
via the extraction module 311 and, then, feature disambiguation may
be performed on the new document via the feature disambiguation
sub-module 314 of the MEMDB 308. In one embodiment, after feature
disambiguation of the new document is performed, the extracted new
features 310 may be included in the MEMDB to pass through link OTF
sub-module 312; where the features may be compared and linked, and
a feature ID of disambiguated feature 310 may be returned to the
user as a result from a query. In addition to the feature ID, the
resulting feature cluster defining the disambiguated feature may
optionally be returned.
[0080] MEMDB computer 308 can be a database storing data in records
controlled by a database management system (DBMS) (not shown)
configured to store data records in a device's main memory, as
opposed to conventional databases and DBMS modules that store data
in "disk" memory. Conventional disk storage requires processors
(CPUs) to execute read and write commands to a device's hard disk,
thus requiring CPUs to execute instructions to locate (i.e., seek)
and retrieve the memory location for the data, before performing
some type of operation with the data at that memory location.
In-memory database systems access data that is placed into main
memory, and then addressed accordingly, thereby mitigating the
number of instructions performed by the CPUs and eliminating the
seek time associated with CPUs seeking data on hard disk.
[0081] System 300 comprises the search manager computer module and
disambiguation analytic computer module as machine-readable
instructions executed by a processor on one or more servers or
computing devices. These modules may be hosted on a single server
or may each function as a separate computer.
[0082] In-memory databases may be implemented in a distributed
computing architecture, which may be a computing system comprising
one or more nodes configured to aggregate the nodes' respective
resources (e.g., memory, disks, processors). As disclosed herein,
embodiments of a computing system hosting an in-memory database may
distribute and store data records of the database among one or more
nodes. In some embodiments, these nodes are formed into "clusters"
of nodes. In some embodiments, these clusters of nodes store
portions, or "collections," of database information.
[0083] Various embodiments provide a computer executed feature
disambiguation technique that employs an evolving and efficiently
linkable feature knowledge base that is configured to store
secondary features, such as co-occurring topics, key phrases,
proximity terms, events, facts and trending popularity index. The
disclosed embodiments may be performed via a wide variety of
linking algorithms that can vary from simple conceptual distance
measure to sophisticated graph clustering approaches based on the
dimensions of the involved secondary features that aid in resolving
a given extracted feature to a stored feature in the knowledge
base. Additionally, embodiments can introduce an approach to
evolves the existing feature knowledge base by a capability that
not only updates the secondary features of the existing feature
entry, but also expands it by discovering new features that can be
appended to the knowledge base.
[0084] In a first example, the exemplary method for providing
in-loop validation of disambiguated features is applied. In this
example a user initiates a search with the name "John Steint", the
results return 6 different disambiguated features with the same
name. The user decides that two of them are the same "John Steint",
the input is processed and the feature occurrence record is stored
in the knowledge base. The feature occurrence record stored relates
the features associated to the two "John Steint" features that were
merged.
[0085] In a second example, the exemplary method for providing
in-loop validation of disambiguated features is applied. In this
example a user initiates a search with the name "John Steint", the
results return 2 different disambiguated features with the same
name. The user decides that one of them is actually two different
"John Steint" and the related features are separated into 2 groups,
the input is processed and the feature occurrence record is stored
in the knowledge base. The feature occurrence records stored states
that there is no relationship between the features associated to
the two "John Steint" features that were separated.
[0086] In a third example, an exemplary application of method for
in-loop validation of disambiguated features, where the feature to
be disambiguated may be an image.
[0087] The various illustrative logical blocks, modules, circuits,
and algorithm steps described in connection with the embodiments
disclosed here may be implemented as electronic hardware, computer
software, or combinations of both. To clearly illustrate this
interchangeability of hardware and software, various illustrative
components, blocks, modules, circuits, and steps have been
described above generally in terms of their functionality. Whether
such functionality is implemented as hardware or software depends
upon the particular application and design constraints imposed on
the overall system. Skilled artisans may implement the described
functionality in varying ways for each particular application, but
such implementation decisions should not be interpreted as causing
a departure from the scope of the present invention.
[0088] Embodiments implemented in computer software may be
implemented in software, firmware, middleware, microcode, hardware
description languages, or any combination thereof. A code segment
or machine-executable instructions may represent a procedure, a
function, a subprogram, a program, a routine, a subroutine, a
module, a software package, a class, or any combination of
instructions, data structures, or program statements. A code
segment may be coupled to another code segment or a hardware
circuit by passing and/or receiving information, data, arguments,
parameters, or memory contents. Information, arguments, parameters,
data, etc. may be passed, forwarded, or transmitted via any
suitable means including memory sharing, message passing, token
passing, network transmission, etc.
[0089] The actual software code or specialized control hardware
used to implement these systems and methods is not limiting of the
invention. Thus, the operation and behavior of the systems and
methods were described without reference to the specific software
code being understood that software and control hardware can be
designed to implement the systems and methods based on the
description here.
[0090] When implemented in software, the functions may be stored as
one or more instructions or code on a non-transitory
computer-readable or processor-readable storage medium. The steps
of a method or algorithm disclosed here may be embodied in a
processor-executable software module which may reside on a
computer-readable or processor-readable storage medium. A
non-transitory computer-readable or processor-readable media
includes both computer storage media and tangible storage media
that facilitate transfer of a computer program from one place to
another. A non-transitory processor-readable storage media may be
any available media that may be accessed by a computer. By way of
example, and not limitation, such non-transitory processor-readable
media may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk
storage, magnetic disk storage or other magnetic storage devices,
or any other tangible storage medium that may be used to store
desired program code in the form of instructions or data structures
and that may be accessed by a computer or processor. Disk and disc,
as used here, include compact disc (CD), laser disc, optical disc,
digital versatile disc (DVD), floppy disk, and Blu-ray disc where
disks usually reproduce data magnetically, while discs reproduce
data optically with lasers. Combinations of the above should also
be included within the scope of computer-readable media.
Additionally, the operations of a method or algorithm may reside as
one or any combination or set of codes and/or instructions on a
non-transitory processor-readable medium and/or computer-readable
medium, which may be incorporated into a computer program
product.
[0091] The preceding description of the disclosed embodiments is
provided to enable any person skilled in the art to make or use the
present invention. Various modifications to these embodiments will
be readily apparent to those skilled in the art, and the generic
principles defined here may be applied to other embodiments without
departing from the spirit or scope of the invention. Thus, the
present invention is not intended to be limited to the embodiments
shown here but is to be accorded the widest scope consistent with
the following claims and the principles and novel features
disclosed here.
* * * * *