U.S. patent application number 15/281171 was filed with the patent office on 2018-04-05 for reducing processing for comparing large metadata sets.
The applicant listed for this patent is Microsoft Technology Licensing, LLC. Invention is credited to Victor Cazin.
Application Number | 20180096018 15/281171 |
Document ID | / |
Family ID | 61758719 |
Filed Date | 2018-04-05 |
United States Patent
Application |
20180096018 |
Kind Code |
A1 |
Cazin; Victor |
April 5, 2018 |
REDUCING PROCESSING FOR COMPARING LARGE METADATA SETS
Abstract
A computer system compares databases, where each data record
includes values for a plurality of fields, by applying database
operations on database tables of the data records to reduce
computations being performed. For each data record in a reference
set, candidate matches are identified from among the data records
in a target set by performing a join operation on the sets on one
or more selected, normalized fields from the sets. This
significantly reduced set of candidate matches can be further
analyzed to determine how similar they are to the data record from
the reference set.
Inventors: |
Cazin; Victor; (Paris,
FR) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Microsoft Technology Licensing, LLC |
Redmond |
WA |
US |
|
|
Family ID: |
61758719 |
Appl. No.: |
15/281171 |
Filed: |
September 30, 2016 |
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 16/2282 20190101;
G06F 16/2456 20190101; G06F 16/2455 20190101; G06F 16/2365
20190101; G06F 16/273 20190101; G06F 16/258 20190101 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A computer system comprising: computer storage, wherein a first
database and a second database are stored in the computer storage,
the first database comprising a first plurality of first data
records, each first data record comprising a first plurality of
first data fields storing metadata describing an item from among a
first plurality of items, the first plurality of data fields
comprising at least a first identifier field, a first index field,
and at least two more first metadata fields, each first identifier
field for each first data record for an item storing a value
representing an identifier for the item in the first database, each
first index field for each first data record for an item storing a
value representing a description of the item, the second database
comprising a second plurality of second data records, each second
data record comprising a second plurality of second data fields
storing metadata describing an item from among a second plurality
of items, the second plurality of data fields comprising at least a
second identifier field, a second index field, and at least two
more second metadata fields, each second identifier field for each
second data record for an item storing a value representing an
identifier for the item in the second database, each second index
field for each second data record for an item storing a value
representing a description of the item, wherein, for similar items
in the first database and the second database, the value
representing the identifier for the similar item in the first
identifier field may be different from the value representing the
identifier for the similar item in the second identifier field, a
processing system comprising a memory and a processor, the
processing system operable to access the computer storage and being
configured by computer program instructions to be operative to:
normalize the values in the first index field of the first data
records and the values in the second index field of the second data
records; select a candidate set based on the first database and the
second database using the normalized first index field and the
normalized second index field, to generate a table combining the
first metadata fields and the second metadata fields, wherein the
table comprises, for each data record in the first database having
a similar data record in the second database, and for each similar
data record, a row combining data fields for the data record from
the first database with data fields for the similar data record
from the second database, wherein the similar data record is a data
record that has a value in the normalized second index field which
is equal to the value in the normalized first index field for the
data record in the first database; for each row in the generated
table, compare values of first metadata fields in the row to values
of corresponding second metadata fields in the row to generate
scores, and storing the scores as data fields in the row; for each
row in the generated table, determine whether the data record from
the second database corresponding to the row is equivalent to the
data record from the first database corresponding to the row, based
on the generated scores stored in the row.
2. The computer system of claim 1, wherein selecting the candidate
set comprises performing a join operation of the first database and
the second database on the normalized first index field and the
normalized second index field to generate the table.
3. The computer system of claim 1, wherein selecting the candidate
set uses a plurality of normalized first index fields and a
plurality of normalized second index fields.
4. The computer system of claim 3, wherein selecting the candidate
set comprises performing a first join operation of the first
database and the second database on the normalized first index
field and the normalized second index field to generate a first
intermediate candidate table, performing a second join operation of
the first database and the second database on a second normalized
first index field and a second normalized second index field to
generating a second intermediate candidate table, and performing a
union of the first intermediate candidate table and the second
intermediate candidate table to generate the table.
5. The computer system of claim 1, wherein determining comprises
applying first criteria to the scores indicative of a match of a
first priority, and applying second criteria to the scores
indicative of a match of a second priority lower than the first
priority, and storing and indication of the match and the priority
of the match.
6. The computer system of claim 2, wherein comparing values of a
first metadata field to a corresponding second metadata field in a
row to generate a scores comprises applying an edit distance to
text values stored in the first and second metadata fields.
7. The computer system of claim 6, wherein comparing values of a
first metadata field to a corresponding second metadata field in a
row to generate a scores comprises determining if a text value
stored in the first metadata field is contained in a text value
stored in the second metadata field.
8. An article of manufacture comprising: a computer storage device,
computer program instructions stored on the computer storage which,
when processed by a computer, configures the computer to be
comprising: computer storage, wherein a first database and a second
database are stored in the computer storage, the first database
comprising a first plurality of first data records, each first data
record comprising a first plurality of first data fields storing
metadata describing an item from among a first plurality of items,
the first plurality of data fields comprising at least a first
identifier field, a first index field, and at least two more first
metadata fields, each first identifier field for each first data
record for an item storing a value representing an identifier for
the item in the first database, each first index field for each
first data record for an item storing a value representing a
description of the item, the second database comprising a second
plurality of second data records, each second data record
comprising a second plurality of second data fields storing
metadata describing an item from among a second plurality of items,
the second plurality of data fields comprising at least a second
identifier field, a second index field, and at least two more
second metadata fields, each second identifier field for each
second data record for an item storing a value representing an
identifier for the item in the second database, each second index
field for each second data record for an item storing a value
representing a description of the item, wherein, for similar items
in the first database and the second database, the value
representing the identifier for the similar item in the first
identifier field may be different from the value representing the
identifier for the similar item in the second identifier field, a
processing system comprising a memory and a processor, the
processing system operable to access the computer storage and being
configured by computer program instructions to be operative to:
normalize the values in the first index field of the first data
records and the values in the second index field of the second data
records; select a candidate set based on the first database and the
second database using the normalized first index field and the
normalized second index field, to generate a table combining the
first metadata fields and the second metadata fields, wherein the
table comprises, for each data record in the first database having
a similar data record in the second database, and for each similar
data record, a row combining data fields for the data record from
the first database with data fields for the similar data record
from the second database, wherein the similar data record is a data
record that has a value in the normalized second index field which
is equal to the value in the normalized first index field for the
data record in the first database; for each row in the generated
table, compare values of first metadata fields in the row to values
of corresponding second metadata fields in the row to generate
scores, and storing the scores as data fields in the row; for each
row in the generated table, determine whether the data record from
the second database corresponding to the row is equivalent to the
data record from the first database corresponding to the row, based
on the generated scores stored in the row.
9. The article of manufacture of claim 8, wherein selecting the
candidate set comprises performing a join operation of the first
database and the second database on the normalized first index
field and the normalized second index field to generate the
table.
10. The article of manufacture of claim 8, wherein selecting the
candidate set uses a plurality of normalized first index fields and
a plurality of normalized second index fields.
11. The article of manufacture of claim 10, wherein selecting the
candidate set comprises performing a first join operation of the
first database and the second database on the normalized first
index field and the normalized second index field to generate a
first intermediate candidate table, performing a second join
operation of the first database and the second database on a second
normalized first index field and a second normalized second index
field to generating a second intermediate candidate table, and
performing a union of the first intermediate candidate table and
the second intermediate candidate table to generate the table.
12. The article of manufacture of claim 8, wherein determining
comprises applying first criteria to the scores indicative of a
match of a first priority, and applying a second criteria to the
scores indicative of a match of a second priority lower than the
first priority, and storing and indication of the match and the
priority of the match.
13. The article of manufacture of claim 9, wherein comparing values
of a first metadata field to a corresponding second metadata field
in a row to generate a scores comprises applying an edit distance
to text values stored in the first and second metadata fields.
14. The article of manufacture of claim 9, wherein comparing values
of a first metadata field to a corresponding second metadata field
in a row to generate a scores comprises determining if a text value
stored in the first metadata field is contained in a text value
stored in the second metadata field.
15. A computer-implemented process performed by a computer program
executing on a computer, the computer including computer storage,
wherein a first database and a second database are stored in the
computer storage, the first database comprising a first plurality
of first data records, each first data record comprising a first
plurality of first data fields storing metadata describing an item
from among a first plurality of items, the first plurality of data
fields comprising at least a first identifier field, a first index
field, and at least two more first metadata fields, each first
identifier field for each first data record for an item storing a
value representing an identifier for the item in the first
database, each first index field for each first data record for an
item storing a value representing a description of the item, the
second database comprising a second plurality of second data
records, each second data record comprising a second plurality of
second data fields storing metadata describing an item from among a
second plurality of items, the second plurality of data fields
comprising at least a second identifier field, a second index
field, and at least two more second metadata fields, each second
identifier field for each second data record for an item storing a
value representing an identifier for the item in the second
database, each second index field for each second data record for
an item storing a value representing a description of the item,
wherein, for similar items in the first database and the second
database, the value representing the identifier for the similar
item in the first identifier field may be different from the value
representing the identifier for the similar item in the second
identifier field, the computer further including a processing
system comprising a memory and a processor, the processing system
operable to access the computer storage and being configured by
computer program instructions to perform the process comprising:
normalizing the values in the first index field of the first data
records and the values in the second index field of the second data
records; selecting a candidate set based on the first database and
the second database using the normalized first index field and the
normalized second index field, to generate a table combining the
first metadata fields and the second metadata fields, wherein the
table comprises, for each data record in the first database having
a similar data record in the second database, and for each similar
data record, a row combining data fields for the data record from
the first database with data fields for the similar data record
from the second database, wherein the similar data record is a data
record that has a value in the normalized second index field which
is equal to the value in the normalized first index field for the
data record in the first database; for each row in the generated
table, comparing values of first metadata fields in the row to
values of corresponding second metadata fields in the row to
generate scores, and storing the scores as data fields in the row;
for each row in the generated table, determining whether the data
record from the second database corresponding to the row is
equivalent to the data record from the first database corresponding
to the row, based on the generated scores stored in the row.
16. The computer-implemented process of claim 15, wherein selecting
the candidate set comprises performing a join operation of the
first database and the second database on the normalized first
index field and the normalized second index field to generate the
table.
17. The computer-implemented process of claim 15, wherein selecting
the candidate set uses a plurality of normalized first index fields
and a plurality of normalized second index fields.
18. The computer-implemented process of claim 17, wherein selecting
the candidate set comprises performing a first join operation of
the first database and the second database on the normalized first
index field and the normalized second index field to generate a
first intermediate candidate table, performing a second join
operation of the first database and the second database on a second
normalized first index field and a second normalized second index
field to generating a second intermediate candidate table, and
performing a union of the first intermediate candidate table and
the second intermediate candidate table to generate the table.
19. The computer-implemented process of claim 15, wherein
determining comprises applying first criteria to the scores
indicative of a match of a first priority, and applying second
criteria to the scores indicative of a match of a second priority
lower than the first priority, and storing and indication of the
match and the priority of the match.
20. The computer-implemented process of claim 16, wherein comparing
values of a first metadata field to a corresponding second metadata
field in a row to generate a scores comprises applying an edit
distance to text values stored in the first and second metadata
fields.
Description
BACKGROUND
[0001] A common problem with large databases is comparing contents
of one database to contents of another database to determine an
extent of overlap between the two databases. For example, a first
database of items for sale by a store can be compared to a second
database of items for sale by another store to determine which
items are available in both stores, or which items are available in
only one of the stores. As another example, a first database of
media items available in a first repository can be compared to a
second database of media items in another repository to determine
whether equivalent items are available in both repositories, or to
identify items which are available in only one of the repositories
and not the other. Given information about which items are believed
to be the same in two databases, information about those items also
can be compared. For example, given databases of media files, after
similar items are identified, information stored in the two
databases about those items can be compared, such as prices, user
ratings, file size, date of upload and so on.
[0002] Identifying similar items in databases can be performed
readily if globally unique identifiers for the items are available
in the databases; the items with equal identifiers are the same. If
such globally unique identifiers are not available, then metadata
for each item can used to determine if items are the same. Such
metadata for an item typically includes multiple fields of data,
with each field storing a value that partially describes the item.
Sets of metadata from two databases can be compared to determine
whether items are the same.
[0003] Comparison of metadata sets, such as for comparing two
databases of items, can be computationally intensive when the
databases are large. In particular, if one database has a first
number N of items, and a second database has a first number M of
items, the number of comparisons of items to be performed is the
product of the first and second numbers (N.times.M). Computation of
a similarity metric based on multiple data fields across a large
number of items in large databases consumes significant computing
resources. Where each item is represented by yet a third number X
of fields of metadata, the number of potential comparison
operations is the product of the first, second and third numbers
(i.e., N.times.M.times.X). Thus, for large data sets in some
applications, comparison of metadata sets can be computationally
expensive.
[0004] The complexity of comparing databases is further increased
when metadata is unstructured text, incomplete, inaccurate or
includes multiple variations. For example, some metadata fields may
include several words of text. Some metadata fields may be missing
values. Some metadata fields may have different, but equivalent,
values in different data sets. For example, a text field may have a
first set of words for one item in a database, and the text field
of a corresponding item in another database may have a second set
of words that is different from the first set of words, but the
sets of words may have the same meaning. For example, the words may
be different words in the same language which have the same
meaning, or words in different languages. Some fields of the
metadata may be incorrect. For example, a date field may have
incorrect data, or a text field may have a misspelled word. With
such data, metadata sets cannot be compared based on an exact match
or numerical similarity metrics.
[0005] In addition, with some kinds of databases, a result desired
from the comparison may be identifying, for each item in one
database, not a single item in another database that exactly
matches, but instead multiple items in the other database that may
be similar. For example, when comparing databases of movie titles,
the result of the comparison could be to identify multiple versions
of each movie. If multiple similar items are the desired result, a
simple exact match type of comparison cannot be used.
SUMMARY
[0006] This Summary is provided to introduce a selection of
concepts in a simplified form that are further described below in
the Detailed Description. This Summary is intended neither to
identify key or essential features, nor to limit the scope, of the
claimed subject matter.
[0007] A computer system compares databases, where each data record
includes values for a plurality of fields, by applying database
operations on database tables of normalized data from the databases
to reduce computations being performed. For each data record in a
reference set (a first database), candidate matches are identified
from among the data records in a target set (a second database) by
performing a join operation on the sets on one or more selected,
normalized fields from the sets. The effect of the join operation
is to compute a reduced list of candidate data records from the
second database for each data record in the first database. This
significantly reduced set of candidate matches can be further
analyzed to determine how similar they are to the data record from
the reference set.
[0008] In particular, a field in each database is identified as an
index field. The computer system normalizes the index field and
performs a join operation of the databases on the normalized index
field to generate a table. The table resulting from the join
operation includes, for each data record in the first database
having a similar data record in the second database, and for each
similar data record, a row combining data fields for the data
record from the first database with data fields for the similar
data record from the second database. A similar data record is
identified as a data record that has a value in the normalized
second index field which is equal to the value in the normalized
first index field for the data record in the first database. The
effect of the join operation is to compute a reduced list of
candidate data records from the second database for each data
record in the first database, and to associate the data from each
candidate data record in the second database with the data from the
corresponding data record in the first database into a single
record. These candidate data records can then be further
analyzed.
[0009] Multiple metadata fields are used in this further analysis
to compare a data record from the first database to each of its
candidate data records from the second database. For example, for
each row in the generated table, for each pair of a first metadata
field and corresponding second metadata field, the computer system
compares the value of the first metadata field to the value of the
corresponding second metadata field to generate a score for the
pair of corresponding metadata fields. The computer system can
store the scores computed for multiple pairs of corresponding
metadata fields as data fields in the row.
[0010] The computer system can include a matching module for each
type of metadata field. For example, the matching module can
determine how to compute a score for a pair of values of a
particular type of data. Different matching modules can be provided
for different types of data, such as dates, titles, names,
descriptive text, and so on. Matching modules can compute a score
using a function that has a parameter that allows the strictness of
the matching to be adjusted or tuned.
[0011] For each row in the generated table, the computer system
determines, based on the generated scores stored in the row,
whether the data record from the second database corresponding to
the row is sufficiently similar to the data record from the first
database corresponding to the row. In this computation, as an
example, data records may be identified as sufficiently similar if
they are different versions of the same item. Further processing of
the identified data records can be performed to determine whether
the data records represent the same item or different versions of
an item.
[0012] In the following description, reference is made to the
accompanying drawings which form a part hereof, and in which are
shown, by way of illustration, specific example implementations.
Other implementations may be made without departing from the scope
of the disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] FIG. 1 is a combined block diagram and data flow diagram of
an example computer system configured to compare databases.
[0014] FIG. 2 is a flow chart illustrating an example of operation
of the computer system of FIG. 1.
[0015] FIG. 3 is a diagram of an illustrative example of a first
database and a second database.
[0016] FIG. 4 is a diagram of an illustrative example of a table of
extracted and normalized data.
[0017] FIG. 5 is a diagram of an illustrative example of a table of
data records resulting from a join, and such a table with scores
generated by the matching modules.
[0018] FIG. 6 is diagram of an illustrative example of a table of
data records with scores generated by the matching modules, and
such a table with final scores generated by the decision
module.
[0019] FIG. 7 is a flow chart describing an example implementation
of a normalization application.
[0020] FIG. 8 is a flow chart describing an example implementation
of a candidate lookup application.
[0021] FIG. 9 is a flow chart describing an example implementation
of a matching application.
[0022] FIG. 10 is a flow chart describing an example implementation
of a decision module.
[0023] FIG. 11 is a block diagram of an example computer.
DETAILED DESCRIPTION
[0024] FIG. 1 is a block diagram of an example computer system 100
configured to support comparison of contents of two databases. This
example computer system 100 includes a first database 102 and a
second database 103. An example implementation of the databases is
described in more detail below in connection with FIG. 3. Generally
speaking, the first database comprises a first plurality of first
data records, and the second database comprises a second plurality
of second data records. The term "data record" as used herein means
any data structure used to store data describing an item in the
database.
[0025] The computer system 100 also includes one or more computers
110. A computer 110 includes a normalization application 108 that
extracts and normalizes data from the first and second databases
and stores the normalized data in first and second normalized
databases 104 and 106. An example implementation of such a
normalization application, with processes for normalizing data of
different data types, is described in more detail below in
connection with FIGS. 4 and 7. The normalized databases 104 and 106
can be stored as relational database tables in a relational
database management system on the computer 110.
[0026] The computer 110 further includes a candidate lookup
application 112 that, for each data record in the first normalized
database, identifies any data records in the second normalized
database that are candidates for further analysis. More
particularly, given an identification of normalized index fields in
the data records of the first and second normalized databases, a
join operation is performed on the normalized index fields. The
candidate lookup application 112 has a first input through which it
receives data from the first normalized database 104 and a second
input through which it receives data from the second normalized
database 106. The candidate lookup application 112 has an output
through which it provides a table 114 of candidate data records
resulting from the join operation.
[0027] The join operation of the normalized databases on the
normalized index fields generates the table 114 of candidate data
records. More particularly, the table resulting from the join
operation includes, for each data record in the first database
having a similar data record in the second database, and for each
similar data record, a row combining data fields for the data
record from the first database with data fields for the similar
data record from the second database. A similar data record is
identified as a data record that has a value in the normalized
second index field which is equal to the value in the normalized
first index field for the data record in the first database. The
effect of the join operation is to compute a reduced list of
candidate data records from the second database for each data
record in the first database, and to associate the data from each
candidate data record in the second database with the data from the
corresponding data record in the first database into a single
record. These candidate data record associations can then be
further analyzed. FIGS. 5 and 8 below are provided to describe an
example implementation of the candidate lookup operation in more
detail.
[0028] The computer 110 further includes a matching application 116
that, for each candidate data record in the table 114, determines
whether the data record from the first and second databases
represented by the candidate data record are sufficiently similar
to be identified as a match. The matching application 116 has an
input through which it receives data from the table 114 and an
output through which it provides a table 118, including an
indication for each candidate data record whether that candidate
data record represents a match. An example implementation of this
table is described below in connection with FIGS. 5 and 6.
[0029] For example, for each row in the generated table, for each
pair of a first metadata field and corresponding second metadata
field, the matching application 116 compares the value of the first
metadata field to the value of the corresponding second metadata
field to generate a score for the pair of corresponding metadata
fields. The matching application 116 stores the scores computed for
multiple pairs of corresponding metadata fields as data fields
added to the row from table 114, thus providing the updated table
118.
[0030] The matching application 116 can include, as an example
implementation, a matching module 120 (labeled as 120a, . . . ,
120e in FIG. 1) for each type of data in the candidate data record.
For example, each matching module can determine how to compute a
score for a pair of values of a particular type of data. Different
matching modules can be provided for different types of data, such
as dates, titles, names, descriptive text, and so on. Matching
modules can compute a score using a function that has a parameter
that allows the strictness of the matching to be adjusted or tuned.
These scores can be stored in an intermediate table 122, an example
implementation of which is described in FIGS. 5 and 6. Example
implementation of such a matching application is described in more
detail below in connection with FIG. 9.
[0031] The matching application 116 can include, as an example
implementation, a decision module 124, which determines, for each
row in the generated table and based on the generated scores stored
in the row, whether the data record from the second database
corresponding to the row is sufficiently similar to the data record
from the first database corresponding to the row. In this
computation, as an example, the decision module can identify data
records as sufficiently similar if they are different versions of
the same item. The decision module can further process the
identified data records to determine whether a data record
represents the same item or a different version of an item. Such a
determination can be reflected as a value in a data field of each
candidate data record in the output table 118. An example
implementation of a decision module is described in more detail
below in connection with FIG. 10.
[0032] The computer 110 can be implemented using a general purpose
computer such as described below in connection with FIG. 11. The
computer system can include a plurality of such computers 110. For
example, different computers can be used to execute the different
applications or modules in FIG. 1, or different computers can be
used to process different parts of the data, such as different
parts of the table 114 of candidate data records. In particular, by
implementing the applications on the computer 110 using database
operations such as join operations, multiple computers can be used
to process data in a distributed system. In particular, and for
example, join operations of the candidate lookup application can be
highly parallelized. The other applications also can be executed on
multiple computers with partial data sets for more efficiency.
Thus, in practice, the applications 108, 112, 116, are executed on
multiple computers at the same time.
[0033] Referring now to FIG. 2, a flowchart describing operation of
the computer system 100 will now be described. The computer 110
extracts (200) data records from the first database 102 and second
database 103. The extraction of data records from the first and
second databases can be performed in several ways, depending on how
the computer 110 has access to the first database 102 and to the
second database 103. For example, such extraction can be performed
by transmission of one or more queries or read requests from the
computer over a computer network to a remote computer to request
data from the first or second database where the first or second
database is stored on a remote computer. Alternatively, the
computer may periodically receive data from a remote computer. The
computer may read data from a data file, containing the first or
second database, local to the computer. The extracted data may be
normalized and stored in a data file or database, in memory or in
persistent storage such as in a data file. The extracted data can
be a set of data fields which is a smaller subset of the data
fields in the first database or the second database.
[0034] The normalization application 108 normalizes (202) the
extracted data and stores (204) the normalized data in first and
second normalized databases 104 and 106. The normalization process
can include creating multiple normalized data fields from a single
original data field.
[0035] The candidate lookup application performs (206) a join
operation on the first and second normalized databases on a
selected index field to generate the table 114 of candidate data
records. More than one field can be selected as an index field, and
multiple join operations can be performed, with the resulting
tables 114 combined together. In such an implementation, the
combined table can include a field indicating which index fields
matched for the candidate records.
[0036] The matching application 116 generates (208) scores for each
row of the table 114 of candidate data records. The matching
application outputs (210) one or more values for each row of the
table 114 indicating whether the candidate data records are
considered a match based on the generated scores. In some
implementations, such as one described in more detail below in, the
matching application can decide for each row, using the scores
generated for the row, whether the candidate is a match to the
reference and can keep only matches for additional processing. The
additional processing can include a version similarity computation,
which computes a distance between the product version of the
reference and the product version of the candidate. This similarity
metric, among multiple matches for a given reference, can be used
in selection of the best or n-best matches with the given
reference.
[0037] Referring now to FIGS. 3 to 6, example database tables will
now be described. FIG. 3 illustrates an example database for a
first database and a second database. FIG. 4 illustrates an example
database table for storing data extracted from the first database
or from the second database, and for storing the first and second
normalized databases. FIG. 5 illustrates an example database table
for the table of candidate data records resulting from a join
operation, and such a table combined with scores from matching
modules as an intermediate table. FIG. 6 illustrates an example
database table for an intermediate table including scores from
matching modules, and an output table that includes computed scores
and values indicating whether a record represents a match.
[0038] In FIG. 3, a first database 300 comprises a first plurality
of first data records 302. Each first data record comprises a first
plurality of first data fields 304 storing metadata describing an
item from among a first plurality of items. The first plurality of
data fields comprises at least a first identifier field 306, a
first index field 308, and at least two more first metadata fields
310, 312. The first database may store metadata in many more than
two metadata fields. Also, these metadata fields may be distributed
among many different data records in the first database, and likely
not in a single table.
[0039] Each first identifier field 306 for each first data record
for an item stores a value representing an identifier for the item
in the first database. This identifier typically is unique for the
item within the first database. Each first index field 308 for each
first data record for an item is a metadata field that stores a
value representing a description of the item. This first index
field is identified prior to the extraction process as the field on
which the join operation will be performed after normalization of
the data in the field. The field that is selected as the index
field depends on the implementation and the nature of the databases
being compared. In general the index field is a field which, after
normalization of values in both databases, has a high likelihood of
having values from both databases that are an exact match if the
items are similar. The metadata field used as a first index field
for this comparison may not be marked as such a field in the first
database.
[0040] Similarly, a second database 320 comprises a second
plurality of second data records 322. Each second data record
comprises a second plurality of second data fields 324 storing
metadata describing an item from among a second plurality of items.
The second plurality of data fields comprises at least a second
identifier field 326, a second index field 328, and at least two
more second metadata fields 330, 332. The second database may store
data in many more than two metadata fields. The second database may
store metadata in many more than two metadata fields. Also, these
metadata fields may be distributed among many different data
records in the second database, and likely not in a single table.
The structure of the second database, and the kinds of metadata
fields it stores, can be substantially different from the first
database. However, it should have at least two metadata fields,
e.g., 328, 330, 332, in which the stored data is similar in type to
at least two other metadata fields, e.g., 308, 310, 312, in the
first database.
[0041] Each second identifier field 326 for each second data record
for an item stores a value representing an identifier for the item
in the second database. This identifier typically is unique for the
item within the second database. For similar items in the first
database and the second database, the value stored as the
identifier for the similar item in the first identifier field is
assumed to be different from the value stored as the identifier for
the similar item in the second identifier field. Thus the values in
the identifier fields can be ignored in the comparison of the
databases. The identifier is useful for retrieving or otherwise
accessing data about an item from the database.
[0042] Each second index field 328 for each second data record for
an item is a metadata field that stores a value representing a
description of the item. This second index field is identified
prior to the extraction process as the field on which the join
operation will be performed after normalization of the data in the
field, and at is selected because it has data similar to the data
found in the corresponding first index field 308. The metadata
field used as a second index field for this comparison may not be
marked as such a field in the second database.
[0043] It is assumed, in the comparison operation described herein,
that values stored in the metadata fields in either the first
database or the second database may store unstructured text, may be
incomplete, may be inaccurate or may be imprecise. For example,
some metadata fields may be missing values. Some values in the
metadata fields may be incorrect. Corresponding metadata fields
from different data records for the same item may have different,
but similar, values.
[0044] Additionally, in the case where the database represents a
catalog of items representing media data files, the catalog may
have multiple versions of a work, such as multiple versions of a
movie. For example, for a given movie, there may be a theatrical
release, a director's cut, a version reformatted for television, a
sequel, a dubbed version in a different language from the original
language, and so on. Using the techniques described below, such
versions can still be identified as similar works even if they are
not identical. It is assumed that the media data files themselves
are not being accessed to perform a comparison of the media
data.
[0045] Turning now to FIG. 4, an example database table for storing
data extracted from the first database or from the second database,
and for storing the first and second normalized databases, will now
be described.
[0046] Similar to the first database 300 and the second database
320, normalized data extracted from one of these databases can be
stored in a table 400 comprising a plurality of data records 402.
Each data record comprises a plurality of data fields 404 storing
metadata describing an item from among a plurality of items. The
plurality of data fields comprises at least an identifier field
406, an index field 408, and at least two more metadata fields 410,
412. However, unlike the first and second database 300 and 320,
where the various data fields may be distributed among various
structures in the database, in table 400, the extracted identifier
field, index field and metadata fields for an item are combined in
a single row for the item. Queries on the first database and the
second database can be used to populate such a table 400. A first
such table 400 is created using data extracted from the first
database 300; a second such table 400 is created using data
extracted from the second database 320. The data in table 400 is
normalized as described below. Additional fields 414 can be stored
in the normalized data table for normalized values so as to provide
a table with both original values as read from one of the database,
and the corresponding normalized values. The normalization process
may convert an original field into multiple normalized fields as
well.
[0047] Given a normalized table of data extracted from the first
database and a normalized table of data extracted from the second
database, the objective is to identify, for each data record in a
reference set, e.g. from the first database, candidate matches from
among the data records in a target set, e.g., the second database.
This objective can be accomplished by performing a join operation
on two tables 400 on one or more selected normalized fields 414
from the sets. This significantly reduced set of candidate matches
can be further analyzed to determine how similar they are to the
data record from the reference set. The result of the join
operation is a table of candidate records, an example
implementation of which will now be described in connection with
FIG. 5.
[0048] This table 500 resulting from the join operation includes,
for each data record in the first database having a similar data
record in the second database, and for each similar data record, a
row 502 combining data fields for the data record from the first
database with data fields for the similar data record from the
second database.
[0049] A similar data record is identified as a data record that
has a value in the normalized second index field which is equal to
the value in the normalized first index field for the data record
in the first database. The join operation in effect performs this
identification. More particularly, an inner join operation is
performed. Further, the effect of the join operation is to compute
a reduced list of candidate data records from the second database
for each data record in the first database, and to associate the
data from each candidate data record in the second database with
the data from the corresponding data record in the first database
into a single record. These candidate data record associations can
then be further analyzed.
[0050] Thus, a row 502 in the table of candidate data records 114
includes a plurality of fields 504 including the first identifier
field 506 for the data record from the first database and the
second identifier field 508 for the data record from the second
database. The row further includes the first index field and first
metadata fields 510, and second index field and second metadata
fields 512. Further first additional normalized fields 514 and
second additional normalized fields 516 from the normalized data
tables 400 for the first and second database also are added.
[0051] If more than one field has been selected as an index field,
and multiple join operations have been performed, the resulting
tables are combined together. In such an implementation, the
combined table can include, for each row, a field (not shown)
indicating which index fields matched for the candidate data
records.
[0052] Given the table of candidate data records such as shown in
FIG. 5, for each row in the generated table, for each pair of a
first metadata field and corresponding second metadata field, the
matching application compares the value of the first metadata field
to the value of the corresponding second metadata field to generate
a score for the pair of corresponding metadata fields. It should be
understood that a comparison for a pair of corresponding metadata
fields can use data from a plurality of first metadata fields and a
plurality of second metadata fields to generate a score. The
computer system can store the scores computed for multiple pairs of
corresponding metadata fields for a candidate data record as data
fields in the row as indicated at 518, 520. A table with such
scores has rows of fields 522, and can implement the intermediate
table 122 of FIG. 1.
[0053] FIG. 6 illustrates an example database table 600 for an
intermediate table with fields 624, which includes scores computed
for each candidate, or for an output table with fields 604 which
further includes computed final decision values indicating whether
a record represents a match. The intermediate table and the output
table can incorporate the candidate table 500 and add fields to
that table. Thus, each row 602 includes a plurality of fields 604,
including the first identifier field 606, second identifier field
608, first index and metadata fields 610, second index and metadata
fields 612, first additional normalized fields 614 and second
additional normalized fields 616. In addition, each row includes a
first score 618 based on a first pair of metadata fields, and a
second score 620 based on a second pair of metadata fields, and so
on for multiple pairs of corresponding metadata fields. In the
final output table, these scores can be processed to generate one
or more decision values 622. In the example described below, there
are three such decision values.
[0054] More details of an example implementation of such a computer
system and its operation will be described below in connection with
FIGS. 7-10.
[0055] Normalization of the metadata fields in the first and second
databases is dependent on the type of the data stored in the
databases. Generally, normalization is performed to allow data to
be compared. For example, if dates are stored in different formats
in the different databases, date data can be transformed from an
original format in the first or second database into the same
format in the normalized data tables.
[0056] Text information can be normalized, in part, using
techniques to be applied generically to all text data. Such
techniques generally can be implemented in the form of rules which
specify an input text string and its corresponding output text
string. An order of application of such rules also can be
specified. In other words, if the input text string is identified
in an original data field, the output text string replaces that
data in the normalized data field. Such example rules which can be
generically applied are the following: [0057] a. replace capital
letters with lower case letters; [0058] b. convert any word
representing a number into the number, e.g., "two" becomes "2";
[0059] c. remove words, e.g., "the", based on a specified list,
which may be language specific; [0060] d. remove any character that
is not a number or a letter. As an example application of such
rules, consider the string "Iron Man: three". According to rule a,
this becomes "iron man: three". Rule b further transforms this to
"iron man: 3". Rule c is inapplicable, and rule d further
transforms the string to "ironman3".
[0061] Text information can be further normalized based on
information known about the purpose of the metadata field
containing the text information. For example, in a database of
media files such as movies or music, there may be fields that
include proper names, such as a director, actor, musician or the
like. Such a data field may include more than one name, or a name
may be missing. Such name fields can be normalized by first
identifying each name in the field, normalizing each name according
to the normalization rules, e.g., "Jane Doe" becomes "janedoe", and
then concatenating the normalized names together. The concatenation
can be performed using a predetermined ordering, such as
alphabetical ordering, and a predetermined delimiter between each
name, such as a special character such as "&".
[0062] As another example, in a database of media files such as
movies or music, a data field is typically present for storing text
data for a title. Before generic normalization is applied, the text
in the title field can be extracted into multiple data fields, each
of which can be separately normalized using the generic rules. Text
data for a title typically can include a main title, other data
indicative of a version, such as a subtitle and a year. Subtitles
can be extracted by identifying typical delimiters in text that
signify a subtitle, and extracting text after or between such
delimiters. Examples of such delimiters are parentheses or a colon.
A year can be extracted by identifying, for example, a sequence of
four concatenated numerals. Using such rules, an original title
field "Batman (2015 Remaster)" can be transformed into three data
fields: a title field "Batman", a subtitle field "Remaster", and a
year field "2015", which can be further normalized.
[0063] As another example, as a special case of a proper name, in a
database of media files such as movies or music a studio name or
other corporate source of the media file may be one of the metadata
fields. Such data might be in the form of "Copyright 1993 XYZ
Studios". Such a field can be normalized by extracting a name
and/or a year, and applying normalization rules to the extracted
data. In the case of the studio name, one additional normalization
step can be to replace the name with a substitute name given a
predetermined list of synonyms. Such synonyms can be based on
information about corporate acquisitions, for example.
[0064] A range of such normalization rules can be provided for each
data type and each metadata field. Normalization rules also can be
language specific, and can be specified applied in a
language-dependent order. As an example, English-specific rules can
be applied first, and then French-specific rules can be applied
next. Such normalization helps to transform unstructured text,
incomplete, inaccurate or otherwise variable data into a format
that can be more easily compared. Normalization rules can be
defined based on the data type of each metadata field, allowing the
system to be tailored to the data sets actually being compared.
[0065] Accordingly, as shown in FIG. 7, a flowchart illustrates an
example implementation of normalization of a table of data. Such
normalization can be performed while data is being extracted from
the first database 102 and second database 103 into their
corresponding normalized tables 104, 106. The data fields to be
extracted from the selected database, e.g., the first database 102,
are accessed 700. For each metadata field for which there is a
normalization process for that metadata field type, the
normalization process for that type of metadata field is applied
702 to the values in that metadata field to generate corresponding
normalized metadata fields. Other normalization rules are then
generically applied 704 to the remaining fields, including the
additional metadata fields generated through metadata-specific
normalization. These steps are repeated to generate the normalized
table of data 106 for the second database 103, as indicated
706.
[0066] An example implementation of the candidate selection
application will now be described in more detail in connection with
the flowchart shown in FIG. 8. In FIG. 8, a data field from the
first database and a data field from the second database are
selected 800 as the first index field and the second index field. A
join operation is performed 802 on the extracted, normalized data
tables from the first and second databases to generate a table of
candidate records. The join operation can be characterized as a way
of selecting, for each data record in the first database, each data
record in the second database for which the value in the second
index field matches the value in the first index field of the data
record from the first database. There can be multiple such data
records in the second database which can be a match for a given
data record from the first database.
[0067] Multiple join operations can be performed by repeating steps
800 and 802, as indicated at 804, and selecting different metadata
fields as the index field. If multiple such join operations are
performed, a union of the resulting tables is computed 808 to
provide a full list of candidates.
[0068] In a relational database management system (RDBMS), or other
relational database implementation, particularly those supporting a
form of Structured Query Language (SQL) or variant thereof, the
inner join operation is a particular efficient way to implement the
candidate lookup application to reduce the sets of data records
which are compared to each other. Other database structures that
can implement a join operation, such as a file-based database, such
as the Hadoop database with an implementation of a join operation,
can be used to similar effect. As a result of such a join
operation, for each data record in the first database, that data
record may be compared to a small number of candidates from the
second database instead of the entire second database.
[0069] As an example in comparing databases of media data files,
metadata fields that can be used as index fields include title
fields, description fields and/or source entity fields, such as an
artist's name or director's name fields. For movies, as an example,
a combination of title (without its subtitle, due to
normalization), description and director fields as normalized index
fields produces a satisfactory candidate set for further analysis
by a matching application.
[0070] Given the table of candidate data records resulting from the
process of FIG. 8, the matching application (116 in FIG. 1)
processes each data record to provide a set of scores which are
added to the data record. More particularly, it should be observed
that, in each row in the table of candidate records, there is
normalized metadata from a data record in the first database and
corresponding normalized metadata from a candidate match data
record in the second database. For each type of metadata, a
different matching module can be provided to compare normalized
metadata of that type. Several examples of such a matching module
follow. Implementation of the matching application using several
matching modules allows each matching module to be tailored to a
specific type of metadata present in the data records.
[0071] One kind of metadata field is a metadata field that stores
name data, such as a proper name of a person, or of an
organization. For media files, such names may be in metadata fields
for directors, actors, studios, artists, musicians, and so on. As
noted above, such metadata fields can be processed into a list of
zero or more normalized names as strings of text. For a pair of
such fields, one kind of score is a binary value (indicating true
or false) if one of the names is one of the fields is contained in
one of the names in the other of the fields. For example, this
"contained in" value is true if one field has "janedoe" and the
other field has "janedoesmith". Such a comparison can be
implemented using a simple string operation applied to the fields.
Another kind of score that can be computed is a similarity metric
applied to the two strings. Any of a variety of other similarity or
distance metrics can be used. For example, any edit distance metric
can be used. An example distance metric that can be used is a
Levenshtein distance metric. The value resulting from computing the
distance metric, or a binary value resulting from a comparison of
the computed distance to a threshold, or both, can be stored as a
score for the pair of fields. In one implementation, both the
"contained in" score and the distance metric score can be
stored.
[0072] Another kind of metadata field is a metadata field that
stores a year. The normalization process can result in several
metadata fields storing a year. For example, a year field may exist
in the data record, but also may be extracted from a studio name or
title or other data in the data record. To compare the data between
two databases, the matching module can compute a binary value (true
or false) which can be true if the value in any one of the year
fields from one database match the value in any one of the year
fields from the other database.
[0073] Another kind of metadata field is a metadata field that
stores a text description for an item. For media files, a database
may include a significantly large text description for an item,
which can contain some valuable information. There are a number of
different techniques that can be used to compute a measure of
similarity between two text descriptions, such as an edit distance
metric. The value resulting from computing the distance metric, or
a binary value resulting from a comparison of the computed distance
to a threshold, can be stored as a score for the pair of
fields.
[0074] As a particular example of such a similarity score
computation to be applied to a text description is the following.
If the two strings are equal, then the score is 1. Otherwise, the
strings are compared character-by-character, incrementing a counter
for each pair of matching characters, until a mismatch is
identified. After a mismatch is identified, the ratio of the
counter to the length of the description is computed. This ratio
can be compared to a threshold to provide a binary value (true or
false) indicative of a match.
[0075] For databases that are catalogs of movies, a matching module
can also be provided for producing scores particularly with respect
to actor information. In addition to, or as an alternative to
comparing names in pairs of data fields specifying actors' names,
the description field from one database can be searched for actor
names listed in actor fields from the other database to generate a
score. The score can be a simple count of actor names, as
normalized as described above, found in the normalized text
description. Such a computation is useful if one of the databases
does not have actor information specified, but the other database
does. In some cases, a database may include a director in an actor
name field. The director name, if known, can be excluded from the
count of names found in the description.
[0076] The flowchart of FIG. 9 describes an example implementation
of a matching application including a plurality of matching
modules. The matching application selects 900 a matching module
from among the plurality of matching modules and identifies 902 the
corresponding one or more metadata fields derived from the first
database, and the corresponding metadata fields derived from the
second database, to which the matching module is to be applied. The
matching module is applied 904 to each data record in the table of
candidate data records, with the corresponding scores added to the
data records. If there are more matching modules to be applied, as
determined at 906, the process repeats. After all of the matching
modules have been applied, the table of candidate records, as
updated with scores, can be processed 908 (e.g., by the decision
module 124 in FIG. 1) to provide a matching decision.
[0077] The decision module 124 can process the scores resulting
from the matching module in a number of ways. In general, one or
more rules can be applied to the scores in each data record to
generate a final score indicative of whether the two items in the
data record match. As one example, a set of tests can be applied to
the scores which provide both a final score indicative of a match
and a confidence score indicative of a confidence of the match. In
general, the number of matching modules that indicate a match, and
a relative priority of those matching modules, can be used to
generate both the criteria for whether the final score indicates a
match, and the confidence of that match.
[0078] As a particular example, in FIG. 10, for a given data
record, first criteria can be applied 1000 to the scores. The first
criteria can evaluate whether the results from selected matching
modules of high priority are true. If, according to the first
criteria, there is a match as indicated at 1002, then the final
score for the record is set 1004 to indicate a match, and a
priority score is set to a high value. Otherwise, second criteria
can be applied 1006. The second criteria can consider whether
values from a different selection of matching modules are true, but
for which the confidence of the match is lower. If, according to
the second criteria, there is a match as indicated at 1008, then
the final score for the record is set 1010 to indicate a match, and
a priority score is set to the second highest value. Otherwise,
further criteria can be considered, as indicated at 1012. The
number of further criteria can depend on the implementation.
[0079] The following example implementation uses four levels of
priority. As a specific example of such criteria, for databases
representing catalogs of movies, a high priority match can be
indicated if the titles are equal or if the descriptions match and
if any one of the following is true: the directors are equal, the
directors match and the year or description match, the studio
matches and the year or description match. A second priority match
can be indicated if the titles match, the year matches, and either
the directors or studios match. A third priority match can be
indicated if the titles are equal or the descriptions match and the
year matches. A fourth priority, or lowest priority, match can be
indicated if the titles are equal or the descriptions match or the
directors are equal and another condition is met. That other
condition can be based on the number of actors found in the
description and the years match. Another condition that can be
considered is the number of products related to the director in the
specified year. In the foregoing definitions of criteria, values
are equal if they are an exact match; values match when the output
score produced by the matching module is above a threshold.
[0080] After determining a final match value and a priority for
each data record, any data record for which the final match value
is not true can be removed from the table of candidate data
records. The remaining data records indicate those candidates from
the second database which have a high likelihood of being a good
match to the corresponding item from the first database. However,
these candidate matches may represent different versions of an item
and may not be exact matches. The matching application can further
apply additional computations and criteria to provide information
about versions.
[0081] Determination of version information is dependent on the
type of items represented by the database and information about
those items. For example, with media files such as movies or music,
subtitle information extracted in the normalization process can
indicate version information, such as "digitally remastered" or
"bonus features" and the like.
[0082] In one implementation, given the subtitle field for an item,
a vector of version information can be computed based on
identifying keywords in the subtitle text. A process of feature
extraction applied to the subtitle field can be used to generate
the version information vector. For example, each component of the
version information vector can represent whether a feature is found
in the subtitle field. Feature extraction generally applies a
function to the data in the subtitle field to obtain a value. For
example, feature extraction can include a string search for a
keyword or keywords in the subtitle field Version information
vectors for two items can be compared to determine an extent of
similarity of the versions of this item. For each record remaining
in the table of candidate data records, a vector of version
information is computed for each item in the data record, providing
a reference vector for one item and a candidate vector for another
item.
[0083] The vector of version information can be a set of binary
values (e.g., "1" representing true and "0" representing false)
with each binary value representing the presence of a corresponding
keyword in the subtitle. As an example, for music such a vector may
include a value indicating the presence of the keyword "live" and a
value indicating the presence of the keyword "remastered". For
movies, such a vector may include a value for the keyword "bonus"
and the keyword "director". The selection of the keywords related
to versions is dependent on the type of items represented by the
database. With such an implementation, this module can compute a
scalar distance between the vectors of version information.
[0084] In another implementation, given the subtitle fields for the
item, a form of edit distance calculation can be applied to compare
the subtitle texts. For example, a Levenshtein distance can be
used. The distance can be compared to a threshold to make a
determination of whether the subtitles are the same, and provide a
binary value indicative of such.
[0085] Such an edit distance calculation can be modified so as to
improve the precision of the calculation depending on the type of
item the subtitle is for. For example, with data about movies, an
example implementation is the following. First, version keywords
used to generate the vector of version information are removed from
the subtitles. An initial score is set to zero. If the reduced
subtitles are equal, then 100 is subtracted from the current score.
Otherwise, if the reduced subtitle for one item is contained in the
reduced subtitle for the other item, then 1 is subtracted from the
current score. The Levenshtein distance between the reduced
subtitles is computed and added to the current score to provide a
version distance value. Either the version distance value, or the
version vector distance value, or both can be combined, to provide
a final version distance for each data record.
[0086] As a result of the foregoing processing by the matching
application, for each candidate record remaining from the table of
candidate records, the record includes an identifier of an item
from the first database, an identifier of an item from the second
database that is a candidate match, and three data fields
indicating the quality of the match, including version information.
These three data fields are a field with Boolean value (e.g., 1 or
0) indicating whether the two records are a match, a field with a
match priority value, and a field with a subtitle distance value.
Another pass through this table can identify the most similar
product for each item from the first database, which is represented
by the data record which includes that item and which is indicated
as a match with its associated item from the second database, and
which has the highest priority and lowest subtitle distance value
from among multiple data records, if any, for that item. This pass
through the table can provide an output table where each row
includes an identifier of an item from the first database and an
identifier of its corresponding best match item from the second
database.
[0087] The information resulting from this process can be used in a
number of ways. For example, for any item that does not appear in
any data record in the final candidate table, such an item is in
one database, but not the other database. Such information can be
used to identify items to be added to one of the databases. For
items that are identified as similar, other comparisons can be made
between the information in the databases. For example, metadata
from one database can be used to update the other database.
[0088] Having now described an example implementation, FIG. 11
illustrates an example of a computer with which techniques
described herein can be implemented. This is only one example of a
computer and is not intended to suggest any limitation as to the
scope of use or functionality of such a computer.
[0089] The computer can be any of a variety of general purpose or
special purpose computing hardware configurations. Some examples of
types of computers that can be used include, but are not limited
to, personal computers, game consoles, set top boxes, hand-held or
laptop devices (for example, media players, notebook computers,
tablet computers, cellular phones including but not limited to
"smart" phones, personal data assistants, voice recorders), server
computers, multiprocessor systems, microprocessor-based systems,
programmable consumer electronics, networked personal computers,
minicomputers, mainframe computers, and distributed computing
environments that include any of the above types of computers or
devices, and the like.
[0090] With reference to FIG. 11, a computer 1100 includes a
processing system at least one processing unit 1102 and memory
1004. The computer can have multiple processing units 1002 and
multiple devices implementing the memory 1004. A processing unit
1002 comprises a processor, which is logic circuitry which responds
to and processes instructions to provide the functions of the
computer. A processing unit can include one or more processing
cores (not shown) that are processors within the same logic
circuitry that can operate independently of each other. Generally,
one of the processing units in the computer is designated as a
primary processing unit, typically called the central processing
unit (CPU). Additional co-processing units, such as a graphics
processing unit (GPU), also can be present in the computer. A
co-processing unit comprises a processor that performs operations
that supplement the central processing unit, such as but not
limited to graphics operations and signal processing
operations.
[0091] The memory 1004 may include volatile computer storage
devices (such as dynamic random access memory (DRAM) or other
random access memory device), and non-volatile computer storage
devices (such as a read-only memory, flash memory, and the like) or
some combination of the two. A nonvolatile computer storage device
is a computer storage device whose contents are not lost when power
is removed. Other computer storage devices, such as dedicated
memory or registers, also can be present in the one or more
processors. The computer 1000 can include additional computer
storage devices (whether removable or non-removable) such as, but
not limited to, magnetically-recorded or optically-recorded disks
or tape. Such additional computer storage devices are illustrated
in FIG. 11 by removable storage device 1008 and non-removable
storage device 1010. Such computer storage devices 1008 and 1010
typically are nonvolatile storage devices. The various components
in FIG. 11 are generally interconnected by an interconnection
mechanism, such as one or more buses 1030.
[0092] A computer storage device is any device in which data can be
stored in and retrieved from addressable physical storage locations
by the computer. A computer storage device thus can be a volatile
or nonvolatile memory, or a removable or non-removable storage
device. Memory 1004, removable storage 1008 and non-removable
storage 1010 are all examples of computer storage devices. Some
examples of computer storage devices are RAM, ROM, EEPROM, flash
memory or other memory technology, CD-ROM, digital versatile disks
(DVD) or other optically or magneto-optically recorded storage
device, magnetic cassettes, magnetic tape, magnetic disk storage or
other magnetic storage devices. Computer storage devices and
communication media are mutually exclusive categories of media, and
are distinct from the signals propagating over communication
media.
[0093] Computer 1000 may also include communications connection(s)
1012 that allow the computer to communicate with other devices over
a communication medium. Communication media typically transmit
computer program instructions, data structures, program modules or
other data over a wired or wireless substance by propagating a
modulated data signal such as a carrier wave or other transport
mechanism over the substance. The term "modulated data signal"
means a signal that has one or more of its characteristics set or
changed in such a manner as to encode information in the signal,
thereby changing the configuration or state of the receiving device
of the signal. By way of example, and not limitation, communication
media includes wired media, such as metal or other electrically
conductive wire that propagates electrical signals or optical
fibers that propagate optical signals, and wireless media, such as
any non-wired communication media that allows propagation of
signals, such as acoustic, electromagnetic, electrical, optical,
infrared, radio frequency and other signals. Communications
connections 1012 are devices, such as a wired network interface,
wireless network interface, radio frequency transceiver, e.g., WiFi
1070, cellular 1074, long term evolution (LTE) or Bluetooth 1072,
etc., transceivers, navigation transceivers, e.g., global
positioning system (GPS) or Global Navigation Satellite System
(GLONASS), etc., or network interface devices 1076, e.g., Ethernet,
etc., or other devices that interface with communication media to
transmit data over and receive data from the communication
media.
[0094] The computer 1000 may have various input device(s) 1014 such
as a pointer device, keyboard, touch-based input device, pen,
camera, microphone, sensors, such as accelerometers, thermometers,
light sensors and the like, and so on. The computer 1000 may have
various output device(s) 1016 such as a display, speakers, and so
on. Such devices are well known in the art and need not be
discussed at length here. Various input and output devices can
implement a natural user interface (NUI), which is any interface
technology that enables a user to interact with a device in a
"natural" manner, free from artificial constraints imposed by input
devices such as mice, keyboards, remote controls, and the like.
[0095] Examples of NUI methods include those relying on speech
recognition, touch and stylus recognition, gesture recognition both
on screen and adjacent to the screen, air gestures, head and eye
tracking, voice and speech, vision, touch, gestures, and machine
intelligence, and may include the use of touch sensitive displays,
voice and speech recognition, intention and goal understanding,
motion gesture detection using depth cameras (such as stereoscopic
camera systems, infrared camera systems, and other camera systems
and combinations of these), motion gesture detection using
accelerometers or gyroscopes, facial recognition, three dimensional
displays, head, eye, and gaze tracking, immersive augmented reality
and virtual reality systems, all of which provide a more natural
interface, as well as technologies for sensing brain activity using
electric field sensing electrodes (EEG and related methods).
[0096] The various computer storage devices 1008 and 1010,
communication connections 1012, output devices 1016 and input
devices 1014 can be integrated within a housing with the rest of
the computer, or can be connected through various input/output
interface devices on the computer, in which case the reference
numbers 1008, 1010, 1012, 1014 and 1016 can indicate either the
interface for connection to a device or the device itself as the
case may be.
[0097] A computer generally includes an operating system, which is
a computer program that manages access, by applications running on
the computer, to the various resources of the computer. There may
be multiple applications. The various resources include the memory,
storage, input devices and output devices, such as display devices
and input devices as shown in FIG. 11. To manage access to data
stored in nonvolatile computer storage devices, the computer also
generally includes a file system maintains files of data. A file is
a named logical construct which is defined and implemented by the
file system to map a name and a sequence of logical records of data
to the addressable physical locations on the computer storage
device. Thus, the file system hides the physical locations of data
from applications running on the computer, allowing applications
access data in a file using the name of the file and commands
defined by the file system. A file system provides basic file
operations such as creating a file, opening a file, writing a file,
reading a file and closing a file.
[0098] The various modules, tools, or applications, and data
structures and flowcharts of FIGS. 1 through 10, as well as any
operating system, file system and applications on a computer in
FIG. 11, can be implemented using one or more processing units of
one or more computers with one or more computer programs processed
by the one or more processing units. A computer program includes
computer-executable instructions and/or computer-interpreted
instructions, such as program modules, which instructions are
processed by one or more processing units in the computer.
Generally, such instructions define routines, programs, objects,
components, data structures, and so on, that, when processed by a
processing unit, instruct or configure the computer to perform
operations on data, or configure the computer to implement various
components, modules or data structures.
[0099] Alternatively, or in addition, the functionality of one or
more of the various components described herein can be performed,
at least in part, by one or more hardware logic components. For
example, and without limitation, illustrative types of hardware
logic components that can be used include Field-programmable Gate
Arrays (FPGAs), Program-specific Integrated Circuits (ASICs),
Program-specific Standard Products (ASSPs), System-on-a-chip
systems (SOCs), Complex Programmable Logic Devices (CPLDs),
etc.
[0100] Accordingly, in one aspect, a computer system compares
databases, where each data record includes values for a plurality
of fields, by applying database operations on database tables of
the data records to reduce computations being performed. For each
data record in a reference set, candidate matches are identified
from among the data records in a target set by performing a join
operation on the sets on one or more selected, normalized fields
from the sets. This significantly reduced set of candidate matches
can be further analyzed to determine how similar they are to the
data record from the reference set.
[0101] In another aspect, a computer system includes a means for
selecting a candidate set based on the first database and the
second database using the normalized first index field and the
normalized second index field, to generate a table combining the
first metadata fields and the second metadata fields, wherein the
table comprises, for each data record in the first database having
a similar data record in the second database, and for each similar
data record, a row combining data fields for the data record from
the first database with data fields for the similar data record
from the second database, wherein the similar data record is a data
record that has a value in the normalized second index field which
matches the value in the normalized first index field for the data
record in the first database.
[0102] In another aspect, a computer system includes a means for
computing a reduced list of candidate data records from a second
database for each data record in a first database, and to associate
data from each candidate data record in the second database with
data from a corresponding data record in the first database into a
single record. The computer system further includes a means for
computing scores based on data from an associated candidate data
record of the second database and the corresponding data record
from the first database, which scores are indicatively of
similarity.
[0103] In another aspect, a computer system comprises computer
storage, wherein a first database and a second database are stored
in the computer storage. The first database comprises a first
plurality of first data records. Each first data record comprises a
first plurality of first data fields storing metadata describing an
item from among a first plurality of items. The first plurality of
data fields comprises at least a first identifier field, a first
index field, and at least two more first metadata fields. Each
first identifier field for each first data record for an item
stores a value representing an identifier for the item in the first
database. Each first index field for each first data record for an
item stores a value representing a description of the item. The
second database comprises a second plurality of second data
records. Each second data record comprises a second plurality of
second data fields storing metadata describing an item from among a
second plurality of items. The second plurality of data fields
comprises at least a second identifier field, a second index field,
and at least two more second metadata fields. Each second
identifier field for each second data record for an item stores a
value representing an identifier for the item in the second
database. Each second index field for each second data record for
an item stores a value representing a description of the item. For
similar items in the first database and the second database, the
value representing the identifier for the similar item in the first
identifier field may be different from the value representing the
identifier for the similar item in the second identifier field.
[0104] In this computer system, in one aspect, a processing system
comprises a memory and a processor. The processing system is
operable to access the computer storage and is configured by
computer program instructions to be operative to normalize the
values in the first index field of the first data records and the
values in the second index field of the second data records. A
candidate set of data records is selected based on the first
database and the second database using the normalized first index
field and the normalized second index field, to generate a table
combining the first metadata fields and the second metadata fields.
This table comprises, for each data record in the first database
having a similar data record in the second database, and for each
similar data record, a row combining data fields for the data
record from the first database with data fields for the similar
data record from the second database. The similar data record is a
data record that has a value in the normalized second index field
which is equal to the value in the normalized first index field for
the data record in the first database. For each row in the
generated table, values of first metadata fields in the row are
compared to values of corresponding second metadata fields in the
row to generate scores. The scores can be stored as data fields in
the row. For each row in the generated table, whether the data
record from the second database corresponding to the row is
equivalent to the data record from the first database corresponding
to the row is determined based on the generated scores stored in
the row.
[0105] Using such a computer system, in another aspect, a computer
implemented process comprises normalizing the values in the first
index field of the first data records and the values in the second
index field of the second data records. The process further
includes selecting a candidate set based on the first database and
the second database using the normalized first index field and the
normalized second index field, to generate a table combining the
first metadata fields and the second metadata fields, wherein the
table comprises, for each data record in the first database having
a similar data record in the second database, and for each similar
data record, a row combining data fields for the data record from
the first database with data fields for the similar data record
from the second database, wherein the similar data record is a data
record that has a value in the normalized second index field which
is equal to the value in the normalized first index field for the
data record in the first database. The process further includes,
for each row in the generated table, comparing values of first
metadata fields in the row to values of corresponding second
metadata fields in the row to generate scores, and storing the
scores as data fields in the row. The process further includes, for
each row in the generated table, determining whether the data
record from the second database corresponding to the row is
equivalent to the data record from the first database corresponding
to the row, based on the generated scores stored in the row.
[0106] In any of the foregoing aspects, the computer system can
select the candidate set by performing a join operation of the
first database and the second database on the normalized first
index field and the normalized second index field to generate the
table.
[0107] In any of the foregoing aspects, selecting the candidate set
can use a plurality of normalized first index fields and a
plurality of normalized second index fields.
[0108] In any of the foregoing aspects, selecting the candidate set
can include performing a first join operation of the first database
and the second database on the normalized first index field and the
normalized second index field to generate a first intermediate
candidate table, performing a second join operation of the first
database and the second database on a second normalized first index
field and a second normalized second index field to generating a
second intermediate candidate table, and performing a union of the
first intermediate candidate table and the second intermediate
candidate table to generate the table.
[0109] In any of the foregoing aspects, determining can include
applying first criteria to the scores indicative of a match of a
first priority, and applying second criteria to the scores
indicative of a match of a second priority lower than the first
priority, and storing and indication of the match and the priority
of the match.
[0110] In any of the foregoing aspects, comparing values of a first
metadata field to a corresponding second metadata field in a row to
generate a scores can include applying an edit distance to text
values stored in the first and second metadata fields.
[0111] In any of the foregoing aspects, comparing values of a first
metadata field to a corresponding second metadata field in a row to
generate a scores can include determining if a text value stored in
the first metadata field is contained in a text value stored in the
second metadata field.
[0112] In another aspect, an article of manufacture includes at
least one computer storage medium, and computer program
instructions stored on the at least one computer storage medium.
The computer program instructions, when processed by a processing
system of a computer, the processing system comprising one or more
processing units and storage, configures the computer as set forth
in any of the foregoing aspects and/or performs a process as set
forth in any of the foregoing aspects.
[0113] Any of the foregoing aspects may be embodied as a computer
system, as any individual component of such a computer system, as a
process performed by such a computer system or any individual
component of such a computer system, or as an article of
manufacture including computer storage in which computer program
instructions are stored and which, when processed by one or more
computers, configure the one or more computers to provide such a
computer system or any individual component of such a computer
system.
[0114] It should be understood that the subject matter defined in
the appended claims is not necessarily limited to the specific
implementations described above. The specific implementations
described above are disclosed as examples only.
* * * * *