U.S. patent application number 11/745166 was filed with the patent office on 2008-11-13 for searching document sets with differing metadata schemata.
Invention is credited to Richard Critchlow, Lijiang Fang, Prakash Sundra Krishnamoorthy, Lei Zhao.
Application Number | 20080281781 11/745166 |
Document ID | / |
Family ID | 39970438 |
Filed Date | 2008-11-13 |
United States Patent
Application |
20080281781 |
Kind Code |
A1 |
Zhao; Lei ; et al. |
November 13, 2008 |
SEARCHING DOCUMENT SETS WITH DIFFERING METADATA SCHEMATA
Abstract
Search and filtering of documents with different metadata
schemata is enabled using a single index that supports a single
schema through decorated namespaces. Each metadata schema submitted
to a system is assigned a unique identifier and property names
associated with the schema are prefixed with the unique identifier.
A single-valued, decorated property is used to indicate whether a
submitted document is part of a registered schema in the system.
The single-valued properties are converted to a search index table
that enables resource-optimized searching and filtering of
documents eliminating documents of other schemata by simply
checking the association property.
Inventors: |
Zhao; Lei; (Sammamish,
WA) ; Krishnamoorthy; Prakash Sundra; (Redmond,
WA) ; Critchlow; Richard; (Seattle, WA) ;
Fang; Lijiang; (Bellevue, WA) |
Correspondence
Address: |
Carl K. Turk;Merchant & Gould P.C.
P.O. Box 2903
Minneapolis
MN
55402-0903
US
|
Family ID: |
39970438 |
Appl. No.: |
11/745166 |
Filed: |
May 7, 2007 |
Current U.S.
Class: |
1/1 ;
707/999.003; 707/E17.095; 707/E17.108 |
Current CPC
Class: |
G06F 16/38 20190101 |
Class at
Publication: |
707/3 ;
707/E17.108 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A method to be executed at least in part in a computing device
for searching documents with differing metadata schemata, the
method comprising: receiving a customer defined metadata schema;
assigning a unique identifier to the schema; generating an
association property that indicates whether a document is
associated with the schema; storing the association property in a
search index; in response to receiving a document associated with
the schema, modifying the association property to indicate that the
document is associated with the schema; and in response to
receiving a request for a search query, performing the search using
the search index by filtering the entries based on their
association properties.
2. The method of claim 1, further comprising: prefixing names of
properties associated with the schema with the unique
Identifier.
3. The method of claim 2, wherein a name of the association
property is also prefixed with the unique identifier.
4. The method of claim 1, wherein the association property is a
single-value property.
5. The method of claim 4, wherein modifying the association
property includes setting a value of the association property to
"true" for the document associated with the schema.
6. The method of claim 1, further comprising: registering the
customer defined schema.
7. The method of claim 1, wherein the query request is received
from a user associated with the customer.
8. The method of claim 7, further comprising: filtering search
results based on a credential of the requesting user.
9. The method of claim 7, further comprising: filtering search
parameters in the query based a credential of the requesting user
by eliminating schemas to which the user does not have sufficient
permission from the search list.
10. A system for searching documents with differing metadata
schemata, the system comprising: a memory; a processor coupled to
the memory, wherein the processor is configured to: receive a
customer defined metadata schema; assign a unique identifier to the
schema; in response to receiving a document associated with the
schema, generate an association property indicating that the
received document is associated with the schema; store the
association property in a search index; in response to receiving a
request for a search query from a user associated with the
customer, perform the search using the search index by filtering
the entries based on their association properties; and providing
the results of the search to the user.
11. The system of claim 10, wherein the search index further
includes core properties for each document.
12. The system of claim 10, wherein a name of the association
property includes the unique identifier.
13. The system of claim 12, wherein the processor is further
configured to modify names of all properties associated with the
received document to include the unique identifier.
14. The system of claim 10, wherein the processor is further
configured to enable the user to search based on schemas of other
customers based on the user's credentials.
15. The system of claim 10, wherein the association property is a
single-value property.
16. The system of claim 10, further comprising: a metadata store
for storing metadata schemas; and a document store for storing
documents received from customers.
17. A computer-readable storage medium with instructions encoded
thereon for searching documents with differing metadata schemata,
the instructions comprising: receiving custom metadata schema;
assigning a unique identifier to the schema; in response to
receiving a document associated with the schema, generating a
single-value association property indicating that the received
document is associated with the schema, wherein a name of the
association property includes the unique identifier; storing the
association property in a search index along with core properties
for documents associated with the search index; in response to
receiving a request for a search, performing the search using the
search index by filtering the documents based on their association
properties.
18. The computer-readable storage medium of claim 17, wherein a
value of the association property is set to "true" for a document
associated with the schema and to "false" for a document not
associated with the schema.
19. The computer-readable storage medium of claim 18, wherein
filtering the documents includes eliminating documents with
association property values of "false" from the search.
20. The computer-readable storage medium of claim 17, wherein the
instructions further comprise: enabling a query for performing the
search to define a plurality of schemas to be included in the
search.
Description
BACKGROUND
[0001] Document search in digital libraries, the Internet, and
organizational intranets is best served by a combination of
metadata processing and content searching. Searchers often rely on
content if metadata is absent, erroneous, or incomplete.
Metadata-based searches have their own unique challenges. For
example, large legacy collections combined with budgets
insufficient to permit complete and consistent tagging may mean
that metadata associated with the documents of such collections is
often limited or non-existent. Furthermore, the wide variety of
document types and processing approaches result in non-standardized
ways of using metadata to assign properties to documents. Not only
may different content generators use different types of properties,
but they may use completely different properties (e.g. author,
expiration date, version, and so on). On the other extreme end of
the spectrum, some or all of the documents may be catalog records
consisting entirely of metadata (e.g. in museums, libraries, or
repositories).
[0002] Often for reasons of economy or practicality, a service
platform that provides customers with the service of searching sets
of documents that have been annotated with metadata properties may
not be able to dictate what schema of metadata the customer should
use. In order for the service platform to support multiple
customers with a reasonably sized physical implementation, it is
desired for the service to be able to combine documents from
different customers, and thus with different metadata schemata, in
a single search-engine index without loss of data.
SUMMARY
[0003] This summary is provided to introduce a selection of
concepts in a simplified form that are further described below in
the Detailed Description. This summary is not intended to identify
key features or essential features of the claimed subject matter,
nor is it intended as an aid in determining the scope of the
claimed subject matter.
[0004] Embodiments are directed to enabling search of documents
with different metadata schemata in a single index that supports a
single schema through use of namespaces. According to one
embodiment, the property names associated with a document may be
converted to a decorated version used in the index schema when the
documents are indexed. By using multiple single-valued properties
whose names indicate which schema the document belongs to, use of
multi-valued property may be avoided in filters. A requested set of
document schemas may be converted into a filter over the properties
of joint tables containing documents of all schemas at query time.
Query semantics are defined based on the properties.
[0005] These and other features and advantages will be apparent
from a reading of the following detailed description and a review
of the associated drawings. It is to be understood that both the
foregoing general description and the following detailed
description are explanatory only and are not restrictive of aspects
as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] FIG. 1 illustrates an example document and its associated
metadata;
[0007] FIG. 2 illustrates an example service platform with search
capability for documents with differing metadata schemata according
to embodiments;
[0008] FIG 3 illustrates an example metadata schema of a service
platform's core properties associated with differing metadata
schemata from individual customer before the properties are
converted to a single schema index;
[0009] FIG. 4 is an example networked environment, where
embodiments may be implemented;
[0010] FIG. 5 is a block diagram of an example computing operating
environment, where embodiments may be implemented; and
[0011] FIG. 6 illustrates a logic flow diagram of a document search
process based on documents with differing metadata schemata
according to embodiments.
DETAILED DESCRIPTION
[0012] As briefly described above, documents with different
metadata schemata may be searched in a single index that supports a
single schema through use of namespaces, and filtering over
metadata may be accomplished allowing searches to return documents
from one or more schemata. In the following detailed description,
references are made to the accompanying drawings that form a part
hereof, and in which are shown by way of illustrations specific
embodiments or examples. These aspects may be combined, other
aspects may be utilized, and structural changes may be made without
departing from the spirit or scope of the present disclosure. The
following detailed description is therefore not to be taken in a
limiting sense, and the scope of the present invention is defined
by the appended claims and their equivalents.
[0013] While the embodiments will be described in the general
context of program modules that execute in conjunction with an
application program that runs on an operating system on a personal
computer, those skilled in the art will recognize that aspects may
also be implemented in combination with other program modules.
[0014] Generally, program modules include routines, programs,
components, data structures, and other types of structures that,
perform particular tasks or implement particular abstract data
types. Moreover, those skilled in the art will appreciate that
embodiments may be practiced with other computer system
configurations, including hand-held devices, multiprocessor
systems, microprocessor-based or programmable consumer electronics,
minicomputers, mainframe computers, and the like. Embodiments may
also be practiced in distributed computing environments where tasks
are performed by remote processing devices that are linked through
a communications network, in a distributed computing environment,
program modules may be located in both local, and remote memory
storage devices.
[0015] Embodiments may be implemented as a computer process
(method), a computing system, or as an article of manufacture, such
as a computer program product or computer readable media. The
computer program product may be a computer storage media readable
by a computer system and encoding a computer program of
instructions for executing a computer process. The computer program
product may also be a propagated signal on a carrier readable by a
computing system and encoding a computer program of instructions
for executing a computer process.
[0016] Referring to FIG. 1, an example document and its associated
metadata are illustrated in diagram 100. The simplest definition of
metadata is that it is data about data. An item of metadata may
describe an individual data item or a collection of data items.
Metadata is used to facilitate the understanding, use and
management of data and vary with the type of data and context of
use. For example, in the context of a library, where the data is
the content of the titles stocked, metadata about a title might
typically include a description of the content, the author, the
publication date and the physical location. Metadata about a
collection of data items, a computer file, might typically include
the name of the file, the type of file, and the name of the data
administrator.
[0017] When structured into a hierarchical arrangement, metadata is
more properly called an ontology or schema. Both terms describe
"what exists" for some purpose or to enable some action. For
example, the arrangement of subject headings in a library catalog
serves not only as a guide to finding books on a particular subject
in the stacks, but also as a guide to what subjects "exist" in the
library's own ontology and how more specialized topics are related
to or derived from the more general subject headings. Metadata is
frequently stored in a central location and used to help
organizations standardize their data. This information may be
stored in a metadata registry.
[0018] Usually it may be difficult to distinguish between (raw)
data and metadata because something can be data and metadata at the
same time (e.g. the headline of an article may both its
title--metadata--and part of its text--data. Furthermore, data and
metadata may exchange their roles. A poem, as such, may be regarded
as data, but if there is a song that used the poem as lyrics, the
whole poem may be attached to an audio file of the song as
metadata. Thus, the labeling depends on the point of view.
[0019] Metadata has many different applications. For example,
metadata may be used to speed up and enrich searching for
resources. In general, search queries using metadata can save users
from performing more complex filter operations manually. It is now
common for web browsers, P2P applications and media management
software to automatically download and locally cache metadata, to
improve the speed at which files can be accessed and searched.
[0020] As shown in FIG. 1, a document 102 may include text, images,
and other embedded objects such as audio files, video files, and
the like. Metadata 104 associated with the document 102 may include
general properties associated with the entire document such as the
name of the author, an expiration date, a version of the document,
and the like. Metadata 104 may also include specific metadata
associated with sections of the document (sometimes called tags)
such as semantic labels associated with specific strings of text,
properties associated with embedded objects, and the like.
[0021] FIG. 2 illustrates an example service platform with search
capability for documents with differing metadata schemata according
to embodiments. Service platforms may take many forms and
configurations. Typically, a service platform is associated with
multiple customers, whose clients are served through the platform
based on the parameters and content provided be each customer. For
example, a product support service for a computer products provider
may provide support documents (and/or online help services) for a
variety of products and components that may be part of the systems
sold by the provider. These products and components may include
hardware and software from various vendors and may involve
licensing and similar permission issues. Thus, a service platform
designed to provide a uniform support experience to the users of
the product support service may receive documents from many sources
utilizing various types of metadata. The documents may include
metadata that conforms to a core schema, but each may also include
metadata that has its own custom schema. Therefore, the service
platform may have to deal with the differing metadata schemata of
the documents when performing a search and filtering results for a
user.
[0022] Example service platform 224 includes document store 218 and
metadata store 220 for storing documents and their metadata
submitted by customers (e.g. customer 212) through the submit
module 216. Search index 222 may be generated to perform efficient
searches on the stored documents and metadata employing filtering
techniques. Provisioning service 214 may manage provisioning of
schemata among various metadata types submitted by different
clients.
[0023] According to one embodiment, customers (e.g. customer 212)
define the metadata schema to be used by the documents before
submitting documents to service platform 224. When these schemata
are submitted through provisioning service 214, the sen-ice
platform may assign a unique identifier (sid) to each schema.
Documents may be submitted through submit service 216. The service
may then prefix this unique identifier to the name of each property
in the schema to create a new, namespace decorated name (sid.name)
which may be guaranteed to be unique across all schemata submitted
to service platform 224. A property with this decorated name may
then be created in the search index schema (222). This "decorated
property" may be used for filtering queries.
[0024] When documents are submitted to service platform 224, they
specify which metadata schema they use. Thus, the service platform
can convert the property names to the decorated version used in the
index schema when the documents are indexed. Documents may be
submitted multiple times, with different schemata. This enables the
same document to be shared by multiple customers. Service platform
224 may track which schemata a document is associated with using a
multi-valued metadata property to hold a list of schema names. At
query time, the customer may specify that they wish to search over
documents belonging to only one schema, or to documents belonging
to any of a set of schemata.
[0025] However, filtering over a multi-valued property is
resource-expensive. Therefore, service platform 224 may perform
filtering using a set of single-valued properties by first
automatically creating a single-valued decorated property in the
search index (e.g. with the name sid.IsPartOfSchema) for each
schema that is submitted to the service platform. Whenever a
document is submitted associated with that schema, the value of
that property may be set to "true". By using multiple single-valued
properties whose names indicate which schema the document belongs
to, service platform 224 can avoid using the multi-valued property
in filtering.
[0026] At query time, service platform 224 may convert the
requested set of document schemas into a filter over the properties
of joint tables containing documents of all schemas. For documents
that belong to other schema, the core or common properties may be
set to null, such that they do not surface. Similarly, to perform
property based filtering, when an asset is published to some
schemas, but not other schemas, the properties specific to the
schemas that this asset is not published to may be treated as null.
Thus, a single schema search index may be used for performing the
search and filtering on documents with varying schemata.
[0027] A system according to embodiments is not limited to the
example system and its components described above. Searching
documents with differing metadata schemata may be implemented with
fewer or additional components performing additional or fewer tasks
using the principles described herein.
[0028] FIG. 3 illustrates an example metadata schema of a service
platform's core properties associated with differing metadata
schemata from individual customer before the properties are
converted to a single schema index.
[0029] As described above, according to embodiments, documents
include a specification of which metadata schema they use when
submitted to a service platform. The service platform can then
convert the property names to the decorated version used in the
index schema when the documents are indexed.
[0030] At query time, the customer may specify that they wish to
search over documents belonging to only one schema, or to documents
belonging to any of a set of schemata. Documents may share some of
the properties conforming the service platforms core properties
schema 332. The common properties may include an identifier (asset
identifier), a type (asset type), a name (asset name), and other
common properties.
[0031] In addition to the core properties, each document may
include custom properties according to the customer's own schema
(e.g. schema S1 334, S2 336, and so on). While some of these
schemata may be shared by a portion of the customers, they may also
each be unique to the individual customers. Any combination of
commonality between metadata schemata may be encountered.
[0032] A single-valued decorated property may be created in the
search index (sid.IsPartOfSchema) for each schema that is submitted
to the service platform. For each document submitted to the
platform that is associated with the particular schema, the value
of that property may be set to "true". This way, multiple
single-valued properties with names indicating which schema they
belong to may be employed in place of multi-valued properties for
filtering.
[0033] Following the same example, the service platform may convert
the requested set of document schemas (documents in s1 and s2) into
a filter over the properties sid1.IsPartOffSchema and
sid2.isPartOfSchema of joint tables containing documents of all
schemas, such as (sid1.IsPartOfSchema=1 or sid2.IsPartOfSchema=1)
at query time. For a document associated with another schema, such
as s1, the names may be set as sid3.IsPartOfSchema=1 but
sid1.IsPartOfSchema=mull and sid2.IsPartOfSchema=mull. Thus, the
first two are filtered out automatically.
[0034] For property based filtering, when an asset is published to
some schemas, but not to other schemas, the properties specific to
the schemas that this asset is not published to may be treated as
null. For example. Asset A1 may be published to S1 and S2, but not
S4 and S4 may have a specific property S4.P1. Then, A1's S4.P1
property would be considered as null.
[0035] While specific property indexing and filtering techniques
are used and described, a system according to embodiments is not
limited to the definitions and examples described above. Performing
a document search in a service platform with non-uniform metadata
schemata may be provided using additional or fewer steps and
techniques.
[0036] FIG. 4 is an example networked environment, where
embodiments may be implemented. Document search systems may be
implemented locally on a single computing device or in a
distributed manner over a number of physical and virtual clients
and servers. They may also be implemented in un-clustered systems
or clustered systems employing a number of nodes communicating over
one or more networks (e.g. network(s) 450).
[0037] Such a system may comprise any topology of servers, clients,
Internet service providers, and communication media. Also, the
system may have a static or dynamic topology. The term "client" may
refer to a client application or a client device. While a networked
system implementing a document search system for document sets with
differing metadata schemata may involve many more components,
relevant ones are discussed in conjunction with this figure.
[0038] A document search engine capable of searching document with
differing metadata schemata according to embodiments may be
implemented as part of a service platform in individual client
devices 441-443 or executed in server 452 and accessed from anyone
of the client devices (or applications). Data stores associated
with searchable documents and their metadata may be embodied in a
single data store such as data store 456 or distributed over a
number of data stores associated with individual client devices,
servers, and the like. Dedicated database servers (e.g. database
server 454) may be used to coordinate data retrieval and storage in
one or more of such data stores.
[0039] Network(s) 450 may include a secure network such as an
enterprise network, an unsecure network such as a wireless open
network, or the Internet. Network(s) 450 provide communication
between the nodes described herein. By way of example, and not
limitation, network(s) 450 may include wired media such as a wired
network or direct-wired connection, and wireless media such as
acoustic, RF, infrared and other wireless media.
[0040] Many other configurations of computing devices,
applications, data sources, data distribution systems may be
employed to implement document searching in an environment with
various metadata schemata. Furthermore, the networked environments
discussed in FIG. 4 are for illustration purposes only. Embodiments
are not limited to the example applications, modules, or
processes.
[0041] FIG. 5 and the associated discussion are intended to provide
a brief general description of a suitable computing environment in
which embodiments may be implemented. With reference to FIG. 5, a
block diagram of an example computing operating environment is
illustrated, such as computing device 560. In a basic
configuration, the computing device 560 may be a server providing
document search service and typically include at least one
processing unit 562 and system memory 564. Computing device 560 may
also include a plurality of processing units that cooperate in
executing programs. Depending on the exact configuration and type
of computing device, the system memory 564 may be volatile (such as
RAM), non-volatile (such as ROM, flash memory, etc.) or some
combination of the two. System memory 564 typically includes an
operating system 565 suitable for controlling the operation of a
networked personal computer, such as the WINDOWS.RTM. operating
systems from MICROSOFT CORPORATION of Redmond, Wash. The system
memory 564 may also include one or more software applications such
as program modules 566, service platform 582, and search engine
584.
[0042] Service platform 582 may be an individual application or a
cluster of interacting applications that provides a variety of
services to clients associated with computing device 560. Search
engine 586 may perform document searches and filtering on document
sets with differing metadata schemata, as described previously.
This basic configuration is illustrated in FIG. 5 by those
components within dashed line 568.
[0043] The computing device 560 may have additional features or
functionality. For example, the computing device 560 may also
include additional data storage devices (removable and/or
non-removable) such as, for example, magnetic disks, optical disks,
or tape. Such additional storage is illustrated in FIG. 5 by
removable storage 569 and non-removable storage 570. Computer
storage media may include volatile and nonvolatile, removable and
non-removable media implemented in any method or technology for
storage of information, such as computer readable instructions,
data structures, program modules, or other data. System memory 564,
removable storage 569 and non-removable storage 570 are all
examples of computer storage media. Computer storage media
includes, but is not limited to, RAM, ROM, EEPROM, flash memory or
other memory technology, CD-ROM, digital versatile disks (DVD) or
other optical storage, magnetic cassettes, magnetic tape, magnetic
disk storage or other magnetic storage devices, or any other medium
which can be used to store the desired information and which can be
accessed by computing device 560. Any such computer storage media
may be part of device 560. Computing device 560 may also have input
device(s) 572 such as keyboard, mouse, pen, voice input device,
touch input device, etc. Output device(s) 574 such as a display,
speakers, printer, etc. may also be included. These devices are
well known in the art and need not be discussed at length here.
[0044] The computing device 560 may also contain communication
connections 576 that allow the device to communicate with other
computing devices 578, such as over a wireless network in a
distributed computing environment, for example, an intranet or the
Internet. Other computing devices 578 may include server(s) that
provide access to document stores, user information, metadata, and
so on. Communication connection 576 is one example of communication
media. Communication media may typically be embodied by computer
readable instructions, data structures, program modules, or other
data in a modulated data signal, such as a carrier wave or other
transport mechanism, and includes any information delivery media.
The term "modulated data signal" means a signal that has one or
more of its characteristics set or changed in such a manner as to
encode information in the signal. By way of example, and not
limitation, communication media includes wired media such as a
wired network or direct-wired connection, and wireless media such
as acoustic, RF, infrared and other wireless media. The term
computer readable media as used herein includes both storage media
and communication media.
[0045] The claimed subject matter also includes methods. These
methods can be implemented in any number of ways, including the
structures described in this document. One such way is by machine
operations, of devices of the type described in this document.
[0046] Another optional way is for one or more of the individual
operations of the methods to be performed in conjunction with one
or more human operators performing some. These human operators need
not be collocated with each other, but each can be only with a
machine that performs a portion of the program.
[0047] FIG. 6 illustrates a logic flow diagram of a document search
process 600 based on documents with differing metadata schemata
according to embodiments. Process 600 may be implemented as part of
a document search service.
[0048] Process 600 begins with operation 602, where a customer
defined metadata schema is received. Processing advances from
operation 602 to operation 604.
[0049] At operation 604, a unique identifier is assigned to the
schema by the service platform. Processing continues to operation
606 from operation 604.
[0050] At operation 606, the unique identifier is prefixed to the
name of each property in the schema to create a new, namespace
decorated name such that each decorated name is ensured to be
unique across all schemata submitted to the service platform.
Processing moves to operation 608 from operation 606.
[0051] At operation 608, a single-valued decorated property is
created in the search index for each submitted schema that
indicates whether a document that includes this property is part of
the schema or not. In other embodiments, a multi-valued property
may also be utilized. Processing moves to operation 610 from
operation 608.
[0052] At operation 610, the single-valued decorated property is
set to "true" for each document submitted by the customer that is
associated with the schema submitted in operation 602. Operation
610 completes a first portion of the process of using single-valued
decorated properties for searching documents with differing
metadata schemata. A second portion of the process which is loosely
coupled to the first portion as indicated by the dashed line in
FIG. 6, begins at operation 620 following operation 610.
[0053] At operation 620, a query request is received by the
platform that necessitates search of document sets previously
associated with submitted metadata schemata as described above.
Processing advances from operation 620 to operation 622.
[0054] At operation 622, the requested set of document schemata are
converted into a filter over the properties of joint tables
containing documents of all schemata. Because the decorated
single-valued property indicating document-schema association
indicates whether a document is using a particular schema or not,
documents of other schemata can easily be eliminated from the
search list. After operation 622, processing moves to a calling
process for further actions.
[0055] The operations included in process 600 are for illustration
purposes. Document search in a service platform on documents with
differing metadata schemata may be implemented by similar processes
with fewer or additional steps, as well as in different order of
operations using the principles described herein.
[0056] The above specification, examples and data provide a
complete description of the manufacture and use of the composition
of the embodiments. Although the subject matter has been described
in language specific to structural features and/or methodological
acts, it is to be understood that the subject matter defined in the
appended claims is not necessarily limited to the specific features
or acts described above. Rather, the specific features and acts
described above are disclosed as example forms of implementing the
claims and embodiments.
* * * * *