System for information storage, retrieval and voice based content search and methods thereof Suresh, Narasimha ; et al. [Bhide, Sudarshan]

System for information storage, retrieval and voice based content search and methods thereof

Suresh, Narasimha ; et al.

Patent Application Summary

U.S. patent application number 10/108875 was filed with the patent office on 2003-09-11 for system for information storage, retrieval and voice based content search and methods thereof. Invention is credited to Bhide, Sudarshan, Suresh, Narasimha.

Application Number	20030171926 10/108875
Document ID	/
Family ID	29434395
Filed Date	2003-09-11

United States Patent Application	20030171926
Kind Code	A1
Suresh, Narasimha ; et al.	September 11, 2003

System for information storage, retrieval and voice based content search and methods thereof

Abstract

An information retrieval system for voice-based applications enabling voice based content search is provided. The system comprises a remote communication device for communication through a telecommunication network, a data storage server for storing data and an adaptive indexer interfacing with a speech recognition platform. Further the adaptive indexer is coupled to a content extractor. The adaptive indexer indexes the contents in configured manner and the local memory stores the link to the indexed contents. The speech recognition platform recognizes the voice input with the help of a dynamic grammar generator and the results thereof is encapsulated into a markup language document. Employing the speech recognition results a search is performed by a search engine using the indexed contents and the results is returned to the originator of the search input. Systems are provided to perform the methods.

Inventors:	Suresh, Narasimha; (Bangalore, IN) ; Bhide, Sudarshan; (Pune, IN)
Correspondence Address:	WELLS ST. JOHN ROBERTS GREGORY & MATKIN P.S. 601 W. FIRST AVENUE SUITE 1300 SPOKANE WA 99201-3828 US
Family ID:	29434395
Appl. No.:	10/108875
Filed:	March 27, 2002

Current U.S. Class:	704/270.1 ; 704/E15.045
Current CPC Class:	H04M 2201/40 20130101; H04M 3/4938 20130101; G10L 15/26 20130101; G10L 15/193 20130101
Class at Publication:	704/270.1
International Class:	G10L 021/00

Foreign Application Data

Date	Code	Application Number
Mar 7, 2002	IN	220/MUM/2002

Claims

What is claimed is:

1. A system comprising: a remote communications device configured to communicate through a telecommunication network; a base station in communication with the remote communications device, the base station having a data storage server for storing data, an information retrieval system having an adaptive indexer and a speech recognition platform interfacing with the adaptive indexer; the base station being configured to selectively communicate with the remote communications device, wherein the system is configured to perform voice based content search using the speech recognition platform and the information retrieval system.

2. A system according to claim 1 wherein the remote communications device comprises any device capable of communicating through a telecommunication network.

3. The system according to claim 1 wherein the remote communications device comprises a mobile phone.

4. The system according to claim 1 wherein the base station is configured to perform a search in response to a voice based search request from the remote communications device.

5. The system according to claim 1 wherein the base station is configured to provide voice based search results to the remote communications device.

6. A system for information retrieval and voice based content search, the system comprising: a remote communications device configured to communicate through a telecommunication network; a base station selectively in communication with the remote communications device, the base station having: an information retrieval system comprising a server storage configured to store contents; a content extractor configured to extract contents from the server storage; an adaptive indexer configured to adaptively index contents extracted by the content extractor; a core indexer configured to collect textual information from the extracted contents; an index configurator configured to configure the adaptive indexer using the extracted contents; a content cataloguer configured to catalogue the indexed contents; an index re-shuffler configured to periodically reshuffle the indexed contents; a local memory configured to store contents, the memory positioned proximally to the storage adapter; a storage adapter configured to provide access to the contents stored in the local memory; a dynamic grammar generator configured to generate speech recognition grammar; a voice information retrieval interface operatively interfacing with the dynamic grammar generator; a speech recognition platform interfacing with the voice information retrieval interface; a markup language generator/parser configured to create and interpret contents using voice mark up languages, and wherein the base station further comprising a search engine coupled to the voice information retrieval interface, the adaptive indexer operatively connected to the content extractor, the content extractor configured to perform indexing of contents extracted from the remote server storage; the core indexer extracts textual matter from the contents, the contents being catalogued by a content cataloguer, indexed contents being stored in the local memory, the storage adapter configured to provide access to the contents stored in the local memory, the dynamic grammar generator configured to generate speech recognition grammar, the markup language generator configured to wrap the grammar into a markup language document, the voice information retrieval interface configured to send the markup language document to the speech recognition platform, the speech recognition platform configured to use the document received from the information retrieval interface to recognizing the user input, the speech recognition platform returns the results thereof to the search engine, the search engine configured to perform search using the speech recognition results and the indexed contents and returns the results thereof as a markup language document to the speech recognition platform.

7. The system according to claim 6 wherein the local memory comprises a hard drive, a floppy diskette or a compact diskette.

8. The system according to claim 6 wherein the base station is configured to perform a search in response to a voice based search request from the remote communications device.

9. The system according to claim 6 wherein the base station is configured to provide voice based search results to the remote communications device.

10. The system according to claim 6 wherein the core indexer is configured to extract textual data from emails.

11. The system according to claim 6 wherein the core indexer is configured to extract textual data from scanned documents.

12. The system according to claim 6 wherein the core indexer is configured to extract textual data from any of the word processor documents.

13. The system according to claim 6 wherein the base station is configured to define algorithms to integrate with application development standards for Voice XML.

14. The system according to claim 6 wherein the system is configured to define algorithms to integrate with application development standards for SALT.

15. An adaptive indexing system configured to adapt indexing contents for use in an information retrieval system, the system comprising: an adaptive indexer configured to index contents; a core indexer configured to implement textual extraction from contents forwarded by the adaptive indexer; an index re-shuffler configured to at times reshuffle contents; an index configurator configured to index the contents received by the adaptive indexer employing a plurality of configuration parameters; an index cataloguer interfacing with the adaptive indexer configured to perform cataloguing of the contents and maintaining a per-user catalogue configured for a specific content type wherein the index cataloguer is configured to selectively load the indices upon receipt of a search request; a duplicate word remover configured to remove duplicate words from the indexed contents; a local memory configured to store contents, the memory positioned proximally to the storage adapter; a storage adapter configured to provide access to the contents stored in the local memory; an exclusion dictionary configured to exclude irrelevant words from the indexed contents; a dynamic grammar generator configured to generate speech recognition grammar and wherein the adaptive indexer coupled to the index configurator, the core indexer and the storage adaptor indexes the contents to define a user index and a common index, the grammar generator configured to process search requests to conduct searches using the user indexes and the common indexes and performs context sensitive selective loading of indices.

16. The system according to claim 15 wherein the user index being per user index maintained in the local memory.

17. The system according to claim 15 wherein the common index comprises words common to source messages.

18. The system according to claim 15 wherein the common index comprises per-catalogue common index and global common index.

19. The system according to claim 15 wherein a programming interface is configured to create a document template for any of the configured contents.

20. The system according to claim 15 wherein the adaptive indexer uses CPU's idle time thus enabling optimal utilization of resources.

21. The system according claim 15 wherein the index provides links to original documents stored on the remote server storage, the links contain access information for an identified document.

22. The system according to claim 15 wherein the index re-shuffler is a periodic processor that maintains a clean index.

23. The system according to claim 15 wherein the per-user index and the common index are used to create the speech recognition grammar.

24. The system according to claim 15 wherein the speech recognition grammar is generated by the dynamic grammar generator with platform interoperability.

25. The system according to claim 15 wherein the dynamic grammar generator uses the index catalogs for selective loading of index; selective loading being dependent on the user-context.

26. The system according to claim 15 wherein the base station is configured to define algorithms to integrate with application development standards for voice based markup languages.

27. The system according to claim 15 wherein an optical character recognizer is configured to extract text matter from a scanned document content source.

28. The system according to claim 15 wherein an exclusion dictionary is configured to exclude unidentified word contents for purposes of indexing.

29. The system according to claim 15 wherein the said core indexer for scanned documents is configured to perform thresholding for reducing the sampling depth of an image.

30. A method for voice based content search and information retrieval; the method comprising: sending a voice based search request by a device capable of communicating through a telecommunication network, receiving the voice based search input by a speech recognition platform, establishing a search session by the speech recognition platform conjointly with a voice information retrieval interface, generating a dynamic grammar in respect of the search input by a dynamic grammar generator, encapsulating the dynamic grammar into a voice markup language document by a markup language generator, sending the voice markup language document containing the dynamic grammar generator to the speech recognition platform, performing a speech recognition test by the speech recognition platform and returning the test results thereof to the voice information retrieval interface, conducting a search using the test results by a search engine at the local memory and employing the indexed content, providing the search results as a voice markup language documents to the speech recognition platform and returning the search results to the originator of the search input.

Description

TECHNICAL FIELD

[0001] This invention in general relates to communication systems including information storage and retrieval mechanisms. More particularly, the invention relates to voice recognition systems and methods and to information storage and retrieval systems and methods.

BACKGROUND OF THE INVENTION

[0002] The frequency of accessing searchable databases stored in electronic medium by users of hand-held communication devices like mobile telephones has considerably increased in the recent past. However there are a number of factors that limit the utility parameters of a system that enables such hand held device holders to access databases for retrieval of information. This is specifically so, when the end user employs devices like mobile telephones, internet capable mobile phones, Personal Digital Assistants with wireless capability for accessing a generic database catering to a variety of requirements. The limitations of these devices in respect of system capabilities pose a major impediment in quick and easy access to the target data that the end user is looking for. These limiting factors of a hand-held device further include limited rendering capabilities as compared to Personal Computers, parameters like form factor, absence of a Graphical User Interface for telephone and limited processing powers.

[0003] Conventional art employing telephonic devices for data access employs voice as the only medium for presenting information. A conventional system in which user provides input and receives output through a telephone is an Interactive Voice Response (IVR) system, wherein the user is presented with a menu in the form of a voice file. User responds to the menu by pressing a digit on the telephony instrument. This response is then processed by the system and the result is dispatched to the user again in the form of a voice file. This system is suitable for applications having limited options to choose from (e.g. telephone based banking service).

[0004] However, for applications that require more detailed inputs from the user, this system becomes cumbersome to use. This necessitates the use of voice recognition to accept input from the user. User can speak out what he wants from the system and the system will respond accordingly. But the use of voice recognition alone does not resolve all technical problems associated with a data storage and retrieval system for telephony applications. As for example, yet another complexity stems from the generic nature of the data stored and the multiplicity of end users looking for speedy retrieval of targeted information. Thus there are issues associated with the system when a variety of content is generated and accessed. Also factors like performance, resource utilization (processing power and memory requirement), voice-recognition, etc. further shrink the possibilities of application providers providing for such a system.

[0005] Existing solutions for voice-based search cater to specific search needs. They are built for specific applications and as such are well designed for those applications. However, this limits the spectrum of content that can be searched using voice since they are built for specific applications.

[0006] Current speech applications include Voice XML, the Voice Extensible Markup Language. Voice XML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and Dual-Tone Multi Frequency (DTMF), also known as Touch Tone. DTMF is commonly used in remote control applications that use telephones. Examples for these applications are accessing your messages from an answering machine and retrieving your account balance information from your bank database. Also Voice XML has applications for recording of spoken key input, telephony, and mixed-initiative conversations. The Voice XML standard is described in detail in www.voicexmlreview.org. The World Wide Web Consortium [W3C] has brought out specifications of a revised speech recognition grammar format aimed at enhancing the interoperability of Voice XML browsers and Voice XML applications. This W3C speech recognition format is described in detail in www.w3.org. The Voice XML 1.0 version employs Java Speech Grammar Format [JSGF]. Current versions of Voice XML employ mostly native grammar formats of the speech recognizer embodied in the browser. The Voice XML version 2.0 provides grammar interoperability [www.w3.org/TR/speech-grammar].

[0007] Speech Application Language Tags [SALT] is another speech interface markup language, which comprises of a small set of XML elements. SALT can be used with Hyper Text Mark-up Language [HTML] and other standards to write speech interfaces for voice-only or multimodal applications. The SALT standard is described in detail in www.saltforum.org.

[0008] Advances in voice-recognition technologies has made it easier for end-users to have access to increasing amount of data through voice since the number of applications that are being voice-enabled is increasing. However, this means that users have to go through larger and larger volumes of data to reach the information they want. Given the limited rendering capabilities of the telephone, it is required that users be able to search for the specific information they want.

SUMMARY OF THE INVENTION

[0009] The invention provides for a system for information storage, retrieval and voice based content search. The system comprises of a remote communications device configured to communicate through a telecommunication network; a base station in communication with the mobile device, the base station having a data storage server for storing data, an information retrieval system having an adaptive indexer and a speech recognition platform interfacing with the adaptive indexer; the base station being remote from the communication device selectively communicates with the communication device, wherein the system is configured to perform voice based content search using the speech recognition platform and the information retrieval system.

[0010] Another aspect of the invention provides a system for information retrieval and voice based content search, the system comprising a remote communications device configured to communicate through a telecommunication network, a base station in communication with the mobile device, the base station having an information retrieval system comprising a server storage for storing contents, a content extractor for extracting contents from the server storage, an adaptive indexer for adaptively indexing contents extracted by the content extractor, a core indexer for collecting textual information from the extracted contents, an index configurator for configuring the adaptive indexer using the extracted contents, a content cataloguer for cataloguing the indexed contents, an index re-shuffler for periodical reshuffling of the indexed contents, a local memory for storing contents, the memory positioned proximally to the storage adapter, a storage adapter configured to provide access to the contents stored in the local memory, a dynamic grammar generator configured to generate speech recognition grammar, a voice information retrieval interface operatively interfacing with the dynamic grammar generator, a speech recognition platform interfacing with the voice information retrieval interface, a markup language generator/parser configured to create and interpret contents using voice mark up languages, and wherein the base station further comprising a search engine coupled to the voice information retrieval interface, the adaptive indexer operatively connected to the content extractor, the content extractor configured to perform indexing of contents extracted from the remote server storage, the core indexer extracts textual matter from the contents, the contents being catalogued by a content cataloguer, indexed contents being stored in the local memory, the storage adapter configured to provide access to the contents stored in the local memory, the dynamic grammar generator configured to generate speech recognition grammar, the markup language generator configured to wrap the grammar into a markup language document, the voice information retrieval interface configured to send the markup language document to the speech recognition platform, the speech recognition platform configured to use the document received from the information retrieval interface to recognizing the user input, the speech recognition platform returns the results thereof to the search engine, the search engine configured to perform search using the speech recognition results and the indexed contents and returns the results thereof as a markup language document to the speech recognition platform.

[0011] In yet another aspect the invention provides an adaptive indexing system for adaptively indexing contents for use in an information retrieval system, the system comprising an adaptive indexer configured to index contents, a core indexer configured to implement textual extraction from contents forwarded by the adaptive indexer, an index re-shuffler configured to at times reshuffle contents, an index configurator for indexing the contents received by the adaptive indexer employing a plurality of configuration parameters, an index cataloguer interfacing with the adaptive indexer configured to perform cataloguing of the contents and maintaining a per-user catalogue configured for a specific content type wherein the index cataloguer is configured to selectively load the indices upon receipt of a search request, a duplicate word remover for removing duplicate words from the indexed contents, a local memory for storing contents, the memory positioned proximally to the storage adapter, a storage adapter configured to provide access to the contents stored in the local memory, an exclusion dictionary configured to exclude irrelevant words from the indexed contents, a dynamic grammar generator configured to generate speech recognition grammar and wherein the adaptive indexer coupled to the index configurator, the core indexer and the storage adaptor indexes the contents to define a user index and a common index, the grammar generator configured to process search requests to conduct searches using the user indexes and the common indexes and performs context sensitive selective loading of indices.

[0012] In still another aspect the invention provides for a method for voice based content search and information retrieval; the method comprising sending a voice based search request by a device capable of communicating through a telecommunication network, receiving the voice based search input by a speech recognition platform, establishing a search session by the speech recognition platform conjointly with a voice information retrieval interface, generating a dynamic grammar in respect of the search input by a dynamic grammar generator, encapsulating the dynamic grammar into a voice markup language document by a markup language generator, sending the voice markup language document containing the dynamic grammar generator to the speech recognition platform, performing a speech recognition test by the speech recognition platform and returning the test results thereof to the voice information retrieval interface, conducting a search using the test results by a search engine at the local memory and employing the indexed content, providing the search results as a voice markup language documents to the speech recognition platform and returning the search results to the originator of the search input.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] Preferred embodiments of the invention are described below with reference to the following accompanying drawings.

[0014] FIG. 1 is a block diagram illustrating a system embodying the invention.

[0015] FIG. 2 is a block diagram illustrating more details of some of the components included in the system of FIG. 1.

[0016] FIG. 3 is a diagram illustrating the base station as embodying in the system of FIG. 1.

[0017] FIG. 4 is a diagram illustrating the adaptive indexer configured with content sources.

[0018] FIG. 5 is a block diagram illustrating emails, scanned documents and word processor documents as source contents.

[0019] FIG. 6 is a diagram illustrating sources emails as the content source as embodying in the system of FIG. 4.

[0020] FIG. 7 is a diagram illustrating scanned page as the data source as embodying in the system of FIG. 4.

[0021] FIG. 8 is a diagram illustrating word processor document as the data source as embodying in the system of FIG. 4.

[0022] FIG. 9 illustrates a conventional inverted indexing mechanism adapted to email indexing.

[0023] FIG. 10 illustrates a sample index generated for the sources: email, scanned pages, word processor documents.

[0024] FIGS. 11-A, 11-B and 11-C are flowcharts illustrating the method of operation of the systems shown in FIG. 1 and FIG. 2.

[0025] FIG. 12 illustrates the indexing process for generic content sources.

[0026] FIG. 13 illustrates the primary Indexing process for generic content sources.

[0027] FIG. 14 illustrates the primary indexing process for email content sources.

[0028] FIGS. 15-A and 15-B illustrate the primary indexing process for scanned pages content sources.

[0029] FIG. 16 illustrates the Indexing process for word processor documents content sources.

[0030] FIG. 17 illustrates secondary indexing process.

[0031] FIG. 18 illustrates search process for email content sources.

DETAILED DESCRIPTION OF THE INVENTION

[0032] FIG. 1 illustrates the components and their major interactions in the system. The user 100 interfaces with the base station 110 through a communication network 120. The base station 110 comprises speech recognition platform 130, the adaptive indexer 140 and remote server storage 150.

[0033] FIG. 2 illustrates a more detailed interaction of the components of FIG. 1. The speech recognition platform 130 is operatively connected with the adaptive indexer 140, which in turn is operatively coupled to the remote server storage 150.

[0034] FIG. 3 shows the remote server storage 150. The server storage 150 comprises of storage locations for content (e.g. email server, document management system, etc). The content extractor 160 extracts content from the remote storage 150 in various formats. The adaptive indexer 140 then indexes all the incoming documents by forwarding the content to the respective core indexers 170 for the content type, to extract the relevant textual information from the document. The index data is then catalogued by the content cataloguer 190 and stored in the local memory 210 by the storage adapter 200, along with the access information for the documents. The local memory 210 can be, for example, a hard drive, optical disk, random access memory, read only memory, flash memory, or any other appropriate type of memory. The speech recognition platform 130 establishes a search session with the system through its Voice Information Retrieval Interface [VIR Interface] 220. Upon a search request, the dynamic grammar generator 230 loads the user index and generates a grammar for the search request. This grammar is then encapsulated in a voice based markup language document by the Markup generator/parser 240. The VIR Interface 220 sends this markup language voice based document to the external Speech Recognition platform 130, which performs recognition and returns the user input. Search engine 250 uses this input and the user index to perform search. Search hits are returned to the speech recognition platform 130 as a Markup language voice based document.

[0035] The index configurator 260 is employed to configure the indexer. The content extractor 160 is configured to extract textual data from content sources and data types. The index re-shuffler 180 is configured to optimize index storage. The Hyper-Text Transfer Protocol Server [HTTP Server] 270 is used by the VIR Interface 220 to accept requests from the speech recognition platform 130. Remote Server Storage 150 is the location where the message/content is physically stored. The present invention does not store the actual content in the local memory. However, it maintains links to the exact location of a document on the remote storage. Examples of remote storage include mail server, document management System or a hard disk. The index configurator 260 is used for configuration of contents. Since content can be from any source, the exact details of the source need to be specified. Various configuration parameters include content type, content source and access details. For instance, in case of email content, we need to provide details corresponding to standard email access protocols like IMAP (Internet Message Access Protocol) and POP3 (Post Office Protocol Version 3). Detailed description and specification can be found at the Internet address: http://www.imap.org. Detailed description and specification of POP3 protocol can be found at the Internet address: http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1939.html. Details to be given include server details, user-id and password. The content extractor 160 uses a polling mechanism for importing content.

[0036] FIG. 4 illustrates the employment of adaptive indexer 140 for a content source. The adaptive indexer 140 is employed to index content. The adaptive indexer 140 is responsible for indexing all the incoming documents coming in from Content Source for User 280, cataloguing the indices and storing these indices in the local memory, which can be, for example, a hard drive, floppy disk, optical disk, random access memory, read only memory, flash memory, or any other appropriate type of memory. For a voice-based content search system, the amount of searchable data should be kept at a minimum given the resource requirements for speech recognition. The present invention solves this problem through cataloguing of indices. The adaptive indexer 140 can be configured with the required types of content. Core indexing for each configured content type is implemented in a separate core indexer 170, which is referenced by the adaptive indexer 140. As a result, the adaptive indexer 140 consists of core indexers 170 and request delegating mechanisms for core indexing. Cataloguing updates index in the per-user catalog for the content source 280 and the common index 300. These catalogs are stored in the local memory 210.

[0037] In FIG. 5, the adaptive indexer 140 is configured for email source 310, scanned pages source 320 and word processor documents source 330, as the content sources. Adaptive Indexer 140 delegates indexing operations to respective core indexers 170 i.e. email core indexer 340, scanned pages core indexer 350 and word processor document core indexer 360. Each of these core indexers generate index for the respective content and the index is updated in respective catalogs i.e. email catalog 370, scanned pages catalog 380 and word processor documents catalog 390. Common index elements are updated in the common index 300.

[0038] The embodiments embodying the indexing of emails, scanned pages and word processor documents have been illustrated in FIG. 5, FIG. 6, and FIG. 7 respectively.

[0039] In FIG. 5, Adaptive Indexer 140 receives email content from email Source 310. Adaptive Indexer 140 determines the content type and forwards it to the email core indexer 340, which performs core indexing and updates the email catalog 370 and common index 300. The email catalog 370 and common index 300 are then stored in the local memory 210.

[0040] In FIG. 6, Adaptive Indexer 140 receives a scanned page from scanned pages source 320. The content is forwarded to the scanned-pages core indexer 350, which performs thresholding 400 and Optical Character Recognition 410 operations on the image to extract text. Thresholding reduces the sampling depth of an image. This technique is used here to convert a color image into a bi-tonal form. The text is then indexed and catalogued in the per-user scanned pages catalog 380 and common index 300. The catalogs are then updated in the local memory 210.

[0041] In FIG. 7, Adaptive Indexer 140 receives word processor document from Word Processor Document Source 330 and forwards it to word processor document core indexer 360. The core indexer extracts text from the document indexes it and updates the per-user document catalog 390 and common index 300. The catalogs are then updated in the local memory.

[0042] The adaptive indexer 140 interfaces with index re-shuffler 180, referring to FIG. 3. Since documents may enter or leave the remote storage locations at any time, the behavior of the index should be highly dynamic in order to reflect the changes in remote server storage 150. The index re-shuffler 180 achieves this. It periodically cross-checks the index with the documents on the remote server storage 150 and updates the index accordingly. For instance, if an email message is deleted by the user, the index re-shuffler 180 removes the words contained exclusively in that email message from the email catalog of the user index. This maintains the index at an optimal level.

[0043] Further the adaptive indexer 140 interfaces with the content cataloguer 190. The entire index for a user cannot be loaded upon a search request, due to resource requirements. In a large deployment setup with a huge user-base, this factor would affect performance significantly. Cataloguing of indices is done to solve this problem. The content cataloguer 190 interfaces with the adaptive indexer 140 and maintains per-user catalogs for each of the configured content types. In accordance with the present invention, catalogs for email, scanned pages and word processor documents are maintained. For instance, the index generated for word processor documents for user A is stored in word processor documents catalog for user A, the index generated for emails for user B is stored in email catalog for user B, etc. This process enables selective loading of indices when a search request arrives. For instance, if the user wants to retrieve a scanned document, only the scanned pages catalog for the user will be loaded, instead of loading the entire index for the user. It may be noted that there are a large number of words that are commonly used by various users in different contexts. This led to the conclusion that having a common word index across all the users would conserve resources. These words are maintained in the common index and updated by the cataloguing component periodically, after scanning through user indices.

[0044] FIG. 10 illustrates user catalogs for content sources 290, per-catalog common indices and the global common index 300. The generated index is composed of index elements, each index element further comprising of a LINK-SET described in detail herein. A LINK-SET stores the access information for a document. The cataloguing component uses the following algorithm to update a per-user catalog:

[0045] 1. For each index element:

[0046] a. If the element is not present in the catalog:

[0047] i. Create a new entry in the catalog for the index element

[0048] ii. Copy the index element into the catalog along with all the LINK-SET elements

[0049] b. Else

[0050] i. Locate the index element in the catalog

[0051] ii. Append all the new LINK-SET elements to the index element with the new document access information

[0052] Further the adaptive indexer 140 interfaces with the storage adapter 200. The storage adapter 200 is used to abstract the storage protocol from the system. Storage could be the native file system on the disk, a relational database, etc. In this embodiment, the storage adapter uses the native file system of the Operating System to store data. As a result it uses the file input-output operations supported by the operating systems to manipulate data.

[0053] Inverted indexing is used as the core indexing algorithm. U.S. Pat. No. 6,216,123 to Robertson, et al. describes a method for generating and searching a full-text index. The invention presented here makes use of this method for full-text indexing and search operations.

[0054] Referring to FIG. 10, the Indexer maintains two broad-level indices--the user index 290 and the common index 300. The common index 300 contains words that are common for most of the message sources as well as most users (e.g. common word for like `APPLICATION FORM`, `MEMO`, `PHONE`, etc.). The cataloguing component of the Indexer intelligently scans user indices to look for common words and updates the common index.

[0055] The common index 300 is further categorized into two levels--per-catalog common index and global common index. Per-catalog common index is maintained for each catalog and contains elements common to most of the users in the particular catalog. In this embodiment, the email catalog, scanned pages catalog and word processor document catalog each have a common index. This technique reduces the size of the grammar presented to the speech recognition platform. For instance, if the user requests for email search, only the global common index and the email common index will be presented to him for recognition. If the user enters another context, the email common index will be unloaded for the user and the per-catalog index for the particular context will be loaded.

[0056] Global common index is a system-wide common index and contains elements common to all the Per-catalog common indices. If an index element belongs to all the Per-catalog common indices, this element is removed from these indices and updated in the Global common index. While updating, all the document references for the element are updated as required.

[0057] The criterion for updating an element in the Per-Catalog Common Index is:

[0058] For each catalog:

[0059] For each element in the catalog:

[0060] If (element present in >=N % of user catalogs)

[0061] Update element in Per-Catalog Common Index

[0062] Where, N is determined by the type of content being search-enabled. For instance, if the content type is scanned pages in a specific format (e.g. an insurance application form), the number of common elements (words in this case) is expected to be more. As a result, N may be set to a relatively high value of 80%. However, if the content comprises of data from diverse sources, the number of common elements is expected to be less. In this case, N may be set to a relatively low value of 60%-70%. This system parameter is configurable.

[0063] The criterion for updating the Global Common Index is:

[0064] For each element in one Per-Catalog Common Index

[0065] If (element is present in all other Per-Catalog Common Indices)

[0066] Update element in Global Common Index

[0067] The user index is a per-user index maintained in the local memory. This index is categorized and maintained as catalogs. In this embodiment, three content sources are configured: email, scanned pages and word processor documents. The Indexer creates three catalogs for these sources. The respective indices are updated in the corresponding catalogs. Indices are stored in compressed format in the local memory. The system decompresses the indices while loading. Huffman coding (The Data Compression Book, Mark Nelson, M&T Books) is used for compression/decompression of indices.

[0068] Each index element in the index comprises:

[0069] ELEMENT-ID

[0070] DATA-ELEMENT

[0071] DATA-TYPE

[0072] DATA-SIZE

[0073] SOURCE-TYPE

[0074] LINK-SET

[0075] Where, DATA-ELEMENT is the actual data of the index,

[0076] DATA-TYPE is the type of data. In the current embodiment, the value of DATA-TYPE is WORD. In another embodiment this value could be an image map, color information, etc, according to the source that was indexed.

[0077] DATA-SIZE is the size of DATA-ELEMENT in bytes.

[0078] SOURCE-TYPE is the type of source document. In this embodiment, this could be EMAIL, SCANNED PAGE or WORD DOC.

[0079] LINK-SET is the element which holds the access information for the document the index element has reference to.

[0080] Each index element in the inverted index holds a reference to the source document. The source document is stored on the remote storage location. Since the system allows any type of document to be indexed, it also provides access information for the document. In the current embodiment, the content types configured are: email, scanned pages and word processor documents. Assuming the corresponding sources as EMAIL SERVER, DOCUMENT MANAGEMENT SYSTEM and HARD DISK, the index stores the required information for each of these sources in the LINK-SET element.

[0081] The format of a LINK-SET is as follows:

[0082] ACCESS-INFORMATION

[0083] RESOURCE-LOCATOR

[0084] Where ACCESS-INFORMATION is the access information, if any, required for the document. For an email,

[0085] ACCESS-INFORMATION=hostname:protocol:userid

[0086] Where, hostname is the mail server name

[0087] protocol is the access protocol used: IMAP, POP3, etc

[0088] userid is the subscriber ID of the user

[0089] RESOURCE-LOCATOR is the path of the document.

[0090] For an email,

[0091] RESOURCE-LOCATOR=serial number of email

[0092] For a scanned page in a document management system,

[0093] RESOURCE-LOCATOR=fully qualified document name

[0094] For a personal word processor document,

[0095] RESOURCE-LOCATOR=complete path on the hard disk

[0096] In another embodiment wherein one of the content sources is a web-site,

[0097] RESOURCE-LOCATOR=Complete URL of HTML page

[0098] Given a LINK-SET, the system knows how and from where to access a particular document. Actual authentication mechanism for accessing a document is provided by source program from which the document originated.

[0099] Further the system includes an exclusion dictionary 430. In case of text index, in order to prevent the size of the index from growing exponentially, the adaptive indexer extracts only common nouns and proper nouns for indexing. All verbs, pronouns, adjectives, etc are excluded from indexing. This is because the system is targeted for keyword search and the user is most likely to utter a noun during a voice-based search request. Also, indexing of verbs, adverbs, etc would increase the size of the index significantly. A part-of-speech disambiguation mechanism is use to extract the required words. U.S. Pat. No. 6,182,028, by Karaali, et al. describes a part-of-speech disambiguation method using hybrid neural network, stochastic processing and lexicon. The invention presented here makes use of this method for word exclusion.

[0100] The dynamic grammar generator 230 in FIG. 3 generates speech recognition grammar for search requests. It uses the user index 290 and common index 300 shown in FIG. 10 and performs context-sensitive selective loading of indices.

[0101] The common grammar is generated from the common index 300 shown in FIG. 4. Since common index 300 is common for most of the users, this index is loaded only once into the system, and updated periodically. This saves loading and unloading time. The common grammar is generated in W3C format. The common grammar also contains defaults like dates, numbers, digits, day of the week, etc, which is common for all the users. The user grammar is created from the user index and is loaded only during the actual search request. Depending on the context, the dynamic grammar generator first loads the user index from a particular catalog, scans through the entire set of index elements, removes duplicate elements, if any and creates a grammar in W3C format. Following is a simple user grammar for a user requesting email search:

1 <?xml version="1.0"?> <grammar xml:lang="en-US" version="1.0" root="ROOT"> <rule id="ROOT" scope="public"> <one-of> <item>HOROSCOPE&l- t;/item> <item>DRAGON</item> <item>FRANK DENNIS</item> <item>PEDOMETER</i- tem> <item>LUNETTE</item> <item>WRIST-REMOTE-CONTROLLER</item> ..... <one-of> </rule> </grammar>

[0102] According to the grammar shown above, user can speak any of the words present in the grammar and the speech recognition platform would recognize these words for this particular search request, for this user. If the same user enters a different context, e.g. scanned pages search, this grammar would be unloaded first and a new grammar would be created:

2 <?xml version="1.0"?> <grammar xml:lang="en-US" version="1.0" root="ROOT"> <rule id="ROOT" scope="public"> <one-of> <item>FAX</ite- m> <item>SPRINGWARE</item> <item>HATCHBACK</item> <item>DRAWING </item> ..... <one-of> </rule> </grammar>

[0103] In FIG. 3, Markup generator/parser 240 is used to create and parse markup language voice based documents. The Markup generator/parser 240 uses a third-party core XML (Extended Markup Language) parser, e.g. Xerces XML Parser provided by Apache (http://xml.apache.org), to parse VoiceXML documents.

[0104] Speech recognition grammar is presented to the speech recognition platform 130 as a VoiceXML document by the VIR Interface 220. The use of VoiceXML ensures interoperability with a variety of speech recognition systems. The system supports file-mode grammar with the VoiceXML standard. A temporary grammar file is created in the local memory and its reference is put in the VoiceXML. The speech recognition platform 130 can access this file and load the grammar. For this, the speech recognition platform 130 must support W3C grammar.

[0105] Following is a sample VoiceXML document for the speech recognition grammar:

3 <?xml version='1.0'?> <vxml version="1.0"> <var name="var1"/> <var name="var2"/> <form id="MAIN"> <field name="search_input1"> <grammar src="user1.grm"/> <prompt cond="TEXT">

[0106] Please say your first search key word. Or say Done if you are finished.

4 </prompt> <filled> <assign name="var1" expr="search_input1"/> <if cond="search_input1 == 'Done"'> <goto next="#submit_search"/> </if> </filled> </field> <field name="search_input2"> <grammar src="user1.grm"/> <prompt cond="TEXT">

[0107] Please say your second search key word. Or say Done if you are finished.

5 </prompt> <filled> <assign name="var2" expr="search_input2"/> <if cond="search_input2 == 'Done'"> <goto next="#submit_search"/> </if> </filled> </field> </form> <form id="submit_search"> <field name="confirm"> <prompt cond="TEXT">The key words you said are <value expr="var1"/> and <value expr="var2"/>Say Yes to fetch result and Say No to re-enter. </prompt> <filled> <if cond="confirm == 'No'"> <goto next="#MAIN"/> </if> <submit next="search_svc.jsp" namelist="var1 var2"/> </filled> </field> </form> </vxml>

[0108] Grammar caching is adopted whereby every time a grammar is generated, the system creates a grammar file in a section of the local memory. This file is stored for a specific amount of time. The time for which it is stored depends on the frequency of the user entering the context for which the file was generated. For instance, if the user enters email search frequently, the system will store the grammar file for that user, for his email catalog. When the user enter email search the next time, only the incremental index would be added to the grammar file. The system "learns" about the access pattern for each user over a period of time and sets the grammar caching levels.

[0109] In FIG. 3, The Voice Information Retrieval (VIR) Interface 220 is exposed by the system in order to interface with speech recognition platform 130. The VIR Interface 220 allows the speech recognition platform 130 to connect and transact with the system. When a user requests for search, the speech recognition platform 130 establishes a session with the present system through the VIR Interface 220 during which user information is passed to the system. After a connection is established, the speech recognition platform 130 can issue search requests to the system, receive search results and open the documents, based on user input. The VIR Interface 220 runs an Hyper-Text Transfer Protocol [HTTP] Server 270 to accept requests from the speech recognition platform 130. The VoiceXML sent by the system specifies the program to be called by the HTTP Server 270 to execute the request. Session information is mapped from this program to the VIR Interface 220. Following are the key operations the speech recognition platform 130 performs using the VIR Interface 220:

[0110] Connect to the system

[0111] Pass user information

[0112] Set search context

[0113] Issue search request

[0114] Receive search hits

[0115] Obtain access information to open the required document

[0116] Disconnect from the system

[0117] Search engine 250 is used for actual searching of data. It uses n-gram search for fast retrieval of data. The search engine 250 uses the per-user index and the catalogue created by the Indexer and retrieves data. Since the index is updated as and when new content comes in, it is immediately available for search. This enables the user to quickly access documents.

[0118] In FIG. 3, the adaptive indexer can be extended to support indexing of non-textual documents. For instance, it could be used to retrieve image based on image block information or tag notes. For instance, a user might want to retrieve an image, which has a red-colored block in the upper left corner and a picture in the center. The adaptive indexer 140 would maintain a list of image blocks along with color information and position and the search engine would use this information to retrieve the correct images. If images have tag notes attached, user could search for tag notes and retrieve images. Indexing is performed in two stages: primary indexing and secondary indexing. Primary indexing involves the process of core indexing of the content after applying document template. The output of this process is an inverted index with links to original documents. Secondary indexing involves optimizations like duplicate word removal, segregating of words into common index and user index, etc.

[0119] FIG. 4 illustrates the content source 280 as supplying content to core indexer. FIG. 6 illustrates the content source 310 as email content source supplying a an email to the email core indexer 340. FIG. 7 illustrates the content source as scanned page 320 being supplied to the scanned page core indexer 350. Whereas FIG. 8 illustrates the content source as word processor content source 330 supplying word processor documents to word processor core indexer 360. Since content can be in any format, the exact format of the document needs to be specified. A document template is used for this purpose. A document template represents the skeleton of a document from the indexing point of view. All incoming documents are mapped to their respective document templates by the core indexers before performing indexing. Each core indexer 170 knows the internal representation of its data source through the document template. It uses this information to extract the data required for primary indexing. The template specifies parameters like document type, areas of indexing (also referred to as AOls in this document), etc. For instance, a template for email documents may look like:

6 Document Type: EMAIL Area of indexing Field AOl.sub.1 "From" AOl.sub.2 "Subject" AOl.sub.3 "Date" AOl.sub.4 "Content"

[0120] Where, fields shown are different attributes of an email message. If indexing of the complete email message is required, AOls need not be specified. For instance, the scanned pages core indexer 350 in FIG. 7 applies the document template to a scanned page. After extracting the AOls from the page, it submits these AOls as bi-tonal images to an Optical Character Recognition (OCR) 410 to extract text. Primary indexing is then performed on the extracted text.

[0121] FIG. 9 illustrates a conventional inverted indexing mechanism adapted to email indexing. After applying document template for email and extracting required data, word list is first created for each incoming document for each user. After all documents are processed, all the word lists are processed to yield an output as shown. For each word, there's a link-set to the document that contains that word, which is the inverted index.

[0122] FIG. 10 illustrates a sample index generated for the source contents described in this invention. In accordance with the described content sources, each index element is a spoken "word" since text indexing is performed for all the sources. Per-catalog common index contains elements (words) common to most of the users per catalog. Global common index contains words common to all per-catalog common indices. The personal index is catalogued into categories referred to as user catalogs. Each word may belong to one or more categories. This technique enables selective loading of indices depending on the context. The per-catalog common index and the global common index have been illustrated.

[0123] FIGS. 11-A, 11-B and 11-C depict a flow chart illustrating the method of operation of the systems shown in FIG. 2 and FIG. 3.

[0124] FIG. 12 is a flowchart depicting the general indexing process for all content sources. The adaptive indexer 140 polls the various message sources for content 280. When content is available primary indexing is performed on the data. The primary index in then fed to the secondary indexing process, which performs duplicate word removal and cataloguing. The catalogs are then updated in the local memory.

[0125] FIG. 13 depicts general primary indexing for all content sources. After polling for the content, the content is received, document template is applied and the data is extracted from Areas of Indexing. Indexing is performed on the extracted data and element exclusion is employed to remove unwanted index elements. A Primary Index is created and the LINK-SET elements are added appropriately. The index is then stored in the local memory.

[0126] FIG. 14 is a flowchart depicting the indexing process for email content sources. After fetching email data, email document template is applied to extract Areas of Indexing. Text is extracted from Areas of Indexing and indexing is performed. The full-text index generated is then subjected to a lexicon and part-of-speech disambiguation for removal of unwanted words. Primary index is generated and LINK-SETs are added. The index is then stored in the local memory.

[0127] FIGS. 15-A and 15-B illustrated primary indexing for scanned pages. The scanned page could be in any color format (e.g. 24-bit color, gray scale, bi-tonal, etc). Thresholding is first performed to reduce the image to bi-tonal. Scanned pages document template is applied to extract areas of indexing. The bi-tonal output is the fed to the Optical Character Recogniser to extract text. The text is then indexed and the full-text index is subjected to unwanted word removal. If tag-notes are present full-text indexing of tag-notes is performed. The primary index thus generated is updated with LINK-SETs and stored in local memory.

[0128] FIG. 16 is a flowchart depicting primary indexing for word processor documents.

[0129] FIG. 17 is a flowchart depicting secondary indexing process. Primary index is first fetched. Duplicate element removal is then performed. User catalog for the content source is loaded and duplicate element removal is again performed with respect to the user catalog. Index elements are then extracted and the common index is updated. User catalog is updated and stored in local memory.

[0130] FIG. 18 shows the various steps performed for email search. When the user logs in and requests for mail search, the system loads the user's email index from the email catalog 370 as well as the common index 300. Check is again performed for duplicate words in order to keep the word list to a minimum. The word list is used to create a W3C grammar, which is then encapsulated in a markup language voice based document illustratively a VoiceXML document, which is passed to the speech recognition platform 130. The speech recognition platform 130 returns the user input, which is fed to the search engine along with the index. The search engine 250 returns the search results and the search hits are passed on to the user in markup language document illustratively a VoiceXML document.

* * * * *

System for information storage, retrieval and voice based content search and methods thereof

Suresh, Narasimha ; et al.

References