Online Spelling Correction/phrase Completion System Hsu; Bo-June ; et al. [MICROSOFT CORPORATION]

Online Spelling Correction/phrase Completion System

Hsu; Bo-June ; et al.

Patent Application Summary

U.S. patent application number 13/069526 was filed with the patent office on 2012-09-27 for online spelling correction/phrase completion system. This patent application is currently assigned to MICROSOFT CORPORATION. Invention is credited to Huizhong Duan, Bo-June Hsu, Kuansan Wang.

Application Number	20120246133 13/069526
Document ID	/
Family ID	46878179
Filed Date	2012-09-27

United States Patent Application	20120246133
Kind Code	A1
Hsu; Bo-June ; et al.	September 27, 2012

ONLINE SPELLING CORRECTION/PHRASE COMPLETION SYSTEM

Abstract

Online spelling correction/phrase completion is described herein. A computer-executable application receives a phrase prefix from a user, wherein the phrase prefix includes a first character sequence. A transformation probability is retrieved responsive to receipt of the phrase prefix, wherein the transformation probability indicates a probability that a second character sequence has been transformed into a first character sequence. A search is then executed over a trie to locate a most probable phrase completion based at least in part upon the transformation probability.

Inventors:	Hsu; Bo-June; (Woodinville, WA) ; Wang; Kuansan; (Bellevue, WA) ; Duan; Huizhong; (Urbana, IL)
Assignee:	MICROSOFT CORPORATION Redmond WA
Family ID:	46878179
Appl. No.:	13/069526
Filed:	March 23, 2011

Current U.S. Class:	707/706 ; 707/802; 707/E17.079; 707/E17.108
Current CPC Class:	G06F 16/3322 20190101; G06F 40/274 20200101; G06F 40/232 20200101; G06F 16/3338 20190101
Class at Publication:	707/706 ; 707/802; 707/E17.108; 707/E17.079
International Class:	G06F 17/30 20060101 G06F017/30

Claims

1. A computer-executable method that facilitates performing in-line spelling correction, the method comprising: receiving a first character sequence from a user, wherein the first character sequence is a potentially misspelled portion of a phrase; responsive to receiving the first character sequence, retrieving transformation probability data from a first data structure in a computer-readable data repository, wherein the transformation probability data is indicative of a probability that a second character sequence transformed into the first character sequence, wherein the second character sequence is a properly spelled portion of the phrase; subsequent to retrieving the transformation probability data, searching over a second data structure in the computer-readable data repository for a completion of the phrase based at least in part upon the transformation probability data; and providing at least one completion of the phrase to the user subsequent to receiving the first character sequence but prior to receiving additional characters from the user.

2. The method of claim 1, wherein the second data structure comprises an n-gram language model.

3. The method of claim 1, wherein the second data structure comprises a trie that maps phrases to probabilities.

4. The method of claim 3, wherein the trie comprises a plurality of nodes and a plurality of paths, wherein each node is representative of a character sequence and a path between two nodes extends the character sequence, and wherein each node in the trie has a largest probability among possible words or phrases that include a respective character sequence stored in relation thereto.

5. The method of claim 4, wherein the searching is undertaken across multiple paths in the trie to locate a threshold number of most probable words or phrases in combination with the transformation probability corresponding to the first character sequence.

6. The method of claim 5, further comprising utilizing beam pruning to limit a number of paths that is searched over during the act of searching.

7. The method of claim 1 configured for execution by a search engine, wherein the first character sequence is a portion of a query.

8. The method of claim 1 configured for execution by a word processing application.

9. The method of claim 1, wherein the completion of the phrase comprises multiple words that have yet to be provided by the user.

10. The method of claim 1, further comprising: computing a risk that the completion of the phrase is not germane to informational retrieval intent of the user; comparing the risk with a threshold value; and providing the completion of the phrase to the user only if the risk is below the threshold value.

11. The method of claim 1, wherein a number of characters in at least one of the first character sequence or the second character sequence of a transformation unit is greater than one.

12. A system comprising a plurality of components that are executable by a processor, the components comprising: a receiver component that receives a character sequence from a user, wherein the character sequence is intended by the user to be a portion of a particular word; a search component that: accesses a first data structure in a data repository, wherein the first data structure comprises a translation probability that indicates a probability that a second character sequence is a translation of the first character sequence; searches over a plurality of possible word or phrase completions in a second data structure, wherein the possible word or phrase completions have a probability assigned thereto; retrieves at least a most probable word or phrase completion from the plurality of possible word or phrase completions based at least in part upon the translation probability, wherein the most probable word or phrase completion comprises the particular word; and outputs the most probable word or phrase completion to the user as a suggested word or phrase correction/completion.

13. The system of claim 12 being comprised by a search engine.

14. The system of claim 12 being comprised by an operating system.

15. The system of claim 12 being comprised by one of a word processing application or a web browser.

16. The system of claim 12, wherein the second data structure is a trie that comprises a plurality of nodes that are representative of character sequences and a plurality of paths between nodes that are representative of continuations of the character sequences, and wherein leaf nodes in the trie represent the possible word or phrase completions.

17. The system of claim 16, wherein each node in the trie has a probability assigned thereto, wherein a probability assigned to a node is a highest probability from amongst all leaf nodes that are coupled to the node.

18. The system of claim 12, wherein the search component utilizes an A* search algorithm to search over the plurality of possible word or phrase completions in the second data structure.

19. The system of claim 12, wherein the search component is configured to compute a risk value corresponding to the most probable word or phrase and only outputs the most probable word or phrase to the user if the risk value is below a threshold, wherein the risk value is indicative of a risk that the most probable word or phrase fails to correspond to intentions of the user.

20. A non-transitory computer-readable medium comprising instructions that, when executed by a processor, cause the processor to perform acts comprising: receiving a partial query from a user, wherein the partial query comprises a first character sequence; responsive to receiving the partial query, retrieving a transformation probability from a first data structure that indicates a probability that a second character sequence is a transformation of the first character sequence; subsequent to retrieving the transformation probability, executing an A* search algorithm over a trie based at least in part upon the transformation probability, wherein the trie comprises a plurality of nodes and paths, wherein leaf nodes in the trie represent possible query completions and internal nodes represent character sequences that are portions of query completions, and wherein each internal node in the trie has a probability assigned thereto that is indicative of a most probable query completion given a character sequence that corresponds to a respective internal node; and outputting a query correction/completion based at least in part upon the A* search.

Description

BACKGROUND

[0001] As data storage devices are becoming less expensive, an increasing amount of data is retained, wherein such data can be accessed through utilization of a search engine. Accordingly, search engine technology is frequently updated to satisfy information retrieval requests of a user. Moreover, as users continue to interact with search engines, such users become increasing adept at crafting queries that are likely to cause search results to be returned that satisfy informational requests of the users.

[0002] Conventionally, however, search engines have difficulty retrieving relevant results when a portion of a query includes a misspelled word. An analysis of search engine query logs finds that words in queries are often misspelled, and that there are various types of misspellings. For instance, some misspellings may be caused by "fat finger syndrome", when a user accidentally depresses a key on a keyboard that is adjacent to a key that was intended to be depressed by the user. In another example, an issuer of a query may be unfamiliar with certain spelling rules, such as when to place the letter "i" before the letter "e" and when to place the letter "e" before the letter "i". Other misspellings can be caused by the user typing too quickly, such as for instance, accidentally depressing a same letter twice, accidentally transposing two letters in a word, etc. Moreover, many users have difficulty in spelling words that originated in different languages.

[0003] Some search engines have been adapted to attempt to correct misspelled words in a query after an entirety of the query is received (e.g., after the issuer of the query depresses a "search" button). Furthermore, some search engines are configured to correct misspelled words in a query after the query in its entirety has been issued to a search engine, and then automatically undertake a search over an index utilizing the corrected query. Additionally, conventional search engines are configured with technology that provides query completion suggestions as the user types a query. These query completion suggestions often save the user time and angst by assisting the user in crafting a complete query that is based upon a query prefix that has been provided to the search engine. If a portion of the query prefix, however, includes a misspelled word, then the ability of conventional search engines to provide helpful query suggestions greatly decreases.

SUMMARY

[0004] The following is a brief summary of subject matter that is described in greater detail herein. This summary is not intended to be limiting as to the scope of the claims.

[0005] Described herein are various technologies pertaining to online spelling correction/phrase completion, wherein online spelling correction refers to providing a spelling correction for a word or phrase as the user provides a phrase prefix to a computer-executable application. Pursuant to an example, online spelling correction/phrase completion can be undertaken at a search engine, wherein a query prefix (e.g., a portion of a query but not an entirety of the query) includes a potentially misspelled word, wherein such misspelled word can be identified and corrected as the user enters characters into the search engine, and wherein query completions (suggestions) that include a corrected word (properly spelled word) can be provided to the user. In another example, online spelling correction can be undertaken in a word processing application, in a web browser, can be included as a portion of an operating system, or may be included as a portion of another computer-executable application.

[0006] In connection with undertaking online spelling correction/phrase completion, a phrase prefix can be received from a user of a computing apparatus, where the phrase prefix includes a first character sequence that is potentially a misspelled portion of a word. For example, the user may provide the phrase prefix "get invl". This phrase prefix includes the potentially misspelled character sequence "invl", wherein an entirety of the phrase may be desired by the user to be "get involved with computers." Aspects described herein pertain to identifying potential misspellings in character sequences of a phrase prefix, correcting potential misspellings, and thereafter providing a suggested complete phrase to a user.

[0007] Continuing with the example, responsive to receipt of the character sequence "vl", a transformation probability can be retrieved from a first data structure in a computer readable data repository. For example, this transformation probability can be indicative of a probability that the character sequence "vol" has been (unintentionally) transformed into the character sequence proffered by the user ("vl"). While the character sequence "vl " includes two characters, and the character sequence "vol" includes three characters, it is to be understood that a character sequence can be a single character, zero characters, or multiple characters. Transformation probabilities can be computed in real-time (as phrase prefixes are received from the user), or pre-computed and retained in a data structure such as a hash table. Moreover, a transformation probability can be dependent upon previous transformation probabilities in a phrase. Therefore, for example, the transformation probability that the character sequence "vol" has been transformed into the character sequence "vl" by the user can be based at least in part upon the transformation probability that the character sequence "in" has been transformed into the identical character sequence "in".

[0008] Subsequent to retrieving the transformation probability data, a search can be undertaken over a second data structure to locate at least one phrase completion, wherein the at least one phrase completion is located based at least in part upon the transformation probability data. Pursuant to an example, the second data structure may be a trie. The trie can comprise a plurality of nodes, wherein each node can represent a character or a null field (e.g., representing the end of the phrase). Two nodes connected by a path in the trie indicate a sequence of characters that are represented by the nodes. For example, a first node may represent the character "a", a second node may represent the character "b", and a path directly between these nodes represents the sequence of characters "ab". Additionally, each node can have a score associated therewith that is indicative of a most probable phrase completion that includes such node. The score can be computed based at least in part upon, for instance, a number of occurrences of a word or phrase that have been observed with respect to a particular application. For example, the score can be indicative of a number of times a query has been received by a search engine (over some threshold window of time). Moreover, the search over the trie may be undertaken through utilization of an A* search algorithm or a modified A* search algorithm.

[0009] Based at least in part upon the search undertaken over the second data structure, a most probable word or phrase completion or plurality of most probable word or phrase completions can be provided to the user, wherein such word or phrase completions include corrections to potential misspellings included in the phrase prefix that has been provided to the computer-executable application. In the context of a search engine, through utilization of such technology, the search engine can quickly provide the user with query suggestions that include corrections to potential misspellings in a query prefix that has been proffered to the search engine by the user. The user may then choose one of the query suggestions, and the search engine can perform a search utilizing the query suggestion selected by the user.

[0010] Other aspects will be appreciated upon reading and understanding the attached figures and description.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] FIG. 1 is a functional block diagram of an exemplary system that facilitates performing online spell correction/phrase completion responsive to receipt of a phrase prefix from a user.

[0012] FIG. 2 is an exemplary trie data structure.

[0013] FIG. 3 is a functional block diagram of an exemplary system that facilitates estimating, pruning, and smoothing, a transformation model.

[0014] FIG. 4 is a functional block diagram of an exemplary system that facilitates building a trie based at least in part upon data from a query log.

[0015] FIG. 5 is an exemplary graphical user interface pertaining to a search engine.

[0016] FIG. 6 illustrates an exemplary graphical user interface of a word processing application.

[0017] FIG. 7 is a flow diagram that illustrates an exemplary methodology for performing online spell correction/phrase completion responsive to receipt of a phrase prefix from a user.

[0018] FIG. 8 is a flow diagram that illustrates an exemplary methodology for outputting a query suggestion/completion with correction of potential misspellings received in a query prefix from a user.

[0019] FIG. 9 is an exemplary computing system

DETAILED DESCRIPTION

[0020] Various technologies pertaining to online correction of a potentially misspelled word in a phrase prefix will now be described with reference to the drawings, where like reference numerals represent like elements throughout. In addition, several functional block diagrams of exemplary systems are illustrated and described herein for purposes of explanation; however, it is to be understood that functionality that is described as being carried out by certain system components may be performed by multiple components. Similarly, for instance, a component may be configured to perform functionality that is described as being carried out by multiple components. Additionally, as used herein, the term "exemplary" is intended to mean serving as an illustration or example of something, and is not intended to indicate a preference.

[0021] With reference now to FIG. 1, an exemplary online spell correction/phrase completion system 100 is illustrated, wherein the term "online spell correction//phrase completion" refers to proffering a phrase completion with a correction to a potentially misspelled word responsive to receipt of a phrase prefix from a user but prior to the user entering an entirety of the phrase. Pursuant to an example, the system 100 may be included in a computer executable application. Such application may be resident upon a server, such as a search engine, a word processing application that is hosted on a server, or other suitable server-side application. Moreover, the system 100 may be employed in a word processing application that is configured to execute on a client computing device, wherein the client computing device can be, but is not limited to, a desktop computer, a laptop computer, a portable computing device such as a tablet computer, a mobile telephone, or the like. Additionally, the system 100 may be utilized in connection with providing an online correction/completion of a potentially misspelled word for a single word, or may also be used in connection with providing an online correction/completion of a potentially misspelled word for an incomplete phrase. In addition, while the system 100 will be described herein as being configured to perform spelling corrections/phrase completions for phrases in a first language that include potentially misspelled words, it is to be understood that the technology described herein can be extended to assist the user in spelling correction/phrase completion for phrase prefixes in a first language that are desirably translated to a second language. For example, a user may wish to generate a phrase that includes Chinese characters. The user, however, may only have access to a keyboard that includes English characters. The technology described herein may be utilized to allow the user to type a phrase prefix utilizing English characters to approximate pronunciation of a particular Chinese word or phrase, and completed phrases in Chinese characters can be provided to the user responsive to the phrase prefix. Other applications will be readily comprehended by one skilled in the art.

[0022] The online spell correction/phrase completion system 100 comprises a receiver component 102 that receives a first character sequence from a user 104. For example, the first character sequence may be a portion of a prefix of a word or phrase that is provided by the user 104 to the computer executable application. For purposes of explanation, such computer executable application will be described herein as a search engine, but it is to be understood that the system 100 may be utilized in a variety of different applications. The first character sequence provided by the user 104 may be at least a portion of a potentially misspelled word. Moreover, the first character sequence may be a phrase or portion thereof that includes a potentially misspelled word, such as "getting invlv". As will be described in greater detail herein, the first character sequence received by the receiver component 102 may be a single character, a null character, or multiple characters.

[0023] The online spell correction/phrase completion system 100 further comprises a search component 106 that is in communication with the receiver component 102. Responsive to the receiver component 102 receiving the first character sequence from the user 104, the search component 106 can access a data repository 108. The data repository 108 comprises a first data structure 110 and a second data structure 112. As will be described below, the first data structure 110 and the second data structure 112 can be pre-computed to allow for the search component 106 to efficiently search through such data structures 110 and 112. Alternatively, at least the first data structure 110 may be a model that is decoded in real-time (e.g., as characters in a phrase prefix proffered by the user are received).

[0024] The first data structure 110 can comprise or be configured to output a plurality of transformation probabilities that pertain to a plurality of character sequences. More specifically, the first data structure 110 includes a probability that a second character sequence, which may or may not be different from the character sequence received from the user 104, has been transformed (possibly unintentionally) into the first character sequence by the user 104. Thus, the first data structure 110 can include or output data that indicates that the probability that the user, either through mistake (fat finger syndrome or typing too quickly) or ignorance (unfamiliar with spelling rules, unfamiliar with a native language of a word) intended to type the second character sequence but instead typed the first character sequence. Additional detail pertaining to generating/learning the first data structure 110 is provided below. The second data structure 112 can comprise data indicative of a probability of a phrase, which can be determined based upon observed phrases provided to a computer-executable application, such as observed queries to a search engine. In an example, the data indicative of probability of the phrase can be based upon a particular phrase prefix. Therefore, for example, the second data structure 112 can include data indicative of a probability that the user 104 wishes to provide a computer executable application with the word "involved". Pursuant to an example, the second data structure 112 may be in the form of a prefix tree or trie. Alternatively, the second data structure 112 may be in the form of an n-gram language model. In still yet another example, the second data structure may be in the form of a relational database, wherein probabilities of phrase completions are indexed by phrase prefixes. Of course, other data structures are contemplated by the inventors and are intended to fall under the scope of the hereto-appended claims.

[0025] The search component 106 can perform a search over the second data structure 112, wherein the second data structure comprises word or phrase completions, and wherein such word or phrase completions have a probability assigned thereto. For instance, the search component 106 may utilize an A* search or a modified A* search algorithm in connection with searching over the possible word or phrase completions in the second data structure 112. An exemplary modified A* search algorithm that can be employed by the search component 106 is described below. The search component 106 can retrieve at least one most probable word or phrase completions from the plurality of possible word or phrase completions in the second data structure 112 based at least in part upon the translation probability between the first character sequence and the second character sequence retrieved from the first data structure 110. The search component 106 may then output at least the most probable phrase completion to the user 104 as a suggested phrase completion, wherein the suggested phrase completion includes a correction to a potentially misspelled word. Accordingly, if the phrase prefix provided by the user 104 includes a potentially misspelled word, the most probable word/phrase completion provided by the search component 106 will include a correction of such potentially misspelled word, as well as a most likely phrase completion that includes the correctly spelled word.

[0026] With reference now to FIG. 2 an exemplary trie 200 that can be searched over by the search component 106 in connection with providing a threshold number of most probable word or phrase completions with corrected spellings is illustrated. The trie 200 comprises a first intermediate node 202, which represents a first character that may be proffered by a user when entering a query to a search engine. The trie 200 further comprises a plurality of other intermediate nodes 204, 206, 208, and 210, which are representative of a sequence characters that begin with the character represented by the first intermediate node 202. For instance, the intermediate node 204 can represent the character sequence "ab". The intermediate 206 represents the character sequence "abc", and the intermediate node 208 represents the character sequence "abcc". Similarly, the intermediate node 210 represents the character sequence "ac".

[0027] The trie further comprises a plurality of leaf nodes 212, 214, 216, 218 and 220. The leaf nodes 212-220 represent query completions that have been observed or hypothesized. For example, the leaf node 212 indicates that users have proffered the query "a". The leaf node 214 indicates that users have proffered the query "ab". Similarly, the leaf node 216 indicates that users have set forth the query "abc", and the leaf node 218 indicates that users have set forth a query "abcc". Finally, the leaf node 220 indicates that users have set forth the query "ac". For instance, these queries can be observed in a query log of a search engine. Each of the leaf nodes 212-220 may have a value assigned thereto that indicates a number of occurrences of the query represented by the leaf nodes 212-220 in a query log of a search engine. Additionally or alternatively, the values assigned to the leaf nodes 212-220 can be indicate of probability of the phrase completion from a particular intermediate node. Again, the trie 200 has been described with respect to query completions, but it is understood that the trie 200 may represent words in a dictionary utilized in a word processing application, or the like. Each of the nodes 202-210 can have a value assigned thereto that is indicative of a most probable path beneath such intermediate node. For example, the node 202 may have a value of 20 assigned thereto, since the leaf node 212 has a score of 20 assigned thereto, and such value is higher than values assigned to other leaf nodes that can be reached by way of the intermediate node 202. Similarly, the intermediate node 204 can have a value of 15 assigned thereto, since the value of the leaf node at 216 is the highest value assigned to leaf nodes that can be reached by way of the intermediate node 204.

[0028] With reference now to FIG. 3, an exemplary system 300 that facilitates building the first data structure 110 for utilization in connection with performing online spell correction/phrase completion is illustrated. In off-line spelling correction, wherein an entirety of a query received, it is desirable to find a correctly spelled query c with the highest probability of yielding the potentially misspelled input query q. By applying Bayes rule, this task can be alternatively expressed as follows:

c=argmax.sub.cp(c|q)=argmax.sub.cp(q|c)p(c) (1)

In this noisy channel model formulation, p(c) is a query language model that describes the prior probability of c as the intended user query. p(q|c)=p(c.fwdarw.q) is the transformation model that represents the probability of observing the query q when the original user intent is to enter the query c.

[0029] For online spelling correction, the prefix of the query q is received, wherein such prefix of the query is a portion of the potentially misspelled input query q. Accordingly, the objective of online spelling correction is to locate the correctly spelled query c that maximizes the probability of yielding any query q that extends the given partial query q. More formally, it is desirable to locate the following:

c=argmax.sub.c,q:q= q . . . p(c|q)=argmax.sub.c,q:q= q . . . p(q|c)p(c) (2)

where q= q . . . denotes that q is a prefix of q. In such a formulation, off-line spelling correction can be viewed as a constrained special case of the more generic online spelling correction.

[0030] The system 300 facilitates learning a transformation model 302 that is an estimate of the aforementioned generative model. The transformation model 302 is similar to the joint sequence model for grapheme to phoneme conversion in speech recognition, as described in the following publication: M. Bisani and H. Ney. "Joint-Sequence Models for Grapheme-to-Phoneme Conversion. Speech Communication, Vol. 50. 2008, the entirety of which is incorporated herein by reference.

[0031] The system 300 comprises a data repository 304 that includes training data 306. For instance, the training data 306 may include the following labeled data: word pairs, wherein a first word in a word pair is a misspelling of a word and a second word in the word pair is the properly spelled word, and labeled character sequences in each word in the word pair, wherein such words are broken into non-overlapping character sequences, and wherein character sequences between words in the word pair are mapped to one another. It can be ascertained, however, that obtaining such training data, particularly on a large scale, may be costly. Therefore, in another example, the training data 306 may include word pairs, wherein a word pair includes a misspelled word and a corresponding properly spelled word. This training data 306 can be acquired from a query log of a search engine, wherein a user first proffers a misspelled word as a portion of a query and thereafter corrects such word by selecting a query suggested by the search engine. Thereafter, and as will be described below, an expectation maximization algorithm can be executed over the training data 306 to learn the aforementioned character sequences between word pairs, and thus learn the transformation model 302. Such an expectation maximization algorithm is represented in FIG. 3 by an expectation-maximization component 308. The expectation-maximization component 308 can include a pruning component 310 that can prune the transformation model 302, and can further include a smoothing component 312 that can smooth such model 302. Thereafter, the transformation model 302 may be provided previously observed query prefixes to generate the first data structure 110. Alternatively, the pruned, smoothed transformation model 302 may itself be the first data structure 110, and can be operative to output, in real-time, transformation probabilities pertaining to one or more character sequences in a query prefix set forth by a user.

[0032] In more detail, the transformation model 302 can be defined as follows: a transformation from an intended query c to the observed query q can be decomposed as a sequence of substring transformation units, which are referred to herein as transfemes or character sequences. For example, the transformation "britney" to "britny" can be segmented into the transfeme sequence {br.fwdarw.br,i.fwdarw.i,t.fwdarw.t,ney.fwdarw.ny}, where only the last transfeme ney.fwdarw.ny, involves a correction. Given a sequence of transfemes s=t.sub.1t.sub.2, . . . , t.sub.l.sub.s, the probability of the sequence can be expanded utilizing the chain rule. As there are multiple manners to segment a transformation, in general the transformation probability p(c.fwdarw.q) can be modeled as a sum of all possible segmentations. This can be represented as follows:

p(c.fwdarw.q)=.SIGMA..sub.s.di-elect cons.S(c.fwdarw.q)p(s)=.SIGMA..sub.s.di-elect cons.S(c.fwdarw.q).PI..sub.i.di-elect cons.[1,l.sub.s.sub.]p(t.sub.i|t.sub.1, . . . , t.sub.i-1), (3)

where S(c.fwdarw.q) is the set of all possible joint segmentations of c and q. Further, by applying the Markov assumption that a transfeme only depends on the previous M-1 transfemes, similar to an n-gram language model, the following can be obtained

p(c.fwdarw.q)=.SIGMA..sub.s.di-elect cons.S(c.fwdarw.q).PI..sub.i.di-elect cons.[1,l.sub.s.sub.]p(t.sub.i|t.sub.i-M+1, . . . , t.sub.i-1) (4)

[0033] The length of a transfeme t=c.sub.t.fwdarw.q.sub.t can be defined as follows:

|t|=max {|c.sub.t|, |q.sub.t|} (5)

In general, a transfeme can be arbitrarily long. To constrain the complexity of the resulting transformation model 302, a maximum length of a transfeme can be limited to L. With both n-gram approximation and character sequence length constraint, a transformation model 302 with parameters M and L can be obtained:

p ( c .fwdarw. q ) = s .di-elect cons. S ( c .fwdarw. q ) : .A-inverted. t .di-elect cons. s , t .ltoreq. L i .di-elect cons. [ 1 , l s ] p ( t i | t i - M + 1 , , t i - 1 ) ( 6 ) ##EQU00001##

[0034] In the special case of M=1 and L=1, the transformation model 302 degenerates to a model similar to weighted edit distance. With M=1, it can be assumed that the transfemes are generated independently of one another. As each transfeme may include substrings of at most one character with L=1, the standard Levenshtein edit operations can be modeled: insertions: .epsilon..fwdarw..alpha.; deletions .alpha..fwdarw..epsilon.; and substitutions .alpha..fwdarw..beta., where .epsilon. denotes an empty string. Unlike many edit distance models, however, the weights in the transformational model 302 represent normalized probabilities estimated from data, not just arbitrary score penalties. Accordingly, such transformation model 302 not only captures the underlying patterns of spelling errors, but also allows for comparison of the probabilities of different completion suggestions in a mathematically principled manner.

[0035] When L=1, transpositions are penalized twice even though a transposition occurs as easily as other edit operations. Similarly, phonetic spelling errors, such as ph.fwdarw.f, often involve multiple characters. Modeling these character sequences as single character edit operations not only over-penalizes the transformation, but may also pollute the model as it increases the probabilities of edit operations such as p.fwdarw.f that would otherwise have very low probabilities. By increasing L, the allowable length of the transfemes is increased. Accordingly, the resultant transformation model 302 is able to capture more meaningful transformation units and reduce probability contamination that results from decomposing intuitively atomic substring transformations.

[0036] Rather than increasing L or in addition to increasing L, the modeling of errors spanning multiple characters can be improved by increasing M, the number of transfemes on which the model probabilities are conditioned. In an example, the character sequence "ie" is often transposed as "ei". A unigram model of (M=1) is not able to express such an error. A bigram model (M=2) captures this pattern by assigning a higher probability to the character sequence e.fwdarw.i when following i.fwdarw.e. A trigram model (M=3) can further identify exceptions to this pattern, such as when the characters "ie" or "ei" are preceded by the letter "c", as "cei" is more common than "cie".

[0037] As mentioned previously, to learn patterns of spelling errors, a parallel corpus of input and output word pairs is desired. The input represents the intended word with corrected spelling while the output corresponds to a potentially misspelled transformation of the input. Additionally, such data may be pre-segmented into the aforementioned transfemes, in which case the transformation model 302 can be derived directly utilizing a maximum likelihood estimation algorithm. As noted above, however, such labeled training data may be too costly to obtain in a large scale. Thus, the training data 306 may include input and output word pairs that are labeled, but such word pairs are not segmented. The expectation-maximization component 308 can be utilized to estimate the parameters of the transformation model 302 from partially observed data.

[0038] If the training data 306 comprises a set of observed training pairs O={O.sup.k}, where O.sup.k=c.sup.k.fwdarw.q.sup.k, the log likelihood of the training data 306 can be written as follows:

log L(.theta.; 0)=.SIGMA..sub.klog p(c.sup.k.fwdarw.q.sup.k|.theta.)=.SIGMA..sub.k log .SIGMA..sub.s.sub.k.sub..di-elect cons.S(O.sub.k.sub.)p(s.sup.k|.theta.) (7)

where .theta.={p(t|t.sub.-M+1, . . . , t.sub.-1)} is a set of model parameters. s.sup.k=t.sub.1.sup.kt.sub.2.sup.k, . . . , t.sub.l.sub.s.sup.k, the joint segmentation of each training pair c.sup.k.fwdarw.q.sup.k into a sequence of character sequences, is the unobserved variable. By applying an expectation maximization algorithm, the parameter set .theta. can be located that maximizes the log likelihood.

[0039] For M=1 and L=1, for each transfeme of length up to 1 is generated independently, the following update formulas can be derived:

p ( s ; .THETA. ) = i .di-elect cons. [ 1 , l s ] p ( t i ; .THETA. ) ( 8 ) e ( t ; .THETA. ) = k s k .di-elect cons. S ( O k ) p ( s k ; .THETA. ) s ' .di-elect cons. S ( O k ) p ( s ' ; .THETA. ) # ( t , s k ) ( 9 ) p ( t ; .THETA. ' ) = e ( t ; .THETA. ) t ' e ( t ' ; .THETA. ) ( 10 ) ##EQU00002##

where #(t, s) is the count of transfeme t in the segmentation sequence s, e (t; .theta.) is the expected partial account of the transfeme t with respect to the transformation model .theta., and .theta.' is the updated model. e(t; .theta.), also known as the evidence for t, can be computed efficiently using a forward-backward algorithm.

[0040] The expectation maximization training algorithm represented by the expectation mechanization component 308 can be extended to higher order transformation models (M>1), where the probability of each transfeme may depend on the previous M-1 transfemes. Other than having to take into account the transfeme history context when accumulating partial counts, the general expectation maximization procedure is essentially the same. Specifically, the following can be obtained:

p ( s ; .THETA. ) = i .di-elect cons. [ 1 , l s ] p ( t i | t i - M + 1 i - 1 ; .THETA. ) ( 11 ) e ( t , h ; .THETA. ) = k s k .di-elect cons. S ( O k ) p ( s k ; .THETA. ) s ' .di-elect cons. S ( O k ) p ( s ' ; .THETA. ) # ( t , h , s k ) ( 12 ) p ( t | h ; .THETA. ' ) = e ( t , h ; .THETA. ) t ' e ( t ' , h ; .THETA. ) , ( 13 ) ##EQU00003##

where h is a transfeme sequence representing the history context, and #(t, h, s) is the occurrence count of transfeme t following the context h in the segmentation sequence s. Although more complicated, e(t, h; .theta.) the evidence for t in the context of h can still be computed efficiently using the forward backward algorithm.

[0041] As the number of model parameters increases with M, the model parameters can be initialized using the convergence of values from the lower order model to achieve faster convergence. Specifically, the following algorithm can be employed:

p(t|h.sup.M; .theta..sup.M).ident.p(t|h.sup.M-1; .theta..sup.M-1) (14)

where h.sup.M is a sequence of M-1 character sequences representing the context, and h.sup.M-1 is h.sup.M without the oldest context character transfeme. Extending the training procedure to L>1 further complicates the forward-backward computation, but the general form of the expectation maximization algorithm can remain the same.

[0042] When the model parameters M and L are increased in the transformation model 302, the number of potential parameters in the transformation model 302 increases exponentially. The pruning component 310 may be utilized to prune some of such potential parameters to reduce complexity of the transformation model 302. For example, assuming an alphabet size of 50, a M=1, L=1 model includes (50+1).sup.2 parameters, as each component in the t=c.sub.t.fwdarw.q.sub.t can take on any of the 50 symbols or .epsilon.. A M=3, L=2 model, however, may contain up to (50.sup.2+50+1).sup.23.apprxeq.2.8.times.10.sup.20 parameters. Although most parameters are not observed in the data, model pruning techniques can be beneficial to reduce overall search space during both training and decoding, and to reduce overfitting, as infrequent transfeme n-grams are likely to be noise.

[0043] Two exemplary pruning strategies that can be utilized by the pruning component 310 when pruning parameters of the transformation model 302 are described herein. In a first example, the pruning component 310 can remove transfeme n-grams with expected partial counts below a threshold .tau..sup.e. Additionally, the pruning component 310 can remove transfeme n-grams with conditional probabilities below a threshold .tau..sup.p. The thresholds can be tuned against a held-out development set. By filtering out transfemes with low confidence, the number of active parameters in the transformation model 302 can be significantly reduced, thereby speeding up running time of training and decoding the transformation model 302. While the pruning component 310 has been described as utilizing the two aforementioned pruning strategies, it is understood that a variety of other pruning techniques may be utilized to prune parameters of the transformation model 302, and such techniques are intended to fall within the scope of the hereto-appended claims.

[0044] As with any maximum likelihood estimation techniques, the expectation-maximization component 308 may overfit the training data 306 when the number of model parameters is large, for example, when M>1. The standard technique in n-gram language modeling to address this problem is to apply smoothing when computing the conditional probabilities. Accordingly, the smoothing component 312 can be utilized to smooth the transformation model 302, wherein the smoothing component 312 can utilize for instance, Jelinek Mercer (JM), absolute discounting (AD), or some other suitable technique when performing model smoothing.

[0045] In JM smoothing, the probability of a character sequence is given by the linear interpolation of its maximum likelihood estimation at order M (using partial counts), and its smoothed probability from a lower order distribution:

p JM ( t | h M ) = ( 1 - a ) e ( t , h M ) t ' e ( t ' , h M ) + .alpha. p JM ( t | h M - 1 ) ( 15 ) ##EQU00004##

where .alpha. .di-elect cons. (0,1) is the linear interpolation parameter. It can be noted that p.sup.JM(t|h.sup.M) and p.sup.JM(t|h.sup.M-1) are probabilities from different distributions within the same model. That is, in computing the M-gram model, the partial counts and probabilities for all lower order m-grams can also be computed, where m.ltoreq.M.

[0046] AD smoothing operates by discounting the partial counts of the transfemes. The removed probability mass is then redistributed to the lower order model:

p AD ( t | h M ) = max ( e ( t , h M ) - d , 0 ) t ' e ( t ' , h M ) + .alpha. ( h M ) p AD ( t | h M - 1 ) ( 16 ) ##EQU00005##

where d is the discount and .alpha.(h.sup.M) is computed such that .SIGMA..sub.tp.sup.AD(t|h.sup.M)=1. Since the partial count e (t, h.sup.M) can be arbitrarily small, it may not be possible to choose a value of d such that e(t,h.sup.M) will always be larger than d. Consequently, the smoothing component 312 can trim the model if e (t, h.sup.M).ltoreq.d. For these pruning techniques, parameters can be tuned on a held-out development set. While a few exemplary techniques for smoothing the transformation model 302 have been described, it is to be understood that various other techniques may be employed to smooth such model 302, and these techniques are contemplated by the inventors.

[0047] It is to be understood that when training the transformation model 302 from the training data 306 that only includes word correction pairs, the resulting transformation model 302 may be likely to over-correct. Accordingly, the training data 306 may also include word pairs wherein, both the input and output word are correctly spelled (e.g., the input and output word are the same). Accordingly, the training data 306 can include a concatenation of two different data sets. A first data set that includes word pairs where the input is a correctly spelled word and the output is the word incorrectly spelled, and a second data set that includes word pairs where both the input and output are correctly spelled. Another technique is to train two separate transformation models from two different data sets. In other words, a first transformation model can be trained utilizing correct/incorrect word pairs while the second transformation model can be trained utilizing correct word pairs. It can be ascertained that the model trained from correctly spelled words will only assign non-zero probabilities to transfemes with identical input and output, as all the transformation pairs are identical. In an example, the two models can be linear interpolated as the final transformation model 302 as follows:

p(t)=(1-.lamda.)p(t;.theta..sup.misspelled)+.lamda.p(t; .theta..sup.identical) (17)

This approach can be referred to as model mixture, where each transfeme can be viewed as being probabilistically generated from one of the two distributions according to the interpolation factor .lamda.. As with other modeling parameters, .lamda. can be tuned on a held out development set. While some exemplary approaches for addressing the tendency of the transformation model 302 to over-correct have been described above, other approaches for addressing such tendency are also contemplated.

[0048] Subsequent to the transformation model 302 being trained, such transformation model 302 can be provided with queries proffered by users 308 in the query log 314 of a search engine. The transformation model 302, for various queries in the query log 314, can segment such queries into transfemes and compute transformation probabilities for transfemes in the query to other transfemes. In this case, the transformation model 302 is utilized to pre-compute first data structure 110, which can include transformation probabilities corresponding to various transfemes. Alternatively, the transformation model 302 itself may be the first data structure 110.

[0049] While the transformation model 302 has been described above as being learned through utilization of queries in a query log, it is to be understood that the transformation model 302 can be trained for particular applications. For instance, soft keyboards (e.g., keyboards on touch-sensitive devices such as tablet computing devices and portable telephones) have become increasingly popular. These keyboards, however, may have an unconventional setup, due to lack of available space. This may cause spelling errors to occur that are different from spelling errors that commonly occur on a QWERTY keyboard. Thus, the transformation model 302 can be trained utilizing data pertaining to such soft keyboard. In another example, portable telephones are often equipped with specialized keyboards for texting, wherein "fat finger syndrome", for example, may cause different types of spelling errors to occur. Again, the transformation model 302 can be trained based upon the specific keyboard layout. In addition, if sufficient data is acquired, the transformation model 302 can be trained based upon observed spelling of a particular user for a certain keyboard/application. Moreover, such a trained transformation model 302 can be utilized to automatically select a key when the input of what the user actually selected is "fuzzy". For instance, the user input may be proximate to an intersection of four keys. Transformation probabilities output by the transformation model 302 pertaining to the input and possible transformations can be utilized to accurately estimate the intent of the user in real-time.

[0050] Turning now to FIG. 4, an exemplary system 400 that facilitates building the second data structure 112 is illustrated. As mentioned previously, the second data structure 112 may be a trie. The system 400 comprises a data repository 402 that includes a query log 404. A tried builder component 406 can receive the query log 404 and generate the second data structure 112 based at least in part upon queries in the query log 404. For example, the trie builder component 406 can, for queries that include correctly spelled words, segment the query into individual characters. Nodes can be built that represent individual characters in queries in the query log 404, and paths can be generated between characters that are sequentially arranged. As noted above, each intermediate node can be assigned a value that is indicative of a most commonly occurring or probable query sequence that extends from such intermediate node.

[0051] Returning again to FIG. 1, additional detail pertaining to operation of the search component 106 is provided. The receiver component 102 can receive a first character sequence (transfeme) from the user 104, and the search component 106 can access the first data structure 110 and the second data structure 112 responsive to receiving the first character sequence. The search component 106 can utilize a modified A* search algorithm to locate at least one most probable word/phrase completion for the phrase prefix q. Each intermediate search path can be represented as a quadruplet <Pos, Node, Hist, Prob> corresponding to the current position in the phrase prefix q, the current node in the trie T, the transformation history Hist up to this point, and the probability Prob of a particular search path, respectively. An exemplary search algorithm that can be utilized by the search component 106 is shown below.

TABLE-US-00001 Input: Query trieT, transformation model .THETA., integer k, query prefix q Output: Top k completion suggestions of q A List l = new List( ) B PriorityQueuepq = new PriorityQueue( ) C pq.Enqueue(new Path(0, T.Root, [ ], 1)) D while (!pq.Empty( )) E Path .pi. = pq.Dequeue( ) F if (.pi..Pos<| q|) // Transform input query G foreach (Transfeme t in GetTransformations(.pi., q, T, .THETA.)) H int i = .pi..Pos + t.Output.Length I Node n = .pi..Node.FindDescendant(t.Input) J History h = .pi..Hist + t K Probp = .pi..Prob .times. (n.Prob / .pi..Node.Prob) .times. P(t, .pi..Hist; .THETA.) L pq.Enqueue(new Path(i, n, h, p)) M else // Extend input query N if (.pi..Node.IsLeaf( )) O l.Add(.pi..Node.Query) P if (l.Count .gtoreq. k) Q return l R else S foreach (Transfeme t in GetExtensions(.pi., T, .THETA.)) T inti = .pi..Pos + t.Output.Length U Node n = .pi..Node.FindDescendant(t.Input) V History h = .pi..Hist + t W Probp = .pi..Prob .times. (n.Prob / .pi..Node.Prob) X pq.Enqueue(new Path(i, n, h, p)) Y return l

[0052] This exemplary algorithm works by maintaining a priority queue of intermediate search paths ranked by decreasing probabilities. The queue can be initialized with the initial path <0, T.Root, [ ], 1> as shown in line C. While there is still a path on the queue, such path can be de-queued and reviewed to ascertain whether there are still characters unaccounted for in the input phrase prefix q (line F). If so, all transfeme expansions that transform substrings starting from the current node in the trie to substrings yet accounted for in the phrase prefix q can be iterated over (line G). For each character sequence expansion, a corresponding path can be added to the trie (line L). The probability of the path can be updated to include adjustments to the heuristic future score and the probability of the transfeme given the previous history (line K).

[0053] As the search component 106 expands the search path, a point will eventually be reached when all characters in the input phrase prefix q have been consumed. The first path in the search performed by the search component 106 that meets this criterion represents a partial correction to the partial input phrase q. At this point, the search transitions from correcting potential errors in the partial input to extending the partial correction to complete phrases (queries). Accordingly, when this occurs (line M), if the path is associated with a leaf node in the trie (line N), indicating that the search component 106 has reached the end of a complete phrase, the corresponding phrase can be added to the suggestion list (line O) and returned if a sufficient number of suggestions exist (line P). Otherwise, all transfemes that extend from the current node (line S) are iterated over and are added to the priority queue (line X). As the transformation score is not affected by extensions to the partial query, the score is updated to reflect alterations in the heuristic future score (line W). When there are no further search paths to expand, the current list of correction completions can be returned (line Y).

[0054] The heuristic future score utilized by the search component 106 is a modified A* algorithm, as applied in lines K and W, is the probability value stored with each node in the trie. As this value represents the largest probability among all phrases reachable from this path, it is an admissible heuristic value that guarantees that the algorithm will indeed find the top suggestions.

[0055] A problem with such heuristic function is that it does not penalize the untransformed part of the input phrase. Therefore, another heuristic can be designed that takes into consideration the upper bound of the transformation probability p(c.fwdarw.q). This can be written formally as follows:

heuristic*(.pi.)=max.sub.c.di-elect cons..pi..Node.Queriesp(c).times.max.sub.c'p(c'.fwdarw.q.sub.[.pi..Pos,|q- |]|.pi..Hist; .theta.) (18)

where q.sub..pi..Pos,|q|] is the substring of q from position .pi..Pos to |q|. For each query, the second maximization in the equation can be computed for all positions of q using dynamic programming, for instance.

[0056] The A* algorithm utilized by the search component 106 can also be configured to perform exact match for off-line spelling correction by substituting the probabilities in line W with line K. Accordingly, transformations involving additional unmatched letters can be penalized even after finding a prefix match.

[0057] It may be worth noting that a search path can theoretically grow to infinite length, as .epsilon. is allowed to appear as either the source or target of a character sequence. In practice, this does not happen as the probability of such transformation sequences will be very low and will not be further expanded in the search algorithm utilized by the search component 106.

[0058] A transformation model with larger L parameter significantly increases the number of potential search paths. As all possible character sequences with length less than or equal to L are considered when expanding each path, transformation models with larger L are less efficient.

[0059] Since the search component 106 is configured to return possible spelling corrections and phrase completions as the user 104 provides input to the online spell correction/phrase completion system 100, it may be desirable to limit the search space such that the search component 106 does not consider unpromising paths. In practice, beam pruning methods can be employed to achieve significant improvement in efficiency without causing a significant loss in accuracy. Two exemplary pruning techniques that can be employed are absolute pruning and relative pruning, although other pruning techniques may be employed.

[0060] In absolute pruning, a number of paths to be explored at each position in the target query q can be limited. As mentioned previously, the complexity of the aforementioned search algorithm is previously unbounded due to E transfemes. By applying absolute pruning, however, the complexity of the algorithm can be bound by O(|q|LK), where K is the number of paths allowed at each position in q.

[0061] In relative pruning, only the paths that have probabilities higher than a certain percentage of the maximum probability at each position are explored by the search component 106. Such threshold values can be carefully designed to achieve substantially optimal efficiency without causing a significant drop in accuracy. Furthermore, the search component 106 can make use of both absolute pruning and relative pruning (as well as other pruning techniques) to improve search efficiency and accuracy.

[0062] In addition, while the search component 106 may be configured to always provide a top threshold number of spell correction/phrase completion suggestions to the user 104, in some instances it may not be desirable to provide to the user 104 with a predefined number of suggestions for every query proffered by the user 104. For instance, showing more suggestions to the user 104 incurs a cost, as the user 104 will spend more time looking through suggestions instead of completing her task. Additionally, displaying irrelevant suggestions may annoy the user 104. Therefore, a binary decision can be made for each phrase completion/suggestion on whether it should be shown to the user 104. For instance, the distance between the target query q and a suggested correction c can be measured, wherein the larger the distance, the greater the risk that providing the suggested correction to the user 104 will be undesirable. An exemplary manner to approximate the distance is to compute the log of the inverse transformation probability, averaged over the number of characters in the suggestion. This can be shown as follows:

risk ( c , q ) = 1 q log 1 p ( c .fwdarw. q ) ( 19 ) ##EQU00006##

[0063] This risk function may not be incredibly effective in practice, however, as the input query q may comprise several words, of which only one is misspelled. It is not intuitive to average the risk over all letters in the query. Instead, the query q can be segmented into words and the risk can be measured at the word level. For example, the risk of each word can be measured separately using the above formula, and the final risk function can be defined as a fraction of words in q having a risk value above a given threshold. If the search component 106 determines that the risk of providing a suggested correction/completion is too great, then the search component 106 can fail to provide such suggested correction/completion to the user.

[0064] Turning now to FIG. 5, an exemplary graphical user interface 500 corresponding to a search engine is illustrated. The graphical user interface 500 includes a text entry field 502, wherein the user can proffer a query that is to be provided to the search engine. A button 504 may be shown in graphical relation to the text entry field 502, wherein depression of the button 504 causes the query entered into the text entry field 502 to be provided to the search engine (finalized by the user). A query suggestion field 506 can be included, wherein the query suggestion field 506 includes suggested queries based upon the query prefix that has been entered by the user. As shown, the user has entered the query prefix "invlv". This query prefix can be received by the online spell correction/phrase completion system 100, which can correct the spelling in the potentially misspelled phrase prefix and provide most likely query completions to the user. The user may then utilize a mouse to select one of the query suggestions/completions for provision to the search engine. These query suggestions include properly spelled words which can improve performance of the search engine.

[0065] Referring now to FIG. 6, another exemplary graphical user interface 600 is illustrated. This graphic user interface 600 can correspond to a word processing application, for instance. The graphical user interface 600 includes a toolbar 602 that may comprise a plurality of selectable buttons, pull down menus or the like, wherein individual buttons or possible selections correspond to certain word processing tasks such as font selection, text size, formatting, and the like. The graphical user interface 600 further comprises a text entry field 604, where the user can compose text and images, etc. As can be shown, the text entry field 604 comprises text that was entered by the user. As a user types, spelling corrections can be presented to the user through utilization of the online spell correction/phrase completion system 100. For instance, the user has typed the letters "concie" into the text entry field. In an example corresponding to the word processing system, this word/phrase prefix can be provided to the online spell correction/phrase completion system 100, which can present the user 104 with a most probable corrected spelling suggestion. The user may utilize a mouse pointer to select for such suggestion, which can replace the text that was previously entered by the user.

[0066] With reference now to FIGS. 7 and 8, various exemplary methodologies are illustrated and described. While the methodologies are described as being a series of acts that are performed in a sequence, it is to be understood that the methodologies are not limited by the order of the sequence. For instance, some acts may occur in a different order than what is described herein. In addition, an act may occur concurrently with another act. Furthermore, in some instances, not all acts may be required to implement a methodology described herein.

[0067] Moreover, the acts described herein may be computer-executable instructions that can be implemented by one or more processors and/or stored on a computer-readable medium or media. The computer-executable instructions may include a routine, a sub-routine, programs, a thread of execution, and/or the like. Still further, results of acts of the methodologies may be stored in a computer-readable medium, displayed on a display device, and/or the like. The computer-readable medium may be a non-transitory medium, such as memory, hard drive, CD, DVD, flash drive, or the like.

[0068] With reference now to FIG. 7, an exemplary methodology 700 that facilitates performing online spelling correction/phrase completion is illustrated. The methodology 700 starts at 702, and at 704 a first character sequence is received from a user. Such first character sequence may be a portion of a phrase prefix that is provided to a computer-executable application. At 706, transformation probability data is retrieved from a first data structure in a computer readable data repository. For example, the first data structure may be a computer executable transformation model that is configured to receive the first character sequence (as well as other character sequences in a phrase prefix that includes the first character sequence) and outputs a transformation probability for the first character sequence. This transformation probability indicates a probability that a second character sequence has been transformed into the first character sequence. For instance, the second character sequence may be a properly spelled portion of a word, while the first character sequence is an improperly spelled portion of such word that corresponds to the properly spelled portion of the word.

[0069] At 708, a second data structure is searched over in the computer readable data repository for a completion of a word or phrase. This search can be performed based at least in part upon the transformation probability retrieved at 706. As mentioned previously, the second data structure in the computer readable data repository may be a trie, an n-gram language model, or the like.

[0070] At 710, a top threshold number of completions of the word or phrase are provided to the user subsequent to receiving the first character sequence, but prior to receiving additional characters from the user. In other words, the top completions of the word or phrase are provided to the user as an online spelling correction/phrase completion suggestions. The methodology 700 completes at 712.

[0071] With reference now to FIG. 8, another exemplary methodology 800 that facilitates performing a query spelling correction/completion is illustrated. The methodology 800 starts at 802, and at 804 a query prefix is received from a user, wherein the query prefix comprises a first character sequence.

[0072] At 806, responsive to receiving the query prefix, transformation probability data is retrieved from a first data structure, wherein the transformation probability data indicates a probability that the first character sequence is a transformation of a properly spelled second character sequence. At 808, subsequent to retrieving the transformation probability data, an A* search algorithm is executed over a trie based at least in part upon the transformation probability data. As discussed above, the trie comprises a plurality of nodes and paths, where leaf nodes in the trie represent possible query completions and intermediate nodes represent character sequences that are portions of query completions. Each intermediate node in the trie has a value assigned thereto that is indicative of a most probable query completion given a query sequence that reaches the intermediate node that is assigned the value.

[0073] At 810, a query suggestion/completion is output based at least in part upon the A* search. This query suggestion/completion can include a spelling correction of a misspelled word or a partially misspelled word in a query proffered by the user. The methodology 800 completes at 812.

[0074] Now referring to FIG. 9, a high-level illustration of an exemplary computing device 900 that can be used in accordance with the systems and methodologies disclosed herein is illustrated. For instance, the computing device 900 may be used in a system that supports performance of online spelling correction/phrase completion. In another example, at least a portion of the computing device 900 may be used in a system that supports building data structures described above. The computing device 900 includes at least one processor 902 that executes instructions that are stored in a memory 904. The memory 904 may be or include RAM, ROM, EEPROM, Flash memory, or other suitable memory. The instructions may be, for instance, instructions for implementing functionality described as being carried out by one or more components discussed above or instructions for implementing one or more of the methods described above. The processor 902 may access the memory 904 by way of a system bus 906. In addition to storing executable instructions, the memory 904 may also store a trie, an n-gram language model, a transformation model, etc.

[0075] The computing device 900 additionally includes a data store 908 that is accessible by the processor 902 by way of the system bus 906. The data store may be or include any suitable computer-readable storage, including a hard disk, memory, etc. The data store 908 may include executable instructions, a trie, a transformation model, etc. The computing device 900 also includes an input interface 910 that allows external devices to communicate with the computing device 900. For instance, the input interface 910 may be used to receive instructions from an external computer device, from a user, etc. The computing device 900 also includes an output interface 912 that interfaces the computing device 900 with one or more external devices. For example, the computing device 900 may display text, images, etc. by way of the output interface 912.

[0076] Additionally, while illustrated as a single system, it is to be understood that the computing device 900 may be a distributed system. Thus, for instance, several devices may be in communication by way of a network connection and may collectively perform tasks described as being performed by the computing device 900.

[0077] As used herein, the terms "component" and "system" are intended to encompass hardware, software, or a combination of hardware and software. Thus, for example, a system or component may be a process, a process executing on a processor, or a processor. Additionally, a component or system may be localized on a single device or distributed across several devices. Furthermore, a component or system may refer to a portion of memory and/or a series of transistors.

[0078] It is noted that several examples have been provided for purposes of explanation. These examples are not to be construed as limiting the hereto-appended claims. Additionally, it may be recognized that the examples provided herein may be permutated while still falling under the scope of the claims.

* * * * *