System and method for improved name matching using regularized name forms Patent Grant Biesenbach , et al. October 6, 2 [International Business Machines Corporation]

System and method for improved name matching using regularized name forms

Biesenbach , et al. October 6, 2

Patent Grant 7599921

U.S. patent number 7,599,921 [Application Number 11/681,333] was granted by the patent office on 2009-10-06 for system and method for improved name matching using regularized name forms. This patent grant is currently assigned to International Business Machines Corporation. Invention is credited to David Edward Biesenbach, Richard Theodore Gillam, Frankie Elizabeth Patman Maguire, Leonard Arthur Shaefer, Jr., Charles Kinston Williams.

United States Patent	7,599,921
Biesenbach , et al.	October 6, 2009

System and method for improved name matching using regularized name forms

Abstract

A system and method for improved name matching using regularized name forms is presented. A regularization rule engine uses culture-specific regularization rules to iteratively convert candidate names and query names to a canonical form, which are regularized candidate names and regularized query names, respectively. The regularization rules are context-sensitive or context-free rules that pertain to a name's originating culture. Subsequently, a name search engine compares the regularized query name with the regularized candidate names and identifies the regularized candidate names that meet a particular regularization matching threshold. In turn, name search engine selects the candidate names that correspond to the identified regularized candidate names and provides the selected candidate names to a user.

Inventors:	Biesenbach; David Edward (Alexandria, VA), Gillam; Richard Theodore (Chantilly, VA), Maguire; Frankie Elizabeth Patman (Washington, DC), Shaefer, Jr.; Leonard Arthur (Leesburg, VA), Williams; Charles Kinston (Fairfax, VA)
Assignee:	International Business Machines Corporation (Armonk, NY)
Family ID:	39733867
Appl. No.:	11/681,333
Filed:	March 2, 2007

Prior Publication Data


	Document Identifier	Publication Date
	US 20080215562 A1	Sep 4, 2008

Current U.S. Class:	1/1; 704/8; 704/7; 707/999.003
Current CPC Class:	G06F 16/3335 (20190101); G06F 16/3337 (20190101); Y10S 707/99933 (20130101)
Current International Class:	G06F 17/30 (20060101)
Field of Search:	;707/1,3,6 ;704/4,5,7,8,9,243,252 ;715/741

References Cited [Referenced By]

U.S. Patent Documents


5912949	June 1999	Chan et al.
6963871	November 2005	Hermansen et al.
2005/0119875	June 2005	Shaefer, Jr. et al.
2006/0031239	February 2006	Koenig
2008/0215621	September 2008	Ravindran et al.

Other References

Hermansen, "Automatic Name Searching in Large Data Bases of International Names," abstract of doctoral dissertation, Georgetown University, Department of Linguistics, 1985. cited by other .
Patman et al., "Names: A New Frontier in Text Mining," Symposium on Intelligence and Security Informatics, No. 1, Tucson, AZ, ETATS-UNIS Jun. 2003, 20031973, vol. 2665, pp. 27-38. cited by other .
Zawaydeh et al., "Orthographic Variations in Arabic Corpora," Basis Technology Corporation, 2006, http://www.basistech.com/knowledge-center/Arabic/orthographic-variations-- in-arrabic.pdf. cited by other.

Primary Examiner: Ehichioya; Fred I
Attorney, Agent or Firm: VanLeeuwen & VanLeeuwen Ming; Erin C.

Claims

What is claimed is:

1. A computer-implemented method comprising: retrieving, by a processor, a candidate name from memory; identifying, by the processor, a cultural classification that corresponds to the candidate name; retrieving, by the processor, one or more culture-specific regularization rules from the memory corresponding to the cultural classification; applying, by the processor, one or more of the culture-specific regularization rules to the candidate name, resulting in a regularized candidate name, wherein the applying further comprises: determining that a first regularization rule included in the one or more culture-specific regularization rules applies to the candidate name; generating a first iteration regularized candidate name by applying the first regularized rule to the candidate name; determining that a second regularization rule included in the one or more culture-specific regularization rules applies to the candidate name; and generating the regularized candidate name by applying the second regularized rule to the first iteration regularized candidate name; storing the regularized candidate name in the memory; comparing, by the processor, the regularized candidate name with a regularized query name; determining, by the processor, that the comparison meets a regularization matching threshold, which indicates a potential match between the regularized candidate name and the regularized query name; and in response to determining that comparison meets the regularization matching threshold, providing the candidate name to the user.

2. The method of claim 1 further comprising: receiving a query name from a user; detecting that the cultural classification corresponds to the query name; applying one or more of the regularization rules to the query name, resulting in the regularized query name; and storing the regularized query name in the memory.

3. The method of claim 2 further comprising: in response to determining that the comparison meets the regularization matching threshold, determining that the candidate name corresponds to the regularized candidate name; and in response to determining that the candidate name corresponds to the regularized candidate name, providing the candidate name to the user.

4. The method of claim 1 wherein each of the culture-specific regularization rules are a context-sensitive rule or a context-free rule, each of the applied culture specific regularization rules used to convert one or more letters included in the candidate name to one or more different letters.

5. The method of claim 1 further comprising: wherein the cultural classification corresponds to an originating culture of the candidate name; and wherein applying the culture-specific regularization rules does not result in the regularized candidate name corresponding to a different originating culture than the candidate name.

6. The method of claim 1 wherein the cultural classification corresponds to an originating culture that is selected from the group consisting of Afghan, Anglo, Arabic, Chinese, Farsi, French, German, Hispanic, Indian, Indonesian, Japanese, Korean, Pakistani, Russian, Thai, Vietnamese, and Yoruban.

7. A computer program product stored in computer memory, comprising functional descriptive material that, when executed by an information handling system, causes the information handling system to perform actions that include: retrieving a candidate name; identifying a cultural classification that corresponds to the candidate name; retrieving one or more culture-specific regularization rules from corresponding to the cultural classification; applying one or more of the culture-specific regularization rules to the candidate name, resulting in a regularized candidate name, wherein the applying further comprises: determining that a first regularization rule included in the one or more culture-specific regularization rules applies to the candidate name; generating a first iteration regularized candidate name by applying the first regularized rule to the candidate name; determining that a second regularization rule included in the one or more culture-specific regularization rules applies to the candidate name; and generating the regularized candidate name by applying the second regularized rule to the first iteration regularized candidate name; storing the regularized candidate name; comparing the regularized candidate name with a regularized query name; determining that the comparison meets a regularization matching threshold, which indicates a potential match between the regularized candidate name and the regularized query name; and in response to determining that comparison meets the regularization matching threshold, providing the candidate name to the user.

8. The computer program product of claim 7 wherein the information handling system further performs actions that include: receiving a query name from a user; detecting that the cultural classification corresponds to the query name; applying one or more of the regularization rules to the query name, resulting in the regularized query name; and storing the regularized query name.

9. The computer program product of claim 8 wherein the information handling system further performs actions that include: in response to determining that the comparison meets the regularization matching threshold, determining that the candidate name corresponds to the regularized candidate name; and in response to determining that the candidate name corresponds to the regularized candidate name, providing the candidate name to the user.

10. The computer program product of claim 7 wherein each of the culture-specific regularization rules are a context-sensitive rule or a context-free rule, each of the applied culture specific regularization rules used to convert one or more letters included in the candidate name to one or more different letters.

11. The computer program product of claim 7 wherein the information handling system further performs actions that include: wherein the cultural classification corresponds to an originating culture of the candidate name; and wherein applying the culture-specific regularization rules does not result in the regularized candidate name corresponding to a different originating culture than the candidate name.

12. The computer program product of claim 7 wherein the cultural classification corresponds to an originating culture that is selected from the group consisting of Afghan, Anglo, Arabic, Chinese, Farsi, French, German, Hispanic, Indian, Indonesian, Japanese, Korean, Pakistani, Russian, Thai, Vietnamese, and Yoruban.

13. An information handling system comprising: one or more processors; a memory accessible by the processors; one or more nonvolatile storage devices accessible by the processors; and a set of instructions stored in the memory, wherein one or more of the processors executes the set of instructions in order to perform actions of: retrieving a candidate name from one of the nonvolatile storage areas; identifying a cultural classification that corresponds to the candidate name; retrieving one or more culture-specific regularization rules corresponding to the cultural classification from one of the nonvolatile storage areas; applying one or more of the culture-specific regularization rules to the candidate name, resulting in a regularized candidate name, wherein the applying further comprises: determining that a first regularization rule included in the one or more culture-specific regularization rules applies to the candidate name; generating a first iteration regularized candidate name by applying the first regularized rule to the candidate name; determining that a second regularization rule included in the one or more culture-specific regularization rules applies to the candidate name; and generating the regularized candidate name by applying the second regularized rule to the first iteration regularized candidate name; and storing the regularized candidate name in one of the nonvolatile storage areas; comparing, by the processor, the regularized candidate name with a regularized query name; determining, by the processor, that the comparison meets a regularization matching threshold, which indicates a potential match between the regularized candidate name and the regularized query name; and in response to determining that comparison meets the regularization matching threshold, providing the candidate name to the user.

14. The information handling system of claim 13 further comprises an additional set of instructions in order to perform actions of: receiving a query name from a user; detecting that the cultural classification corresponds to the query name; applying one or more of the regularization rules to the query name, resulting in the regularized query name; and storing the regularized query name in one of the nonvolatile storage areas.

15. The information handling system of claim 14 further comprises an additional set of instructions in order to perform actions of: in response to determining that the comparison meets the regularization matching threshold, determining that the candidate name corresponds to the regularized candidate name; and in response to determining that the candidate name corresponds to the regularized candidate name, providing the candidate name to the user.

16. The information handling system of claim 13 wherein each of the culture-specific regularization rules are a context-sensitive rule or a context-free rule, each of the applied culture specific regularization rules used to convert one or more letters included in the candidate name to one or more different letters.

17. The information handling system of claim 13 wherein the cultural classification corresponds to an originating culture of the candidate name, and wherein applying the culture-specific regularization rules does not result in the regularized candidate name corresponding to a different originating culture than the candidate name.

Description

BACKGROUND OF THE INVENTION

1. Technical Field

The present invention relates to a system and method for improved name matching using regularized name forms. More particularly, the present invention relates to a system and method for regularizing candidate names and query names based upon their particular culture origin, and identifying names whose corresponding regularized candidate names meet a matching threshold when compared against a regularized query name.

2. Description of the Related Art

A major difficulty in successfully matching personal names stored in a database with a user-provided name query arises when variant forms of the name are possible either through 1) spelling variation inherent to the language itself, or 2) through spelling variation that arises when the names are transliterated into the Roman alphabet from other writing systems.

One approach relies on phonetically based rewrite rules that convert a name to a phonetic form approximating its pronunciation, along with the calculation of a phonetic distance value between two name forms that are being compared. A challenge found, however, is that this approach is only valid in cases in which alternate spelling variations for names that sound similar are inherent to the language itself. Name variants that arise from different transliteration conventions may not show evidence of such similarity in pronunciation. Furthermore, generating phonetic variants and calculating their similarity is computationally very expensive, making it necessary to create a static, pre-processed database that may not be changed or updated in real time. When a new record is added or a rule is changed, the entire database must be regenerated, which renders such a system impractical for most users.

What is needed, therefore, is a system and method that effectively and efficiently improve name-matching capabilities for names with spelling variations and transliteration variations.

SUMMARY

It has been discovered that the aforementioned challenges are resolved using a system, method, and program product that retrieves a candidate name. The system, method, and program product then identify a cultural classification that corresponds to the candidate name. The system, method, and program product then retrieve one or more culture-specific regularization rules corresponding to the cultural classification. The system, method, and program product then apply one or more of the culture-specific regularization rules to the candidate name, which results in a regularized candidate name. The system, method, and program product then store the regularized candidate name in a storage area.

In one embodiment, the system, method, and program product receive a query name from a user. In this embodiment, the system, method, and program product detect that the cultural classification corresponds to the query name. The system, method, and program product then apply one or more of the regularization rules to the query name, which results in a regularized query name. The system, method, and program product then store the regularized query name in a storage area.

In one embodiment, the system, method, and program product compare the regularized candidate name with the regularized query name. In this embodiment, the system, method, and program product determine that the comparison meets a regularization matching threshold. The system, method, and program product then determine that the candidate name corresponds to the regularized candidate name. The system, method, and program product then provide the candidate name to the user.

In one embodiment, the system, method, and program product's culture-specific regularization rules are context-sensitive rules or context-free rules, which convert one or more letters included in the candidate name to one or more different letters.

In one embodiment, the system, method, and program product's cultural classification corresponds to an originating culture of the candidate name. In another embodiment, the system, method, and program product apply the culture-specific regularization rules such that the application does not result in the regularized candidate name corresponding to a different originating culture than the candidate name.

In one embodiment, the system, method, and program product determine that a first regularization rule included in the culture-specific regularization rules applies to the candidate name. In this embodiment, the system, method, and program product generate a first iteration regularized candidate name by applying the first regularized rule to the candidate name. The system, method, and program product then determine that a second regularization rule included in the culture-specific regularization rules applies to the candidate name. The system, method, and program product then generate the regularized candidate name by applying the second regularized rule to the first iteration regularized candidate name.

In one embodiment, the system, method, and program product's cultural classification corresponds to an originating culture that is selected from the group consisting of Afghan, Anglo, Arabic, Chinese, Farsi, French, German, Hispanic, Indian, Indonesian, Japanese, Korean, Pakistani, Russian, Thai, Vietnamese, and Yoruban.

The foregoing is a summary and thus contains, by necessity, simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, inventive features, and advantages of the present invention, as defined solely by the claims, will become apparent in the non-limiting detailed description set forth below.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings.

FIG. 1 is a diagram showing a regularization engine regularizing candidate names and a query name, and a name search engine matching the regularized candidate names with the regularized query name;

FIG. 2 is a diagram showing culture-specific regularization rules;

FIG. 3 is a diagram showing a regularization engine iteratively converting candidate names with an English cultural classification to regularized candidate names;

FIG. 4 is a diagram showing a regularization engine iteratively converting candidate names with an Arabic cultural classification to regularized candidate names;

FIG. 5 is a flowchart showing steps taken in converting candidate names to regularized candidate names;

FIG. 6 is a flowchart showing steps taken in converting a query name to a regularized query name, and matching the regularized query name to one or more regularized candidate names;

FIG. 7 is a flowchart showing steps taken in iteratively converting a candidate name or a query name to a regularized candidate name or a regularized query name using one or more regularization rules; and

FIG. 8 is a block diagram of a computing device capable of implementing the present invention.

DETAILED DESCRIPTION

The following is intended to provide a detailed description of an example of the invention and should not be taken to be limiting of the invention itself. Rather, any number of variations may fall within the scope of the invention, which is defined in the claims following the description.

FIG. 1 is a diagram showing a regularization engine regularizing candidate names and a query name, and a name search engine matching the regularized candidate names with the regularized query name. Regularization rule engine 100 uses culture-specific regularization rules included in rules store 130 to regularize candidate names and query names into canonical form. Subsequently, name search engine 160 compares the regularized query name with the regularized candidate names and identifies the regularized candidate names that meet a particular regularization matching threshold. In turn, name search engine selects the candidate names that correspond to the identified regularized candidate names and provides the selected candidate names to user 180. Rules store 130 may be stored on a nonvolatile storage area, such as a computer hard drive.

Regularization rule engine 100 retrieves candidate name 110 from candidate name store 120. Candidate name 110 includes cultural classification 115, which identifies candidate name 110's culture origin, such as Afghan, Anglo, Arabic, Chinese, Farsi, French, German, Hispanic, Indian, Indonesian, Japanese, Korean, Pakistani, Russian, Thai, Vietnamese, or Yoruban.

Regularization rule engine 100 uses cultural classification 115 to retrieve culture-specific regularization rules from rules store 130, such as a set of English regularization rules or a set of Arabic regularization rules. The regularization rules are used to convert candidate name 110 into a canonical form by converting letters based upon particular context-free or context-sensitive rules. For example, a context-free regularization rule "x>cks" converts any "x" into "cks," regardless of the letters that occur before or after the letter "x." In another example, a context-sensitive regularization rule "$break c} $vowel>k" converts a "c" at the beginning of a word and followed by a vowel into a "k," such as "co" (see FIG. 2 and corresponding text for further details).

Regularization rule engine 100 iteratively converts candidate name 110 into regularized candidate name 140 based upon each applicable regularization rule. Once regularization rule engine 100 is finished with the iterative conversion process, regularization rule engine 100 stores regularized candidate name 140 in regularized name store 150. Regularized name store 150 may be stored on a nonvolatile storage area, such as a computer hard drive. Regularization rule engine 100 performs the above process for each candidate name included in candidate name store 120, which results in multiple regularized candidate names, which are each stored in regularized name store 150.

Name search engine 160 receives query name 170 from user 180, which includes a name that user 180 wishes to query. Name search engine 160 uses cultural identification engine 165 to identify a cultural classification that corresponds to query name 170. As those skilled in the art can appreciate, cultural identification engine 165 may a standard off-the-shelf name classification system that uses statistical algorithms to identify a name's cultural origin.

Name search engine 160 sends query name 175, which includes query name 170 and its corresponding cultural classification, to regularization rule engine 100. In turn, regularization rule engine 100 retrieves culture-specific regularization rules from rules store 130 that correspond to the cultural classification included in query name 175. As such, regularization rule engine 100 iteratively converts query name 175 to regularized query name 180, which it sends back to name search engine 160.

Once name search engine 160 receives regularized query name 160, name search engine 160 compares regularized query name 180 with the regularized candidate names included in regularized name store 150. Name search engine 160 identifies regularized candidate names that meet a regularization matching threshold when compared with regularized query name. For example, name search engine 160 may base a potential match on bigram comparisons (i.e., overlap between combinations of two-character strings in the names). In this example, name search engine 160's matching threshold may be user-configurable and set at a 70% value. In turn, name search engine 160 identifies candidate names that correspond to matching regularized candidate names, and sends the identified candidate names as result 190 to user 180 (see FIG. 6 and corresponding text for further details).

In one embodiment, name search engine 160 performs a second comparison between original names in order to calculate an "unregularized" match score. Name search engine 160 performs the second comparison to account for situations in which the regularization rules are not applied to the original names because of, for example, typographical errors in the names. In this embodiment, name search engine 160 may identify names meeting a matching threshold from either the regularized or unregularized comparisons.

FIG. 2 is a diagram showing culture-specific regularization rules. Regularization rules 200 includes to sets of culture-specific regularization rules, which are English rules 210-230 and Arabic rules 240-255.

When a regularization rule engine identifies a name with an "English" cultural classification, whether it is a candidate name or a query name, the regularization rule engine retrieves rules 210-230. Rule 210 instructs the regularization rule engine to convert any "x" into a "cks." Rule 215 instructs the regularization rule engine to convert a "c" at the beginning of a word, and also followed by a vowel, into a "k." Rule 220 instructs the regularization rule engine to delete a "p" when the "p" is between an "m" and an "s." Rule 225 instructs the regularization rule engine to convert an "e," when the e is after a consonant and before an "n" at the end of a word, to an "o." And, rule 230 instructs the regularization rule engine to delete an "h" when the "h" is after a "t."

When a regularization rule engine identifies a name with an "Arabic" cultural classification, whether it is a candidate name or a query name, the regularization rule engine retrieves rules 240-255. Rule 240 instructs the regularization rule engine to convert an "l," when it is part of "abdal," into an "s" when it is before an "s." Rule 245 instructs the regularization rule engine to convert "abdel," "abdil," "abdul," and "abdol" into "abdal." Rule 250 instructs the regularization rule engine to convert an "ll" into an "l." And, rule 255 instructs the regularization rule engine to convert an "ss" into an "s."

As those skilled in the art can appreciate, other culture-specific rules may be used with the invention described herein than what is shown in FIG. 2, such as rules applicable to Afghan, Anglo, Chinese, Farsi, French, German, Hispanic, Indian, Indonesian, Japanese, Korean, Pakistani, Russian, Thai, Vietnamese, or Yoruban cultures.

FIG. 3 is a diagram showing a regularization engine iteratively converting candidate names with an English cultural classification to regularized candidate names. Table 300 includes candidate names in column 310 along with their corresponding cultural classification in column 320. Regularization engine 100 retrieves English culture-specific regularization rules from rules store 130 in order to iteratively convert the candidate names included in column 310 to regularized candidate names included in column 340. Regularization rule engine 100 and rules store 130 are the same as that shown in FIG. 1.

Column 330 shows iterative regularized names that result from regularization rule engine 100 applying regularization rules to the various candidate names. Regularization rule engine 100 iteratively applies each applicable regularization rule to the candidate names, which ultimately results in the regularized candidate names included in column 340.

FIG. 4 is a diagram showing a regularization engine iteratively converting candidate names with an Arabic cultural classification to regularized candidate names. FIG. 4 is similar to FIG. 3 with the exception that FIG. 4 includes candidate names that have an "Arabic" cultural classification. Table 400 includes candidate names in column 410 along with their corresponding cultural classification in column 420. Regularization engine 100 retrieves Arabic culture-specific regularization rules from rules store 130 in order to iteratively convert the candidate names included in column 410 to regularized candidate names included in column 440. Regularization rule engine 100 and rules store 130 are the same as that shown in FIG. 1.

Column 430 shows iterative regularized names that result from regularization rule engine 100 applying regularization rules to the various candidate names. Regularization rule engine 100 iteratively applies each applicable regularization rule to the candidate names, which ultimately results in the regularized candidate names included in column 440.

FIG. 5 is a flowchart showing steps taken in converting candidate names to regularized candidate names. The invention described herein iteratively converts a candidate name to a canonical form (regularized name) using one or more regularization rules that are culture-specific to the candidate name.

Processing commences at 500, whereupon processing retrieves a candidate name from candidate name store 120 (step 510). For example, the candidate name may be a name in a financial database. A determination is made as to whether the candidate name includes a cultural classification (decision 520). The cultural classification classifies the candidate name based upon the candidate name's culture origin, such as Afghan, Anglo, Arabic, Chinese, Farsi, French, German, Hispanic, Indian, Indonesian, Japanese, Korean, Pakistani, Russian, Thai, Vietnamese, or Yoruban. Candidate store 120 is the same as that shown in FIG. 1.

If the candidate name does not include a cultural classification, decision 520 branches to "No" branch 522 whereupon processing culturally classifies the candidate name using existing methods known to those skilled in the art (step 530). On the other hand, if the candidate name already includes a cultural classification, decision 520 branches to "Yes" branch 528 bypassing cultural classification steps.

At step 540, processing retrieves regularization rules, which are culture-specific to the candidate name's cultural classification, from rules store 130. For example, the candidate name may be "Cox" and have an "English" cultural classification. In this example, processing retrieves English regularization rules from rules store 130. Rules store 130 is the same as that shown in FIG. 1.

Processing proceeds through a series of iterations to apply the culture-specific regularization rules to the candidate name in order to generate a regularized candidate name, which is stored in temporary store 560 (pre-defined process block 550, see FIG. 7 and corresponding text for further details). Temporary store 560 may be stored on a nonvolatile storage area, such as a computer hard drive. At step 570, processing stores the regularized candidate name in regularized name store 150. Processing subsequently compares the regularized names included in regularized name store 150 with regularized query names in order to identify matches to provide to a user (see FIG. 6 and corresponding text for further details).

A determination is made as to whether there are more candidate names to regularize (decision 580). If there are more candidate names to regularize, decision 580 branches to "Yes" branch 582, which loops back to retrieve and process another candidate name. This looping continues until there are no more candidate names to process, at which point decision 580 branches to "No" branch 588 whereupon processing ends at 590.

FIG. 6 is a flowchart showing steps taken in converting a query name to a regularized query name, and matching the regularized query name to one or more regularized candidate names.

Processing commences at 600, whereupon processing receives a query name from user 170 at step 610. For example, user 170 may wish to know whether a particular name is included in a financial database. User 170 is the same as that shown in FIG. 1.

A determination is made as to whether the query name includes a cultural classification (decision 620). If the query name does not include a cultural classification, decision 620 branches to "No" branch 622 whereupon processing culturally classifies the query name using existing methods known to those skilled in the art (step 625). On the other hand, if the query name already includes a cultural classification, decision 620 branches to "Yes" branch 628 bypassing cultural classification steps.

At step 630, processing retrieves regularization rules that are culturally specific to the query name's cultural classification from rules store 130. For example, the candidate name may be "Cox" and have an "English" cultural classification. In this example, processing retrieves English regularization rules from rules store 130. Rules store 130 is the same as that shown in FIG. 1.

Processing proceeds through a series of iterations to apply the culture-specific regularization rules to the query name in order to generate a regularized query name, which is stored in temporary store 560 (pre-defined process block 640, see FIG. 7 and corresponding text for further details). Temporary store 560 is the same as that shown in FIG. 5.

At step 650, processing compares the regularized query name included in temporary store 560 with regularized candidate names included in regularized name store 150 in order to identify potential matches. A determination is made as to whether the comparison results in a match that meets a regularization matching threshold, such as 70% (decision 660).

If one of the regularized candidate names meets the regularization matching threshold, decision 660 branches to "Yes" branch 668 whereupon processing identifies the original candidate names that corresponds to the matched regularized candidate names (step 670). For example, the regularized candidate name be "Kocks," which corresponds to an original candidate name "Cox." Once identified, processing provides the identified original candidate names to user 170 at step 680. On the other hand, if no regularized candidate names meet the regularization matching threshold, decision 660 branches to "No" branch 662 whereupon processing notifies user 170 that no candidate names matched the query name (step 665). Processing ends at 690.

FIG. 7 is a flowchart showing steps taken in iteratively converting a candidate name or a query name to a regularized candidate name or a regularized query name using one or more regularization rules.

Processing commences at 700, whereupon processing selects a first culture-specific regularization rule, such as one of English rules 210-230 shown in FIG. 2 (step 710). At step 720, processing compares the selected rule with the name (candidate name or query name) to identify whether the rule applies to the name. For example, if the select rule is "x>cks," (turn any x into cks) and the name is "Cox," the selected rule applies to the name because the name includes the letter "x."

A determination is made as to whether the selected rule applies to the name (decision 730). If the selected rule applies to the name, decision 730 branches to "Yes" branch 732 whereupon processing regularizes the name according to the selected rule and stores a "first iteration regularized candidate name" in temporary store 560 at step 740. Using the example discussed above, processing converts "Cox" to "Cocks" based upon the selected rule. Since processing may iteratively compare multiple regularization rules to a name, the regularized names temporarily stored are iterations of the final regularized name until the last regularization rule is compared with the name. Temporary store 560 is the same as that shown in FIG. 5.

A determination is made as to whether there are more culture-specific regularization rules to compare with the name (decision 750). If there are more culture-specific regularization rules, decision 750 branches to "Yes" branch 752 whereupon processing loops back and selects the next rule (step 760) and compares it with the regularized name iteration stored in temporary store 560 at step 720.

This looping continues until there are no more culture-specific regularization rules, at which point decision 750 branches to "No" branch 758 whereupon processing returns at 770.

FIG. 8 illustrates information handling system 801 which is a simplified example of a computer system capable of performing the computing operations described herein. Computer system 801 includes processor 800 which is coupled to host bus 802. A level two (L2) cache memory 804 is also coupled to host bus 802. Host-to-PCI bridge 806 is coupled to main memory 808, includes cache memory and main memory control functions, and provides bus control to handle transfers among PCI bus 810, processor 800, L2 cache 804, main memory 808, and host bus 802. Main memory 808 is coupled to Host-to-PCI bridge 806 as well as host bus 802. Devices used solely by host processor(s) 800, such as LAN card 830, are coupled to PCI bus 810. Service Processor Interface and ISA Access Pass-through 812 provides an interface between PCI bus 810 and PCI bus 814. In this manner, PCI bus 814 is insulated from PCI bus 810. Devices, such as flash memory 818, are coupled to PCI bus 814. In one implementation, flash memory 818 includes BIOS code that incorporates the necessary processor executable code for a variety of low-level system functions and system boot functions.

PCI bus 814 provides an interface for a variety of devices that are shared by host processor(s) 800 and Service Processor 816 including, for example, flash memory 818. PCI-to-ISA bridge 835 provides bus control to handle transfers between PCI bus 814 and ISA bus 840, universal serial bus (USB) functionality 845, power management functionality 855, and can include other functional elements not shown, such as a real-time clock (RTC), DMA control, interrupt support, and system management bus support. Nonvolatile RAM 820 is attached to ISA Bus 840. Service Processor 816 includes JTAG and I2C busses 822 for communication with processor(s) 800 during initialization steps. JTAG/I2C busses 822 are also coupled to L2 cache 804, Host-to-PCI bridge 806, and main memory 808 providing a communications path between the processor, the Service Processor, the L2 cache, the Host-to-PCI bridge, and the main memory. Service Processor 816 also has access to system power resources for powering down information handling device 801.

Peripheral devices and input/output (I/O) devices can be attached to various interfaces (e.g., parallel interface 862, serial interface 864, keyboard interface 868, and mouse interface 870 coupled to ISA bus 840. Alternatively, many I/O devices can be accommodated by a super I/O controller (not shown) attached to ISA bus 840.

In order to attach computer system 801 to another computer system to copy files over a network, LAN card 830 is coupled to PCI bus 810. Similarly, to connect computer system 801 to an ISP to connect to the Internet using a telephone line connection, modem 885 is connected to serial port 864 and PCI-to-ISA Bridge 835.

While FIG. 8 shows one information handling system that employs processor(s) 800, the information handling system may take many forms. For example, information handling system 801 may take the form of a desktop, server, portable, laptop, notebook, or other form factor computer or data processing system. Information handling system 801 may also take other form factors such as a personal digital assistant (PDA), a gaming device, ATM machine, a portable telephone device, a communication device or other devices that include a processor and memory.

One of the preferred implementations of the invention is a client application, namely, a set of instructions (program code) in a code module that may, for example, be resident in the random access memory of the computer. Until required by the computer, the set of instructions may be stored in another computer memory, for example, in a hard disk drive, or in a removable memory such as an optical disk (for eventual use in a CD ROM) or floppy disk (for eventual use in a floppy disk drive). Thus, the present invention may be implemented as a computer program product for use in a computer. In addition, although the various methods described are conveniently implemented in a general purpose computer selectively activated or reconfigured by software, one of ordinary skill in the art would also recognize that such methods may be carried out in hardware, in firmware, or in more specialized apparatus constructed to perform the required method steps.

While particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that, based upon the teachings herein, that changes and modifications may be made without departing from this invention and its broader aspects. Therefore, the appended claims are to encompass within their scope all such changes and modifications as are within the true spirit and scope of this invention. Furthermore, it is to be understood that the invention is solely defined by the appended claims. It will be understood by those with skill in the art that if a specific number of an introduced claim element is intended, such intent will be explicitly recited in the claim, and in the absence of such recitation no such limitation is present. For non-limiting example, as an aid to understanding, the following appended claims contain usage of the introductory phrases "at least one" and "one or more" to introduce claim elements. However, the use of such phrases should not be construed to imply that the introduction of a claim element by the indefinite articles "a" or "an" limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases "one or more" or "at least one" and indefinite articles such as "a" or "an"; the same holds true for the use in the claims of definite articles.

* * * * *

References

basistech.com/knowledge-center/Arabic/orthographic-variations-in-arrabic.pdf