Automatic Speaker Verification By Non-linear Time Alignment Of Acoustic Parameters Patent Grant Doddington , et al. October 24, 1 [Bell Telephone Laboratories, Incorporated]

Automatic Speaker Verification By Non-linear Time Alignment Of Acoustic Parameters

Doddington , et al. October 24, 1

Patent Grant 3700815

U.S. patent number 3,700,815 [Application Number 05/135,697] was granted by the patent office on 1972-10-24 for automatic speaker verification by non-linear time alignment of acoustic parameters. This patent grant is currently assigned to Bell Telephone Laboratories, Incorporated. Invention is credited to George Rowland Doddington, James Loton Flanagan, Robert Carl Lummis.

United States Patent	3,700,815
Doddington , et al.	October 24, 1972

AUTOMATIC SPEAKER VERIFICATION BY NON-LINEAR TIME ALIGNMENT OF ACOUSTIC PARAMETERS

Abstract

Speaker verification, as opposed to speaker identification, is carried out by matching a sample of a person's speech with a reference version of the same text derived from prerecorded samples of the same speaker. Acceptance or rejection of the person as the claimed individual is based on the concordance of a number of acoustic parameters, for example, formant frequencies, pitch period and speech energy. The degree of match is assessed by time aligning the sample and reference utterance. Time alignment is achieved by a nonlinear process which maximizes the similarity between the sample and reference through a piece-wise linear continuous transformation of the time scale. The extent of time transformation that is required to achieve maximum similarity also influences the decision to accept or reject the identity claim.

Inventors:	Doddington; George Rowland (Richardson, TX), Flanagan; James Loton (Somerset, NJ), Lummis; Robert Carl (Berkeley Heights, NJ)
Assignee:	Bell Telephone Laboratories, Incorporated (Murray Hill, NJ)
Family ID:	22469241
Appl. No.:	05/135,697
Filed:	April 20, 1971

Current U.S. Class:	704/246; 704/238; 704/241; 704/243; 704/E15.016
Current CPC Class:	G10L 15/12 (20130101); G07C 9/37 (20200101); G10L 15/00 (20130101)
Current International Class:	G10L 15/12 (20060101); G10L 15/00 (20060101); G07C 9/00 (20060101); G10l 001/02 ()
Field of Search:	;179/1SA,1SB,15.55R,15.55T ;340/148

References Cited [Referenced By]

U.S. Patent Documents


3509280	April 1970	Jones
3525811	August 1970	Trice
3466394	September 1969	French

Primary Examiner: Claffy; Kathleen H.
Assistant Examiner: Leaheey; Jon Bradford

Claims

What is claimed is:

1. In an auditory verification system in which acoustic parameters of a test sample of an individual's speech are matched for identity to like parameters of a reference sample of his speech, that improvement which includes the steps of:

time adjusting said test sample parameters with said reference parameters according to a nonlinear registration schedule,

measuring internal dissimilarities and irregularities between said time adjusted parameters and said average reference parameters, and

verifying said individual's identity on the basis of said measures of dissimilarities and irregularities.

2. In an auditory verification system in which acoustic parameters of a test sample of an individual's speech are matched for identity to corresponding parameters of a reference sample of his speech, that improvement which comprises the steps of:

preparing said reference sample of an individual's speech from sets of parameters developed from a plurality of different utterances of a test phrase by said individual which have been mutually registered in time, and from a plurality of measures of variation between said different utterances,

measuring internal dissimilarities and irregularities between said test sample parameters and said reference parameters, and

verifying said individual's identity on the basis of said measures of dissimilarities and irregularities.

3. In an auditory verification system in which acoustic parameters of a test sample of an individual's speech are matched for identity to like parameters of a reference sample of his speech, that improvement which comprises the steps of:

developing said reference sample from time registered values of a plurality of different speech signal parameters,

developing a like plurality of different speech signal parameters from said test speech sample,

time adjusting said test sample parameters with said reference parameters according to a nonlinear registration schedule,

measuring internal dissimilarities and irregularities between said time adjusted parameters and said average reference parameters, and

verifying said individual's identity on the basis of said measures of dissimilarities and irregularities.

4. In a speech signal verification system wherein selected speech signal parameters derived from a test phrase spoken by an individual to produce a sample are compared to reference parameters derived from the same test phrase spoken by the same individual, and wherein verification or rejection of the identity of the individual is determined by the similarities of said sample and reference parameters,

means for bringing the time span of said sample parameters into temporal registration with the time span of said reference parameters, and

means for temporally adjusting the time distribution of parameters of said sample within said adjusted time span to maximize similarities between said sample parameters and said reference parameters.

5. The speech signal verification system as defined in claim 4, wherein said similarities between said sample parameters and said reference parameters are measured by the coefficient of correlation therebetween.

6. The speech signal verification system as defined in claim 5, wherein said temporal adjustment of parameters within said adjusted time span comprises,

means for iteratively incrementing the time locations of selected parameter features until said measure of correlation between said sample parameters and said reference parameters does not increase significantly for a selected number of consecutive iterations.

7. The speech signal verification system as defined in claim 4, wherein said means for temporally adjusting said parameters within said adjusted time span comprises,

means for temporally transforming said sample parameters, designated s(t), into a set of parameters, designated s(.tau.), in which .tau.(t) = a+bt+q(t), in which a and b are constants selected to align the end points of said time span of said sample parameters with the end points of said reference parameters, and in which q(t) is a nonlinear function which defines the distribution of parameter values within said time span.

8. The speech signal verification system as defined in claim 7, wherein said nonlinear function q(t) is a continuous piece-wise linear function described by N selected amplitude values q.sub.i and time values t.sub.i within said time span, wherein i = 0, 1, . . .N.

9. An auditory speech signal verification system, which comprises, in combination,

means for analyzing a plurality of individual utterances of a test phrase spoken by an individual to develope a prescribed set of acoustic parameter signals for each utterance,

means for developing from each of said sets of parameter signals a reference set of parameter signals and a set of signals which denotes variations between parameter signals used to develop said reference set of signals,

means for storing a set of reference parameter signals and a set of variation signals for each of a number of different individuals,

means for analyzing a sample utterance of said test phrase spoken by an individual purported to be one of said number of different individuals to develop a set of acoustic parameter signals,

means for adjusting selected parameter signals of said sample to bring the time scale of said utterance represented by said parameters into registry with the time scale of a designated one of said stored reference utterances represented by said reference parameters,

said means including means for adjusting selected values of said sample parameter signals to maximize similarities between said sample utterance and said reference utterance,

means responsive to said reference parameter signals, said adjusted sample parameter signals, and said variation signals for developing a plurality of signals representative of selected similarities between each of said sample parameters and each of said corresponding reference parameters,

means for developing signals representative of the extent of adjustment employed to register said time scales,

means responsive to said plurality of similarity signals and said signals representative of the extent of adjustment for developing a signal representative of the overall degree of similarity between said sample utterance and said designated reference utterance, and

threshold comparison means supplied with said overall similarity signal for matching the magnitude of said similarity signal to the magnitude of a stored threshold signal, and for issuing an "accept" signal for similarity signals above threshold, a "reject" signal for signals below threshold, and a "no decision" signal for similarity signals within a prescribed narrow range of signal magnitudes near said threshold magnitude.

10. An auditory speech signal verification system, as defined in claim 9, wherein,

said means for developing a plurality of signals representative of selected similarities includes,

means for measuring a plurality of different speech signal characteristics for similarity in each of a number of time subintervals within the interval of said designated time scale.

11. An auditory speech signal verification system, as defined in claim 10, wherein,

said different speech signal characteristics are based, respectively, on (1) the difference in average values between said reference speech signal parameters and said sample speech signal parameters, (2) the squared difference in linear components between the two, (3) the squared difference in quadratic components between the two, (4) the correlation between the two in each of said subintervals, and (5) the correlation between the two over said entire interval of said designated time scale.

12. An auditory speech signal verification system, as defined in claim 11, wherein,

each of said signals representative of selected similarities between each of said sample parameters and each of said corresponding reference parameters is scaled in accordance with the magnitudes of said variation signals.

Description

This invention relates to speech signal analysis and, more particularly, to a system for verifying the identity of a person on the basis of acoustic perameters unique to his speech.

BACKGROUND OF THE INVENTION

Many business transactions might be conducted by voice over a telephone if the identity of a caller could be verified. It might, for example, be convenient if a person could telephone his bank and ascertain the balance of his account. He might dial the bank and enter both his identification number and his request by keying the dial. A computer could (via synthetic speech) ask him to speak his verification phrase. If a verification of sufficiently high confidence was achieved, the machine would proceed to read out the requested balance. Other instances are apparent where verification by voice would prove useful.

From the practical point of view, the problem of verification appears both more important and more tractable than the problem of absolute identification. The former problem consists of the decision to accept or reject an identity claim made by an unknown voice. In identification the problem is to decide which of a reference set is the unknown most like. In verification, the expected probability of error tends to remain constant regardless of the size of the user population, whereas in identification the expected probability of error tends to unity as the population becomes large. In the usual context of the verification problem one has a closed set of cooperative "customers," who wish to be verified and who are willing to pronounce prescribed code phrases (tailored to the individual voices if necessary). The machine may ask for repeats and might adjust its acceptance threshold in accordance with the importance of the transaction. Further, the machine may control the average mix of the two kinds of errors it can make: i.e., accept a false speaker (miss), or reject a true speaker (false alarm).

DESCRIPTION OF THE PRIOR ART

A number of recognition techniques have been proposed for identifying a speaker on the basis of prerecorded samples of his speech. For example, human observers have been trained to identify talkers on the basis of certain variables in their speech, and others have been trained to identify them on the basis of a visual display of acoustic features. These studies indicate that talkers can indeed be identified primarily on the basis of acoustic cues. Auditory verification (by a human listener) thus is a possibility, but it is generally inconvenient and it occupies talent that might be better applied otherwise. Also, present indications are that auditory verification is not as reliable as machine verification.

Accordingly, several proposals have been made for the automatic recognition of speech sounds based entirely on acoustic information. These have shown some degree of promise, providing that the sample words to be recognized or identified are limited in number. Most of these recognition techniques are based on individual words, with each word being compared to a corresponding word. Some work has been done on comparing selected parameters in a sample utterance for example, peaks and valleys of pitch periods, against corresponding reference data.

SUMMARY OF THE INVENTION

It is, accordingly, an object of this invention to verify the identity of a human being on the basis of certain unique acoustic cues in his speech. In accordance with the invention, verification of a speaker is achieved by comparing the characteristic way in which he utters a test sentence with a previously prepared utterance of the same sentence. A number of different tests are made on the speech signals and a binary decision is then made; the identity claim of the talker is either rejected or accepted.

The problem may be defined as follows. A person asserts a certain identity and then makes a "sample" utterance of a special test phrase. Previously prepared information about the voice of the person whose identity is claimed i.e., a "reference" utterance, embodies the typical way in which that person utters the test phrase, as well as measures of the variability to be expected in separate repetitions of the phrase by that person. The sample utterance is compared with the reference information and a decision is rendered as to the veracity of the identity claim. For the sake of exposition, it is convenient to divide the verification technique into three basic operations: time registration, construction of a reference, and measurement of the "distance" from the reference to a particular sample utterance.

Time registration is the process in which the time axis of a sample time function is warped so as to make the function most nearly similar to the unwarped version of a reference function. The warped time scale may be specified by any continuous transformation. One suitable function is a piece-wise linear continuous function of unwarped time. In this case warping is uniquely determined by two coordinates of each breakpoint in the piece-wise linear function. Typically, 10 break-points may be used for warping a two-second-long function, so the registration task amounts to the optimal assignment of values to 20 parameters.

The coefficient of correlation between warped sample and unwarped reference may be used as one index of the similarity of the two functions. The 20 warping parameters are iteratively modified to maximize the correlation coefficient. One suitable technique is the method of steepest ascent. That is, in every iteration, each of the 20 parameters is incremented by an amount proportional to the partial derivative of the correlation coefficient with respect to that parameter.

Success of this procedure hinges on the avoidance of certain degenerate outcomes. Accordingly, several constraints on the steepest ascent iteration process are employed. In effect, these constraints prevent the original function from being distorted too severely, and prevent unreasonably large steps on any one iteration.

A reference phrase is formed by collecting a number of independent utterances of the phrase by the same speaker. Each is referred to as a "specimen" utterance. A typical phrase which has been used in practice is "We were away a year ago." Each utterance is analyzed to yield, for an all voiced utterance such as this one, five "control functions" (so called because they can be used to control a formant synthesizer to generate a signal similar to the original voice signal). It has been found that gain, pitch period, and first, second, and third formant frequencies, are satisfactory as control functions. The gain function is scaled to have a particular peak value independent of the talking level.

The reference consists of a version of each of the five control functions chosen to represent a typical utterance by that speaker. By convention, the length of the reference is always the same; a value of 1.9 seconds may be used as the standard length. Any value may be used that is not grossly different from the natural length of the utterance.

The reference functions are constructed by averaging together the specimen functions after each has been time-warped to bring them all into mutual registration with each other. One way this mutual registration has been achieved is as follows. One of the five control functions is singled out to guide the registration. This control function is called the "guide" function. Either gain or second formant may be used for this purpose. The guide function from each specimen is linearly expanded (or contracted) to the desired reference length, and then all of the expanded guide functions are averaged together. This average is the first trial reference for the control function serving as guide. Each of the specimen guide functions is then registered to the trial reference by non-linear time-warping, and a new trial reference is generated by averaging the warped specimens. This process is continued iteratively, i.e., warp each specimen guide function for registration with the current trial reference, and then make a new trial reference by averaging the warped guide functions, until the reference does not change significantly. The other four control functions for each specimen utterance are then warped by the final guide warping function for that utterance, and then each control function is averaged across all specimens to form a reference. The reference control functions are stored for future use, along with computed variance values which indicate the reliability of the function as a standard in selected intervals of the utterance.

When a sample of the standard utterance is presented for verification, a "distance" value is computed that is a measure of the unlikelihood that that sample would have been generated by the person whose identity is claimed. Distances are always positive numbers; a distance value of zero means that the utterance is identical to the reference in every detail.

The sample is first analyzed to generate the five control functions in terms of which the reference is stored. The control functions are then brought into temporal registration with the reference. This is done by choosing one of the control functions (e.g., gain) to serve as the "guide." The guide function of the sample utterance is registered with its counterpart in the reference by non-linear warping, and other control functions are then warped in an identical way.

After registration of the control functions, a variety of distances between the sample and reference utterance are measured. Included are measures of the difference in local average, local linear variation, and local quadratic variation for all control functions; local and global correlation coefficients between sample and reference control functions; and measures that represent the difficulty of time registration. In forming these separate measures, various time segments of the utterance are weighted in proportion to the constancy of the given measure in that time segment across the set of warped specimens. These measures are then combined to form a single overall distance that represents the degree to which the sample utterance differs from the reference.

The verification decision is based on the single overall distance. If it is less than a pre-determined criterion, the claimed identity is accepted ("verified"); if it is greater than the criterion, the identity claim is rejected. In addition, an "indeterminate" zone may be established around the criterion value within which neither definite decision would be rendered. In this event, additional information about the person is sought .

BRIEF DESCRIPTION OF THE DRAWING

The invention will be fully apprehended from the following detailed description of a preferred illustrative embodiment thereof taken in connection with the appended drawings.

In the drawings:

FIG. 1 is a block schematic diagram of a speech verification system in accordance with the invention;

FIG. 2 illustrates an alternative analyzer arrangement;

FIG. 3 illustrates graphically the registration technique employed in the practice of the invention;

FIG. 4 is a chart which illustrates the dependence of two kinds of error ratios on the choice of threshold;

FIG. 5 is a block schematic diagram of a time adjustment configuration which may be employed for non-linearly warping parameter values;

FIG. 6 illustrates a criterion for maximizing similarity of acoustic parameters in accordance with the invention;

FIG. 7 illustrates a number of distance measures used in establishing an identity between two speech samples; and

FIGS. 8A, 8B and 8C are graphic illustrations of speech parameters of an unknown and a reference talker. There is illustrated in A, the time normalized parameters before the nonlinear time warping procedure. B illustrates parameters for a reference and specimen utterance which match after time registration using the second formant as the guide function. C illustrates a fully time normalized set of parameters for an impostor, i.e., a no-match condition.

DETAILED DESCRIPTION

A system for verifying an individual's claimed identity is shown schematically in FIG. 1. A library of reference utterances is established to maintain a voice standard for each individual subscriber to the system. A later claim of identity is verified by reference to the appropriate stored reference utterance. Accordingly, an individual speaks a reference sentence, for example, by way of a microphone at a subscriber location, or over his telephone to a central location (indicated generally by the reference numeral 10). Although any reference phrase may be used, the phrase should be capable of representing a number of prosodic characteristics and variations of his speech. Since vowel or voiced sounds contain a considerable number of such features, the reference sentence, "We were away a year ago." has been used in practice. This phrase is effective, in part, because of its lack of nasal sounds and its totally voiced character. Moreover, it is long enough to require more than passing attention to temporal registration, and is short enough to afford economical analysis and storage.

Whatever the phrase spoken by the individual to establish a standard, it is delivered to speech analyzer 11, of any known construction, wherein a number of different acoustic parameters are derived to represent it. For example, individual formant frequencies, amplitudes, and pitch, at the Nyquist rate are satisfactory. These speech parameters are commonly used to synthesize speech in vocoder apparatus and the like. One entirely suitable speech signal analyzer is described in detail in a copending application of L. R. Rabiner and R. W. Schafer, Ser. No. 872,050, filed Oct. 29, 1969. In essence, analyzer 11 includes individual channels for identifying formant frequencies F1, F2, F3, pitch period P, and gain G control signals. In addition, fricative identifying signals may be derived if desired.

In order that variations in the manner in which the individual speaks the phrase may be taken into account, it is preferable to have him repeat the reference sentence a number of times in order that an average set of speech parameters may be prepared. It is convenient to analyze the utterance as it is spoken, and to adjust the duration of the utterance to a standard length T. Typically, a two-second sample is satisfactory. Each spoken reference sentence therefore is either stretched or contracted in apparatus 12 to adjust it to the standard duration. Each adjusted set of parameters is then stored either as analog, or after conversion, as digital signals, for example, in unit 12. When all of the test utterances have been analyzed and brought into mutual time registration, an average set of parameters is developed in averaging apparatus 13. The single resultant set of reference parameter values is then stored for future use in storage unit 15.

In addition, a set of variance signals is prepared and stored in unit 15. Variance values are developed, in the manner described hereinafter, for parameters in each of a number of time segments within the span of the reference utterance to indicate the extent of any difference in the manner in which the speaker utters that segment of the test phrase. Hence, variance values provide a measure of the reliability with which parameters in different segments may be used as a standard.

It is evident that a non-vocal identification of each individual is also stored, preferably in library store 15. The identification may be either in the form of a separate address or some other key to the storage location of the reference utterance for each individual. Any form of addressing well known to those in the art may be employed. Moreover, it is evident that store 15 may be in the form of a library of reference data for any number of subscribers to the verification service.

When it is desired to verify a claim of identity by an individual, for example, at the time a credit card transaction takes place at a retail store or the like, two separate entries are made in the system. First, the individual identifies himself, for example, by means of his name and address or his credit card number. This data is entered into reading unit 16, of any desired construction, in order that a request to verify may be initiated. Secondly, upon command, i.e., a ready light from unit 17, the individual speaks the reference sentence. These operations are indicated generally by block 18 in FIG. 1. The sample voice signal is delivered to analyzer 19 where it is broken down to develop parameter values equivalent to those previously stored for him. Analyzer 19, accordingly, should be of identical construction to analyzer 11, and preferably, is located physically at the central processing station. The resultant set of sample parameter values are thereupon delivered to unit 17 to initiate all subsequent operations.

Since it is unlikely that the sample utterance will be in time registration with the reference sample, it is necessary to adjust its time scale to bring it into temporal alignment with the reference. This operation is carried out in time adjustment apparatus 20. In essence, iterative processing is employed to maximize the similarity between the specimen parameters and the reference parameters. Similarity may be measured by the coefficient of correlation between the sample and reference. Sample parameters are initially adjusted to start and stop in registry with the reference. It is also in accordance with the invention to match the time spread of variables within the speech sample. Internal time registration is achieved by a nonlinear process which maximizes the similarity between the sample and the reference by way of a monotonic continuous transformation of time.

Accordingly, values of the sample signal parameters, s(t) alleged to be the same as reference signal r(t), are delivered to adjustment apparatus 20. They are remapped, i.e., converted, by a substitution process to values s(.tau.) where

.tau.(t) = [a + bt + q (t) ] . (1)

In the equation, coefficient a and b are determined so as to cause the end points of the sample to coincide with those of the reference when q(t) is zero. The function q(t) defines the character of the time scale transformation between the end points of the utterance. In practice, q(t) may be a continuous piece-wise linear function. The time adjustment operation is illustrated graphically in FIG. 3. A reference function r(t) extends through the period O to T. It is noted, however, that the sample reference s(t) is considerably shorter in duration. It is necessary, therefore, to stretch it to a duration T. This is done by means of the substitute function .tau.(t) shown in the third line of the illustration. A so-called gradient climbing procedure may be employed in which values q.sub.i at times t.sub.i are varied in order that values of q.sub.i and t.sub.i may be found that maximize the normalized correlation I between the reference speech and the sample speech, where

The symbols <> denote a time average value of the enclosed expression. By thus maximizing the correlation between the two, a close match between prominent features in the utterance, e.g., formants, pitch, and intensity values, is achieved.

Details of the time normalization process are described hereinafter with reference to FIG. 5. Suffice it to say at this point that the substitute values of the sample s(.tau.) together with values of r(t) and variance values .sigma..sup.2 are delivered to measurement apparatus 25. Values of q.sub.i and t.sub.i, which reflect the amount of non-linear squeezing used to maximize I, are delivered to measurement apparatus 26.

Since the reference speech and the sample speech are now in time registry, it is possible to measure internal similarities between the two. Accordingly, a value is prepared in measurement apparatus 25 which denotes the internal dissimilarities between the two speech signals. Similarly, a measure is prepared in apparatus 26 which denotes the extent of warping required to bring the two into registry. If the dissimilarities are found to be small, it is likely that a match has been found. Yet, if the warping function value is extremely high, there is a likelihood that the match is a false one, resulting solely from extensive registration adjustment. The two measures of dissimilarity are combined in apparatus 27 and delivered to comparison unit 28 wherein a judgment is made in accordance with preestablished rules, i.e., threshold levels, balanced between similarity and inconsistencies. An "accept" or "reject" signal is thereupon developed. Ordinarily, this signal is returned to unit 16 to verify or reject the claim of identity made by the speaker.

It is evident that there is redundancy in the apparatus illustrated in FIG. 1. Thus, for example, analyzer 11 is used only to prepare reference samples. It may, of course, be switched as required to analyze identity claim samples. Such an arrangement is illustrated in FIG. 2. Reference and sample information is routed by way of switch 29a to analyzer 11 and delivered by way of switch 29b to the appropriate processing channel. Other redundancies within the apparatus may, of course, be minimized by judicious construction. Moreover, it is evident that all of the operations described may equally well be performed on a computer. All of the steps and all of the apparatus functions may be incorporated in a program for implementation on a general-purpose or dedicated computer. Indeed, in practice, a computer implementation has been found to be most effective. No unusual programming steps are required for carrying out the indicated operations.

FIG. 4 illustrates the manner in which acceptance or rejection of a sample is established. Since absolute discrimination between reference and sample values would require near perfect matching, it is evident that a compromise must be used. FIG. 4 indicates, therefore, the error rate of the verification procedure as a function of the value of the "dissimilarity" measure between the reference and sample, taken as a threshold value for acceptance or rejection. A compromise value is selected that then determines the number of true matches that are rejected, i.e., customers whose claim to identity is disallowed, versus the number of impostors whose claim to identity is accepted. Evidently, the crossover point may be adjusted in accordance with the particular identification application.

FIG. 5 illustrates in block schematic form the operations employed in accordance with the invention for registering the time scales of a reference utterance, in parameter value form, with the sample utterance in like parameter value form. It is evident that the Figure illustrates a hardware implementation. The Figure also constitutes a working flow chart for an equivalent computer program. Indeed, FIG. 5 represent the flow chart of the program that has been used in the practice of the invention. As with the overall system, no unusual programming steps are required to implement the arrangement.

The system illustrated in FIG. 5 corresponds generally to the portion of FIG. 1 depicted in unit 20. Reference values of speech signal parameters r(t) from store 15 are read into store 51 as a set. Similarly, samples from analyzer 19 are stored in unit 52. In order to register the time scale of the samples with those of the reference, samples s(t) are converted into a new set of values s(.tau.) in transformation function generator 53. This operation is achieved by developing, in generator 54, values of s(.tau.) as discussed above in Equation (1). Coefficients a and b are determined to cause the end points of the sample utterance, as determined for example by speech detector 55, to coincide with those of the reference when q(t) is zero. Detector 55 issues a first marker signal at the onset of the sample and a second marker signal at the cessation of the utterance. These signals are delivered directly to generator 54. Values of q.sub.i and t.sub.i for the interval between the terminal points of the utterance are initially entered into the system as prescribed sets of constants q.sub.io and t.sub.io. These values are delivered to OR gates 56 and 57, respectively, and by way of adders 58 and 59 to the input of generator 54. Accordingly, with these initial values, a set of values .tau.(t) is developed in generator 54 in accordance with Equation (1). Values of the specimen s(t) are thereupon remapped in generator 53 according to the functions developed in generator 54 to produce a time-warped version of the sample, designated s(.tau.).

Values of s(.tau.) are next compared with the reference samples to determine whether or not the transformed specimen values have been brought into satisfactory time alignment with the reference. The normalized correlation I, as defined above in Equation (2), is used for this comparison. Since I is developed on the basis of root mean square values of the sample functions, the necessary algebraic terms are prepared by developing a product in multiplier 60 and summing the resultant over the period T in accumulator 61. This summation establishes the numerator of Equation (2). Similarly, values of r(t) and s(t) are squared, integrated, and rooted, respectively, in units 62, 63, 64, and 65, 66, and 67. The two normalized values are delivered to multiplier 68 to form the denominator of Equation (2). Divider network 69 then delivers as its output a value of the normalized correlation function I in accordance with Equation (2). It indicates the similarity of the sample utterance to the reference utterance.

The degree of sensitivity to the change produced by substituting the values s(.tau.) for s(t) in generator 54 is then measured in units 70 and 71. The sensitivity calculation is most conveniently carried out by evaluating the partial differential of the correlation function I to the indicated changes in the values of q and t for the previous values supplied to generator 54. Accordingly, the partial derivative values of I with respect to q and with respect to t are prepared and delivered to multipliers 72 and 73, respectively. These values are equalized by multiplying by constants Cq and Ct in order to enhance the subsequent evaluation. These products constitute incremental values of q and t. The mean squares of the sets of values q.sub.i and t.sub.i are thereupon compared in gates 74 and 75 to selected small constants U and V. Constants U and V are selected to indicate the required degree of correlation that will assure a low error rate in the ultimate decision. If either of the comparisons is unsatisfactory, incremental values of q.sub.i or t.sub.i, or both, are returned to adders 58 and 59. The previously developed values q.sub.i and t.sub.i, are incremented thereby to provide a new set of values as inputs to function generator 54. Values of s(.tau.) are thereupon developed using the new data and the process is repeated. In essence, the values of q at intervals t as shown in FIG. 3 are individually altered to determine an appropriate set of values for maximizing the correlation between the reference utterance and the altered sample utterance.

FIG. 6 illustrates mathematically correlating the operation. The relationships are those used in a computer program used to implement the steps discussed above.

Thus, each q.sub.i and t.sub.i is adjusted until a further change in its value produces only a small change in correlation. When the change is sufficiently small, the last generated value is held, e.g., in generator 54. When the sensitivity measures are found to meet the above-discussed criterion, i.e., maximum normalized correlation, gates 74 and 75 deliver the last values of s(.tau.) by way of AND gate 78 to store 79. These values then are used as the time registered specimen samples and, in the apparatus of FIG. 1, are delivered to dissimilarity measuring apparatus 25. The values of q at time t from function generator 54 are similarly delivered, for example, by way of a gate (not shown in the Figure to avoid undue complexity) energized by the output of gate 78 to function measurement apparatus 26 of FIG. 1.

With the sample speech utterance appropriately registered with the averaged reference speech utterance, it is then in accordance with the invention to assess the similarities between the two and to develop a single numerical evaluation of them. The numerical evaluation is used to accept or reject the claim of identity. For convenience, it has been found best to generate a measure of dissimilarity such that a numerical value of zero denotes a perfect match between the two, and progressively higher numerical values denote greater degrees of dissimilarity. Such a value is sometimes termed a "distance" value.

To provide a satisfactory measure of dissimilarity, the two registered utterances are examined in a variety of different ways and at a variety of different locations in time. The resultant measures are combined to form a single distance value. One convenient way of assessing dissimilarity comprises dividing the interval of the utterances, O to T, into N equal intervals. If T = 2 seconds, as discussed in the above example for a typical application, it is convenient to divide the interval into N = 20 equal parts. FIG. 7 illustrates such a subdivision. Each subdivision i is then treated individually and a number of measures of dissimilarity are developed. These are based on (1) differences in average values between the reference speech r(t) and the registered sample s(.tau.), (2) the differences between linear components of variation of the two functions, (3) differences between quadratic components of variations of the two functions, and (4) the correlation between the two functions. In addition, a correlation coefficient (5) over the entire interval is obtained. Five such evaluations are made for each of the speech signal parameters used in representing the utterances. Thus, in the example of practice discussed herein, five evaluations are made for each of the formants F.sub.1, F.sub.2 and F.sub.3, for the pitch P of the signal, and for its gain G. Accordingly, 25 individual signal values of dissimilarity are produced.

It has also been found that the reliability of these measures varies between individual segments of the utterances. That is to say, certain speakers appreciably vary the manner in which they deliver certain portions of an utterance but are relatively consistent in delivering other portions. It is preferable therefore to use the most reliable segments for matching purposes and to reduce the relative weight of, or eliminate entirely, the measures in those segments known to be unreliable. The degree of reliability in each segment is based on the variance between the reference speech signal in each segment for each of the several reference utterances used in preparing the average reference in unit 13 of FIG. 1. The average values are thus compared and a value .sigma..sup.2, representative of the variance, is developed and stored along with values r(t) in storage unit 15.

Dissimilarity measurement apparatus 25 thus is supplied with the function r(t), s(.tau.), and .sigma..sup.2. It performs the necessary mathematical evaluation to divide the functions into N equal parts and to compute a measure of the squared difference in average values of the reference utterance and adjusted sample utterance, the squared difference in linear components between the two, (also designated "slope") the squared difference in quadratic components between the two (also designated "curvature"), and the correlation between the two. Each of the measures is scaled in accordance with the reliability factor as measured by the variance .sigma..sup.2, discussed above.

The equations which define these mathematical equations are set forth in FIG. 7. In the equations, the subscripts r and s refer, respectively, to the reference utterance and the warped sample utterance, and the functions x, y, and z are the coefficients of the first three terms of an orthogonal polynominal expression of the corresponding utterance value. The symbol .rho..sub.rs represents the correlation coefficient between the sample and reference functions computed over the full length of the sample. The function .rho..sub.rs,i represents the correlation coefficient between the sample and reference computed for the ith segment. Similarly, .sigma..sup.2 represents the variance of the reference parameters computed for the entire set of reference utterances used to produce the average. The numerical evaluation for each of these measures is combined to form a single number and a signal representative of the number is delivered to combining network 27.

Although the numerical value of dissimilarity thus prepared is sufficient to permit a reasonably reliable verification decision to be made, it is evident that the sample was adjusted severely to maximize the correlation between it and the reference. The degree of adjustment used constitutes another clue as to the likelihood of identity between the sample and the reference. If the warping values q.sub.i and t.sub.i were excessively large, it is more unlikely that the sample corresponds to the reference than if maximum correlation was achieved with less severe warping. Accordingly, the final values of q and t developed in generator 24 (FIG. 1) are delivered to measurement apparatus 26. Three measures of warping are thereupon prepared in apparatus 26.

For convenience an expression for the amount of warping employed is defined as

Typically, 10 values of .tau. are employed so that 10 values of A are produced. These values are averaged to get a single numerical value A.sub.avg = X. A value of X is developed for each of the reference speech utterances used to prepare the average. All values of X are next averaged over each of the N reference utterances to produce a value X. A first measure of "distance" for warping is then evaluated as

D.sub.1 = (X-X).sup.2. (4)

In similar fashion, a number Y representative of the linear component of variation in the values of A is prepared, and a quadratic component of variation is evaluated as Z. A second measure of distance is then evaluated as

D.sub.2 = Z.sup.2. (5)

Finally, a third measure of distance is developed as

where t.sub.m is the value of the t at the midpoint of the utterance.

The three warping distance measures, d.sub.1, D.sub.2, and d.sub.3 from system 26 are then delivered together with 25 dissimilarity measures from system 25 to combining unit 27 wherein a single distance measure is developed. Preferably, each of the individual distance values is suitably weighted. If the weighting function is equal to one for each distance value, a simple summation is performed. Other weighting systems may be employed in accordance with experience, i.e., the error rate experienced in verifying claims of identity of those references accommodated by the system.

The warping function measurements are therefore delivered to combining network 27 where they are combined with the numerical values developed in apparatus 25. The composite distance measure is thereupon used in threshold comparison network 28 to determine whether the sample speech should be accepted or rejected as being identical with the reference, i.e., to verify or reject the claim of identity. Since the distance measure is in the form of a numerical value, it may be matched directly against a stored numerical value in apparatus 28. The stored threshold value is selected to distribute the error possibility between a rejection of true claims of identity versus the acceptance of false claims of identity as illustrated in FIG. 4, discussed above. It is also possible that the distance value is too close to the threshold limit to permit a positive decision to be made. In this case, i.e., in an intermediate zone between accept and reject, a "no decision" mark is issued. This may be used to request a repeat of the sample utterance. If a new test sample is delivered, the entire process is repeated. Alternatively, the no-decision signal may be used to suggest that additional information about the individual claiming identity is needed, e.g., in the form of other tangible identification.

FIGS. 8A, 8B and 8C illustrate the overall performance of the system of the invention based on data developed in practice. In FIG. 8A, waveforms of the sample sentence "We were away a year ago." are shown for the first three formants, for the pitch period, and for signal gain, both for a sample utterance and for an averaged reference utterance. It will be observed that the waveforms of the sample and reference are not in time registry. FIG. 8B illustrates the same parameters after time adjustment, i.e., after warping, for a sample utterance determined to be substantially identical to the reference. In this case, the dissimilarity measure is sufficiently low to yield an "accept" signal, thus to verify the claim of identity. In FIG. 8C, the sample and reference utterances of the test sentence have been registered; yet it is evident that severe disparities are present between the two. Hence, the resulting measure of dissimilarity is sufficiently high to yield a "reject" signal.

Since the basic features of the invention involve the computation of certain numerical values and certain comparison operations, it is evident that the invention may most conveniently be turned to account by way of a suitable program for a computer. Indeed, the block schematic diagrams of FIGS. 1 and 5, together with the mathematical relationships set forth in the specification and figures constitute in essence a flowchart diagram illustrative of the programming steps used in the practice of the invention.

* * * * *