U.S. patent number 3,700,815 [Application Number 05/135,697] was granted by the patent office on 1972-10-24 for automatic speaker verification by non-linear time alignment of acoustic parameters.
This patent grant is currently assigned to Bell Telephone Laboratories, Incorporated. Invention is credited to George Rowland Doddington, James Loton Flanagan, Robert Carl Lummis.
United States Patent |
3,700,815 |
Doddington , et al. |
October 24, 1972 |
AUTOMATIC SPEAKER VERIFICATION BY NON-LINEAR TIME ALIGNMENT OF
ACOUSTIC PARAMETERS
Abstract
Speaker verification, as opposed to speaker identification, is
carried out by matching a sample of a person's speech with a
reference version of the same text derived from prerecorded samples
of the same speaker. Acceptance or rejection of the person as the
claimed individual is based on the concordance of a number of
acoustic parameters, for example, formant frequencies, pitch period
and speech energy. The degree of match is assessed by time aligning
the sample and reference utterance. Time alignment is achieved by a
nonlinear process which maximizes the similarity between the sample
and reference through a piece-wise linear continuous transformation
of the time scale. The extent of time transformation that is
required to achieve maximum similarity also influences the decision
to accept or reject the identity claim.
Inventors: |
Doddington; George Rowland
(Richardson, TX), Flanagan; James Loton (Somerset, NJ),
Lummis; Robert Carl (Berkeley Heights, NJ) |
Assignee: |
Bell Telephone Laboratories,
Incorporated (Murray Hill, NJ)
|
Family
ID: |
22469241 |
Appl.
No.: |
05/135,697 |
Filed: |
April 20, 1971 |
Current U.S.
Class: |
704/246; 704/238;
704/241; 704/243; 704/E15.016 |
Current CPC
Class: |
G10L
15/12 (20130101); G07C 9/37 (20200101); G10L
15/00 (20130101) |
Current International
Class: |
G10L
15/12 (20060101); G10L 15/00 (20060101); G07C
9/00 (20060101); G10l 001/02 () |
Field of
Search: |
;179/1SA,1SB,15.55R,15.55T ;340/148 |
References Cited
[Referenced By]
U.S. Patent Documents
Primary Examiner: Claffy; Kathleen H.
Assistant Examiner: Leaheey; Jon Bradford
Claims
What is claimed is:
1. In an auditory verification system in which acoustic parameters
of a test sample of an individual's speech are matched for identity
to like parameters of a reference sample of his speech, that
improvement which includes the steps of:
time adjusting said test sample parameters with said reference
parameters according to a nonlinear registration schedule,
measuring internal dissimilarities and irregularities between said
time adjusted parameters and said average reference parameters,
and
verifying said individual's identity on the basis of said measures
of dissimilarities and irregularities.
2. In an auditory verification system in which acoustic parameters
of a test sample of an individual's speech are matched for identity
to corresponding parameters of a reference sample of his speech,
that improvement which comprises the steps of:
preparing said reference sample of an individual's speech from sets
of parameters developed from a plurality of different utterances of
a test phrase by said individual which have been mutually
registered in time, and from a plurality of measures of variation
between said different utterances,
measuring internal dissimilarities and irregularities between said
test sample parameters and said reference parameters, and
verifying said individual's identity on the basis of said measures
of dissimilarities and irregularities.
3. In an auditory verification system in which acoustic parameters
of a test sample of an individual's speech are matched for identity
to like parameters of a reference sample of his speech, that
improvement which comprises the steps of:
developing said reference sample from time registered values of a
plurality of different speech signal parameters,
developing a like plurality of different speech signal parameters
from said test speech sample,
time adjusting said test sample parameters with said reference
parameters according to a nonlinear registration schedule,
measuring internal dissimilarities and irregularities between said
time adjusted parameters and said average reference parameters,
and
verifying said individual's identity on the basis of said measures
of dissimilarities and irregularities.
4. In a speech signal verification system wherein selected speech
signal parameters derived from a test phrase spoken by an
individual to produce a sample are compared to reference parameters
derived from the same test phrase spoken by the same individual,
and wherein verification or rejection of the identity of the
individual is determined by the similarities of said sample and
reference parameters,
means for bringing the time span of said sample parameters into
temporal registration with the time span of said reference
parameters, and
means for temporally adjusting the time distribution of parameters
of said sample within said adjusted time span to maximize
similarities between said sample parameters and said reference
parameters.
5. The speech signal verification system as defined in claim 4,
wherein said similarities between said sample parameters and said
reference parameters are measured by the coefficient of correlation
therebetween.
6. The speech signal verification system as defined in claim 5,
wherein said temporal adjustment of parameters within said adjusted
time span comprises,
means for iteratively incrementing the time locations of selected
parameter features until said measure of correlation between said
sample parameters and said reference parameters does not increase
significantly for a selected number of consecutive iterations.
7. The speech signal verification system as defined in claim 4,
wherein said means for temporally adjusting said parameters within
said adjusted time span comprises,
means for temporally transforming said sample parameters,
designated s(t), into a set of parameters, designated s(.tau.), in
which .tau.(t) = a+bt+q(t), in which a and b are constants selected
to align the end points of said time span of said sample parameters
with the end points of said reference parameters, and in which q(t)
is a nonlinear function which defines the distribution of parameter
values within said time span.
8. The speech signal verification system as defined in claim 7,
wherein said nonlinear function q(t) is a continuous piece-wise
linear function described by N selected amplitude values q.sub.i
and time values t.sub.i within said time span, wherein i = 0, 1, .
. .N.
9. An auditory speech signal verification system, which comprises,
in combination,
means for analyzing a plurality of individual utterances of a test
phrase spoken by an individual to develope a prescribed set of
acoustic parameter signals for each utterance,
means for developing from each of said sets of parameter signals a
reference set of parameter signals and a set of signals which
denotes variations between parameter signals used to develop said
reference set of signals,
means for storing a set of reference parameter signals and a set of
variation signals for each of a number of different
individuals,
means for analyzing a sample utterance of said test phrase spoken
by an individual purported to be one of said number of different
individuals to develop a set of acoustic parameter signals,
means for adjusting selected parameter signals of said sample to
bring the time scale of said utterance represented by said
parameters into registry with the time scale of a designated one of
said stored reference utterances represented by said reference
parameters,
said means including means for adjusting selected values of said
sample parameter signals to maximize similarities between said
sample utterance and said reference utterance,
means responsive to said reference parameter signals, said adjusted
sample parameter signals, and said variation signals for developing
a plurality of signals representative of selected similarities
between each of said sample parameters and each of said
corresponding reference parameters,
means for developing signals representative of the extent of
adjustment employed to register said time scales,
means responsive to said plurality of similarity signals and said
signals representative of the extent of adjustment for developing a
signal representative of the overall degree of similarity between
said sample utterance and said designated reference utterance,
and
threshold comparison means supplied with said overall similarity
signal for matching the magnitude of said similarity signal to the
magnitude of a stored threshold signal, and for issuing an "accept"
signal for similarity signals above threshold, a "reject" signal
for signals below threshold, and a "no decision" signal for
similarity signals within a prescribed narrow range of signal
magnitudes near said threshold magnitude.
10. An auditory speech signal verification system, as defined in
claim 9, wherein,
said means for developing a plurality of signals representative of
selected similarities includes,
means for measuring a plurality of different speech signal
characteristics for similarity in each of a number of time
subintervals within the interval of said designated time scale.
11. An auditory speech signal verification system, as defined in
claim 10, wherein,
said different speech signal characteristics are based,
respectively, on (1) the difference in average values between said
reference speech signal parameters and said sample speech signal
parameters, (2) the squared difference in linear components between
the two, (3) the squared difference in quadratic components between
the two, (4) the correlation between the two in each of said
subintervals, and (5) the correlation between the two over said
entire interval of said designated time scale.
12. An auditory speech signal verification system, as defined in
claim 11, wherein,
each of said signals representative of selected similarities
between each of said sample parameters and each of said
corresponding reference parameters is scaled in accordance with the
magnitudes of said variation signals.
Description
This invention relates to speech signal analysis and, more
particularly, to a system for verifying the identity of a person on
the basis of acoustic perameters unique to his speech.
BACKGROUND OF THE INVENTION
Many business transactions might be conducted by voice over a
telephone if the identity of a caller could be verified. It might,
for example, be convenient if a person could telephone his bank and
ascertain the balance of his account. He might dial the bank and
enter both his identification number and his request by keying the
dial. A computer could (via synthetic speech) ask him to speak his
verification phrase. If a verification of sufficiently high
confidence was achieved, the machine would proceed to read out the
requested balance. Other instances are apparent where verification
by voice would prove useful.
From the practical point of view, the problem of verification
appears both more important and more tractable than the problem of
absolute identification. The former problem consists of the
decision to accept or reject an identity claim made by an unknown
voice. In identification the problem is to decide which of a
reference set is the unknown most like. In verification, the
expected probability of error tends to remain constant regardless
of the size of the user population, whereas in identification the
expected probability of error tends to unity as the population
becomes large. In the usual context of the verification problem one
has a closed set of cooperative "customers," who wish to be
verified and who are willing to pronounce prescribed code phrases
(tailored to the individual voices if necessary). The machine may
ask for repeats and might adjust its acceptance threshold in
accordance with the importance of the transaction. Further, the
machine may control the average mix of the two kinds of errors it
can make: i.e., accept a false speaker (miss), or reject a true
speaker (false alarm).
DESCRIPTION OF THE PRIOR ART
A number of recognition techniques have been proposed for
identifying a speaker on the basis of prerecorded samples of his
speech. For example, human observers have been trained to identify
talkers on the basis of certain variables in their speech, and
others have been trained to identify them on the basis of a visual
display of acoustic features. These studies indicate that talkers
can indeed be identified primarily on the basis of acoustic cues.
Auditory verification (by a human listener) thus is a possibility,
but it is generally inconvenient and it occupies talent that might
be better applied otherwise. Also, present indications are that
auditory verification is not as reliable as machine
verification.
Accordingly, several proposals have been made for the automatic
recognition of speech sounds based entirely on acoustic
information. These have shown some degree of promise, providing
that the sample words to be recognized or identified are limited in
number. Most of these recognition techniques are based on
individual words, with each word being compared to a corresponding
word. Some work has been done on comparing selected parameters in a
sample utterance for example, peaks and valleys of pitch periods,
against corresponding reference data.
SUMMARY OF THE INVENTION
It is, accordingly, an object of this invention to verify the
identity of a human being on the basis of certain unique acoustic
cues in his speech. In accordance with the invention, verification
of a speaker is achieved by comparing the characteristic way in
which he utters a test sentence with a previously prepared
utterance of the same sentence. A number of different tests are
made on the speech signals and a binary decision is then made; the
identity claim of the talker is either rejected or accepted.
The problem may be defined as follows. A person asserts a certain
identity and then makes a "sample" utterance of a special test
phrase. Previously prepared information about the voice of the
person whose identity is claimed i.e., a "reference" utterance,
embodies the typical way in which that person utters the test
phrase, as well as measures of the variability to be expected in
separate repetitions of the phrase by that person. The sample
utterance is compared with the reference information and a decision
is rendered as to the veracity of the identity claim. For the sake
of exposition, it is convenient to divide the verification
technique into three basic operations: time registration,
construction of a reference, and measurement of the "distance" from
the reference to a particular sample utterance.
Time registration is the process in which the time axis of a sample
time function is warped so as to make the function most nearly
similar to the unwarped version of a reference function. The warped
time scale may be specified by any continuous transformation. One
suitable function is a piece-wise linear continuous function of
unwarped time. In this case warping is uniquely determined by two
coordinates of each breakpoint in the piece-wise linear function.
Typically, 10 break-points may be used for warping a
two-second-long function, so the registration task amounts to the
optimal assignment of values to 20 parameters.
The coefficient of correlation between warped sample and unwarped
reference may be used as one index of the similarity of the two
functions. The 20 warping parameters are iteratively modified to
maximize the correlation coefficient. One suitable technique is the
method of steepest ascent. That is, in every iteration, each of the
20 parameters is incremented by an amount proportional to the
partial derivative of the correlation coefficient with respect to
that parameter.
Success of this procedure hinges on the avoidance of certain
degenerate outcomes. Accordingly, several constraints on the
steepest ascent iteration process are employed. In effect, these
constraints prevent the original function from being distorted too
severely, and prevent unreasonably large steps on any one
iteration.
A reference phrase is formed by collecting a number of independent
utterances of the phrase by the same speaker. Each is referred to
as a "specimen" utterance. A typical phrase which has been used in
practice is "We were away a year ago." Each utterance is analyzed
to yield, for an all voiced utterance such as this one, five
"control functions" (so called because they can be used to control
a formant synthesizer to generate a signal similar to the original
voice signal). It has been found that gain, pitch period, and
first, second, and third formant frequencies, are satisfactory as
control functions. The gain function is scaled to have a particular
peak value independent of the talking level.
The reference consists of a version of each of the five control
functions chosen to represent a typical utterance by that speaker.
By convention, the length of the reference is always the same; a
value of 1.9 seconds may be used as the standard length. Any value
may be used that is not grossly different from the natural length
of the utterance.
The reference functions are constructed by averaging together the
specimen functions after each has been time-warped to bring them
all into mutual registration with each other. One way this mutual
registration has been achieved is as follows. One of the five
control functions is singled out to guide the registration. This
control function is called the "guide" function. Either gain or
second formant may be used for this purpose. The guide function
from each specimen is linearly expanded (or contracted) to the
desired reference length, and then all of the expanded guide
functions are averaged together. This average is the first trial
reference for the control function serving as guide. Each of the
specimen guide functions is then registered to the trial reference
by non-linear time-warping, and a new trial reference is generated
by averaging the warped specimens. This process is continued
iteratively, i.e., warp each specimen guide function for
registration with the current trial reference, and then make a new
trial reference by averaging the warped guide functions, until the
reference does not change significantly. The other four control
functions for each specimen utterance are then warped by the final
guide warping function for that utterance, and then each control
function is averaged across all specimens to form a reference. The
reference control functions are stored for future use, along with
computed variance values which indicate the reliability of the
function as a standard in selected intervals of the utterance.
When a sample of the standard utterance is presented for
verification, a "distance" value is computed that is a measure of
the unlikelihood that that sample would have been generated by the
person whose identity is claimed. Distances are always positive
numbers; a distance value of zero means that the utterance is
identical to the reference in every detail.
The sample is first analyzed to generate the five control functions
in terms of which the reference is stored. The control functions
are then brought into temporal registration with the reference.
This is done by choosing one of the control functions (e.g., gain)
to serve as the "guide." The guide function of the sample utterance
is registered with its counterpart in the reference by non-linear
warping, and other control functions are then warped in an
identical way.
After registration of the control functions, a variety of distances
between the sample and reference utterance are measured. Included
are measures of the difference in local average, local linear
variation, and local quadratic variation for all control functions;
local and global correlation coefficients between sample and
reference control functions; and measures that represent the
difficulty of time registration. In forming these separate
measures, various time segments of the utterance are weighted in
proportion to the constancy of the given measure in that time
segment across the set of warped specimens. These measures are then
combined to form a single overall distance that represents the
degree to which the sample utterance differs from the
reference.
The verification decision is based on the single overall distance.
If it is less than a pre-determined criterion, the claimed identity
is accepted ("verified"); if it is greater than the criterion, the
identity claim is rejected. In addition, an "indeterminate" zone
may be established around the criterion value within which neither
definite decision would be rendered. In this event, additional
information about the person is sought .
BRIEF DESCRIPTION OF THE DRAWING
The invention will be fully apprehended from the following detailed
description of a preferred illustrative embodiment thereof taken in
connection with the appended drawings.
In the drawings:
FIG. 1 is a block schematic diagram of a speech verification system
in accordance with the invention;
FIG. 2 illustrates an alternative analyzer arrangement;
FIG. 3 illustrates graphically the registration technique employed
in the practice of the invention;
FIG. 4 is a chart which illustrates the dependence of two kinds of
error ratios on the choice of threshold;
FIG. 5 is a block schematic diagram of a time adjustment
configuration which may be employed for non-linearly warping
parameter values;
FIG. 6 illustrates a criterion for maximizing similarity of
acoustic parameters in accordance with the invention;
FIG. 7 illustrates a number of distance measures used in
establishing an identity between two speech samples; and
FIGS. 8A, 8B and 8C are graphic illustrations of speech parameters
of an unknown and a reference talker. There is illustrated in A,
the time normalized parameters before the nonlinear time warping
procedure. B illustrates parameters for a reference and specimen
utterance which match after time registration using the second
formant as the guide function. C illustrates a fully time
normalized set of parameters for an impostor, i.e., a no-match
condition.
DETAILED DESCRIPTION
A system for verifying an individual's claimed identity is shown
schematically in FIG. 1. A library of reference utterances is
established to maintain a voice standard for each individual
subscriber to the system. A later claim of identity is verified by
reference to the appropriate stored reference utterance.
Accordingly, an individual speaks a reference sentence, for
example, by way of a microphone at a subscriber location, or over
his telephone to a central location (indicated generally by the
reference numeral 10). Although any reference phrase may be used,
the phrase should be capable of representing a number of prosodic
characteristics and variations of his speech. Since vowel or voiced
sounds contain a considerable number of such features, the
reference sentence, "We were away a year ago." has been used in
practice. This phrase is effective, in part, because of its lack of
nasal sounds and its totally voiced character. Moreover, it is long
enough to require more than passing attention to temporal
registration, and is short enough to afford economical analysis and
storage.
Whatever the phrase spoken by the individual to establish a
standard, it is delivered to speech analyzer 11, of any known
construction, wherein a number of different acoustic parameters are
derived to represent it. For example, individual formant
frequencies, amplitudes, and pitch, at the Nyquist rate are
satisfactory. These speech parameters are commonly used to
synthesize speech in vocoder apparatus and the like. One entirely
suitable speech signal analyzer is described in detail in a
copending application of L. R. Rabiner and R. W. Schafer, Ser. No.
872,050, filed Oct. 29, 1969. In essence, analyzer 11 includes
individual channels for identifying formant frequencies F1, F2, F3,
pitch period P, and gain G control signals. In addition, fricative
identifying signals may be derived if desired.
In order that variations in the manner in which the individual
speaks the phrase may be taken into account, it is preferable to
have him repeat the reference sentence a number of times in order
that an average set of speech parameters may be prepared. It is
convenient to analyze the utterance as it is spoken, and to adjust
the duration of the utterance to a standard length T. Typically, a
two-second sample is satisfactory. Each spoken reference sentence
therefore is either stretched or contracted in apparatus 12 to
adjust it to the standard duration. Each adjusted set of parameters
is then stored either as analog, or after conversion, as digital
signals, for example, in unit 12. When all of the test utterances
have been analyzed and brought into mutual time registration, an
average set of parameters is developed in averaging apparatus 13.
The single resultant set of reference parameter values is then
stored for future use in storage unit 15.
In addition, a set of variance signals is prepared and stored in
unit 15. Variance values are developed, in the manner described
hereinafter, for parameters in each of a number of time segments
within the span of the reference utterance to indicate the extent
of any difference in the manner in which the speaker utters that
segment of the test phrase. Hence, variance values provide a
measure of the reliability with which parameters in different
segments may be used as a standard.
It is evident that a non-vocal identification of each individual is
also stored, preferably in library store 15. The identification may
be either in the form of a separate address or some other key to
the storage location of the reference utterance for each
individual. Any form of addressing well known to those in the art
may be employed. Moreover, it is evident that store 15 may be in
the form of a library of reference data for any number of
subscribers to the verification service.
When it is desired to verify a claim of identity by an individual,
for example, at the time a credit card transaction takes place at a
retail store or the like, two separate entries are made in the
system. First, the individual identifies himself, for example, by
means of his name and address or his credit card number. This data
is entered into reading unit 16, of any desired construction, in
order that a request to verify may be initiated. Secondly, upon
command, i.e., a ready light from unit 17, the individual speaks
the reference sentence. These operations are indicated generally by
block 18 in FIG. 1. The sample voice signal is delivered to
analyzer 19 where it is broken down to develop parameter values
equivalent to those previously stored for him. Analyzer 19,
accordingly, should be of identical construction to analyzer 11,
and preferably, is located physically at the central processing
station. The resultant set of sample parameter values are thereupon
delivered to unit 17 to initiate all subsequent operations.
Since it is unlikely that the sample utterance will be in time
registration with the reference sample, it is necessary to adjust
its time scale to bring it into temporal alignment with the
reference. This operation is carried out in time adjustment
apparatus 20. In essence, iterative processing is employed to
maximize the similarity between the specimen parameters and the
reference parameters. Similarity may be measured by the coefficient
of correlation between the sample and reference. Sample parameters
are initially adjusted to start and stop in registry with the
reference. It is also in accordance with the invention to match the
time spread of variables within the speech sample. Internal time
registration is achieved by a nonlinear process which maximizes the
similarity between the sample and the reference by way of a
monotonic continuous transformation of time.
Accordingly, values of the sample signal parameters, s(t) alleged
to be the same as reference signal r(t), are delivered to
adjustment apparatus 20. They are remapped, i.e., converted, by a
substitution process to values s(.tau.) where
.tau.(t) = [a + bt + q (t) ] . (1)
In the equation, coefficient a and b are determined so as to cause
the end points of the sample to coincide with those of the
reference when q(t) is zero. The function q(t) defines the
character of the time scale transformation between the end points
of the utterance. In practice, q(t) may be a continuous piece-wise
linear function. The time adjustment operation is illustrated
graphically in FIG. 3. A reference function r(t) extends through
the period O to T. It is noted, however, that the sample reference
s(t) is considerably shorter in duration. It is necessary,
therefore, to stretch it to a duration T. This is done by means of
the substitute function .tau.(t) shown in the third line of the
illustration. A so-called gradient climbing procedure may be
employed in which values q.sub.i at times t.sub.i are varied in
order that values of q.sub.i and t.sub.i may be found that maximize
the normalized correlation I between the reference speech and the
sample speech, where
The symbols <> denote a time average value of the enclosed
expression. By thus maximizing the correlation between the two, a
close match between prominent features in the utterance, e.g.,
formants, pitch, and intensity values, is achieved.
Details of the time normalization process are described hereinafter
with reference to FIG. 5. Suffice it to say at this point that the
substitute values of the sample s(.tau.) together with values of
r(t) and variance values .sigma..sup.2 are delivered to measurement
apparatus 25. Values of q.sub.i and t.sub.i, which reflect the
amount of non-linear squeezing used to maximize I, are delivered to
measurement apparatus 26.
Since the reference speech and the sample speech are now in time
registry, it is possible to measure internal similarities between
the two. Accordingly, a value is prepared in measurement apparatus
25 which denotes the internal dissimilarities between the two
speech signals. Similarly, a measure is prepared in apparatus 26
which denotes the extent of warping required to bring the two into
registry. If the dissimilarities are found to be small, it is
likely that a match has been found. Yet, if the warping function
value is extremely high, there is a likelihood that the match is a
false one, resulting solely from extensive registration adjustment.
The two measures of dissimilarity are combined in apparatus 27 and
delivered to comparison unit 28 wherein a judgment is made in
accordance with preestablished rules, i.e., threshold levels,
balanced between similarity and inconsistencies. An "accept" or
"reject" signal is thereupon developed. Ordinarily, this signal is
returned to unit 16 to verify or reject the claim of identity made
by the speaker.
It is evident that there is redundancy in the apparatus illustrated
in FIG. 1. Thus, for example, analyzer 11 is used only to prepare
reference samples. It may, of course, be switched as required to
analyze identity claim samples. Such an arrangement is illustrated
in FIG. 2. Reference and sample information is routed by way of
switch 29a to analyzer 11 and delivered by way of switch 29b to the
appropriate processing channel. Other redundancies within the
apparatus may, of course, be minimized by judicious construction.
Moreover, it is evident that all of the operations described may
equally well be performed on a computer. All of the steps and all
of the apparatus functions may be incorporated in a program for
implementation on a general-purpose or dedicated computer. Indeed,
in practice, a computer implementation has been found to be most
effective. No unusual programming steps are required for carrying
out the indicated operations.
FIG. 4 illustrates the manner in which acceptance or rejection of a
sample is established. Since absolute discrimination between
reference and sample values would require near perfect matching, it
is evident that a compromise must be used. FIG. 4 indicates,
therefore, the error rate of the verification procedure as a
function of the value of the "dissimilarity" measure between the
reference and sample, taken as a threshold value for acceptance or
rejection. A compromise value is selected that then determines the
number of true matches that are rejected, i.e., customers whose
claim to identity is disallowed, versus the number of impostors
whose claim to identity is accepted. Evidently, the crossover point
may be adjusted in accordance with the particular identification
application.
FIG. 5 illustrates in block schematic form the operations employed
in accordance with the invention for registering the time scales of
a reference utterance, in parameter value form, with the sample
utterance in like parameter value form. It is evident that the
Figure illustrates a hardware implementation. The Figure also
constitutes a working flow chart for an equivalent computer
program. Indeed, FIG. 5 represent the flow chart of the program
that has been used in the practice of the invention. As with the
overall system, no unusual programming steps are required to
implement the arrangement.
The system illustrated in FIG. 5 corresponds generally to the
portion of FIG. 1 depicted in unit 20. Reference values of speech
signal parameters r(t) from store 15 are read into store 51 as a
set. Similarly, samples from analyzer 19 are stored in unit 52. In
order to register the time scale of the samples with those of the
reference, samples s(t) are converted into a new set of values
s(.tau.) in transformation function generator 53. This operation is
achieved by developing, in generator 54, values of s(.tau.) as
discussed above in Equation (1). Coefficients a and b are
determined to cause the end points of the sample utterance, as
determined for example by speech detector 55, to coincide with
those of the reference when q(t) is zero. Detector 55 issues a
first marker signal at the onset of the sample and a second marker
signal at the cessation of the utterance. These signals are
delivered directly to generator 54. Values of q.sub.i and t.sub.i
for the interval between the terminal points of the utterance are
initially entered into the system as prescribed sets of constants
q.sub.io and t.sub.io. These values are delivered to OR gates 56
and 57, respectively, and by way of adders 58 and 59 to the input
of generator 54. Accordingly, with these initial values, a set of
values .tau.(t) is developed in generator 54 in accordance with
Equation (1). Values of the specimen s(t) are thereupon remapped in
generator 53 according to the functions developed in generator 54
to produce a time-warped version of the sample, designated
s(.tau.).
Values of s(.tau.) are next compared with the reference samples to
determine whether or not the transformed specimen values have been
brought into satisfactory time alignment with the reference. The
normalized correlation I, as defined above in Equation (2), is used
for this comparison. Since I is developed on the basis of root mean
square values of the sample functions, the necessary algebraic
terms are prepared by developing a product in multiplier 60 and
summing the resultant over the period T in accumulator 61. This
summation establishes the numerator of Equation (2). Similarly,
values of r(t) and s(t) are squared, integrated, and rooted,
respectively, in units 62, 63, 64, and 65, 66, and 67. The two
normalized values are delivered to multiplier 68 to form the
denominator of Equation (2). Divider network 69 then delivers as
its output a value of the normalized correlation function I in
accordance with Equation (2). It indicates the similarity of the
sample utterance to the reference utterance.
The degree of sensitivity to the change produced by substituting
the values s(.tau.) for s(t) in generator 54 is then measured in
units 70 and 71. The sensitivity calculation is most conveniently
carried out by evaluating the partial differential of the
correlation function I to the indicated changes in the values of q
and t for the previous values supplied to generator 54.
Accordingly, the partial derivative values of I with respect to q
and with respect to t are prepared and delivered to multipliers 72
and 73, respectively. These values are equalized by multiplying by
constants Cq and Ct in order to enhance the subsequent evaluation.
These products constitute incremental values of q and t. The mean
squares of the sets of values q.sub.i and t.sub.i are thereupon
compared in gates 74 and 75 to selected small constants U and V.
Constants U and V are selected to indicate the required degree of
correlation that will assure a low error rate in the ultimate
decision. If either of the comparisons is unsatisfactory,
incremental values of q.sub.i or t.sub.i, or both, are returned to
adders 58 and 59. The previously developed values q.sub.i and
t.sub.i, are incremented thereby to provide a new set of values as
inputs to function generator 54. Values of s(.tau.) are thereupon
developed using the new data and the process is repeated. In
essence, the values of q at intervals t as shown in FIG. 3 are
individually altered to determine an appropriate set of values for
maximizing the correlation between the reference utterance and the
altered sample utterance.
FIG. 6 illustrates mathematically correlating the operation. The
relationships are those used in a computer program used to
implement the steps discussed above.
Thus, each q.sub.i and t.sub.i is adjusted until a further change
in its value produces only a small change in correlation. When the
change is sufficiently small, the last generated value is held,
e.g., in generator 54. When the sensitivity measures are found to
meet the above-discussed criterion, i.e., maximum normalized
correlation, gates 74 and 75 deliver the last values of s(.tau.) by
way of AND gate 78 to store 79. These values then are used as the
time registered specimen samples and, in the apparatus of FIG. 1,
are delivered to dissimilarity measuring apparatus 25. The values
of q at time t from function generator 54 are similarly delivered,
for example, by way of a gate (not shown in the Figure to avoid
undue complexity) energized by the output of gate 78 to function
measurement apparatus 26 of FIG. 1.
With the sample speech utterance appropriately registered with the
averaged reference speech utterance, it is then in accordance with
the invention to assess the similarities between the two and to
develop a single numerical evaluation of them. The numerical
evaluation is used to accept or reject the claim of identity. For
convenience, it has been found best to generate a measure of
dissimilarity such that a numerical value of zero denotes a perfect
match between the two, and progressively higher numerical values
denote greater degrees of dissimilarity. Such a value is sometimes
termed a "distance" value.
To provide a satisfactory measure of dissimilarity, the two
registered utterances are examined in a variety of different ways
and at a variety of different locations in time. The resultant
measures are combined to form a single distance value. One
convenient way of assessing dissimilarity comprises dividing the
interval of the utterances, O to T, into N equal intervals. If T =
2 seconds, as discussed in the above example for a typical
application, it is convenient to divide the interval into N = 20
equal parts. FIG. 7 illustrates such a subdivision. Each
subdivision i is then treated individually and a number of measures
of dissimilarity are developed. These are based on (1) differences
in average values between the reference speech r(t) and the
registered sample s(.tau.), (2) the differences between linear
components of variation of the two functions, (3) differences
between quadratic components of variations of the two functions,
and (4) the correlation between the two functions. In addition, a
correlation coefficient (5) over the entire interval is obtained.
Five such evaluations are made for each of the speech signal
parameters used in representing the utterances. Thus, in the
example of practice discussed herein, five evaluations are made for
each of the formants F.sub.1, F.sub.2 and F.sub.3, for the pitch P
of the signal, and for its gain G. Accordingly, 25 individual
signal values of dissimilarity are produced.
It has also been found that the reliability of these measures
varies between individual segments of the utterances. That is to
say, certain speakers appreciably vary the manner in which they
deliver certain portions of an utterance but are relatively
consistent in delivering other portions. It is preferable therefore
to use the most reliable segments for matching purposes and to
reduce the relative weight of, or eliminate entirely, the measures
in those segments known to be unreliable. The degree of reliability
in each segment is based on the variance between the reference
speech signal in each segment for each of the several reference
utterances used in preparing the average reference in unit 13 of
FIG. 1. The average values are thus compared and a value
.sigma..sup.2, representative of the variance, is developed and
stored along with values r(t) in storage unit 15.
Dissimilarity measurement apparatus 25 thus is supplied with the
function r(t), s(.tau.), and .sigma..sup.2. It performs the
necessary mathematical evaluation to divide the functions into N
equal parts and to compute a measure of the squared difference in
average values of the reference utterance and adjusted sample
utterance, the squared difference in linear components between the
two, (also designated "slope") the squared difference in quadratic
components between the two (also designated "curvature"), and the
correlation between the two. Each of the measures is scaled in
accordance with the reliability factor as measured by the variance
.sigma..sup.2, discussed above.
The equations which define these mathematical equations are set
forth in FIG. 7. In the equations, the subscripts r and s refer,
respectively, to the reference utterance and the warped sample
utterance, and the functions x, y, and z are the coefficients of
the first three terms of an orthogonal polynominal expression of
the corresponding utterance value. The symbol .rho..sub.rs
represents the correlation coefficient between the sample and
reference functions computed over the full length of the sample.
The function .rho..sub.rs,i represents the correlation coefficient
between the sample and reference computed for the ith segment.
Similarly, .sigma..sup.2 represents the variance of the reference
parameters computed for the entire set of reference utterances used
to produce the average. The numerical evaluation for each of these
measures is combined to form a single number and a signal
representative of the number is delivered to combining network
27.
Although the numerical value of dissimilarity thus prepared is
sufficient to permit a reasonably reliable verification decision to
be made, it is evident that the sample was adjusted severely to
maximize the correlation between it and the reference. The degree
of adjustment used constitutes another clue as to the likelihood of
identity between the sample and the reference. If the warping
values q.sub.i and t.sub.i were excessively large, it is more
unlikely that the sample corresponds to the reference than if
maximum correlation was achieved with less severe warping.
Accordingly, the final values of q and t developed in generator 24
(FIG. 1) are delivered to measurement apparatus 26. Three measures
of warping are thereupon prepared in apparatus 26.
For convenience an expression for the amount of warping employed is
defined as
Typically, 10 values of .tau. are employed so that 10 values of A
are produced. These values are averaged to get a single numerical
value A.sub.avg = X. A value of X is developed for each of the
reference speech utterances used to prepare the average. All values
of X are next averaged over each of the N reference utterances to
produce a value X. A first measure of "distance" for warping is
then evaluated as
D.sub.1 = (X-X).sup.2. (4)
In similar fashion, a number Y representative of the linear
component of variation in the values of A is prepared, and a
quadratic component of variation is evaluated as Z. A second
measure of distance is then evaluated as
D.sub.2 = Z.sup.2. (5)
Finally, a third measure of distance is developed as
where t.sub.m is the value of the t at the midpoint of the
utterance.
The three warping distance measures, d.sub.1, D.sub.2, and d.sub.3
from system 26 are then delivered together with 25 dissimilarity
measures from system 25 to combining unit 27 wherein a single
distance measure is developed. Preferably, each of the individual
distance values is suitably weighted. If the weighting function is
equal to one for each distance value, a simple summation is
performed. Other weighting systems may be employed in accordance
with experience, i.e., the error rate experienced in verifying
claims of identity of those references accommodated by the
system.
The warping function measurements are therefore delivered to
combining network 27 where they are combined with the numerical
values developed in apparatus 25. The composite distance measure is
thereupon used in threshold comparison network 28 to determine
whether the sample speech should be accepted or rejected as being
identical with the reference, i.e., to verify or reject the claim
of identity. Since the distance measure is in the form of a
numerical value, it may be matched directly against a stored
numerical value in apparatus 28. The stored threshold value is
selected to distribute the error possibility between a rejection of
true claims of identity versus the acceptance of false claims of
identity as illustrated in FIG. 4, discussed above. It is also
possible that the distance value is too close to the threshold
limit to permit a positive decision to be made. In this case, i.e.,
in an intermediate zone between accept and reject, a "no decision"
mark is issued. This may be used to request a repeat of the sample
utterance. If a new test sample is delivered, the entire process is
repeated. Alternatively, the no-decision signal may be used to
suggest that additional information about the individual claiming
identity is needed, e.g., in the form of other tangible
identification.
FIGS. 8A, 8B and 8C illustrate the overall performance of the
system of the invention based on data developed in practice. In
FIG. 8A, waveforms of the sample sentence "We were away a year
ago." are shown for the first three formants, for the pitch period,
and for signal gain, both for a sample utterance and for an
averaged reference utterance. It will be observed that the
waveforms of the sample and reference are not in time registry.
FIG. 8B illustrates the same parameters after time adjustment,
i.e., after warping, for a sample utterance determined to be
substantially identical to the reference. In this case, the
dissimilarity measure is sufficiently low to yield an "accept"
signal, thus to verify the claim of identity. In FIG. 8C, the
sample and reference utterances of the test sentence have been
registered; yet it is evident that severe disparities are present
between the two. Hence, the resulting measure of dissimilarity is
sufficiently high to yield a "reject" signal.
Since the basic features of the invention involve the computation
of certain numerical values and certain comparison operations, it
is evident that the invention may most conveniently be turned to
account by way of a suitable program for a computer. Indeed, the
block schematic diagrams of FIGS. 1 and 5, together with the
mathematical relationships set forth in the specification and
figures constitute in essence a flowchart diagram illustrative of
the programming steps used in the practice of the invention.
* * * * *