U.S. patent number 3,892,919 [Application Number 05/414,746] was granted by the patent office on 1975-07-01 for speech synthesis system.
This patent grant is currently assigned to Hitachi, Ltd.. Invention is credited to Akira Ichikawa.
United States Patent |
3,892,919 |
Ichikawa |
July 1, 1975 |
Speech synthesis system
Abstract
In a system in which a plurality of previously recorded
waveforms corresponding to phonetic elements separately picked up
from natural voice and having a pitch length, are connected to form
any required speech, the degradation in the quality of the
synthesized speech due to the discontinuity in the waveform of the
synthesized speech is prevented by so controlling the period of
reading out each phonetic element as to change the period stepwise
at intervals of several phonetic elements (i.e., pitch
lengths).
Inventors: |
Ichikawa; Akira (Kokubunji,
JA) |
Assignee: |
Hitachi, Ltd.
(JA)
|
Family
ID: |
14600773 |
Appl.
No.: |
05/414,746 |
Filed: |
November 12, 1973 |
Foreign Application Priority Data
|
|
|
|
|
Nov 13, 1972 [JA] |
|
|
47-112995 |
|
Current U.S.
Class: |
704/267;
704/E13.01; 704/268 |
Current CPC
Class: |
G10L
13/07 (20130101) |
Current International
Class: |
G10L
13/00 (20060101); G10L 13/06 (20060101); G10l
001/00 () |
Field of
Search: |
;179/1SM
;340/148-152 |
References Cited
[Referenced By]
U.S. Patent Documents
Primary Examiner: Claffy; Kathleen H.
Assistant Examiner: Kemeny; E. S. Matt
Attorney, Agent or Firm: Craig & Antonelli
Claims
I claim:
1. A speech synthesis system comprising:
a speech segment memory which stores a plurality of speech
segments;
a synthesizing apparatus, coupled to said speech segment memory,
including
first means for selecting desired speech segments from said speech
segment memory,
second means, coupled to said first means, for controlling the
pitch period of each of said desired speech segments, so as to
change the pitch frequency of the synthesized speech step-wise,
and
third means, coupled to said first means and said second means, for
connecting the desired pitch controlled speech segments
together.
2. A speech synthesis system according to claim 1, wherein said
second means includes means for adjusting the intervals of the
stepwise change between a quarter of a syllable and a full
syllable.
3. A speech synthesis system comprising:
a. a data processing circuit to convert code signals representative
of the syllables of words to be synthesized into control signals
for speech synthesis;
b. a speech segment memory to store speech segments each having a
waveform of about a pitch length;
c. a speech synthesizing apparatus, coupled to said data processing
circuit and to said speech segment memory, and including
first means for selecting desired speech segments in said speech
segment memory
second means, coupled to said first means, for controlling the read
time of each of the selected speech segments, so as to change the
pitch period of the synthesized speech stepwise at intervals of a
quarter of a syllable to a full syllable and,
third means, coupled to said first means and to said second means,
for synthesizing speech sound waveform signals by connecting the
selected and pitch controlled speech segment signals together, in
response to the control signals from the data processing circuit;
and
d. an electro-acoustic converting device, coupled to said speech
segment memory and said speech synthesizing apparatus, for
converting the speech sound waveform signals into corresponding
speech sounds.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a speech synthesis system and more
particularly to a system in which a sound wave extracted from
natural voice and having about a pitch length is used as a phonetic
segment or speech segment and in which the phonetic segments
previously stored are selectively connected at controlled periods
due to control signals corresponding to a required word or a
sentence to be synthesized.
2. Description of the Prior Art
In recent years, the information service system which connects data
processing devices such as electronic computors with communication
lines such as telephones, has been developed. In such a system, a
remote subscriber's question sent through a communication line is
received by a central signal processing device which stores large
information and the device prepares an answer for the question and
sends it back to the subscriber, the answer being in the form of
sound like human voice.
In this system, the most important is the speech synthesis part
which makes an answer in the form of voice.
The requirements for the speech synthesis part, however, are as
follows: (1) the synthesized speech must be as near the human voice
as possible; the production cost must be low; and the system
incorporating the part therein must permit multiple uses, that is,
the part must be able to generate a plurality of speech at a
time.
In a conventional speech synthesis system which is rather
satisfactory from the standpoint of the above mentioned
requirements, a plurality of sound waveforms each having a pitch
length are previously prepared so as to be used as speech sound
waveforms, i.e. speech segments, and the speech segments are
selectively connected due to control signals corresponding to words
or sentences to be synthesized.
This conventional system is rather cheap since any desired speech
can be synthesized by connecting speech segments each having a
waveform of a pitch length so that the number of the stored speech
segments is relatively small. The speech segments can be read out
rapidly, that is, the access time is very short, so that the
multiple synthesis of speech is possible.
Moreover, the read-out time of a speech segment, that is, the
length of the waveform of the speech segment can be controlled so
that the pitch of the synthesized speech can also be
controlled.
Although the conventional system has several merits as mentioned
above, it has also been revealed by the inventors' experiments that
the speech synthesized by the conventional system suffers from
hoarse noises and that the vocal quality thereof is very poor. The
cause of such a drawback is as follows. Namely, in this speech
synthesis system, connected speech is formed by connecting the
waveforms of speech segments and therefore a discontinuity, i.e.
rapid change in amplitude, is caused in the junction portion
between any two adjacent waveforms of speech segments and such
discontinuities appear every pitch period (equal to the fundamental
period of speech and having an audible range of frequencies) to
generate hoarse noises in synthesized speech.
SUMMARY OF THE INVENTION
One object of the present invention is to improve the quality of
the synthesized speech produced by a speech synthesis system in
which a plurality of speech sound waveforms, each having a pitch
length, to be used as speech segments are recorded and these speech
segments are selectively connected to form synthesized speech.
Another object of the present invention is to provide a speech
synthesis system in which a plurality of speech sound waveforms,
each having a pitch length, to be used as speech segments are
recorded and these speech segments are selectively connected to
form synthesized speech, and in which the pitch control of speech
sounds is simplified so that the system can be economically
fabricated without deterioration in the vocal quality of the
synthesized speech.
According to the present invention, which has been made to attain
the above objects, in a speech synthesizing system in which speech
segments, each having a pitch length, are selectively connected to
synthesize desired speech, the time of reading out each speech
segment, that is, the wavelength of each speech segment of
synthesized speech is stepwise changed at intervals of several
speech segments. Namely, the waveforms of speech segments read out
are changed at intervals of one fifth of a syllable to a full
syllable. Therefore, the system according to the present invention
can produce synthesized speech softer to ear than that produced by
a conventional speech synthesis system in which the length of the
waveform of every speech segment is controlled individually.
Other objects, features and advantages of the present invention
will be made apparent when one reads the following part of the
specification with the aid of the attached drawings.
BRIEF DESCRIPTION OF THE DRAWING
FIG. 1 is an oscillographic representation of a monosyllable speech
sound waveform.
FIG. 2 shows the modes of variations in the pitch frequency of
monosyllable speech sounds in various pronounciations.
FIG. 3 shows the variations in the pitch frequency of one word.
FIGS. 4A and 4B show waveforms illustrating the discontinuities
resulting from the connection of separate speech segments.
FIG. 5 shows the variation in pitch frequency of the synthesized
speech formed by the speech synthesizing system according to the
present invention.
FIG. 6 is a block diagram of a speech synthesis system embodying
the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
In FIG. 1, the waveform of a monosyllable speech sound is shown in
a rectangular coordinate system in which the abscissa represents
the time base and the ordinate gives the amplitude of waveform. As
seen from FIG. 1, the waveform of the monosyllable speech sound
consists of an irregular portion C like that of a consonant and a
periodical portion V like that of a vowel. Especially, every
syllable of the Japanese speech is composed of a single consonant
followed by a single vowel or of a single vowel. And about one
hundred different syllables can make up all the speech sounds
covering the entire vocabulary of the Japanese language. Of the
portions of the waveform shown in FIG. 1, the more important is the
periodical portion V which occupies most part of the monosyllable
speech sound waveform and forms the factors of the pitch,
intonation and tone (indicating the kind of syllable) of the speech
sound.
Namely, the pitch or intonation of the speech sound depends mainly
on the repetition periods T.sub.1, T.sub.2, . . . , T.sub.n, i.e.
the pitch period, while the tone is determined by the frequency
characteristic of the periodical portion V. The pitch period is
usually 10 to 20 milliseconds.
FIG. 2 shows the variation in the pitch frequency (defined as the
reciprocal of the pitch period) with time of the monosyllable
speech sound shown in FIG. 1. In FIG. 2, the abscissa and the
ordinate respectively represent the time base and the pitch
frequency. When a monosyllable speech sound is individually
pronounced, it has a characteristic curve 1 convex up as shown in
FIG. 2. However, when the same speech sound is pronounced in a word
or sentence, it may assume characteristic curves 2, 3 and 4
corresponding respectively to level, rising and falling intonation,
depending upon the position it assumes in the word or sentence or
upon the kind of word or sentence.
Accordingly, in case where the convected speech sounds
corresponding to a desired word or sentence are formed by
connecting together the prerecorded speech segments, i.e. speech
sound waveforms obtained by dividing the waveform of the
monosyllable speech sound as shown in FIG. 1, pronounced in a
manner corresponding to the curve 1 in FIG. 2, into units, each
having a pitch length, the discontinuities are formed in the
junction points between the unit waveforms, i.e. speech segment
waveforms, the discontinuities being the portions where the
amplitudes of the waveform rapidly change.
Such discontinuities will be described in further detail. FIG. 3
shows the variation in pitch frequency with time of the speech
sound corresponding to a word, in which the abscissa and the
ordinate respectively represent the time base and the pitch
frequency. In FIG. 3, curve 5 indicates the mode of the variation
in pitch frequency of natural speech sound corresponding to a word
to be synthesized, while curve 6 shows the mode of the variation in
the pitch frequency of the monosyllable speech sound corresponding
to the curve 1 in FIG. 2. The abscissa is divided into
pronounciation intervals t.sub.1, t.sub.2, . . . , t.sub.5 of the
monosyllable sounds. Accordingly, in order that the speech sound
having a pitch frequency characteristic corresponding to the curve
5 may be composed of speech segments obtained from the natural
voice having a pitch frequency characteristic corresponding to the
curve 6, the length of the waveform of each speech segment, i.e
pitch period, has to be controlled. Therefore, if the waveforms of
the speech segments having pitch periods T.sub.1, T.sub.2, . . . ,
T.sub.4 as in FIG. 1 are connected and synthesized into connected
speech having pitch periods longer or shorter than those periods
T.sub.1, T.sub.2, T.sub.3 and T.sub.4, then the discontinuities 7
are formed in the junction portion of the respective speech segment
waveforms as shown in FIGS. 4A and 4B. FIG. 4A corresponds to the
case where the synthesized speech has a pitch frequency higher than
that of the original natural voice from which the speech segment
waveforms are obtained and has a pitch period shorter than that of
the natural voice. FIG. 4B, on the other hand, corresponds to the
case where the synthesized speech has a pitch frequency lower than
that of the original natural voice and a pitch period longer than
that of the original natural voice. The dicontinuities 7 thus
resulted deteriorate the vocal quality of the synthesized speech
and also generate hoarse noises.
In order to eliminate the influence of the discontinuities, a
special treatment of waveforms must be introduced. According to the
present invention, the degradation of the vocal quality due to the
discontinuities can be prevented since the way of the pitch control
in the speech control system is improved, and moreover a system can
be realized in which the pitch control is further simplified by
making the best use of the merits of the speech synthesizing system
in which speech segments are connected to form synthesized
speech.
Namely, as shown in FIG. 5, the pitch frequency or the pitch period
of the synthesized speech is changed stepwise at intervals of a
quarter of a syllable to a full syllable. It is empirically
verified that the synthesized speech having a pitch frequency
characteristic corresponding to a staircurve 8 indicated by dotted
line FIG. 5, has a vocal quality superior to that having a pitch
frequency characteristic indicated by a solid curve 5 in FIG. 5. In
this case, it is needless to perform the pitch control for every
speech segment and since the pitch periods of the successive speech
segments are all the same, the pitch control system of the speech
synthesis system is simplified.
In the following, the present invention will be described by way of
a preferred embodiment.
FIG. 6 is a block diagram of a concrete structure of a speech
synthesis system embodying the present invention.
First, a speech segment memory 32 is described for convenience'
sake. In the memory 32, the speech sound waveforms of all the
syllables necessary for the speech synthesis are stored in a high
speed memory device such as a core memory. Each syllable in the
memory consists of time-sequentially arranged speech segments
constituting a waveform as shown in FIG. 1 and the waveform of each
speech segment has an address allotted to indicate its location in
the memory. In a monosyllable, serial numbers are allotted to the
addresses of the speech segment waveforms arranged in
time-sequence. Therefore, the first address is used as a syllable
address to represent the syllable.
Each speech segment waveform is obtained by sampling the speech
sound waveform shown in FIG. 1 at 8KHz and each of the sampled
signal is coded into an 8-bit signal. The period at which one
speech segment, i.e. wave portion within T.sub.1, T.sub.2, T.sub.3
or T.sub.4 in FIG. 1, is recorded is 10 to 20 msec. Namely, the
period is set equal to the maximum one of the pitch periods of
speech sounds to be synthesized.
A series of code signals, each representing one syllable, to
constitute speech to be synthesized are received at a terminal 9
and fed through an input-output control circuit 10 to a data
processing circuit 11. For example, code signals corresponding to
the syllables YO, KO, HA and MA constituting the name of a famous
port city of Japan, are applied to the circuit 11. The device to
generate such code signals is not within the scope of this
invention and not shown in the figure, but the device is equivalent
to the conventional automatic response system, being designed to
form data for answers to preset questions and to connect the code
signals according to the arrangement of words corresponding to
those answers.
The data processing circuit 11 interprets the code signals
according to the predetermined program and generates signals
instructing and controlling the operations of the respective parts
of the speech synthesizing apparatus described later.
The operation of the circuit 11 will be described in further
detail. Judging from the series of code signals, the circuit 11
generates speech segment information, pitch information and
syllable time information according to a reference table.
The speech segment information is, for example, the address of the
first speech segment of a syllable stored in the speech segment
memory 32 described above; the pitch information is the information
indicated by dotted curve 8 in FIG. 5, that is, the number
indicating how many samples, counted from the first one, of the
speech segments stored in the memory 32 is to be read out; and the
syllable time information is the time information representing
t.sub.1 to t.sub.5 in FIG. 5, that is, the number of samples to be
read out within the time of one syllable.
The data processing circuit to perform such processing as described
above may be designed especially for the present invention but a
general purpose computer can be used as such a circuit so that the
details thereof is omitted.
The three kinds of information are respectively stored as
time-sequential signals in a syllable address buffer memory 14, a
pitch time buffer memory 15 and a syllable time buffer memory 16 of
a speech synthesizing apparatus 13. The speech synthesizing
apparatus 13 consists of a part to select speech segments necessary
to synthesize connected speech according to the speech segment
information, a part to determine the pitch periods of the speech
segments according to the pitch information and a part to determine
the times allotted to syllables according to the syllable time
information.
Next, the operations of the respective components of the speech
synthesizing apparatus 13 will be described.
The address data of the syllable address memory 14 are transferred
one by one to a segment address memory 17, in response to an
external signal and simultaneously the data in the syllable address
memory 14 is shifted forward to cause the address of the next
syllable to come to the head position. Namely, the memory 14 and
the memory 17 may be considered to form a shift register. Also, the
combination of the pitch time buffer memory 15 and a pitch time
memory or of the syllable time buffer memory 16 and a syllable time
memory may be also considered to form a shift register.
With the circuit arrangement as described above, the address signal
of the first speech segment of a syllable stored in the segment
memory 17 is applied to a read out circuit 29 so that a series of
sampled values constituting the segment are sequentially read out
in synchronism with clock pulses from a clock signal generator 20.
The number of the readout samples is detected by counting the clock
pulses by a pitch counter 22. When the content of the pitch counter
22 coincides with the pitch time data set in the pitch memory, a
coincidence circuit 25 detects the instant of coincidence to
deliver a coincidence pulse. The coincidence pulse serves not only
to reset the pitch counter 22 but also to shift a segment address
counter 21 step by step. The output of the shifted segment address
counter 21 is applied to the segment address memory 17 to read out
the next speech segment from the speech segment memory 32, in the
same manner as described above. Thereafter, the same operation of
reading out the sampled values is repeated on. The coincidence
pulse also resets the counter 23 at the same time.
On the other hand, the time counter 23 also counts the clock
pulses, and when the content of the time counter 23 coincides with
the syllable time data (that is, the number of sampling points
occurring within a time during which the pitch frequency in one
syllable remain the same, as described above) set in the syllable
memory 19, a coincidence circuit 26 detects the instant of
coincidence to deliver a coincidence pulse at the instant.
The coincidence pulse serves not only to transfer or shift the
foremost pitch time data of the pitch time buffer memory 15 to the
pitch time memory 18, but also to shift a syllable counter 24 step
by step. When the content of the syllable counter 24 coincides with
the step number recorded in a step number memory 23, a coincidence
circuit 27 detects the instant of coincidence to deliver a
coincidence pulse. The coincidence pulse resets the counter 24 and
is also applied to the syllable address buffer memory 14 and the
syllable time buffer memory 16 so that the control information for
the syllable to be next synthesized, i.e. segment address and time
data for the syllable, is transferred respectively to the memory 17
and the memory 19. The step number stored in the step number memory
refers to the number of steps occurring within a time of one
syllable when the pitch frequency is changed stepwise as shown in
FIG. 5. In case of FIG. 5, the number of steps is three. As has
been revealed from the experiments by the inventors, it is where
the number of steps is three that the deterioration of the vocal
quality of the synthesized speech due to the waveform
discontinuities is reduced to the minimum. However, the number of
steps need not be limited necessarily to 3 but may be 4 to 0, that
is, the pitch frequency of the synthesized speech sounds may be
varied at intervals of a quarter of a syllable or a full
syllable.
The output signal obtained from the read out circuit 29 as a result
of the operations as described above is equivalent to a signal
obtained by subjecting the signal waveform shown in FIG. 4A or 4B
to pulse code modulation since the speech synthesizing circuit 13
consists of digital circuits. The signal is then converted to an
analog signal through an digital-to-analog converter 30 and the
analog signal is finally converted to a speech sound signal or
audible voice through an electro-acoustic transducer 31. In this
case, the digital-to-analog converter 30 and the electro-acoustic
transducer 31 are connected by such a transmission line as a
telephone which electrically connects a remote subscriber with the
central information service system.
The speech synthesis system shown in FIG. 6 has been described as
applied to the case where the speech sounds only for one channel
are synthesized. It is, however, a matter of course that since the
whole system is composed of digital signal treating circuits and
the speech segments are stored in such a memory as a core memory
capable of high speed access then the system can be easily
constructed in a multichannel arrangement as known in the field of
the art.
Namely, such an arrangement for multichannel purpose can be
realized if the input-output control circuit 10, the data
processing circuit 11 and the speech segment memory 32 are used
commonly and if the number of the speech synthesizing apparatuses
13 is increased according to the number of channels required.
Moreover, the speech segments stored in the speech segment memory
may be obtained by directly extracting the components from the
natural human voice or by artificially treating the waveforms of
the human speech sounds .
* * * * *