U.S. patent application number 11/009966 was filed with the patent office on 2006-06-15 for method and system for converting text to lip-synchronized speech in real time.
This patent application is currently assigned to Microsoft Corporation. Invention is credited to Brandon Cotton, Timothy V. Fields.
Application Number | 20060129400 11/009966 |
Document ID | / |
Family ID | 36585182 |
Filed Date | 2006-06-15 |
United States Patent
Application |
20060129400 |
Kind Code |
A1 |
Fields; Timothy V. ; et
al. |
June 15, 2006 |
Method and system for converting text to lip-synchronized speech in
real time
Abstract
A method and system for presenting lip-synchronized speech
corresponding to the text received in real time is provided. A lip
synchronization system provides an image of a character that is to
be portrayed as speaking text received in real time. The lip
synchronization system receives a sequence of text corresponding to
the speech of the character. It may modify the received text in
various ways before synchronizing the lips. It may generate
phonemes for the modified text that are adapted to certain idioms.
The lip synchronization system then generates the lip-synchronized
images based on the phonemes generated from the modified texts and
based on the identified expressions.
Inventors: |
Fields; Timothy V.; (Austin,
TX) ; Cotton; Brandon; (Austin, TX) |
Correspondence
Address: |
PERKINS COIE LLP/MSFT
P. O. BOX 1247
SEATTLE
WA
98111-1247
US
|
Assignee: |
Microsoft Corporation
Redmond
WA
|
Family ID: |
36585182 |
Appl. No.: |
11/009966 |
Filed: |
December 10, 2004 |
Current U.S.
Class: |
704/260 ;
704/270; 704/E21.02 |
Current CPC
Class: |
G10L 2021/105
20130101 |
Class at
Publication: |
704/260 ;
704/270 |
International
Class: |
G10L 13/08 20060101
G10L013/08 |
Claims
1. A method for presenting information in real time, the method
comprising: providing an image of a character; receiving a sequence
of text; modifying the text of the received sequence; generating
speech corresponding to the modified text; generating a sequence of
images based on the provided image to represent the character
speaking the generated speech; and outputting the generated speech
and sequence of images to portray the character speaking the
text.
2. The method of claim 1 wherein the text is closed-captioned text
of a television broadcast.
3. The method of claim 1 wherein the text is entered via a keyboard
by a participant in a computer-based chat session.
4. The method of claim 1 wherein the text is modified by expanding
acronyms.
5. The method of claim 1 wherein the text is modified to reflect an
idiom.
6. The method of claim 5 wherein the idiom is associated with the
character.
7. The method of claim 1 wherein the generating of speech includes
identifying phonemes from the modified text.
8. The method of claim 7 wherein the phonemes are identified to
reflect an idiom.
9. The method of claim 1 including identifying expressions from the
sequence of text and wherein the generated sequence of images
reflects the identified expressions.
10. The method of claim 9 wherein different images of the character
are provided for different expressions.
11. The method of claim 1 wherein the generating of the sequence of
images represents the character lip-syncing the generated
speech.
12. The method of claim 1 including identifying expressions from
the sequence of text and wherein the generated sequence of images
reflects the identified expressions while the character lip-syncs
the generated speech.
13. A system for presenting a lip-syncing character, comprising: an
image store containing an image of a character; a modify text
component that receives text in real time and modifies the text;
and a lip synchronization component that inputs the modified text
and the image of the character and outputs speech corresponding to
the modified text and images of the character speaking the output
speech in real time as the text is received.
14. The system of claim 13 wherein the text is closed-captioned
text of a television broadcast.
15. The system of claim 13 wherein the text is entered via a
keyboard by a participant in a computer-based chat session.
16. The system of claim 13 wherein the text is modified by
expanding acronyms.
17. The system of claim 13 wherein the text is modified to reflect
an idiom.
18. The system of claim 13 wherein the generating of speech
includes identifying phonemes from the modified text.
19. The system of claim 18 wherein the phonemes are identified to
reflect an idiom.
20. The system of claim 13 including a component that identifies
expressions from the sequence of text so that the images of the
character speaking the output text can reflect the expressions.
21. A computer-readable medium containing instructions for
controlling a computer to present images of a character speaking,
by a method comprising: providing an image of a character;
receiving a sequence of text in real time; generating speech
corresponding to the received text; generating a sequence of images
based on the provided image to represent the character speaking the
generated speech; and outputting the generated speech and sequence
of images to portray the character speaking the text.
22. The computer-readable medium of claim 21 wherein the text is
closed-captioned text.
23. The computer-readable medium of claim 21 wherein the text is
entered by a participant in a computer-based chat session.
24. The computer-readable medium of claim 21 including modifying
the text before generating the speech by expanding acronyms.
25. The computer-readable medium of claim 21 including modifying
the text before generating the speech to reflect an idiom.
26. The computer-readable medium of claim 21 wherein the generating
of speech includes identifying phonemes from the text.
27. The computer-readable medium of claim 26 wherein the phonemes
are identified to reflect an idiom.
28. The computer-readable medium of claim 21 including identifying
expressions from the sequence of text and wherein the generated
sequence of images reflects the identified expressions.
29. The computer-readable medium of claim 28 wherein different
images of the character are provided for different expressions.
30. The computer-readable medium of claim 21 wherein the generating
of the sequence of images represents the character lip-syncing the
generated speech.
31. The computer-readable medium of claim 21 including identifying
expressions from the sequence of text and wherein the generated
sequence of images reflects the identified expressions while the
character lip-syncs the generated speech.
Description
TECHNICAL FIELD
[0001] The described technology relates to synchronizing lip
movement of a character with speech of the character.
BACKGROUND
[0002] Many types of lip synchronization software are currently
available. One type of lip synchronization software inputs an image
of a person and a sequence of phonemes and outputs a sequence of
images of the person with their lip movement synchronized to the
phonemes. When the audio of the phonemes (e.g., via an enunciator)
is output simultaneously with the sequence of images, the character
appears to be speaking the audio and is sometimes referred to as a
"talking head." Another type of lip synchronization software
additionally inputs expressions and adjusts the image of the
character to reflect those expressions. For example, the
expressions may be used to reflect sadness, happiness, worry,
surprise, fright, and so on. Lip synchronization software may use
morphing techniques to transition between phonemes and between the
different expressions. For example, a change in expression from sad
to happy may occur over a two-second interval, rather than from one
update of the image to the next.
[0003] Lip synchronization software has been used in many
applications including game and Internet communications. Game
applications may provide images of characters of the game along
with the voice of the characters. The voice of a character may be
augmented with lip movement instructions that indicate how the lips
are to move to correspond to the voice. When a character of the
game is to speak, the game provides the lip synchronization
software with the lip movement instructions (which may be
represented by phonemes) along with an image of the character. The
lip synchronization software then controls the display of the
character with lips synchronized to the voice. Internet
communication applications have used lip synchronization software
to display a talking head representing a person who is currently
speaking remotely. As a person speaks, corresponding lip movement
instructions may be transmitted along with the voice to the
computer systems of listeners. The lip movement instructions can be
created in various ways. The lip movement instructions can be
derived from analysis of the person's actual lip movement or can be
a sequence of phonemes derived from the voice. A listener's
computer system can display an image of the person (or caricature
of the person) with the lips synchronized to the voice based on the
lip movement instructions. The sending of lip movement instructions
requires significantly less bandwidth than the sending of a video
of the person. Thus, lip synchronization software can be used in
situations where sending of video is not practical.
[0004] Typical applications that use lip synchronization software
identify lip movement instructions either automatically as a person
speaks or manually as specified by a developer of the application.
Some applications may automatically generate lip movement
instructions and then allow for manual modification of the
instructions to achieve a desired effect.
[0005] It would be desirable to have a system that would
automatically generate a talking head based on text, rather than
voice, that is received in real time. There are many environments
in which text is generated in real time, such as closed-captioned
text of television broadcasts, text entered via a keyboard during
an Internet chat or instant messaging session, text generated by a
stenographer, and so on.
SUMMARY
[0006] A method and system for presenting lip-synchronized speech
corresponding to the text received in real time is provided. A lip
synchronization system provides an image of a character that is to
be portrayed as speaking text received in real time. The lip
synchronization system receives a sequence of text corresponding to
the speech of the character. It may modify the received text in
various ways before synchronizing the lips. It may generate
phonemes for the modified text that are adapted to certain idioms.
The lip synchronization system then generates the lip-synchronized
images based on the phonemes generated from the modified texts and
based on the identified expressions.
BRIEF DESCRIPTION OF THE DRAWINGS
[0007] FIG. 1 is a block diagram that illustrates components of the
lip synchronization system in one embodiment.
[0008] FIG. 2 is a flow diagram that illustrates the processing of
the text modifier component of the lip synchronization system in
one embodiment.
[0009] FIG. 3 is a flow diagram that illustrates the processing of
the phoneme generator component of the lip synchronization system
in one embodiment.
[0010] FIG. 4 is a flow diagram that illustrates the processing of
the expression identifier component of the lip synchronization
system in one embodiment.
DETAILED DESCRIPTION
[0011] A method and system for presenting lip-synchronized speech
corresponding to the text received in real time is provided. A lip
synchronization system provides an image of a character that is to
be portrayed as speaking text received in real time. The character
may be an actual or animated person, animal, or any other thing
that can appear to speak. The lip synchronization system receives a
sequence of text corresponding to the speech of the character. For
example, the received text may be the text sent as the closed
captions of a television broadcast, entered by a participant in a
real-time communications, and so on. The lip synchronization system
may modify the received text in various ways before synchronizing
the lips. For example, if the text is closed-captioned text, then
the lip synchronization system may add, remove, or replace words.
The lip synchronization system may replace certain acronyms with
their corresponding words, such as the acronym "BRB" used in a chat
session with "I'll be right back." The lip synchronization system
may replace words with more or less complex equivalents to dumb up
or dumb down the speech. The lip synchronization system may add
text to affect various idioms. For example, the lip synchronization
system may add an "ummm," an "eh," or slang words to the text to
produce certain effects, such as making the speaker appear confused
or stumbling over words. The lip synchronization system may
generate phonemes for the modified text that are adapted to certain
idioms. For example, the lip synchronization system may select
phonemes to affect a certain accent. The lip synchronization system
may also identify expressions from the received text. For example,
the lips synchronization system may detect the words "[laughter]"
or "[crying]" in closed-captioned text and identify the expressions
of laughing or crying. The lip synchronization system then
generates the lip-synchronized images based on the phonemes
generated from the modified texts and based on the identified
expressions. In this way, when the system outputs the images and
audio of the modified text, the character's lips are synchronized
with the audio.
[0012] FIG. 1 is a block diagram that illustrates components of the
lip synchronization system in one embodiment. The lip
synchronization system includes a text modifier component 101, a
phoneme generator component 102, an expression identifier component
103, and a talking head component 104. The text modifier component
inputs text as it is received in real time and modifies the text
according to rules stored in a text rule store 105. The rules may
specify how to add, remove, and replace words within the text. The
text modifier component provides the modified text to the phoneme
generator component. The phoneme generator component converts the
modified text into a sequence of phonemes based on the mapping of
words to phonemes stored in a phoneme store 106. The phoneme store
may contain phonemes that reflect various idioms, such as accent.
The phoneme generator component then provides the sequence of
phonemes to the talking head component. The expression identifier
component receives the text in real time and identifies expressions
for the character from the text. The expression identifier
component may be customized to identify expressions in a way that
is unique to the character. For example, if an expression of
sadness would normally be identified, the expression identifier
component may identify happiness instead to portray the character's
disregard of a sad situation. The expression identifier component
then provides the expressions to the talking head component. The
expressions and phonemes may be mapped to the underlying text so
that the talking head component can synchronize the expressions and
the phonemes. The talking head component, which may be a
conventional component, displays an image of the character
corresponding to the current expression that is retrieved from an
expression store 107. The talking head component modifies the lips
of the character based on the sequence of phonemes so that the lips
are synchronized with the phonemes. The talking head component then
outputs the sequence of images of the character and enunciates the
sequence of phonemes to affect a talking head that is speaking in
real time the text that is received in real time.
[0013] The computing device on which the lip synchronization system
is implemented may include a central processing unit, memory, input
devices (e.g., keyboard and pointing devices), output devices
(e.g., display devices), and storage devices (e.g., disk drives).
The memory and storage devices are computer-readable media that may
contain instructions that implement the lip synchronization system.
In addition, data structures and message structures may be stored
or transmitted via a data transmission medium, such as a signal on
a communications link. Various communications links may be used,
such as the Internet, a local area network, a wide area network, or
a point-to-point dial-up connection.
[0014] The lip synchronization system may be implemented in various
operating environments including personal computers, server
computers, hand-held or laptop devices, multiprocessor systems,
microprocessor-based systems, programmable consumer electronics,
network PCs, minicomputers, mainframe computers, distributed
computing environments that include any of the above systems or
devices, and the like.
[0015] The lip synchronization system may be described in the
general context of computer-executable instructions, such as
program modules, executed by one or more computers or other
devices. Generally, program modules include routines, programs,
objects, components, data structures, and so on that perform
particular tasks or implement particular abstract data types.
Typically, the functionality of the program modules may be combined
or distributed as desired in various embodiments.
[0016] FIG. 2 is a flow diagram that illustrates the processing of
the text modifier component of the lip synchronization system in
one embodiment. The component may be passed the next word of the
text that is received in real time. The component buffers the
words, applies the modify rules to the buffer full of words, and
then provides the modified text of the buffer to the phoneme
generator component. Example rules may include removing certain
verbs from sentences, adding "umm" after each phrase, and so on. In
block 201, the component adds the passed words to the buffer. In
decision block 202, if the rules can be applied to the buffer of
words, then the component continues at block 203, else the
component completes. The rules can be applied to the buffer of
words, for example, if a certain number of words are buffered, a
sentence is buffered, a paragraph is buffered, and so on. In blocks
203-206, the component loops applying rules to the words in the
buffer. In block 203, the component selects the next rule. In
decision block 204, if all the rules have already been selected,
then the component continues at block 207, else the component
continues at block 205. In decision block 205, if the selected rule
applies to the buffer, then the component continues at block 206,
else the component loops to block 203 to select the next rule. In
block 206, the component applies the selected rule to the words in
the buffer and then loops to block 203 to select the next rule. In
block 207, the component sends the modified text of the buffer to
the phoneme generator component and then completes.
[0017] FIG. 3 is a flow diagram that illustrates the processing of
the phoneme generator component of the lip synchronization system
in one embodiment. The component may be passed a buffer of modified
text and generates the phonemes for that text. In block 301, the
component selects the next word of the passed buffer. In decision
block 302, if all the words have already been selected, then the
component completes, else the component continues at block 303. In
block 303, the component retrieves the phonemes for the selected
word (or a selected phrase). The component may retrieve the
phonemes from the phoneme store. The phoneme store may contain
phonemes that are appropriate for the particular idiom of the
character. For example, different sets of phonemes may be used to
affect accents of the characters from different countries, such as
Australia, Canada, the United Kingdom, and the United States. The
phoneme store may also contain phonemes that are particular to a
certain character. In block 304, the component may modify the
phonemes to produce certain effects. For example, the component may
replace certain phonemes with other phonemes to achieve regional
effects. In block 305, the component sends the phonemes to the
talking head component and then loops to block 301 to select the
next word of the buffer.
[0018] FIG. 4 is a flow diagram that illustrates the processing of
the expression identifier component of the lip synchronization
system in one embodiment. The component is passed a word of the
text that is received in real time and identifies changes in
expressions indicated by the text. For example, the component may
identify that text received rapidly indicates that the speaker is
excited or that text received slowly indicates the speaker is
contemplative. In block 401, the component adds the passed word to
the buffer. In decision block 402, if it is time to process the
words of the buffer, then the component continues at block 403,
else the component completes. In blocks 403-407, the component
loops selecting each word and identifying whether the current
expression has changed. In block 403, the component selects the
next word of the buffer. In decision block 404, if all the words of
the buffer have already been selected, then the component
completes, else the component continues at block 405. In block 405,
the component identifies an expression based on the selected word.
For example, the component may compare previous words and following
words within the buffer to determine the current expression. In
decision block 406, if the current expression has changed from the
previous expression, then the component continues at block 407,
else the component loops to block 403 to select the next word of
the buffer. In block 407, the component tags the selected word with
the new expression and then loops to block 403 to select next word.
Upon completion, the component provides the buffer with the tagged
words to the talking head component.
[0019] One skilled in the art will appreciate that although
specific embodiments of the lip synchronization system have been
described herein for purposes of illustration, various
modifications may be made without deviating from the spirit and
scope of the invention. For example, the lip synchronization system
may be augmented to move the character's hands to effect the output
of the modified text in a sign language, such as American Sign
Language. Accordingly, the invention is not limited except by the
appended claims.
* * * * *