U.S. patent application number 15/908603 was filed with the patent office on 2018-11-22 for voice effects based on facial expressions.
This patent application is currently assigned to Apple Inc.. The applicant listed for this patent is Apple Inc.. Invention is credited to Carlos M. Avendano, Aram M. Lindahl, Sean A. Ramprashad.
Application Number | 20180336716 15/908603 |
Document ID | / |
Family ID | 64269597 |
Filed Date | 2018-11-22 |
United States Patent
Application |
20180336716 |
Kind Code |
A1 |
Ramprashad; Sean A. ; et
al. |
November 22, 2018 |
VOICE EFFECTS BASED ON FACIAL EXPRESSIONS
Abstract
Embodiments of the present disclosure can provide systems,
methods, and computer-readable medium for adjusting audio and/or
video information of a video clip based at least in part on facial
feature and/or voice feature characteristics extracted from
hardware components. For example, in response to detecting a
request to generate an avatar video clip of a virtual avatar, a
video signal associated with a face in a field of view of a camera
and an audio signal may be captured. Voice feature characteristics
and facial feature characteristics may be extracted from the audio
signal and the video signal, respectively. In some examples, in
response to detecting a request to preview the avatar video clip,
an adjusted audio signal may be generated based at least in part on
the facial feature characteristics and the voice feature
characteristics, and a preview of the video clip of the virtual
avatar using the adjusted audio signal may be displayed.
Inventors: |
Ramprashad; Sean A.; (Los
Altos, CA) ; Avendano; Carlos M.; (Campbell, CA)
; Lindahl; Aram M.; (Menlo Park, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Apple Inc. |
Cupertino |
CA |
US |
|
|
Assignee: |
Apple Inc.
Cupertino
CA
|
Family ID: |
64269597 |
Appl. No.: |
15/908603 |
Filed: |
February 28, 2018 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62507177 |
May 16, 2017 |
|
|
|
62556412 |
Sep 9, 2017 |
|
|
|
62557121 |
Sep 11, 2017 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
H04M 2250/52 20130101;
H04N 5/23219 20130101; G06F 3/04842 20130101; H04M 1/72555
20130101; H04L 51/38 20130101; G06F 3/012 20130101; G06F 3/0484
20130101; H04L 51/04 20130101; H04M 1/72552 20130101; G06K 9/00315
20130101; G06F 3/04886 20130101; H04N 5/23293 20130101; G06F 3/0304
20130101; G06K 9/00671 20130101; H04L 51/10 20130101 |
International
Class: |
G06T 13/80 20060101
G06T013/80; G06F 3/16 20060101 G06F003/16; G10L 15/02 20060101
G10L015/02; G06K 9/00 20060101 G06K009/00 |
Claims
1. A method, comprising: at an electronic device having at least a
camera and a microphone: displaying a virtual avatar generation
interface; displaying first preview content of a virtual avatar in
the virtual avatar generation interface, the first preview content
of the virtual avatar corresponding to realtime preview video
frames of a user headshot in a field of view of the camera and
associated headshot changes in an appearance; while displaying the
first preview content of the virtual avatar, detecting an input in
the virtual avatar generation interface; in response to detecting
the input in the virtual avatar generation interface: capturing,
via the camera, a video signal associated with the user headshot
during a recording session; capturing, via the microphone, a user
audio signal during the recording session; extracting audio feature
characteristics from the captured user audio signal; and extracting
facial feature characteristics associated with the face from the
captured video signal; and in response to detecting expiration of
the recording session: generating an adjusted audio signal from the
captured audio signal based at least in part on the facial feature
characteristics and the audio feature characteristics; generating
second preview content of the virtual avatar in the virtual avatar
generation interface according to the facial feature
characteristics and the adjusted audio signal; and presenting the
second preview content in the virtual avatar generation
interface.
2. The method of claim 1, further comprising storing facial feature
metadata associated with the facial feature characteristics
extracted from the video signal and strong audio metadata
associated with the audio feature characteristics extracted from
the audio signal.
3. The method of claim 2, further comprising generating adjusted
facial feature metadata from the facial feature metadata based at
least in part on the facial feature characteristics and the audio
feature characteristics.
4. The method of claim 3, wherein the second preview of the virtual
avatar is displayed further according to the adjusted facial
metadata.
5. An electronic device, comprising: a camera; a microphone; and
one or more processors in communication with the camera and the
microphone, the one or more processors configured to: while
displaying a first preview of a virtual avatar, detecting an input
in a virtual avatar generation interface; in response to detecting
the input in the virtual avatar generation interface, initiating a
capture session including: capturing, via the camera, a video
signal associated with a face in a field of view of the camera;
capturing, via the microphone, an audio signal associated with the
captured video signal; extracting audio feature characteristics
from the captured audio signal; and extracting facial feature
characteristics associated with the face from the captured video
signal; and in response to detecting expiration of the capture
session: generating an adjusted audio signal based at least in part
on the audio feature characteristics and the facial feature
characteristics; and displaying a second preview of the virtual
avatar in the virtual avatar generation interface according to the
facial feature characteristics and the adjusted audio signal.
6. The electronic device of claim 5, wherein the audio signal is
further adjusted based at least in part on a type of the virtual
avatar.
7. The electronic device of claim 6, wherein the type of the
virtual avatar is received based at least in part on an avatar type
selection affordance presented in the virtual avatar generation
interface.
8. The electronic device of claim 6, wherein the type of the
virtual avatar includes an animal type, and wherein the adjusted
audio signal is generated based at least in part on a predetermined
sound associated with the animal type.
9. The electronic device of claim 5, wherein the one or more
processors are further configured to determine whether a portion of
the audio signal corresponds to the face in the field of view.
10. The electronic device of claim 9, wherein the one or more
processors are further configured to, in accordance with a
determination that the portion of the audio signal corresponds to
the face, store the portion of the audio signal for use in
generating the adjusted audio signal.
11. The electronic device of claim 9, wherein the one or more
processors are further configured to, in accordance with a
determination that the portion of the audio signal does not
correspond to the face, discard at least the portion of the audio
signal.
12. The electronic device of claim 5, wherein the audio feature
characteristics comprise features of a voice associated with the
face in the field of view.
13. The electronic device of claim 5, wherein the one or more
processors are further configured to store facial feature metadata
associated with the facial feature characteristics extracted from
the video signal.
14. The electronic device of claim 13, wherein the one or more
processors are further configured to generate adjusted facial
metadata based at least in part on the facial feature
characteristics and the audio feature characteristics.
15. The electronic device of claim 14, wherein the second preview
of the virtual avatar is generated according to the adjusted facial
metadata and the adjusted audio signal.
16. A computer-readable storage medium storing computer-executable
instructions that, when executed by one or more processors,
configure the one or more processors to perform operations
comprising: in response to detecting a request to generate an
avatar video clip of a virtual avatar: capturing, via a camera of
an electronic device, a video signal associated with a face in a
field of view of the camera; capturing, via a microphone of the
electronic device, an audio signal; extracting voice feature
characteristics from the captured audio signal; and extracting
facial feature characteristics associated with the face from the
captured video signal; and in response to detecting a request to
preview the avatar video clip: generating an adjusted audio signal
based at least in part on the facial feature characteristics and
the voice feature characteristics; and displaying a preview of the
video clip of the virtual avatar using the adjusted audio
signal.
17. The computer-readable storage medium of claim 16, wherein the
audio signal is adjusted based at least in part on a facial
expression identified in the facial feature characteristics
associated with the face.
18. The computer-readable storage medium of claim 16, wherein the
adjusted audio signal is further adjusted by inserting one or more
pre-stored audio samples.
19. The computer-readable storage medium of claim 16, wherein the
audio signal is adjusted based at least in part on a level, pitch,
duration, variable playback speed, speech spectral-format
positions, speech spectral-format-levels, instantaneous playback
speed, or change in a voice associated with the face.
20. The computer-readable storage medium of claim 16, wherein the
one or more processors are further configured to perform the
operations comprising transmitting the video clip of the virtual
avatar to another electronic device.
Description
CROSS-REFERENCES TO RELATED APPLICATIONS
[0001] This application claims the benefit of U.S. Provisional
Patent Application No. 62/507,177, entitled "Emoji Recording and
Sending," filed May 16, 2017, U.S. Provisional Patent Application
No. 62/556,412, entitled "Emoji Recording and Sending," filed Sep.
9, 2017, and U.S. Provisional Patent Application No. 62/557,121,
entitled "Emoji Recording and Sending," filed Sep. 11, 2017, the
entire disclosures of each being herein incorporated by reference
for all purposes.
BACKGROUND
[0002] Multimedia content, such as emoji's, can be sent as part of
messaging communications. The emoji's can represent a variety of
predefined people, objects, actions, and/or other things. Some
messaging applications allow users to select from a predefined
library of emoji's which can be sent as part of a message that can
contain other content (e.g., other multimedia and/or textual
content). Animojis are one type of this other multimedia content,
where a user can select an avatar (e.g., a puppet) to represent
themselves. The animoji can move and talk as if it were a video of
the user. Animojis enable users to create personalized versions of
emoji's in a fun and creative way.
SUMMARY
[0003] Embodiments of the present disclosure can provide systems,
methods, and computer-readable medium for implementing avatar video
clip revision and playback techniques. In some examples, a
computing device can present a user interface (UI) for tracking a
user's face and presenting a virtual avatar representation (e.g., a
puppet or video character version of the user's face). Upon
identifying a request to record, the computing device can capture
audio and video information, extract and detect context as well as
facial feature characteristics and voice feature characteristics,
revise the audio and/or video information based at least in part on
the extracted/identified features, and present a video clip of the
avatar using the revised audio and/or video information.
[0004] In some embodiments, a computer-implemented method for
implementing various audio and video effects techniques may be
provided. The method may include displaying a virtual avatar
generation interface. The method may also include displaying first
preview content of a virtual avatar in the virtual avatar
generation interface, the first preview content of the virtual
avatar corresponding to realtime preview video frames of a user
headshot in a field of view of the camera and associated headshot
changes in an appearance. The method may also include detecting an
input in the virtual avatar generation interface while displaying
the first preview content of the virtual avatar. In some examples,
in response to detecting the input in the virtual avatar generation
interface, the method may also include: capturing, via the camera,
a video signal associated with the user headshot during a recording
session, capturing, via the microphone, a user audio signal during
the recording session, extracting audio feature characteristics
from the captured user audio signal, and extracting facial feature
characteristics associated with the face from the captured video
signal. Additionally, in response to detecting expiration of the
recording session, the method may also include: generating an
adjusted audio signal from the captured audio signal based at least
in part on the facial feature characteristics and the audio feature
characteristics, generating second preview content of the virtual
avatar in the virtual avatar generation interface according to the
facial feature characteristics and the adjusted audio signal, and
presenting the second preview content in the virtual avatar
generation interface.
[0005] In some embodiments, the method may also include storing
facial feature metadata associated with the facial feature
characteristics extracted from the video signal and generating
adjusted facial feature metadata from the facial feature metadata
based at least in part on the facial feature characteristics and
the audio feature characteristics. Additionally, the second preview
of the virtual avatar may be displayed further according to the
adjusted facial metadata. In some examples, the first preview of
the virtual avatar may be displayed according to preview facial
feature characteristics identified according to the changes in the
appearance of the face during a preview session.
[0006] In some embodiments, an electronic device for implementing
various audio and video effects techniques may be provided. The
system may include a camera, a microphone, a library of
pre-recorded/pre-determined audio, and one or more processors in
communication with the camera and the microphone. In some examples,
the processors may be configured to execute computer-executable
instructions to perform operations. The operations may include
detecting an input in a virtual avatar generation interface while
displaying a first preview of a virtual avatar. The operations may
also include initiating a capture session including in response to
detecting the input in the virtual avatar generation interface. The
capture session may include: capturing, via the camera, a video
signal associated with a face in a field of view of the camera,
capturing, via the microphone, an audio signal associated with the
captured video signal, extracting audio feature characteristics
from the captured audio signal, and extracting facial feature
characteristics associated with the face from the captured video
signal. In some examples, the operations may also include
generating an adjusted audio signal based at least in part on the
audio feature characteristics and the facial feature
characteristics and presenting the second preview content in the
virtual avatar generation interface, at least in response to
detecting expiration of the capture session.
[0007] In some instances, the audio signal may be further adjusted
based at least in part on a type of the virtual avatar.
Additionally, the type of the virtual avatar may be received based
at least in part on an avatar type selection affordance presented
in the virtual avatar generation interface. In some instances, the
type of the virtual avatar may include an animal type, and the
adjusted audio signal may be generated based at least in part on a
predetermined sound associated with the animal type. The use and
timing of predetermined sounds may be based on audio features from
the captured audio and/or facial features from the captured video.
This predetermined sound may also be itself modified based on audio
features from the captured audio and facial features from the
captured video. In some examples, the one or more processors may be
further configured to determine whether a portion of the audio
signal corresponds to the face in the field of view. Additionally,
in accordance with a determination that the portion of the audio
signal corresponds to the face, the portion of the audio signal may
be stored for use in generating the adjusted audio signal and/or in
accordance with a determination that the portion of the audio
signal does not correspond to the face, at least the portion of the
audio signal may be discarded and not considered for modification
and/or playback. Additionally, the audio feature characteristics
may comprise features of a voice associated with the face in the
field of view. In some examples, the one or more processors may be
further configured to store facial feature metadata associated with
the facial feature characteristics extracted from the video signal.
In some examples, the one or more processors may be further
configured to store audio feature metadata associated with the
audio feature characteristics extracted from the audio signal.
Further, the one or more processors may be further configured to
generate adjusted facial metadata based at least in part on the
facial feature characteristics and the audio feature
characteristics, and the second preview of the virtual avatar may
be generated according to the adjusted facial metadata and the
adjusted audio signal.
[0008] In some embodiments, a computer-readable medium may be
provided. The computer-readable medium may include
computer-executable instructions that, when executed by one or more
processors, cause the one or more processors to perform operations.
The operations may include performing the following actions in
response to detecting a request to generate an avatar video clip of
a virtual avatar: capturing, via a camera of an electronic device,
a video signal associated with a face in a field of view of the
camera, capturing, via a microphone of the electronic device, an
audio signal, extracting voice feature characteristics from the
captured audio signal, and extracting facial feature
characteristics associated with the face from the captured video
signal. The operations may also include performing the following
actions in response to detecting a request to preview the avatar
video clip: generating an adjusted audio signal based at least in
part on the facial feature characteristics and the voice feature
characteristics, and displaying a preview of the video clip of the
virtual avatar using the adjusted audio signal.
[0009] In some embodiments, the audio signal may be adjusted based
at least in part on a facial expression identified in the facial
feature characteristics associated with the face. In some
instances, the audio signal may be adjusted based at least in part
on a level, pitch, duration, format, or change in a voice
characteristic associated with the face. Further, in some
embodiments, the one or more processors may be further configured
to perform the operations comprising transmitting the video clip of
the virtual avatar to another electronic device.
[0010] The following detailed description together with the
accompanying drawings will provide a better understanding of the
nature and advantages of the present disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] FIG. 1 is a simplified block diagram illustrating example
flow for providing audio and/or video effects techniques as
described herein, according to at least one example.
[0012] FIG. 2 is another simplified block diagram illustrating
example flow for providing audio and/or video effects techniques as
described herein, according to at least one example.
[0013] FIG. 3 is another simplified block diagram illustrating
hardware and software components for providing audio and/or video
effects techniques as described herein, according to at least one
example.
[0014] FIG. 4 is a flow diagram to illustrate providing audio
and/or video effects techniques as described herein, according to
at least one example.
[0015] FIG. 5 is another flow diagram to illustrate providing audio
and/or video effects techniques as described herein, according to
at least one example.
[0016] FIG. 6 is a simplified block diagram illustrating a user
interface for providing audio and/or video effects techniques as
described herein, according to at least one example.
[0017] FIG. 7 is another flow diagram to illustrate providing audio
and/or video effects techniques as described herein, according to
at least one example.
[0018] FIG. 8 is another flow diagram to illustrate providing audio
and/or video effects techniques as described herein, according to
at least one example.
[0019] FIG. 9 is a simplified block diagram illustrating is a
computer architecture for providing audio and/or video effects
techniques as described herein, according to at least one
example.
DETAILED DESCRIPTION
[0020] Certain embodiments of the present disclosure relate to
devices, computer-readable medium, and methods for implementing
various techniques for providing voice effects (e.g., revised
audio) based at least in part on facial expressions. Additionally,
in some cases, the various techniques may also provide video
effects based at least in part on audio characteristics of a
recording. Even further, the various techniques may also provide
voice effects and video effects (e.g., together) based at least in
part on one or both of facial expressions and audio characteristics
of a recording. In some examples, the voice effects and/or video
effects may be presented in a user interface (UI) configured to
display a cartoon representation of a user (e.g., an avatar or
digital puppet). Such an avatar that represents a user may be
considered an animoji, as it may look like an emoji character
familiar to most smart phone users; however, it can be animated to
mimic actual motions of the user.
[0021] For example, a user of a computing device may be presented
with a UI for generating an animoji video (e.g., a video clip). The
video clip can be limited to a predetermined amount of time (e.g.,
10 second, 30 seconds, or the like), or the video clip can be
unlimited. In the UI, a preview area may present the user with a
real-time representation of their face, using an avatar character.
Various avatar characters may be provided, and a user may even be
able to generate or import their own avatars. The preview area may
be configured to provide an initial preview of the avatar and a
preview of the recorded video clip. Additionally, the recorded
video clip may be previewed in its original form (e.g., without any
video or audio effects) or it may be previewed with audio and/or
video effects. In some cases, the user may select an avatar after
the initial video clip has been recorded. The video clip preview
may then change from one avatar to another, with the same or
different video effects applied to it, as appropriate. For example,
if the raw preview (e.g., original form, without effects) is being
viewed, and the user switches avatar characters, the UI may be
updated to display a rendering of the same video clip but with the
newly selected avatar. In other words, the facial features and
audio (e.g., the user's voice) that was captured during the
recording can be presented from any of the avatars (e.g., without
any effects). In the preview, it will appear as if the avatar
character is moving the same way the user moved during the
recording, and speaking what the user said during the
recording.
[0022] By way of example, a user may select a first avatar (e.g., a
unicorn head) via the UI, or a default avatar can be initially
provided. The UI will present the avatar (in this example, the head
of a cartoon unicorn if selected by the user or any other available
puppet by default) in the preview area, and the device will begin
capturing audio and/or video information (e.g., using one or more
microphones and/or one or more cameras). In some cases, only video
information is needed for the initial preview screen. The video
information can be analyzed, and facial features can be extracted.
These extracted facial features can then be mapped to the unicorn
face in real-time, such that the initial preview of the unicorn
head appears to mirror that of the user's. In some cases, the term
real-time is used to indicate that the results of the extraction,
mapping, rendering, and presentation are performed in response to
each motion of the user and can be presented substantially
immediately. To the user, it will appear as if they are looking in
the mirror, except the image of their face is replaced with an
avatar.
[0023] While the user's face is in the line of sight (e.g., the
view) of a camera of the device, the UI will continue to present
the initial preview. Upon selection of a record affordance (e.g., a
virtual button) on the UI, the device may begin to capture video
that has an audio component. In some examples, this includes a
camera capturing frames and a microphone capturing audio
information. A special camera may be utilized that is capable of
capturing 3-dimensional (3D) information as well. Additionally, in
some examples, any camera may be utilized that is capable of
capturing video. The video may be stored in its original form
and/or metadata associated with the video may be stored. As such,
capturing the video and/or audio information may be different from
storing the information. For example, capturing the information may
include sensing the information and at least caching it such that
is available for processing. The processed data can also be cached
until it is determined whether to store or simply utilize the data.
For example, during the initial preview, while the user's face is
being presented as a puppet in real-time, the video data (e.g.,
metadata associated with the data) may be cached, while it is
mapped to the puppet and presented. However, this data may not be
stored permanently at all, such that the initial preview is not
reusable or recoverable.
[0024] Alternatively, in some examples, once the user selects the
record affordance of the UI, the video data and the audio data may
be stored more permanently. In this way, the audio and video (A/V)
data may analyzed, processed, etc., in order to provide the audio
and video effects described herein. In some examples, the video
data may be processed to extract facial features (e.g., facial
feature characteristics) and those facial features may be stored as
metadata for the animoji video clip. The set of metadata may be
stored with an identifier (ID) that indicates the time, date, and
user associated with the video clip. Additionally, the audio data
may be stored with the same or other ID. Once stored, or in some
examples--prior to storage, the system (e.g., processors of the
device) may extract audio feature characteristics from the audio
data and facial feature characteristics from the video file. This
information can be utilized to identify context, key words, intent,
and/or emotions of the user, and video and audio effects can be
introduced into audio and video data prior to rendering the puppet.
In some examples, the audio signal can be adjusted to include
different words, sounds, tones, pitches, timing, etc., based at
least in part on the extracted features. Additionally, in some
examples, the video data (e.g., the metadata) can also be adjusted.
In some examples, audio features are extracted in real-time during
the preview itself. These audio features may be avatar specific,
generated only if the associated avatar is being previewed. The
audio features may be avatar agnostic, generated for all avatars.
The audio signal can also be adjusted in part based on these
real-time audio feature extractions, and with the pre-stored
extracted video features which are created during or after the
recording process, but before previewing.
[0025] Once the video and audio data have been adjusted based at
least in part on the extracted characteristics, a second preview of
the puppet can be rendered. This rendering may be performed for
each possible puppet, such as the user scrolls through and selects
different puppets, the adjusted data is already rendered. Or the
rendering can be performed after selection of each puppet. In any
event, once the user selects a puppet, the second preview can be
presented. The second preview will replay the video clip that was
recorded by the user, but with the adjusted audio and/or video.
Using the example from above, if the user recorded themselves with
an angry tone (e.g., with a gruff voice and a furrowed brow), the
context or intent of anger may be detected, and the audio file may
be adjusted to include a growling sound. Thus, the second preview
would look like a unicorn saying the words that the user said;
however, the voice of the user may be adjusted to sound like a
growl, or to make the tone more baritone (e.g., lower). The user
could then save the second preview or select it for transmission to
another user (e.g., through a messaging application or the like).
In some examples, the below and above animoji video clips can be
shared as .mov files. However, in other examples, the described
techniques can be used in real-time (e.g., with video messaging or
the like).
[0026] FIG. 1 is a simplified block diagram illustrating example
flow 100 for providing audio and/or video effects based at least in
part on audio and/or video features detected in a user's recording.
In example flow 100, there are two separate sessions: recording
session 102 and playback session 104. In recording session 102,
device 106 may capture video having an audio component of user 108
at block 110. In some examples, the video and audio may be captured
(e.g., collected) separately, using two different devices (e.g., a
microphone and a camera). The capturing of video and audio may be
triggered based at least in part on selection of a record
affordance by user 108. In some examples, user 108 may say the word
"hello" at block 112. Additionally, at block 112, device 106 may
continue to capture the video and/or audio components of the user's
actions. At block 114, device 106 can continue capturing the video
and audio components, and in this example, user 108 may say the
word "bark." At block 114, device 106 may also extract spoken words
from the audio information. However, in other examples, the spoken
word extraction (or any audio feature extraction) may actually take
place after recording session 102 is complete. In other examples,
the spoken word extraction (or any audio feature extraction) may
actually take place during the preview block 124 in real-time. It
is also possible for the extraction (e.g., analysis of the audio)
to be done in real-time while recording session 102 is still in
process. In either case, the avatar process being executed by
device 106 may identify through the extraction that the user said
the word "bark" and may employ some logic to determine what audio
effects to implement.
[0027] By way of example, recording session 102 may end when user
108 selects the record affordance again (e.g., indicating a desire
to end the recording), selects an end recording affordance (e.g.,
the record affordance may act as an end recording affordance while
recording), or based at least in part on expiration of a time
period (e.g., 10 seconds, 30 seconds, or the like). In some cases,
this time period may be automatically predetermined, while in
others, it may be user selected (e.g., selected from a list of
options or entered in free form through a text entry interface).
Once the recording has completed, user 108 may select a preview
affordance, indicating that user 108 wishes to watch a preview of
the recording. One option could be to play the original recording
without any visual or audio effects. However, another option could
be to play a revised version of the video clip. Based at least in
part on detection of the spoken word "bark," the avatar process may
have revised the audio and/or video of the video clip.
[0028] At block 116, device 106 may present avatar (also called a
puppet and/or animoji) 118 on a screen. Device 106 may also be
configured with speaker 120 that can play audio associated with the
video clip. In this example, block 116 corresponds to the same
point in time as block 110, where user 108 may have had his mouth
open, but was not yet speaking. As such, avatar 118 may be
presented with his mouth open; however, no audio is presented from
speaker 120 yet. At block 122, corresponding to block 112 where
user 108 said "hello," the avatar process can present avatar 118
with an avatar-specific voice. In other words, a predefined dog
voice may be used to say the word "hello" at block 122. The
dog-voice word "hello" can be presented by speaker 120. As will be
described in further detail below, there are a variety of different
animal (and other character) avatars available for selection by
user 108. In some examples, each avatar may be associated with a
particular pre-defined voice that best fits that avatar. For
example, a dog may have a dog voice, a cat may have a cat voice, a
pig may have a pig voice, and a robot may have a robotic voice.
These avatar-specific voices may be pre-recorded or may be
associated with particular frequency or audio transformations, that
can happen by executing mathematical operations on the original
sound, such that any user's voice can be transformed to sound like
the dog voice. However, each user's dog voice may sound different
based at least in part on the particular audio transformation
performed.
[0029] At block 124, the avatar process may replace the spoken word
(e.g., "bark") with an avatar-specific word. In this example, the
sound of a dog bark (e.g., a recorded or simulated dog bark) may be
inserted into the audio data (e.g., in place of the word "bark")
such that when it is played back during presentation of the video
clip, a "woof" is presented by speaker 120. In some examples,
different avatar-specific words will be presented at 124 based at
least in part on different avatar selections, and in other
examples, the same avatar-specific word may be presented regardless
of the avatar selections. For example, if user 108 said "bark," a
"woof" could be presented when the dog avatar is selected. However,
in this same case, if user 108 later selected the cat avatar for
the same flow, there are a couple of options for revising the
audio. In one example, the process could convert the "bark" into a
"woof" even though it wouldn't be appropriate for a cat to "woof."
In a different example, the process could convert "bark" into a
recorded or simulated "meow," based at least in part on the
selection of the cat avatar. And, in yet another example, the
process could ignore the "bark" for avatars other than the dog
avatar. As such, there may be a second level of audio feature
analysis performed even after the extraction at 114. Video and
audio features may also influence processing on the avatar specific
utterances. For example, the level and pitch and intonation with
which a user says "bark" may be detected as part of the audio
feature extraction, and this may direct the system to select a
specific "woof" sample or transform such a sample before and/or
during the preview process.
[0030] FIG. 2 is another simplified block diagram illustrating
example flow 200 for providing audio and/or video effects based at
least in part on audio and/or video features detected in a user's
recording. In example flow 200, much like in example flow 100 of
FIG. 1, there are two separate sessions: recording session 202 and
playback session 204. In recording session 202, device 206 may
capture video having an audio component of user 208 at block 210.
The capturing of video and audio may be triggered based at least in
part on selection of a record affordance by user 208. In some
examples, user 208 may say the word "hello" at block 212.
Additionally, at block 212, device 206 may continue to capture the
video and/or audio components of the user's actions. At block 214,
device 206 can continue capturing the video and audio components,
and in this example, user 208 may hold his mouth open, but not say
anything. At block 214, device 206 may also extract facial
expressions from the video. However, in other examples, the facial
feature extraction (or any video feature extraction) may actually
take place after recording session 202 is complete. Still, it is
possible for the extraction (e.g., analysis of the video) to be
done in real-time while recording session 202 is still in process.
In either case, the avatar process being executed by device 206 may
identify through the extraction that the user opened his mouth
briefly (e.g., without saying anything) and may employ some logic
to determine what audio and/or video effects to implement. In some
examples, the determination that the user held their mouth open
without saying anything may require extraction and analysis of both
audio and video. For example, extraction of the facial feature
characteristics (e.g., open mouth) may not be enough, and the
process may also need to detect that user 208 did not say anything
during the same time period of the recording. Video and audio
features may also influence processing on the avatar specific
utterances. For example, the duration of the opening of the mouth,
opening of eyes, etc. may direct the system to select a specific
"woof" sample or transform such a sample before and/or during the
preview process. One such transformation is changing the level
and/or duration of the woof to match the detected opening and
closing of the user's mouth.
[0031] By way of example, recording session 202 may end when user
208 selects the record affordance again (e.g., indicating a desire
to end the recording), selects an end recording affordance (e.g.,
the record affordance may act as an end recording affordance while
recording), or based at least in part on expiration of a time
period (e.g., 20 seconds, 30 seconds, or the like). Once the
recording has finished, user 208 may select a preview affordance,
indicating that user 208 wishes to watch a preview of the
recording. One option could be to play the original recording
without any visual or audio effects. However, another option could
be to play a revised version of the recording. Based at least in
part on detection of the facial expression (e.g., the open mouth),
the avatar process may have revised the audio and/or video of the
video clip.
[0032] At block 216, device 206 may present avatar (also called a
puppet and/or animoji) 218 on a screen of device 206. Device 206
may also be configured with speaker 220 that can play audio
associated with the video clip. In this example, block 216
corresponds to the same point in time as block 210, where user 208
may not have been speaking yet. As such, avatar 218 may be
presented with his mouth open; however, no audio is presented from
speaker 220 yet. At block 222, corresponding to block 212 where
user 208 said "hello," the avatar process can present avatar 218
with an avatar-specific voice (as described above).
[0033] At block 224, the avatar process may replace the silence
identified at block 214 with an avatar-specific word. In this
example, the sound of a dog bark (e.g., a recorded or simulated dog
bark) may be inserted into the audio data (e.g., in place of the
silence) such that when it is played back during presentation of
the video clip, a "woof" is presented by speaker 220. In some
examples, different avatar-specific words will be presented at 224
based at least in part on different avatar selections, and in other
examples, the same avatar-specific word may be presented regardless
of the avatar selections. For example, if user 208 held his mouth
open, a "woof" could be presented when the dog avatar is selected,
a "meow" sound could be presented for a cat avatar, etc. In some
cases, each avatar may have a predefined sound to be played when it
is detected that user 208 has held his mouth open for an amount of
time (e.g., a half second, a whole second, etc.) without speaking.
However, in some examples, the process could ignore the detection
of the open mouth for avatars that don't have a predefined effect
for that facial feature. Additionally, there may be a second level
of audio feature analysis performed even after the extraction at
214. For example, if the process determines that a "woof" is to be
inserted for a dog avatar (e.g., based on detection of the open
mouth), the process may also detect how many "woof" sounds to
insert (e.g., if the user held his mouth open for double the length
of time used to indicate a bark) or whether it's not possible to
insert the number of barks requested (e.g., in the scenario of FIG.
1, where the user would speak "bark" to indicate a "woof" sound
should be inserted. Thus, based on the above two examples, it
should be evident, that user 208 can control effects of the
playback (e.g., the recorded avatar message) with their facial and
voice expressions. Further, while not shown explicitly in either
FIG. 1 or FIG. 2, the user device can be configured with software
for executing the avatar process (e.g., capturing the A/V
information, extracting features, analyzing the data, implementing
the logic, revising the audio and/or video files, and rendering the
previews) as well as software for executing an application (e.g.,
an avatar application with its own UI) that enables the user to
build the avatar messages and subsequently send them to other user
devices.
[0034] FIG. 3 is a simplified block diagram 300 illustrating
components (e.g., software modules) utilized by the avatar process
described above and below. In some examples, more or less modules
can be utilized to implement the providing of audio and/or video
effects based at least in part on audio and/or video features
detected in a user's recording. In some examples, device 302 may be
configured with camera 304, microphone 306, and a display screen
for presenting a UI and the avatar previews (e.g., the initial
preview before recording as well as the preview of the recording
before sending). In some examples, the avatar process is configured
with avatar engine 308 and voice engine 310. Avatar engine 308 can
manage the list of avatars, process the video features (e.g.,
facial feature characteristics), revise the video information,
communicate with voice engine 301 when appropriate, and render
video of the avatar 312 when all processing is complete and effects
have been implemented (or discarded). Revising of the video
information can include adjusting or otherwise editing the metadata
associated with the video file. In this way, when the video
metadata (adjusted or not) is used to render the puppet, the facial
features can be mapped to the puppet. In some examples, voice
engine 310 can store the audio information, perform the logic for
determining what effects to implement, revise the audio
information, and provide modified audio 314 when all processing is
complete and effects have been implemented (or discarded).
[0035] In some examples, once the user selects to record a new
avatar video clip, video features 316 can be captured by camera 304
and audio features 318 can be captured by microphone 306. In some
cases there may be as many as (or more than) fifty facial features
to be detected within video features 316. Example video features
include, but are not limited to, duration of expressions, open
mouth, frowns, smiles, eyebrows up or furrowed, etc. Additionally,
video features 316 may include only metadata that identifies each
of the facial features (e.g., data points that indicate which
locations on the user's face moved or where in what position).
Further, video features 316 can be passed to avatar engine 308 and
voice engine 310. At avatar engine 308, the metadata associated
with video features 316 can be stored and analyzed. In some
examples, avatar engine 308 may perform the feature extraction from
the video file prior to storing the metadata. However, in other
examples, the feature extraction may be performed prior to video
features 316 being sent to avatar engine (in which case, video
features 316 would be the metadata itself). At voice engine 310,
video features 316 may be compared with audio features 318 when it
is helpful to match up what audio features correspond to which
video features (e.g., to see if certain audio and video features
occur at the same time).
[0036] In some instances, audio features are also passed to voice
engine 310 for storage. Example audio features include, but are not
limited to, level, pitch, dynamics (e.g., changes in level,
pitching, voicing, formants, duration, etc.). Raw audio 320
includes the unprocessed audio file as it's captured. Raw audio 320
can be passed to voice engine 310 for further processing and
potential (e.g., eventual) revision and it can also be stored
separately so that the original audio can be used if desired. Raw
audio 320 can also be passed to voice recognition module 322. Voice
recognition module 322 can be used to word spot and identify a
user's intent from their voice. For example, voice recognition
module 322 can determine when a user is angry, sad, happy, or the
like. Additionally, when a user says a key word (e.g., "bark" as
described above), voice recognition module 322 will detect this.
Information detected and/or collected by voice recognition module
322 can then be passed to voice engine 310 for further logic and/or
processing. As noted, in some examples, audio features are
extracted in real-time during the preview itself. These audio
features may be avatar specific, generated only if the associated
avatar is being previewed. The audio features may be avatar
agnostic, generated for all avatars. The audio signal can also be
adjusted in part based on these real-time audio feature
extractions, and with the pre-stored extracted video features which
are created during or after the recording process, but before
previewing. Additionally, some feature extraction may be performed
during rendering at 336 by voice engine 310. Some pre-stored sounds
338 may be used by voice engine 310, as appropriate, to fill in the
blanks or to replace other sounds that were extracted.
[0037] In some examples, voice engine 310 will make the
determination regarding what to do with the information extracted
from voice recognition module 322. In some examples, voice engine
310 can pass the information from voice recognition module 322 to
feature module 324 for determining which features correspond to the
data extracted by voice recognition module 322. For example,
feature module 324 may indicate (e.g., based on a set of rules
and/or logic) that a sad voice detected by voice recognition module
322 corresponds to a raising of the pitch of the voice, or the
slowing down of the speed or cadence of the voice. In other words,
feature module 322 can map the extracted audio features to
particular voice features. Then, effect type module 326 can map the
particular voice features to the desired effect. Voice engine 310
can also be responsible for storing each particular voice for each
possible avatar. For example, there may be standard or hardcoded
voices for each avatar. Without any other changes being made, if a
user selects a particular avatar, voice engine 310 can select the
appropriate standard voice for use with playback. In this case,
modified audio 314 may just be raw audio 320 transformed to the
appropriate avatar voice based on the selected avatar. As the user
scrolls through the avatars and selects different ones, voice
engine 310 can modify raw audio 320 on the fly to make it sound
like the newly selected avatar. Thus, avatar type 328 needs to be
provided to voice engine 310 to make this change. However, if an
effect is to be provided (e.g., the pitch, tone, or actual words
are to be changed within the audio file), voice engine 310 can
revise raw audio file 320 and provide modified audio 314. In some
examples, the user will be provided with an option to use the
original audio file at on/off 330. If the user selects "off" (e.g.,
effects off), then raw audio 320 can be combined with video of
avatar 312 (e.g., corresponding to the unchanged video) to make A/V
output 332. A/V output 332 can be provided to the avatar
application presented on the UI of device 302.
[0038] Avatar engine 308 can be responsible for providing the
initial avatar image based at least in part on the selection of
avatar type 328. Additionally, avatar engine 308 is responsible for
mapping video features 316 to the appropriate facial markers of
each avatar. For example, if video features 316 indicate that the
user is smiling, the metadata that indicates a smile can be mapped
to the mouth area of the selected avatar so that the avatar appears
to be smiling in video of avatar 312. Additionally, avatar engine
308 can receive timing changes 334 from voice engine, as
appropriate. For example, if voice engine 310 determines that voice
effect is to make the audio be more of a whispering voice (e.g.,
based on feature module 324 and/or effect type 326 and or the
avatar type), and modifies the voice to be more of a whispered
voice, this effect change may include slowing down the voice
itself, in addition to a reduced level and other formant and pitch
changes. Accordingly, the voice engine may produce a modified audio
which is slower in playback speed relative to the original audio
file for the audio clip. In this scenario, voice engine 310 would
need to instruct avatar engine 308 via timing changes 334, so that
the video file can be slowed down appropriately; otherwise, the
video and audio would not be synchronized.
[0039] As noted, a user may use the avatar application of device
302 to select different avatars. In some examples, the voice effect
can change based at least in part on this selection. However, in
other examples, the user may be given the opportunity to select a
different voice for a given avatar (e.g., the cat voice for the dog
avatar, etc.). This type of free-form voice effect change can be
executed by the user via selection on the UI or, in some cases,
with voice activation or face motion. For example, a certain facial
expression could trigger voice engine 310 to change the voice
effect for a given avatar. Further, in some examples, voice engine
310 may be configured to make children's voices sound more high
pitched or, alternatively, determine not to make a child's voice
more high pitched because it would sound inappropriate given that
raw audio 320 for a child's voice might already be high pitched.
Making this user specific determination of an effect could be
driven in part by the audio features extracted, and in this case
such features could include pitch values and ranges throughout the
recording.
[0040] In some examples, voice recognition module 322 may include a
recognition engine, a word spotter, a pitch analyzer, and/or a
formant analyzer. The analysis performed by voice recognition
module 322 will be able to identify if the user if upset, angry,
happy, etc. Additionally, voice recognition module 322 may be able
to identify context and/or intonation of the user's voice, as well
as change the intention of wording and/or determine a profile
(e.g., a virtual identity) of the user.
[0041] In some examples, the avatar process 300 can be configured
to package/render the video clip by combining video of avatar 312
and either modified audio 314 or raw audio 320 into A/V output 332.
In order to package the two, voice engine 310 just needs to know an
ID for the metadata associated with video of avatar 312 (e.g., it
does not actually need video of avatar 312, it just needs the ID of
the metadata). A message within a messaging application (e.g., the
avatar application) can be transmitted to other computing devices,
where the message includes A/V output 332. When a user selects a
"send" affordance in the UI, the last video clip to be previewed
can be sent. For example, if a user previews their video clip with
the dog avatar, and then switches to the cat avatar for preview,
the cat avatar video would be sent when the user selects "send."
Additionally, the state of the last preview can be stored and used
later. For example, if the last message (e.g., avatar video clip)
sent used a particular effect, the first preview of the next
message being generated can utilize that particular effect.
[0042] The logic implemented by voice engine 310 and/or avatar
engine 308 can check for certain cues and/or features, and then
revise the audio and/or video files to implement the desired
effect. Some example feature/effect pairs include: detecting that
user has opened their mouth and paused for a moment. In this
example, both facial feature characteristics (e.g., mouth open) and
audio feature characteristics (e.g., silence) need to happen at the
same time in order for the desired effect to be implemented. For
this feature/effect pair, the desired effect to revise the audio
and video so that the avatar appears to make an
avatar/animal-specific sound. For example, a dog will make a bark
sound, a cat will make a meow sound, a monkey, horse, unicorn,
etc., will make the appropriate sound for that character/animal.
Other example feature/effect pairs include lower the audio pitch
and/or tone when a frown is detected. In this example, only the
video feature characteristics need to be detected. However, in some
examples, this effect could be implemented based at least in part
on voice recognition module 322 detecting sadness in the voice of
the user. In this case, video features 316 wouldn't be needed at
all. Other example feature/effect pairs include whispering to cause
the audio and video speeds to be slowed, toned down, and/or a
reduction in changes. In some cases, video changes can lead to
modifications of the audio while, in other case, audio changes can
lead to modifications of the video.
[0043] As noted above, in some examples, avatar engine 308 may act
as the feature extractor, in which case video features 316 and
audio features 318 may not exist prior to being sent to avatar
engine 308. Instead, raw audio 320 and metadata associated with the
raw video may be passed into avatar engine 308, where avatar engine
308 may extract the audio feature characteristics and the video
(e.g., facial) feature characteristics. In other words, while not
drawn this way in FIG. 3, parts of avatar engine 308 may actually
exist within camera 304. Additionally, in some examples, metadata
associated with video features 316 can be stored in a secure
container, and when voice engine 310 is running, it can read the
metadata from the container.
[0044] In some instances, because the preview video clip of the
avatar is not displayed in real-time (e.g., it is rendered and
displayed after the video is recorded and sometimes only in
response to selection of a play affordance), the audio and video
information can be processed offline (e.g., not in real-time). As
such, avatar engine 308 and voice engine 310 can read ahead in the
audio and video information and make context decisions up front.
Then, voice engine 310 can revise the audio file accordingly. This
ability to read ahead and make decisions offline will greatly
increase the efficiency of the system, especially for longer
recordings. Additionally, this enables a second stage of analysis,
where additional logic can be processed. Thus, the entire audio
file can be analyzed before making any final decisions. For
example, if the user says "bark" two times in a row, but the words
"bark" were said too closely together, the actual "woof" sound that
was prerecorded might not be able to fit in the time it took the
user to say "bark, bark." In this case, voice engine 310 can take
the information from voice recognition 322 and determine to ignore
the second "bark," because it won't be possible to include both
"woof" sounds in the audio file.
[0045] As noted above, when the audio file and the video are
packaged together to make A/V output 332, voice engine does not
actually need to access video of avatar 312. Instead, the video
file (e.g., a .mov format file, or the like) is created as the
video is being played by accessing an array of features (e.g.,
floating-point values) that were written to the metadata file.
However, all permutations/adjustments to the audio and video files
can be done in advance, and some can even be done in real-time as
the audio and video are extracted. Additionally, in some examples,
each modified video clip could be saved temporarily (e.g., cached),
such that if the user reselects an avatar that's already been
previewed, the processing to generate/render that particular
preview does not need to be duplicated. As opposed to re-rendering
the revised video clip each time the same avatar is selected during
the preview section, the above noted caching of rendered video
clips would enable the realization of large savings in processor
power and instructions per second (IPS), especially for longer
recordings and/or recordings with a large number of effects.
[0046] Additionally, in some examples, noise suppression algorithms
can be employed for handling cases where the sound captured by
microphone 306 includes sounds other than the user's voice. For
example, when the user is in a windy area, or a loud room (e.g., a
restaurant or bar). In these examples, a noise suppression
algorithm could lower the decibel output of certain parts of the
audio recording. Alternatively, or in addition, different voices
could be separated and/or only audio coming from certain angles of
view (e.g., the angle of the user's face) could be collected, and
other voices could be ignored or suppressed. In other cases, if the
avatar process 300 determines that the noise levels are too loud or
will be difficult to process, the process 300 could disable the
recording option.
[0047] FIG. 4 illustrates an example flow diagram showing process
400 for implementing various audio and/or video effects based at
least in part on audio and/or video features, according to at least
a few embodiments. In some examples, computing device 106 of FIG. 1
or other similar user device (e.g., utilizing at least avatar
process 300 of FIG. 3) may perform the process 400 of FIG. 4.
[0048] At block 402, computing device 106 may capture video having
an audio component. In some examples, the video and audio may be
captured by two different hardware components (e.g., a camera may
capture the video information while a microphone may capture the
audio information). However, in some instances, a single hardware
component may be configured to capture both audio and video. In any
event, the video and audio information may be associated with one
another (e.g., by sharing an ID, timestamp, or the like). As such,
the video may have an audio component (e.g., they are part of the
same file), or the video may be linked with an audio component
(e.g., two files that are associated together).
[0049] At block 404, computing device 106 may extract facial
features and audio features from the captured video and audio
information, respectively. In some cases, the facial feature
information may be extracted via avatar engine 308 and stored as
metadata. The metadata can be used to map each facial feature to a
particular puppet or to any animation or virtual face. Thus, the
actual video file does not need to be stored, creating memory
storage efficiency and significant savings. Regarding the audio
feature extraction, a voice recognition algorithm can be utilized
to extract different voice features; for example, words, phrases,
pitch, speed, etc.
[0050] At block 406, computing device 106 may detect context from
the extracted features. For example, context may include a user's
intent, mood, setting, location, background items, ideas, etc. The
context can be important when employing logic to determine what
effects to apply. In some cases, the context can be combined with
detected spoken words to determine whether and/or how to adjust the
audio file and/or the video file. In one example, a user may furrow
his eyebrows and speak slowly. The furrowing of the eyebrows is a
video feature that could have been extracted at block 404 and the
slow speech is an audio feature that could have been extracted at
block 404. Individually, those two features might mean something
different; however, when combined together, the avatar process can
determine that the user is concerned about something. In this case,
the context of the message might be that a parent is speaking to a
child, or a friend is speaking to another friend about a serious or
concerning matter.
[0051] At block 408, computing device 106 may determine effects for
rendering the audio and/or video files based at least in part on
the context. As noted above, one context might be concern. As such,
a particular video and/or audio feature may be employed for this
effect. For example, the voice file may be adjusted to sound more
somber, or to be slowed down. In other examples, the
avatar-specific voice might be replaced with a version of the
original (e.g., raw) audio to convey the seriousness of the
message. Various other effects can be employed for various other
contexts. In other examples, the context may be animal noises
(e.g., based on the user saying "bark" or "meow" or the like. In
this case, the determined effect would be to replace the spoken
word "bark" with the sound of a dog barking.
[0052] At block 410, computing device 106 may perform additional
logic for additional effects. For example, if the user attempted to
effectuate the bark effect by saying bark twice in a row, the
additional logic may need to be utilized to determine whether the
additional bark is technically feasible. As an example, if the
audio clip of the bark that is used to replace the spoken word in
the raw audio information is 0.5 seconds long, but the user says
"bark" twice in a 0.7-second span, the additional logic can
determine that two bark sounds cannot fit in the 0.7 seconds
available. Thus, the audio and video file may need to be extended
in order to fit both bark sounds, the bark sound may need to be
shortened (e.g., by processing the stored bark sound), or the
second spoken word bark may need to be ignored.
[0053] At block 412, computing device 106 may revise the audio
and/or video information based at least in part on the determined
effects and/or additional effects. In some examples, only one set
of effects may be used. However, in either case, the raw audio file
may be adjusted (e.g., revised) to form a new audio file with
additional sounds added and/or subtracted. For example, in the
"bark" use case, the spoken word "bark" will be removed from the
audio file and a new sound that represents an actual dog barking
will be inserted. The new file can be saved with a different ID, or
with an appended ID (e.g., the raw audio ID, with a .v2 identifier
to indicate that it is not the original). Additionally, the raw
audio file will be saved separately so that it can be reused for
additional avatars and/or if the user decides not to use the
determined effects.
[0054] At block 414, computing device 106 may receive a selection
of an avatar from the user. The user may select one of a plurality
of different avatars through a UI of the avatar application being
executed by computing device 106. The avatars may be selected via a
scroll wheel, drop down menu, or icon menu (e.g., where each avatar
is visible on the screen in its own position).
[0055] At block 416, computing device 106 may present the revised
video with the revised audio based at least in part on the selected
avatar. In this example, each adjusted video clip (e.g., a final
clip for the avatar that has adjusted audio and/or adjust video)
may be generated for each respective avatar prior to selection of
the avatar by the user. This way, the processing has already been
completed, and the adjusted video clip is ready to be presented
immediately upon selection of the avatar. While this might require
additional IPS prior to avatar selection, it will speed up the
presentation. Additionally, the processing of each adjusted video
clip can be performed while the user is reviewing the first preview
(e.g., the preview that corresponds to the first/default avatar
presented in the UI).
[0056] FIG. 5 illustrates an example flow diagram showing process
500 for implementing various audio and/or video effects based at
least in part on audio and/or video features, according to at least
a few embodiments. In some examples, computing device 106 of FIG. 1
or other similar user device (e.g., utilizing at least avatar
process 300 of FIG. 3) may perform the process 500 of FIG. 5.
[0057] At block 502, computing device 106 may capture video having
an audio component. Just like in block 402 of FIG. 4, the video and
audio may be captured by two different hardware components (e.g., a
camera may capture the video information while a microphone may
capture the audio information). As noted, the video may have an
audio component (e.g., they are part of the same file), or the
video may be linked with an audio component (e.g., two files that
are associated together).
[0058] At block 504, computing device 106 may extract facial
features and audio features from the captured video and audio
information, respectively. Just like above, the facial feature
information may be extracted via avatar engine 308 and stored as
metadata. The metadata can be used to map each facial feature to a
particular puppet or to any animation or virtual face. Thus, the
actual video file does not need to be stored, creating memory
storage efficiency and significant savings. Regarding the audio
feature extraction, a voice recognition algorithm can be utilized
to extract different voice features; for example, words, phrases,
pitch, speed, etc. Additionally, in some examples, avatar engine
308 and/or voice engine 310 may perform the audio feature
extraction.
[0059] At block 506, computing device 106 may detect context from
the extracted features. For example, context may include a user's
intent, mood, setting, location, ideas, identity, etc. The context
can be important when employing logic to determine what effects to
apply. In some cases, the context can be combined with spoken words
to determine whether and/or how to adjust the audio file and/or the
video file. In one example, a user's age may be detected as the
context (e.g., child, adult, etc.) based at least in part on facial
and/or voice features. For example, a child's face may have
particular features that can be identified (e.g., large eyes, a
small nose, and a relatively small head, etc.). As such, a child
context may be detected.
[0060] At block 508, computing device 106 may receive a selection
of an avatar from the user. The user may select one of a plurality
of different avatars through a UI of the avatar application being
executed by computing device 106. The avatars may be selected via a
scroll wheel, drop down menu, or icon menu (e.g., where each avatar
is visible on the screen in its own position).
[0061] At block 510, computing device 106 may determine effects for
rendering the audio and/or video files based at least in part on
the context and the selected avatar. In this example, the effects
for each avatar may be generated upon selection of each avatar, as
opposed to all at once. In some instances, this will enable
realization of significant processor and memory savings, because
only one set of effects and avatar rendering will be performed at a
time. These savings can be realized especially when the user does
not select multiple avatars to preview.
[0062] At block 512, computing device 106 may perform additional
logic for additional effects, similar to that described above with
respect to block 410 of FIG. 4. At block 514, computing device 106
may revise the audio and/or video information based at least in
part on the determined effects and/or additional effects for the
selected avatar, similar to that described above with respect to
block 412 of FIG. 4. At block 516, computing device 106 may present
the revised video with the revised audio based at least in part on
the selected avatar, similar that described above with respect to
block 416 of FIG. 4.
[0063] In some examples, the avatar process 300 may determine
whether to perform flow 400 or flow 500 based at least in part on
historical information. For example, if the user generally uses the
same avatar every time, flow 500 will be more efficient. However,
if the user regularly switches between avatars, and previews
multiple different avatars per video clip, then following flow 400
may be more efficient.
[0064] FIG. 6 illustrates an example UI 600 for enabling a user to
utilize the avatar application (e.g., corresponding to avatar
application affordance 602). In some examples, UI 600 may look
different (e.g., it may appear as a standard text (e.g., short
messaging service (SMS)) messaging application) until avatar
application affordance 602 is selected. As noted, the avatar
application can communicate with the avatar process (e.g., avatar
process 300 of FIG. 3) to make requests for capturing, processing
(e.g., extracting features, running logic, etc.), and adjusting
audio and/video. For example, when the user selects a record
affordance (e.g., record/send video clip affordance 604), the
avatar application may make an application programming interface
(API) call to the avatar process to begin capturing video and audio
information using the appropriate hardware components. In some
example, record/send video clip affordance 604 may be represented
as a red circle (or a plain circle without the line shown in FIG.
6) prior to the recording session beginning. In this way, the
affordance will look more like a standard record button. During the
recording the session, the appearance of record/send video clip
affordance 604 may be changed to look like a clock countdown or
other representation of a timer (e.g., if the length of video clip
recordings is limited). However, in other examples, the record/send
video clip affordance 604 may merely change colors to indicate that
the avatar application is recording. If there is no timer, or limit
on the length of the recording, the user may need to select
record/send video clip affordance 604 again to terminate the
recording.
[0065] In some examples, a user may use avatar selection affordance
606 to select an avatar. This can be done before recording of the
avatar video clip and/or after recording of the avatar video clip.
When selected before recording, the initial preview of the user's
motions and facial characteristics will be presented as the
selected avatar. Additionally, the recording will be performed
while presenting a live (e.g., real-time) preview of the recording,
with the user's face being represented by the selected avatar. Once
the recording is completed, a second preview (e.g., a replay of the
actual recording) will be presented, again using the selected
avatar. However, at this stage, the user can scroll through avatar
selection affordance 606 to select a new avatar to view the
recording preview. In some cases, upon selection of a new avatar,
the UI will begin to preview the recording using the selected
avatar. The new preview can be presented with the audio/video
effects or as originally recorded. As noted, the determination
regarding whether to present the effected version or the original
may be based at least in part on the last method of playback used.
For example, if the last playback used effects, the first playback
after a new avatar selection may use effects. However, if the last
playback did not use effects, the first playback after a new avatar
selection may not use effects. In some examples, the use can replay
the video clip with effects by selecting effects preview affordance
608 or without effects by selecting original preview affordance
610. Once satisfied with the video clip (e.g., the message), the
user can send the avatar video in a message to another computing
device using record/send video clip affordance 604. The video clip
will be sent using the format corresponding to the last preview
(e.g., with or without effects). At any time, if the user desires,
delete video clip affordance 612 may be selected to delete the
avatar video and either start over or exit the avatar and/or
messaging applications.
[0066] FIG. 7 illustrates an example flow diagram showing process
(e.g., a computer-implemented method) 700 for implementing various
audio and/or video effects based at least in part on audio and/or
video features, according to at least a few embodiments. In some
examples, computing device 106 of FIG. 1 or other similar user
device (e.g., utilizing at least an avatar application similar to
that shown in FIG. 6 and avatar process 300 of FIG. 3) may perform
the process 700 of FIG. 7.
[0067] At block 702, computing device 106 may display a virtual
avatar generation interface. The virtual avatar generation
interface may look similar to the UI illustrated in FIG. 6.
However, any UI configured to enable the same features described
herein can be used.
[0068] At block 704, computing device 106 may display first preview
content of a virtual avatar. In some examples, the first preview
content may be a real-time representation of the user's face,
including movement and facial expressions. However, the first
preview would provide an avatar (e.g., cartoon character,
digital/virtual puppet) to represent the user's face instead of an
image of the user's face. This first preview may be video only, or
at least a rendering of the avatar without sound. In some examples,
this first preview is not recorded and can be utilized for as long
as the user desires, without limitation other than batter power or
memory space of computing device 106.
[0069] At block 706, computing device 106 may detect selection of
an input (e.g., record/send video clip affordance 604 of FIG. 6) in
the virtual avatar generation interface. This selection may be made
while the UI is displaying the first preview content.
[0070] At block 708, computing device 106 may begin capturing video
and audio signals based at least in part on the input detected at
block 706. As described, the video and audio signals may be
captured by appropriate hardware components and can be captured by
one or a combination of such components.
[0071] At block 710, computing device 106 may extract audio feature
characteristics and facial feature characteristics as described in
detail above. As noted, the extraction may be performed by
particular modules of avatar process 300 of FIG. 3 or by other
extraction and/or analysis components of the avatar application
and/or computing device 106.
[0072] At block 712, computing device 106 may generate adjusted
audio signal based at least in part on facial feature characterizes
and audio feature characteristics. For example, the audio file
captured at block 708 may be permanently (or temporarily) revised
(e.g., adjusted) to include new sounds, new words, etc., and/or to
have the original pitch, tone, volume, etc., adjusted. These
adjustments can be made based at least in part on the context
detected via analysis of the facial feature characterizes and audio
feature characteristics. Additionally, the adjustments can be made
based on the type of avatar selected and/or based on specific
motions, facial expressions, words, phrases, or actions performed
by the user (e.g., expressed by the user's face) during the
recording session.
[0073] At block 714, computing device 106 may generate second
preview content of the virtual avatar in the UI according to the
adjusted audio signal. The generated second preview content may be
based at least in part on the currently selected avatar or some
default avatar. Once the second preview content is generated,
computing device 106 can present the second preview content in the
UI at block 716.
[0074] FIG. 8 illustrates an example flow diagram showing process
(e.g., instructions stored on a computer-readable memory that can
be executed) 800 for implementing various audio and/or video
effects based at least in part on audio and/or video features,
according to at least a few embodiments. In some examples,
computing device 106 of FIG. 1 or other similar user device (e.g.,
utilizing at least an avatar application similar to that shown in
FIG. 6 and avatar process 300 of FIG. 3) may perform the process
800 of FIG. 8.
[0075] At block 802, computing device 106 may detect a request to
generate an avatar video clip of a virtual avatar. In some
examples, the request may be based at least in part on a user's
selection of send/record video clip affordance 604 of FIG. 6.
[0076] At block 804, computing device 106 may capture a video
signal associated with a face in the field of view of the camera.
At block 806, computing device 106 may capture an audio signal
corresponding to the video signal (e.g., coming from the face being
captured by the camera).
[0077] At block 808, computing device 106 may extract voice feature
characteristics from the audio signal and at block 810, computing
device 106 may extract facial feature characterizes from the video
signal.
[0078] At block 812, computing device 106 may detect a request to
preview the avatar video clip. This request may be based at least
in part on a user's selection of a new avatar via avatar selection
affordance 606 of FIG. 6 or based at least in part on a user's
selection of effects preview affordance 608 of FIG. 6.
[0079] At block 814, computing device 106 may generate adjusted
audio signal based at least in part on facial feature characterizes
and voice feature characteristics. For example, the audio file
captured at block 806 may be revised (e.g., adjusted) to include
new sounds, new words, etc., and/or to have the original pitch,
tone, volume, etc., adjusted. These adjustments can be made based
at least in part on the context detected via analysis of the facial
feature characterizes and voice feature characteristics.
Additionally, the adjustments can be made based on the type of
avatar selected and/or based on specific motions, facial
expressions, words, phrases, or actions performed by the user
(e.g., expressed by the user's face) during the recording
session.
[0080] At block 816, computing device 106 may generate a preview of
the virtual avatar in the UI according to the adjusted audio
signal. The generated preview may be based at least in part on the
currently selected avatar or some default avatar. Once the preview
is generated, computing device 106 can also present the second
preview content in the UI at block 816.
[0081] FIG. 9 is a simplified block diagram illustrating example
architecture 900 for implementing the features described herein,
according to at least one embodiment. In some examples, computing
device 902 (e.g., computing device 106 of FIG. 1), having example
architecture 900, may be configured to present relevant UIs,
capture audio and video information, extract relevant data, perform
logic, revise the audio and video information, and present animoji
videos.
[0082] Computing device 902 may be configured to execute or
otherwise manage applications or instructions for performing the
described techniques such as, but not limited to, providing a user
interface (e.g., user interface 600 of FIG. 6) for recording,
previewing, and/or sending virtual avatar video clips. Computing
device 602 may receive inputs (e.g., utilizing I/O device(s) 904
such as a touch screen) from a user at the user interface, capture
information, process the information, and then present the video
clips as previews also utilizing I/O device(s) 904 (e.g., a speaker
of computing device 902). Computing device 902 may be configured to
revise audio and/or video files based at least in part on facial
features extracted from the captured video and/or voice features
extracted from the captured audio.
[0083] Computing device 902 may be any type of computing device
such as, but not limited to, a mobile phone (e.g., a smartphone), a
tablet computer, a personal digital assistant (PDA), a laptop
computer, a desktop computer, a thin-client device, a smart watch,
a wireless headset, or the like.
[0084] In one illustrative configuration, computing device 902 may
include at least one memory 914 and one or more processing units
(or processor(s)) 916. Processor(s) 916 may be implemented as
appropriate in hardware, computer-executable instructions, or
combinations thereof. Computer-executable instruction or firmware
implementations of processor(s) 916 may include computer-executable
or machine-executable instructions written in any suitable
programming language to perform the various functions
described.
[0085] Memory 914 may store program instructions that are loadable
and executable on processor(s) 916, as well as data generated
during the execution of these programs. Depending on the
configuration and type of computing device 902, memory 914 may be
volatile (such as random access memory (RAM)) and/or non-volatile
(such as read-only memory (ROM), flash memory, etc.). Computing
device 902 may also include additional removable storage and/or
non-removable storage 926 including, but not limited to, magnetic
storage, optical disks, and/or tape storage. The disk drives and
their associated non-transitory computer-readable media may provide
non-volatile storage of computer-readable instructions, data
structures, program modules, and other data for the computing
devices. In some implementations, memory 914 may include multiple
different types of memory, such as static random access memory
(SRAM), dynamic random access memory (DRAM), or ROM. While the
volatile memory described herein may be referred to as RAM, any
volatile memory that would not maintain data stored therein once
unplugged from a host and/or power would be appropriate.
[0086] Memory 914 and additional storage 926, both removable and
non-removable, are all examples of non-transitory computer-readable
storage media. For example, non-transitory computer readable
storage media may include volatile or non-volatile, removable or
non-removable media implemented in any method or technology for
storage of information such as computer-readable instructions, data
structures, program modules, or other data. Memory 914 and
additional storage 926 are both examples of non-transitory computer
storage media. Additional types of computer storage media that may
be present in computing device 902 may include, but are not limited
to, phase-change RAM (PRAM), SRAM, DRAM, RAM, ROM, electrically
erasable programmable read-only memory (EEPROM), flash memory or
other memory technology, compact disc read-only memory (CD-ROM),
digital video disc (DVD) or other optical storage, magnetic
cassettes, magnetic tape, magnetic disk storage or other magnetic
storage devices, or any other medium that can be used to store the
desired information and that can be accessed by computing device
902. Combinations of any of the above should also be included
within the scope of non-transitory computer-readable storage
media.
[0087] Alternatively, computer-readable communication media may
include computer-readable instructions, program modules, or other
data transmitted within a data signal, such as a carrier wave, or
other transmission. However, as used herein, computer-readable
storage media does not include computer-readable communication
media.
[0088] Computing device 902 may also contain communications
connection(s) 928 that allow computing device 902 to communicate
with a data store, another computing device or server, user
terminals and/or other devices via one or more networks. Such
networks may include any one or a combination of many different
types of networks, such as cable networks, the Internet, wireless
networks, cellular networks, satellite networks, other private
and/or public networks, or any combination thereof. Computing
device 902 may also include I/O device(s) 904, such as a touch
input device, a keyboard, a mouse, a pen, a voice input device, a
display, a speaker, a printer, etc.
[0089] Turning to the contents of memory 914 in more detail, memory
914 may include operating system 932 and/or one or more application
programs or services for implementing the features disclosed herein
including user interface module 934, avatar control module 936,
avatar application module 938, and messaging module 940. Memory 914
may also be configured to store one or more audio and video files
to be used to produce audio and video output. In this way,
computing device 902 can perform all of the operations described
herein.
[0090] In some examples, user interface module 934 may be
configured to manage the user interface of computing device 902.
For example, user interface module 934 may present any number of
various UIs requested by computing device 902. In particular, user
interface module 934 may be configured to present UI 600 of FIG. 6,
which enables implementation of the features describe herein,
including communication with avatar process 300 of FIG. 3 which is
responsible for capturing video and audio information, extracting
appropriate facial feature and voice feature information, and
revising the video and audio information prior to presentation of
the generated avatar video clips as described above.
[0091] In some examples, avatar control module 936 is configured to
implement (e.g., execute instructions for implementing) avatar
process 300 while avatar application module 938 is configured to
implement the user facing application. As noted above, avatar
application module 938 may utilize one or more APIs for requesting
and/or providing information to avatar control module 936.
[0092] In some embodiments, messaging module 940 may implement any
standalone or add-on messaging application that can communicate
with avatar control module 936 and/or avatar application module
938. In some examples, messaging module 940 may be fully integrated
with avatar application module 938 (e.g., as seen in UI 600 of FIG.
6), where the avatar application appears to be part of the
messaging application. However, in other examples, messaging
application 940 may call to avatar application module 938 when a
user requests to generate an avatar video clip, and avatar
application module 938 may open up a new application altogether
that is in integrated with messaging module 940.
[0093] Computing device 902 may also be equipped with a camera and
microphone, as shown in at least FIG. 3, and processors 916 may be
configured to execute instructions to display a first preview of a
virtual avatar. In some examples, while displaying the first
preview of a virtual avatar, an input may be detected via a virtual
avatar generation interface presented by user interface module 934.
In some instances, in response to detecting the input in the
virtual avatar generation interface, avatar control module 936 may
initiate a capture session including: capturing, via the camera, a
video signal associated with a face in a field of view of the
camera, capturing, via the microphone, an audio signal associated
with the captured video signal, extracting audio feature
characteristics from the captured audio signal, and extracting
facial feature characteristics associated with the face from the
captured video signal. Additionally, in response to detecting
expiration of the capture session, avatar control module 936 may
generate an adjusted audio signal based at least in part on the
audio feature characteristics and the facial feature
characteristics, and display a second preview of the virtual avatar
in the virtual avatar generation interface according to the facial
feature characteristics and the adjusted audio signal.
[0094] Illustrative methods, computer-readable medium, and systems
for providing various techniques for adjusting audio and/or video
content based at least in part on voice and/or facial feature
characteristics are described above. Some or all of these systems,
media, and methods may, but need not, be implemented at least
partially by architectures and flows such as those shown at least
in FIGS. 1-9 above. While many of the embodiments are described
above with reference to messaging applications, it should be
understood that any of the above techniques can be used within any
type of application including real-time video playback or real-time
video messaging applications. For purposes of explanation, specific
configurations and details are set forth in order to provide a
thorough understanding of the examples. However, it should also be
apparent to one skilled in the art that the examples may be
practiced without the specific details. Furthermore, well-known
features were sometimes omitted or simplified in order not to
obscure the example being described.
[0095] The various embodiments further can be implemented in a wide
variety of operating environments, which in some cases can include
one or more user computers, computing devices or processing devices
which can be used to operate any of a number of applications. User
or client devices can include any of a number of general purpose
personal computers, such as desktop or laptop computers running a
standard operating system, as well as cellular, wireless and
handheld devices running mobile software and capable of supporting
a number of networking and messaging protocols. Such a system also
can include a number of workstations running any of a variety of
commercially-available operating systems and other known
applications for purposes such as development and database
management. These devices also can include other electronic
devices, such as dummy terminals, thin-clients, gaming systems and
other devices capable of communicating via a network.
[0096] Most embodiments utilize at least one network that would be
familiar to those skilled in the art for supporting communications
using any of a variety of commercially-available protocols, such as
TCP/IP, OSI, FTP, UPnP, NFS, CIFS, and AppleTalk. The network can
be, for example, a local area network, a wide-area network, a
virtual private network, the Internet, an intranet, an extranet, a
public switched telephone network, an infrared network, a wireless
network, and any combination thereof.
[0097] In embodiments utilizing a network server, the network
server can run any of a variety of server or mid-tier applications,
including HTTP servers, FTP servers, CGI servers, data servers,
Java servers, and business application servers. The server(s) also
may be capable of executing programs or scripts in response
requests from user devices, such as by executing one or more
applications that may be implemented as one or more scripts or
programs written in any programming language, such as Java.RTM., C,
C# or C++, or any scripting language, such as Perl, Python or TCL,
as well as combinations thereof. The server(s) may also include
database servers, including without limitation those commercially
available from Oracle Microsoft.RTM., Sybase.RTM., and
IBM.RTM..
[0098] The environment can include a variety of data stores and
other memory and storage media as discussed above. These can reside
in a variety of locations, such as on a storage medium local to
(and/or resident in) one or more of the computers or remote from
any or all of the computers across the network. In a particular set
of embodiments, the information may reside in a storage-area
network (SAN) familiar to those skilled in the art. Similarly, any
necessary files for performing the functions attributed to the
computers, servers or other network devices may be stored locally
and/or remotely, as appropriate. Where a system includes
computerized devices, each such device can include hardware
elements that may be electrically coupled via a bus, the elements
including, for example, at least one central processing unit (CPU),
at least one input device (e.g., a mouse, keyboard, controller,
touch screen or keypad), and at least one output device (e.g., a
display device, printer or speaker). Such a system may also include
one or more storage devices, such as disk drives, optical storage
devices, and solid-state storage devices such as RAM or ROM, as
well as removable media devices, memory cards, flash cards,
etc.
[0099] Such devices also can include a computer-readable storage
media reader, a communications device (e.g., a modem, a network
card (wireless or wired), an infrared communication device, etc.),
and working memory as described above. The computer-readable
storage media reader can be connected with, or configured to
receive, a non-transitory computer-readable storage medium,
representing remote, local, fixed, and/or removable storage devices
as well as storage media for temporarily and/or more permanently
containing, storing, transmitting, and retrieving computer-readable
information. The system and various devices also typically will
include a number of software applications, modules, services or
other elements located within at least one working memory device,
including an operating system and application programs, such as a
client application or browser. It should be appreciated that
alternate embodiments may have numerous variations from that
described above. For example, customized hardware might also be
used and/or particular elements might be implemented in hardware,
software (including portable software, such as applets) or both.
Further, connection to other computing devices such as network
input/output devices may be employed.
[0100] Non-transitory storage media and computer-readable storage
media for containing code, or portions of code, can include any
appropriate media known or used in the art (except for transitory
media like carrier waves or the like) such as, but not limited to,
volatile and non-volatile, removable and non-removable media
implemented in any method or technology for storage of information
such as computer-readable instructions, data structures, program
modules or other data, including RAM, ROM, Electrically Erasable
Programmable Read-Only Memory (EEPROM), flash memory or other
memory technology, CD-ROM, DVD or other optical storage, magnetic
cassettes, magnetic tape, magnetic disk storage or other magnetic
storage devices or any other medium which can be used to store the
desired information and which can be accessed by a system device.
Based on the disclosure and teachings provided herein, a person of
ordinary skill in the art will appreciate other ways and/or methods
to implement the various embodiments. However, as noted above,
computer-readable storage media does not include transitory media
such as carrier waves or the like.
[0101] The specification and drawings are, accordingly, to be
regarded in an illustrative rather than a restrictive sense. It
will, however, be evident that various modifications and changes
may be made thereunto without departing from the broader spirit and
scope of the disclosure as set forth in the claims.
[0102] Other variations are within the spirit of the present
disclosure. Thus, while the disclosed techniques are susceptible to
various modifications and alternative constructions, certain
illustrated embodiments thereof are shown in the drawings and have
been described above in detail. It should be understood, however,
that there is no intention to limit the disclosure to the specific
form or forms disclosed, but on the contrary, the intention is to
cover all modifications, alternative constructions and equivalents
falling within the spirit and scope of the disclosure, as defined
in the appended claims.
[0103] The use of the terms "a," "an," and "the," and similar
referents in the context of describing the disclosed embodiments
(especially in the context of the following claims), are to be
construed to cover both the singular and the plural, unless
otherwise indicated herein or clearly contradicted by context. The
terms "comprising," "having," "including," and "containing" are to
be construed as open-ended terms (i.e., meaning "including, but not
limited to,") unless otherwise noted. The term "connected" is to be
construed as partly or wholly contained within, attached to, or
joined together, even if there is something intervening. The phrase
"based on" should be understood to be open-ended, and not limiting
in any way, and is intended to be interpreted or otherwise be read
as "based at least in part on," where appropriate. Recitation of
ranges of values herein are merely intended to serve as a shorthand
method of referring individually to each separate value falling
within the range, unless otherwise indicated herein, and each
separate value is incorporated into the specification as if it were
individually recited herein. All methods described herein can be
performed in any suitable order unless otherwise indicated herein
or otherwise clearly contradicted by context. The use of any and
all examples, or exemplary language (e.g., "such as") provided
herein, is intended merely to better illuminate embodiments of the
disclosure and does not pose a limitation on the scope of the
disclosure unless otherwise claimed. No language in the
specification should be construed as indicating any non-claimed
element as essential to the practice of the disclosure.
[0104] Disjunctive language such as the phrase "at least one of X,
Y, or Z," unless specifically stated otherwise, is otherwise
understood within the context as used in general to present that an
item, term, etc., may be either X, Y, or Z, or any combination
thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is
not generally intended to, and should not, imply that certain
embodiments require at least one of X, at least one of Y, or at
least one of Z to each be present. Additionally, conjunctive
language such as the phrase "at least one of X, Y, and Z," unless
specifically stated otherwise, should also be understood to mean X,
Y, Z, or any combination thereof, including "X, Y, and/or Z."
[0105] Preferred embodiments of this disclosure are described
herein, including the best mode known to the inventors for carrying
out the disclosure. Variations of those preferred embodiments may
become apparent to those of ordinary skill in the art upon reading
the foregoing description. The inventors expect skilled artisans to
employ such variations as appropriate, and the inventors intend for
the disclosure to be practiced otherwise than as specifically
described herein. Accordingly, this disclosure includes all
modifications and equivalents of the subject matter recited in the
claims appended hereto as permitted by applicable law. Moreover,
any combination of the above-described elements in all possible
variations thereof is encompassed by the disclosure unless
otherwise indicated herein or otherwise clearly contradicted by
context.
[0106] All references, including publications, patent applications,
and patents, cited herein are hereby incorporated by reference to
the same extent as if each reference were individually and
specifically indicated to be incorporated by reference and were set
forth in its entirety herein.
* * * * *