U.S. patent application number 15/053941 was filed with the patent office on 2016-08-18 for dialogue detector and correction.
The applicant listed for this patent is Pixel Instruments Corporation. Invention is credited to J. Carl Cooper, Christopher Smith, Mirko Vojnovic.
Application Number | 20160241887 15/053941 |
Document ID | / |
Family ID | 44146144 |
Filed Date | 2016-08-18 |
United States Patent
Application |
20160241887 |
Kind Code |
A1 |
Cooper; J. Carl ; et
al. |
August 18, 2016 |
Dialogue Detector and Correction
Abstract
Compounds of the following formula are provided for use with
kinases: ##STR00001## wherein the variables are as defined herein.
Also provided are pharmaceutical compositions, kits and articles of
manufacture comprising such compounds; methods and intermediates
useful for making the compounds; and methods of using said
compounds.
Inventors: |
Cooper; J. Carl; (Reno,
NV) ; Vojnovic; Mirko; (San Jose, CA) ; Smith;
Christopher; (Suffield, CT) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Pixel Instruments Corporation |
Suffield |
CT |
US |
|
|
Family ID: |
44146144 |
Appl. No.: |
15/053941 |
Filed: |
February 25, 2016 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
12962551 |
Dec 7, 2010 |
9305550 |
|
|
15053941 |
|
|
|
|
61267393 |
Dec 7, 2009 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G10L 2015/025 20130101;
G10L 25/78 20130101; G10L 25/57 20130101; H04N 21/233 20130101;
G10L 19/005 20130101; G10L 15/02 20130101; G10L 15/24 20130101 |
International
Class: |
H04N 21/233 20060101
H04N021/233; G10L 25/57 20060101 G10L025/57; G10L 15/02 20060101
G10L015/02; G10L 19/005 20060101 G10L019/005 |
Claims
1. In an electronic audio portion of an electronic radio or audio
video system with an audio program carried by multiple audio
channel signals, an apparatus for monitoring the presence of
correct audio sounds in one or more channels of the audio program,
the apparatus including: a) an input circuit for receiving a
dialogue channel of an audio signal, the dialogue channel intended
to carry spoken words heard by a listener; b) a phoneme detection
circuit to provide dialogue data in response to the repeated
presence in the dialogue channel of known ones of a plurality of
spoken phonemes corresponding to spoken vowels and/or consonants
created by the corresponding particular shape and/or movement of a
speaker's lips; c) a channel phoneme logic circuit responsive to
the dialogue data of b) and to parameters including the number of
spoken phonemes found to be present in the dialogue channel in a
known time period and operating to provide a report and/or alarm in
response thereto.
2. The apparatus as claimed in claim 1 wherein the known spoken
phonemes of b) are a plurality of particular spoken phonemes chosen
from the set of A, E, O, M, P, B and S and the dialogue data
indicates which of the spoken phoneme is present.
3. The apparatus as claimed in claim 1 wherein the known spoken
phonemes of b) are a plurality of particular spoken phonemes chosen
from the set of A, E and O and the dialogue data indicates which of
the spoken phoneme is present.
4. The apparatus as claimed in claim 1 wherein the parameters of c)
are used to determine if the known spoken phonemes which are found
in b) are as expected for the particular dialogue channel of a)
which is received.
5. The apparatus as claimed in claim 1 wherein the dialogue channel
is from an audio only system.
6. The apparatus as claimed in claim 1 wherein the parameters of c)
are automatically adjusted in response to the audio signal of
a).
7. The apparatus as claimed in claim 1 wherein the audio signal of
a) is part of a system which includes displayable moving images and
corresponding sounds and wherein the spoken words of the dialogue
channel correspond to displayed moving images of a speaking
person.
8. The apparatus as claimed in claim 1 wherein the audio signal of
a) is part of a system which includes displayable moving images and
corresponding sounds and further includes an input circuit
receiving displayable moving images and wherein the known spoken
phonemes of the dialogue channel correspond to displayed moving
images of a speaking person and in b) the dialogue data is further
responsive to the shape and/or movement of the speaker's lips shown
in the moving images corresponding to one or more of the known
spoken phonemes.
9. The apparatus as claimed in claim 7 further including an input
circuit for receiving displayable moving images and a viseme
detection and location circuit and a phoneme viseme logic circuit
which operate to verify dialogue is carried in the proper audio
dialogue channel thereby leading the viewer of the speaking person
carried in the displayable moving images to perceive sounds as
properly corresponding to the location of the displayable speaking
person.
10. The apparatus as claimed in claim 7 further including an input
circuit for receiving displayable moving images, the apparatus
further responsive to the presence in the system of particular
images of the displayable moving images and corresponding sounds
which have a high probability of occurring together in time, and
which presence of particular images and corresponding sounds are
used to verify the corresponding sound is being carried in the
proper channel of the audio signal of a) thereby leading the viewer
of the displayable moving images to perceive sounds as properly
corresponding to the location of the displayed moving images.
11. In an electronic audio video system such as a television system
with a television program which is intended for display, the
television program including visually displayable motion images
which include talking human speakers and audibly displayable sounds
including the words spoken by the speaker and other sounds
corresponding to the motion images and which sounds are carried by
multiple audio channel signals, an apparatus for monitoring the
placement of correct audio sounds in one or more audio channel
signals, the apparatus including: a) an input circuit for receiving
a plurality of audio channels carrying sound corresponding to
motion images including left, center and right audio channels
intended to carry words spoken by a speaker carried in displayable
motion images with the spoken words perceived by the viewer as
coming from a direction of left, center or right, from the
perspective of the viewer as the viewer watches the displayed
speaker; b) a phoneme detection circuit responsive to a plurality
of the left, center and right audio channels and operating to
determine the presence of particular phonemes pertaining to spoken
vowels and/or consonants therein created by a corresponding
particular shape and/or movement of lips of a speaker carried in
the displayable motion images and in response providing dialogue
data indicative of the presence of the particular spoken vowels and
consonants; and c) a channel phoneme logic circuit responsive to
the dialogue data of b) and to parameters including the number of
the phonemes present in at least one of the left, center and right
audio channels in a time period to determine if dialogue sound is
being carried in one or more of the left, center and right channels
and provide a report and/or alarm in response thereto.
12. The apparatus of claim 11 further including: d) an input
circuit for receiving displayable motion images and a viseme
detection circuit responsive to the displayable motion images to
provide viseme activity data related to the presence therein of
visemes formed by the lips of speakers in the displayable motion
images, with the viseme activity data further indicating the
location of the visemes as being in the left, center or right
direction, relative to a viewer, of the displayable image; and e)
wherein in c) a tracking report is generated, and a tracking alarm
may be generated, both in response to: i) known parameters; ii) the
number of phonemes present in each or all the left, center and/or
right audio channels; and iii) the number of visemes present in
each or all of the left, center and/or right directions
respectively.
13. The apparatus of claim 12 wherein in response to e) the sound
signals in a plurality of audio channels of a) are swapped to
correct a tracking error.
14. The apparatus of claim 12 wherein the parameters of i) include
a known period of time wherein in ii) the number of phonemes
present in the left, center or right audio channels which occur
during the time period and in iii) the number of visemes present in
the left, center or right directions which occur within the time
period, are compared as part of the generating the alarm.
15. The apparatus of claim 12 wherein the parameters of i) include
a known period of time wherein the degree of coincidence of ii) the
number of phonemes present in the left, center or right audio
channels which occur during the time period and iii) the number of
visemes present in the left, center or right directions which occur
within the time period, are used in generating the alarm.
16. The apparatus of claim 12 further including in e) the report is
generated in response to parameters of i) including the degree of
coincidence of ii) the number of phonemes present in each of the
left, center and right audio channels and iii) the number of
visemes present in each of the left, center and right directions
respectively.
17. The apparatus of claim 11 further including: d) an input
circuit for receiving displayable motion images and a viseme
detection circuit responsive to the displayable motion images to
provide viseme activity data related to the presence therein of
particular visemes formed by the lips of speakers in the
displayable motion images which correspond to particular phonemes
present in the audio signal of b), with the viseme activity data
further indicating the location of the visemes present in the
displayable motion images; and e) wherein in c) an alarm is
generated when, within a known period of time, the number of
phonemes in the left, center or right audio channels does not
match, within known parameters, the number of visemes in
corresponding locations in the displayable motion images
18. The apparatus of claim 17 wherein in e) the known period of
time is automatically adjusted in response to one or more of the
plurality of audio channels of a).
19. The apparatus of claim 17 wherein in d) the location of the
visemes in the displayable motion images is responsive to the range
of horizontal addresses in the displayable image occupied by the
lips forming the visemes.
20. The apparatus of claim 17 wherein in d) the location of the
visemes in the displayable motion images is responsive to the range
of horizontal addresses in the displayable image occupied by the
lips forming the visemes and the location is related to the lips
being on the left or right of the displayable motion images.
21. The apparatus of claim 19 wherein the left or right location of
the visemes is compared to the left or right channel phonemes of b)
and in response thereto the alarm of e) is generated.
22. In an electronic audio video system such as a television system
with an audio video program intended for display which system
includes video carrying motion images and corresponding multiple
audio channel audio signals including one or more dialogue channels
and a plurality of surround effects channels, an apparatus for
monitoring the placement of correct audio sounds in one or more
audio channels, the apparatus including: a) an input for receiving
a plurality of audio channels corresponding to displayable motion
images including a plurality of dialogue channels the primarily
intent of which is to convey spoken words and a plurality of
surround effects channels the primarily intent of which is to
convey non-speech sounds, the plurality of audio channels allowing,
when a viewer is viewing the motion images, the viewer to hear and
perceive placement of onscreen sounds relative to the motion
images; b) a phoneme detection circuit responsive to the plurality
dialogue channels of a) including left and right dialog channels to
provide dialogue phoneme activity data related to the presence
therein of phonemes pertaining to one or more spoken vowels and/or
consonants selected from the group A, E, O, M, P, B and S created
by the particular shape and/or movement of a speaker's lips; c) a
phoneme detection circuit, which may be that in b) or different,
responsive to the plurality surround effects channels of a) to
provide surround phoneme activity data related to the presence
therein of the same phoneme(s) selected in b); d) an input circuit
for receiving the displayable motion images and a viseme detection
circuit responsive to the particular shape and/or movement of the
speaker's lips in the displayable motion images to provide viseme
activity data related to the presence of viseme(s) corresponding to
at least the selected phoneme(s) of b); e) a channel logic circuit
operative to generate a tracking alarm in response to known
parameters and at least one of: i) the dialogue phoneme activity
data of b); ii) the surround phoneme activity data of c); iii) the
viseme activity data of d).
23. The apparatus of claim 22 wherein in response to element e) a
plurality of audio channels of a) are swapped to correct the
tracking error.
24. The apparatus of claim 22 wherein in element e) the tracking
alarm is provided when an insufficient number of phonemes is found
in a dialogue channel during a first time period or an excessive
number of phonemes is found in a sound effects channel during a
second time period.
25. The apparatus of claim 22 wherein in element e) an alarm is
provided when an insufficient number of phonemes is found in a
dialogue channel during a time period or an excessive number of
phonemes is found in a sound effects channel during the same time
period.
26. The apparatus of claim 24 wherein the first or second time
period is automatically adjusted in response to one or more audio
channels of a).
27. The apparatus of claim 25 wherein the time period is
automatically adjusted in response to one or more audio channels of
a).
28. The apparatus of claim 22 wherein in b) and c) the presence of
phonemes pertains to both a plurality of spoken vowels and a
plurality of spoken consonants and in d) the presence of visemes
pertains to the same plurality of spoken vowels and a plurality of
spoken consonants as in b) and c).
29. The apparatus of claim 22 wherein in element e) the tracking
alarm is provided when during a time period an insufficient number
of phonemes are present in the dialogue channels of b) match the
visemes present in the displayable motion images of d).
30. The apparatus of claim 22 wherein in element e) the tracking
alarm is provided when during a time period an insufficient number
of phonemes are present in the dialogue channels of b) match the
visemes present in the displayable motion images of d) and an
insufficient number of phonemes are present in the surround
channels of c) match the visemes present in the displayable motion
images of d).
31. The apparatus of claim 22 wherein in element d) the viseme
activity data further relates to the location of the visemes as
being in the left or right of the displayable image and in element
e) the tracking alarm is provided when during a time period an
insufficient number of phonemes is present in each of the left and
right dialog channels of b) which correspond respectively to
visemes present in the left and right of the displayable image.
32. The apparatus of claim 22 wherein in element d) the viseme
activity data further relates to the location of the visemes as
being in the left or right of the displayable image and in element
e) the tracking alarm is provided when during a time period an
excessive number of phonemes present in the left dialog channel of
b) correspond to visemes present in the right of the displayable
image.
33. In a television system with a television program, the
television program including motion images carried by a video
signal and further including multiple audio signal channels
corresponding to the motion images, an apparatus for monitoring the
placement of correct audio sounds in one or more audio channels of
the audio signal, the apparatus including.sup.. a) a receiving
circuit for receiving a plurality of audio channels corresponding
to displayable motion images including at least left and right
dialogue channels the primarily intent of which is to convey spoken
words heard by a viewer of the motion images, the spoken words
including words spoken by a displayable human speaker as well as
words spoken by a human speaker which is outside of the displayable
image, and further including at least left and right surround
channels the primarily intent of which is to convey non-speech
sounds which are perceived by the viewer as coming from a direction
relative to both inside or outside of the displayed motion image
frame; b) in response to the audio carried by the dialogue
channels, providing dialogue activity data in response to spoken
vowels and/or consonants present therein which are created by the
particular shape and/or movement of a speaker's lips, the dialogue
activity data also including information indicating the location of
the spoken vowels in the left, center or right direction relative
to at least the left and/or right dialogue channels; c) in response
to the audio carried by the surround channels, providing surround
activity data in response to surround sounds present therein other
than spoken words, the surround activity data also including
information indicating the location of the surround sounds in the
left, center or right direction relative to at least the left
and/or right surround channels and further including providing
surround dialogue data in response to the presence of spoken words
in one more surround channels; d) a channel logic circuit
responsive to the dialogue activity data of b) and the surround
dialogue activity data of c) to provide a report and/or alarm in
response thereto.
34. The apparatus of claim 33 wherein in response to element d) a
plurality of audio channels of a) are swapped to correct the
tracking error.
35. An apparatus as claimed in claim 33 where in d) alarms are
provided for an event chosen from the list of: i) insufficient
dialogue activity data during a first time period; ii) excessive
surround dialogue activity during a second time period; iii)
insufficient matching of left or right dialogue phonemes and left
or right dialogue visemes respectively during a third time period;
iv) excessive matching of left or right dialogue phonemes and left
or right dialogue visemes respectively during a fourth time period;
and wherein the first, second, third and/or fourth time periods may
be the same.
36. An apparatus as claimed in claim 33 where in d) alarms are
provided for a plurality of events chosen from the list of; i)
insufficient dialogue activity data during a first time period, ii)
excessive surround dialogue activity during a second time period,
iii) insufficient matching of left or right dialogue phonemes and
left or right dialogue visemes respectively during a third time
period, iv) excessive matching of left or right dialogue phonemes
and left or right dialogue visemes respectively during a fourth
time period, wherein the first, second, third and/or fourth time
periods may be the same.
Description
RELATED APPLICATION
[0001] This application claims priority from of U.S. patent
application Ser. No. 12/962,551 Filed Dec. 7, 2010 which in turn
claims benefit of U.S. Provisional Application No. 61/267,393,
filed Dec. 7, 2009, the disclosure of which is incorporated by
reference herein.
BACKGROUND
[0002] In modern television systems the sound portion of television
programs is frequently conveyed with the video signal via multiple
channels, for example a typical system could include a video
channel and left and right sound channels such as in a stereo
television system. The well-known intent of using left and right
sound channels is to provide a spatially located sound to the
viewer whereby sounds created by images at a given location on the
television screen are perceived by the viewer as coming from that
location.
[0003] The corresponding images and sounds are known as mutual
events or MUEVs. When the audio and image MUEVs as perceived by the
viewer do not properly correspond they are annoying as the sound is
perceived to come from a different location than the image making
the sound. This is especially true for dialogue (e.g. speech of a
person in a one way or two-way conversation with another) when the
speaker is seen in a different location than the sound comes from.
Consider for example a two-way conversation between two
newscasters, one on the right of the screen and one on the left. If
the left and right sound channels are reversed, the right speaker's
speech will appear to come from the left side of the screen and
vice versa.
[0004] In systems including images and sound, it is important that
mutual events or MUEVs in audio and video are perceived by the
viewer as being spatially aligned. MUEVs are those events in the
video and sound which have a high probability of occurring
together, for example the instant change of direction of a thrown
baseball and the crack of the bat hitting the ball. Other MUEVs
include the shape and/or movement of a person's lips and the sound
being created. The video lip shapes are referred to as visemes or
the visual MUEV and the sounds as phonemes or the sound MUEV. MUEVs
however are not just visemes and phonemes but encompass
simultaneously occurring events which have a probability of being
related, such as the above baseball direction and bat crack
example.
[0005] In other systems, both audio only, for example such as radio
and audio video, for example such as television, it is desired to
convey dialogue in a particular channel or channels. Because sound
signals in modern audio only and audio video acquisition and
production systems are frequently recorded and carried by multiple
sound channels, there is a possibility of the dialogue being
misplaced, that is of the dialogue being carried by the wrong audio
channel. It is also possible for dialogue to be lost entirely, for
example when sound is acquired via a sound effects channel which is
subsequently discarded.
[0006] As used in this specification and claims, if a system sound
channel conveys the proper sound signal (e.g. dialogue in the
proper channel(s) and/or leading the viewer to perceive sounds as
properly corresponding to the image location), the sound channel or
signal is said to (properly) track and if it does not convey the
proper sound signal the channel or signal is said to mistrack. For
example, if the left and right sound channel signals are reversed,
that is the left channel carries the right sound signal and vice
versa (sometimes called swapping), the sound signals mistrack. If
the dialogue sound signal is missing from the dialogue sound
channel(s), the sound signal mistracks.
[0007] As another example of multiple channel sound systems, the
sound of the performers in the television program is conveyed via
left and right sound dialogue channels whereas sound effects such
as music and other non-speech sounds are conveyed by left and right
sound effects channels. Another example is 5.1 channel sound,
sometimes referred to as 3-2 stereo, with a center dialogue
channel, front left and right dialogue channels, rear left and
right effects channels, and a low frequency effects channel.
[0008] Yet another example of a multiple channel sound system is
the Japan Broadcasting Corporation (NHK) experimental Super
Hi-Vision television having 22.2 sound channels. These channels are
grouped relative to the viewer as 9 above the ear, 10 ear level, 3
below the ear and 2 low frequency effects channels. The various
sound channels surround the viewer to provide a highly realistic
audio sensation where the sound can be perceived as coming from
anywhere within about 300 degrees vertically and 360 degrees
horizontally, depending on the location of the viewer relative to
the sound transducers (e.g. speakers).
[0009] Due to widespread audio processing, for example program
conversion between different sound systems, and other problems such
as poor microphone placement, incorrect wiring, equipment failures
and operator error, the sound signals often find their way into the
wrong sound channels. For example, having the dialogue carried in
the wrong channel can cause problems for the viewer ranging from
annoying sound to loss of dialogue audio.
[0010] For example if the left and right channels in a two channel
system are reversed the location of the sound does not match the
location of the image, such as when a person on the left of the
image frame is talking but the sound comes from the right sound
transducer (speaker). As another example consider the NHK system
where the sound which the viewer perceives is intended to come from
various directions around the viewer including from ear level,
higher and lower directions to correspond to the images which are
displayed to the viewer (or previously or about to be displayed to
the viewer). In this system if a sound signal is placed in the
wrong channel various annoying effects can occur, such as a
speaking person located to the viewer's lower right being heard
behind, above, to the left or in some other direction different
from where the viewer sees the image of the person speaking.
[0011] Also, it is important that sound that corresponds to images
not displayed to the viewer or not yet or previously displayed to
the viewer, be in the correct channel. For example consider a
television scene of a person in the middle of the frame carrying on
a conversation with an unseen person to the right side. If the
center dialogue channel and the right front channel are reversed
the conversation will appear unnatural.
[0012] As another example consider a television program which
conveys an airplane flying at low level from behind the viewer, to
above the viewer and on to be displayed in front of the viewer. The
sound will start from behind, progress to above and further
progress to in front of the viewer. In this instance the sound from
behind and from above will correspond to an image not yet seen by
the viewer. Of course the opposite will happen if the aircraft is
flying from the front of the viewer to behind the viewer. In this
instance the sound from above and behind the viewer corresponds to
an image previously displayed.
[0013] In all situations it is important to have the sound
perceived by the viewer as corresponding to the location of the
image creating the sound i.e. tracking the image location. This is
true even when the image is not currently displayed. This is true
even if the image is in a location that is not being displayed at
the instant, such as behind the viewer.
[0014] It is of course possible that the image is never displayed
but nevertheless the sound signals need to track. As an example
similar to that above, consider a conversation between two people,
one located in front of the viewer and seen on the image frame, the
other located behind the viewer, walking from side to side, and
never seen. If the second person's sound signal mistracks, the
viewer could hear the sound from behind and to the viewer's right
whereas he would see the first person looking toward the viewer's
left. If the unseen person were walking about as he talked, the
viewer would see the first person following the unseen person but
if the unseen person's sound signals mistrack the visual signal and
audio cues to the unseen person's location would
beinconsistent.
[0015] In television, film and other systems which provide images
to the viewer in more than one direction, such as wide screen (e.g.
16.times.9), specialized surround projection systems (e.g. IMAX),
or systems providing images in three dimensional or simulated three
dimensional systems (e.g. 3D-TV) it is likewise important that the
sound matches the viewer's perceived image location. When the sound
is not present in the correct sound channel this perception is
negatively affected. Mistracking sound signals will cause
conflicting audio and visual cues which can be annoying to the
viewer.
[0016] As another example of problems with sound not being in the
proper channel, when the dialogue audio is carried in the wrong
channel or not carried in all the proper channels, a loss of
dialogue can occur, for example when the television program is
passed through equipment which is incapable of handling all of the
audio channels and those containing dialogue are discarded. Such is
the case when a television program having center, left and right
dialog channels and rear effects channels is passed through an
audio signal processing device that can only handle left and right
dialogue channels. If the sound is only located in the center
dialog channel and the audio signal processing device discards or
otherwise never utilizes the center channel, the dialogue that was
only in the center channel will be lost. Generally, whenever there
is a mistracking sound signal there is a risk of important sounds
being lost.
[0017] In the prior art it is known to detect the presence of audio
in one or more audio channels and sound an alarm if the channel is
silent for a predetermined period of time. One such system is
described by Basse in U.S. Pat. No. 7,424,160 wherein in FIG. 7 the
flow diagram of an audio silence detector is shown. Basse's system
does not distinguish the type of audio which is present and
consequently missing dialogue in a dialogue channel which is
carrying sound effects would not cause Basse's invention to catch
the problem.
[0018] Basse does mention that system operators desire to monitor
their systems to ensure quality audio and video reaches the viewers
and relates prior systems such as cable TV systems where employees
monitored the quality. Basse also points out that the cost of
hiring employees to monitor every channel in a system can be
expensive and notes several problems with utilizing employees to
monitor modern systems consisting of as many as 800 TV
channels.
[0019] Generally, as Basse suggests, in television, film and other
systems using multiple channel sound it is desirable to have a
human operator monitor the sound to ensure that each sound signal
has been properly assigned to its corresponding channel. As the
number of sound channels increases the task of monitoring becomes
more difficult and as the number of systems to be monitored, such
as in the aforementioned 800 TV channel systems, the number of
operators required for proper monitoring increases
dramatically.
[0020] Typically, due to the costs involved, proper dialogue
presence and spatial sound location monitoring is not performed in
modern systems. The monitoring task falls to a single operator who
performs occasional checking. The use of occasional checking leads
to errors not being discovered promptly and in some systems they
may not be discovered for an entire program.
[0021] What is needed is an automated system which can monitor
particular sound type such as dialogue to ensure it is carried
properly and monitor sound's spatial location to ensure it properly
matches the corresponding video location.
BRIEF DESCRIPTION OF THE DRAWINGS
[0022] FIG. 1 is a diagram of an embodiment of the invention
utilized for detecting the loss of proper tracking in a single or
multiple channel audio system.
[0023] FIG. 2 is a diagram of an embodiment of Channel Phoneme
Logic 4 of FIG. 1.
[0024] FIG. 3 is a diagram of a circuit to correct mistracking in a
multiple channel audio system.
[0025] FIG. 4 is a partial diagram of an embodiment of the
invention utilized for detecting the loss of proper tracking in a
system having multiple channel audio and corresponding video.
[0026] FIG. 5 is a partial diagram of an embodiment of the
invention utilized for detecting the loss of proper tracking in a
system having multiple channel audio and corresponding video.
[0027] FIG. 6 is a diagram of a circuit to correct mistracking in a
system having multiple channel audio.
[0028] FIG. 7 is a flow chart demonstrating one embodiment of the
present invention.
DETAILED DESCRIPTION
[0029] The inventive concepts disclosed herein provide an automated
system and method for monitoring, reporting and correcting the
above described single and multiple channel sound errors caused by
one or more instances of a wrong or no particular type of sound
signal (as distinguished from no sound at all) being conveyed in a
sound channel.
[0030] The embodiment of the present invention includes detection
of tracking or mistracking of one or more particular type of sound
signals in a single or multiple channel sound only or sound and
image system.
[0031] Of course, one of ordinary skill may practice the present
invention in apparatus and methods which may combine the inventive
concepts known from the description herein along with
indiscriminate detection of the presence of any sound in a
particular channel or channels.
[0032] The description of various embodiment of the present
invention also includes detection of tracking or mistracking of one
or more sound signals in a system including multiple channel sound
signals and a corresponding image signal. Corresponding sounds and
images, for example MUEVs, may be utilized to determine if a
particular sound signal is carried by the proper sound channel(s)
such that the sound MUEVs match the image MUEVs in viewer perceived
spatial location. It may also be determined if particular types of
sound MUEVs are being carried by the intended channel(s) in a
system or otherwise correctly carried and presented to the
viewer.
[0033] In one exemplary embodiment phonemes corresponding to
dialogue are detected in one or more sound signals conveyed by one
or more channels and the appropriateness of the detected phonemes
being present in the sound channel is determined. When an
inappropriate presence (or absence) of phonemes is determined an
action such as one or more of a report, an alarm or a correction is
performed.
[0034] The exemplary embodiment includes making the determination
of the appropriateness of the presence (or absence) of phonemes in
one or more sound channel in response to past, current and/or
future images such that the location of the sound may match the
location of the particular image creating the sound.
[0035] The exemplary embodiment of the invention includes the
ability to automatically correct errors by switching one or more
sound signal from one sound channel to another in response to the
determination of the appropriateness of a sound signal being
conveyed in a sound channel.
[0036] FIG. 1 shows a diagram of one exemplary embodiment of the
present invention as utilized for detection of the loss of proper
tracking in a single or multiple channel audio system. The audio
system may be audio only, audio and video or otherwise as will be
known from the present teachings. One or more audio signals of the
M audio channel(s) 1 is coupled to a phoneme/activity detector 2
which operates to identify activity and the occurrence of phonemes
in that audio signal. The detector 2 outputs phoneme signal(s) and
channel activity signal(s) (or alternatively information
corresponding to the detected phonemes and activity) for each of
the M audio channels to channel phoneme logic 4 via N signals 3.
Channel phoneme logic 4 operates to determine if the channel
activity and/or phonemes are proper for the particular channel(s)
and thus whether the corresponding input audio signal properly
tracks. It will be appreciated that the absence of a loss of proper
tracking is a confirmation of proper tracking.
[0037] The description of the exemplary embodiment given herein
will now generally be given with respect to multiple audio channels
as well as in respect to audio video systems. It will be understood
that the descriptions of the exemplary embodiment will also be
applicable to single channel as well as audio only
applications.
[0038] Upon determination of the tracking (or lack thereof) that
determination is reported for each desired channel and if desired
an alarm is generated, via O signals 5. It may be noted that while
it is preferred to detect both phoneme and activity in the audio
signal, that a lack of activity may be inferred from a lack of
phonemes and vice versa thus that inference may be relied on in
lower performance systems. In particular, the activity detection
may be omitted if desired to save costs. While phonemes and
activity are detected and utilized for dialogue tracking
correction, the current invention may detect audio characteristics
other than the above mention phonemes and activity. For example,
the audio MUEV discussed below may be used as audio characteristics
for the current invention, where the audio MUEV may include
non-phoneme sounds such as applause which is commonly found in
audio dialogue. The audio characteristics in this specification
include phonemes, activity, and non-phoneme sound.
[0039] The phoneme detection which is utilized in 2 may be that
disclosed in U.S. Patent Application 2007/0153089. Alternatively,
the mouth sound or audio MUEV detection disclosed in U.S. Pat. No.
7,499,104 may be utilized to detect particular mouth sounds or
MUEVs which are appropriate for a given audio channel. It is
preferred that the output 3 indicates MUEVs which are appropriate
for the expected type audio signal corresponding to the audio
channel being monitored. For example, if the audio channel being
monitored is a center dialogue channel, then detection of audio
MUEVs such as phonemes corresponding to vowels A, E, and O is
performed since these should be frequently found in the center
dialogue channel. Other phonemes may be detected as well, for
example the sounds M, P, B, N and S. The detection of such phonemes
is described in U.S. Patent Application 2007/0153089. Note that
these phonemes are also MUEVs in that they will correspond to
visual images and in particular the shape of the lips of the
speaker, as will be discussed below with respect to FIGS. 4 and
5.
[0040] Activity may be detected by the presence of a known number
of one or more types of phonemes, or may be otherwise determined as
is known in the prior art. For example, an inspection of the audio
signal may be performed to determine if there is significant energy
in frequency bands corresponding to those normally carrying voice
sounds may be utilized. Such a system is described for example in
U.S. Pat. No. 6,836,295 and in particular elements 21-23 of FIG. 2
of the '295 disclosure. The output of elements 23 corresponding to
the desired frequency bands will provide an indication of activity
for that band.
[0041] U.S. Patent Application 2007/0153089 and U.S. Pat. No.
6,836,295 are incorporated herein by reference in respect to their
teachings of methods and apparatus suitable for use in practicing
components of the present invention.
[0042] As the phonemes (MUEVs) are detected in 2 they are output
via 3 to the channel phoneme logic 4. Alternatively, information of
the detection of the phonemes may be coupled to 4, for example the
number of phonemes detected may be reported every second along with
a reporting of the presence of audio in the speech frequency range
since no MUEVs are to be expected when there is silence or only
audio in frequencies where speech is not commonly found.
[0043] Channel phoneme logic 4 operates to analyze the presence of
activity and phonemes detected by 2 and report the results via 5 to
other circuitry or operations. If desired 4 can operate to set an
alarm when an insufficient number of phonemes is detected within a
given period of time. As an example, if no phonemes are detected by
2 within a 3-minute period when audio is present, an alarm can be
set. In the exemplary embodiment it is desired that the operator
can set parameters independently for reporting and setting alarms.
The exemplary parameters are the number of phonemes and the time
duration during which that number are determined. For example,
reporting parameters can be set to report if fewer than one hundred
phonemes are detected within a three-minute period and alarm
parameters can be set to create an alarm if less than ten phonemes
are detected within a five-minute period. Other parameters and
settings will be known to the person of ordinary skill from the
teachings herein. As one other example, parameters may be set for
individual phonemes, such as particular ones of vowels, or for
groups of phonemes such as vowels and consonants. MUEVs other than
phonemes may also be detected, for example applause which is
commonly found in audio dialogue channels. An exemplary embodiment
of 4 which operates in a somewhat different manner is described
with respect to FIG. 2.
[0044] FIG. 2 shows one exemplary embodiment of channel phoneme
logic 4 when operated with a four channel sound system having left
front 7 and right front 8 and left surround 6 and right surround 9
channels. Normally dialogue is carried by the left and right front
channels 7 and 8 and only sound effects are carried by the left and
right surround channels 6 and 9. The circuit of FIG. 2 operates to
check that dialogue is present in the left and right front channels
and not in the left and right surround channels.
[0045] In FIG. 2 only phoneme information is provided from 2,
activity being inferred from the presence and amount of phonemes.
The phoneme input information for dialogue channels 7 and 8 are
added to provide combined dialogue information 10 and the phoneme
input information for the surround channels is added to provide
combined surround information 11. The two combined channels are
compared to determine if the dialogue channel phonemes are greater
than the surround channel phonemes as should be the case if the
four channels are properly tracking. A true output on 12 of the
comparison will indicate more phonemes in the left and right front
channels than in the surround channels. This true output on 12 in
turn indicates proper tracking. If for example the dialogue and
surround channels are reversed, then the output 12 of the
comparison will be false and will indicate the mistracking.
[0046] The output 12 is coupled to a filter 14 which operates to
reduce or prevent false reporting and alarms which might otherwise
happen, for example if there is silence or noise in the front
channels or in a momentary presence of dialogue in the surround
channels. It is preferred that filter 14 operate as a recursive
filter requiring the presence of mistracking signals from 12 for a
period of three minutes before reporting or for five minutes before
setting an alarm at 13. This recursive filtering in effect provides
a running average of the conditions. One skilled in the art will
know from the teachings herein to utilize different time periods as
well as different operations for 14 to suit particular applications
and desired performance tradeoffs.
[0047] Filter 14 also operates to inspect the number of phonemes in
10 and 11 and to infer activity from those numbers. If the phoneme
numbers are very low in 10 or approximately the same in both 10 or
11 it is likely that 12 may not accurately indicate tracking and
reporting and alarms are to be inhibited until the number of
phonemes present on 10 or 11 rises above a known amount and remains
so for a known period of time. In one exemplary embodiment it is
shown that the known amount be ten phonemes within a period of one
minute (without recursive filtering). One skilled in the art will
know from the teachings herein to utilize different time periods as
well as different operations for 14 to suit particular applications
and desired performance tradeoffs.
[0048] The exemplary embodiment description above is given by way
of a simplified example and one of ordinary skill in the art will
recognize that there are some fault modes which will not be
detected, for example if left surround and left front are swapped
it may give roughly equal numbers of MUEVs in 10 and 11. If it is
desired to detect such faults each channel should be analyzed
individually.
[0049] The previous description of the exemplary embodiment
operation of 2 and 4 is given by way of example for teaching the
inventive concepts of the present invention to the person of
ordinary skill in the art. It will be understood that is desirable
to include reporting and alarm logic within 14 which will operate
as a missing signal detector in order to respond to the absence of
audio on 10 or 11. It will also be desirable to inspect each of the
input channels individually (without combining) for the presence of
sound and phonemes as well as using both phoneme and activity
information provided by 2. Such operations will require more
complexity but will achieve better detection and reliability, with
the implementation of that added complexity being within the skill
of one of ordinary skill from the teachings herein.
[0050] One of ordinary skill in the art will know to utilize other
types and methods of channel phoneme logic for 4 in order to meet
particular reporting and alarm requirements of a given application
of the invention as will be apparent from the teachings herein. In
particular, various of the operations described for 2 and 4 are
well suited to implementation with memory such as random access,
read only and programmable read only types or in programmable array
logic such as that provided by Xilinx and Altera, as well as
implementation in a general purpose computer or microprocessor
running particular software to convert the general purpose device
to a specific device. It is also possible to combine various
memory, programmable array logic and software controlled circuitry
to practice the invention described herein as will be known to the
person of ordinary skill in the art.
[0051] The channel phoneme logic 4 of FIG. 2 operates to detect a
swapping of dialogue and surround channels and to report and set an
alarm in the event of mistracking. In that event it is desirable to
correct the mistracking by redirecting the audio signals into the
proper audio channel.
[0052] FIG. 3 shows a circuit which performs correction of
mistracking when dialogue and surround channels are swapped. The
left surround 15, left front 16, right front 17 and right surround
18 signals are coupled to a four pole double throw switch 24 which
is responsive to the alarm 23 from 4. In the switch normal position
shown the input signals 15-18 are output as the same signals left
surround 19, left front 20, right front 21 and right surround 22.
When a mistracking is detected by 4 an alarm 23 is set and the
switch 24 is caused to move to the other position thus returning
the mistracked audio signals to their proper channels.
[0053] As described above there are systems which provide multiple
channel audio where it is desirable to inspect the audio in respect
to the corresponding images to ensure that the sound properly
tracks the images. For example, when an actor is located on the
left of the image frame the sound of that actor should be carried
by the left sound channel.
[0054] FIGS. 4 and 5 show an exemplary embodiment of the invention
where image information is used in conjunction with sound
information to ensure proper spatial relationships. The system
illustrates a scenario of operating with a multiple channel audio
input 1. A phoneme/activity detection 2, such as that of FIG. 1 is
utilized to provide phoneme and activity information 3 for each
input channel. For purposes of illustration the left and right
channels will be considered, although the previously described
dialogue and surround channel operation example of FIGS. 2 and 3
will be understood to be incorporated as well. The present
description however will be limited to the left and right channel
operation.
[0055] FIG. 4 also shows a video input 25 which is coupled to a
viseme detection 26 and a frame spatial location element 30. Viseme
detection 26 may operate for example as described in U.S. Patent
Application 2007/0153089. Alternatively, the mouth shape or image
MUEV detection disclosed in U.S. Pat. No. 7,499,104 may be utilized
to detect particular mouth shapes or MUEVs which are appropriate
for a given video channel.
[0056] Viseme detection 26 outputs visemes via 27 which are coupled
to a viseme location operation 28. Frame spatial location 30
operates to output frame left right information via 31. The frame
left right signal indicates whether each viseme on 27 is located in
the left or right side of the image frame. Viseme location
operation 28 receives both the viseme and location information and
in response thereto outputs left and right visemes via 29.
[0057] Although shown in FIG. 4 as a separate operation 30, the
frame left right signal on 31 is a byproduct of the viseme
detection operation 26 since when a viseme is found as disclosed in
U.S. Patent Application 2007/0153089 it is known where in the image
frame it is located. For example, the mouth shapes are found by
first locating the face in the frame, then locating the lips in the
face and then determining the shape of the lips. It is merely
required to convert the precise location of the lips within the
frame to a left/right signal by comparing the horizontal address of
the viseme to the center point address of the frame.
[0058] As a simplified example, if the frame is 1920 pixels wide,
the horizontal address of the viseme (e.g. the lips) will range
from 0-1919. By comparing that viseme address to 960to determine
which is the larger, the output of that comparison will indicate
left (less than 960) or right (960 or greater). Of course in
practice the horizontal address of the viseme will be a range of
addresses corresponding to the size of the viseme. It is desired to
utilize the middle or average address for the comparison. For
example, if the viseme is 12 pixels wide the address might range
from 30 to 42 and the middle address 36 would be compared to
960.
[0059] FIG. 5 receives the left and right channel phoneme and
activity information via 3 from 2 and the left and right viseme
information via 29 from 28. A left and right
phoneme/activity/viseme logic operation 32 operates to inspect the
location of the visemes on the image frame in comparison to the
corresponding phoneme. For example, a left side visemes would be
expected to have consistently corresponding phonemes present in the
left front audio signal. If however the corresponding phonemes are
present in one of the other three channels the audio mistracks.
[0060] In FIG. 5, logic 32 illustrates to operate in response to
each viseme to identify and spatially locate its corresponding
phoneme. It inspects the corresponding audio signal (e.g. if the
viseme is in the left of the frame the corresponding audio signal
is the left front audio) and if the corresponding phoneme is found
it then inspects the remaining channels to see if a corresponding
phoneme is found in one or more of them. The outcome of the
inspections is noted and the conditions noted. For example, a
corresponding phoneme is in the right channel and none of the
others, or in the right channel and one or more of the others, or
not in the right channel but in one or more of the others, or is
not found in any of the channels. Inspections are performed for
other visemes and the conditions are noted.
[0061] Similar to the description of the filter 14 of FIG. 2, the
results of the inspections to find corresponding phonemes is also
filtered to ensure that there is sufficient activity in the video
and audio channels and also recursively filtered to provide a
running average of the results. Reporting is provided and alarms
may be set in response to the activities and finding of
corresponding phonemes. In particular, it is desired to set a left
right alarm via 33 indicating that left and right channels have
been swapped.
[0062] It will be recognized from the simplified example that many
normal locations of visemes and phonemes can be expected. As one
example, a viseme which is located in or near the center of the
frame will likely have corresponding phonemes in both left and
right audio channels. Accordingly, this finding would not indicate
any problem. If however, the viseme is located in the far left side
of the frame and the only corresponding phoneme is found in the
right audio there might be a problem. Repeated and consistent
problems of this type should lead to the setting of an alarm. Such
normal and problem indicating findings are somewhat particular to
the type of audio and video systems the invention is practiced with
as one of ordinary skill in the art will understand from the
teachings herein.
[0063] FIG. 6 shows a double pole double throw switch 34, similar
to 24 of FIG. 3. Switch 34 however responds to the left right alarm
33 from 32 and operates to pass the input audio signals 15-18 to
the output channels 19-22 if the alarm is not present or to swap
left and right channels if the alarm is present.
[0064] It will be recognized that the invention may be utilized
with a range of audio and video channels, from only one to many
audio channels and with one or multiple video channels. For
example, the invention may be utilized with multiple video channel
systems such as 3D and surround video. The invention may be used
with only one or with multiple audio channels such as the NHK Super
Hi-Vision system with 22.2 audio channels.
[0065] In particular the invention may be utilized with surround
sound systems where sound may be perceived by the viewer as coming
from multiple directions. Visemes are detected and their location
relative to the viewer or some other reference(s) determined. The
location to the viewer may include both locations of visible image
and locations which are not visible such as the previously
described scenario where an airplane is visually located behind the
viewer and not visible.
[0066] Phonemes which correspond to the visemes are then located by
searching one or more audio signals. The located phonemes are then
identified by their spatial location relative to the viewer (or
other reference(s)) in response to the audio channel(s) they are
found in. The spatial location of the viseme is compared to the
spatial location of the phoneme to determine if they match.
[0067] Reporting is performed and/or alarms set in response to the
matching or mismatching of the spatial locations of corresponding
visemes and phonemes. In instances where the spatial locations of
corresponding visemes and phonemes do not match, the audio signals
may be coupled to different audio channels in order to provide
proper tracking. It is also possible to improve matching of audio
and video by use of audio signal processing. For example, dialogue
may be electronically removed from an audio signal, leaving other
sounds in that signal, and that removed dialogue added to one or
more other audio signals.
[0068] FIG. 7 is a flow chart demonstrating one embodiment of the
invention. One or more video signals is inspected to determine
activity indicating the likely presence of video MUEVs as shown in
block 710, in this example visemes, and also to spatially locate
visemes relative to a reference point, as shown in block 712. As
described above, activity may be determined in response to the
visemes. A plurality of audio signals is inspected to determine
activity indicating the likely presence of audio MUEVs as shown in
block 716, in this example phonemes, and to spatially locate
phonemes relative to a reference point, as shown in block 714.
Again, activity may be determined in response to phonemes. The
phonemes and visemes are searched to locate corresponding pairs or
MUEVs, as shown in block 720. The spatial locations of the
corresponding MUEVS are identified, as shown in block 730 and those
spatial locations are compared to determine their degree of spatial
coincidence, as shown in block 740.
[0069] Parameters are set, by operator, as shown in block 760 or
otherwise, for filtering the results of the determination of the
degree of spatial coincidence of the corresponding MUEVs.
Parameters may be set during manufacture, for example by
programming them into software, storing them in memory, or hard
wiring in circuitry. Parameters may also be set and adjusted
automatically in response to the audio and/or video signals, for
example in response to the average audio level or average audio
frequency content.
[0070] The results of the determination of the degree of spatial
coincidence are filtered in response to the parameters and audio
and video activity to determine the average spatial coincidence, as
shown in block 750 and a report of the average spatial coincidence
is made.
[0071] Parameters are set, as shown in block 760 by operator or
otherwise, for comparison to the average spatial coincidence to
determine excessive values indicating a loss of spatial
coincidence. As above the parameters may be set during manufacture
or set and adjusted automatically.
[0072] The average spatial coincidence is compared to the
parameters, as shown in block 7802 to determine if the average
spatial coincidence exceeds the parameters and if so an alarm is
set, as shown in block 790.
[0073] One of ordinary skill in the art will recognize from the
teachings of the various embodiment of the Figures which are given
by way of example to illustrate the inventive concepts that various
changes and enhancements may be resorted to in order to practice
the invention in a particular system or with particular
equipment.
[0074] As generally used in the art, the word audio often, but not
always, pertains to sounds likely include dialogue and the signals
that carry them. This is a holdover from original radio and TV
systems which only had one channel which carried all sounds. Sound
is often used more generally to mean audible sounds and the signals
that carry them. It is noted however that when used in multiple
channel sound systems, audio is now often used to denote any of the
sound channels or signals, including those which are intended to
carry only sound effects without dialogue. Audio is also used to
mean all of the sound channels and signals in a particular program.
The embodiment of the invention is described herein in respect to
audio, and audio in television systems. The use of terminology
including audio and sound in the description of the exemplary
embodiment is that commonly used in the art. One of ordinary skill
in the art will know the particular meaning intended from the
context of the wording, however those not having skill in the art
may not be able to know the intended meaning without some study.
The embodiment is given by way of example and is not intended to be
limiting of the scope of the invention as claimed, and in
particular it is not intended that audio or sound be limited to
only that which contains dialogue.
[0075] Generally, each sound channel conveys an electronic signal
representation of the sound. These electronic signals may be analog
or digital, and may be conveyed by wire, fiber optic, optical,
wireless or any other known method. Several other television and
audio systems which utilize multiple sound channels are known in
the art and it is expected that other multiple sound channel
systems and equipment will become known in the future. The present
invention will find application to many of these multiple sound
channel systems as well as the associated equipment, transmission
systems and methods as will be known to one of ordinary skill in
the art from the teachings herein.
[0076] It will be appreciated that the word channel, is used herein
in a communications theory sense. That channel is the path, whereby
signals, information or data are stored, conveyed or transmitted
utilizing any of various technologies known in the art.
[0077] It will be appreciated that while the embodiment is
described with respect to phonemes and dialogue as a desired type
of audio signals, the inventive concepts will apply as well to
other types of signals. For example, the invention may be practiced
with audio effects, low frequency audio, electronically generated
audio, laugh tracks, applause tracks and any other type of audio
signal which is desired to be present and/or carried on one or more
particular channel(s). The invention may be practiced by inspection
of the signal for expected characteristics of the particular
signal.
[0078] When speaking of the absence or presence of phonemes, MUEVs
or other types of information in a channel, it will be understood
that it is the absence or presence of the particular information in
the signal (of whatever type) which conveys that information via
the channel. Detecting the information in a channel may be
performed directly by inspecting the signal carried in the channel.
It is also possible to detect the presence of information in that
channel indirectly by inspecting the sound from the transducer
(e.g. speaker) which converts the signal to sound, such as by using
another transducer such as a microphone.
[0079] The teachings of the inventive concepts described herein
will also be understood by those of ordinary skill in the art to be
applicable to non-audio signals being carried as intended for
particular systems and methods. Examples include systems and
methods which utilize one or more channels of metadata, subsonic,
ultrasonic, infrared, ultraviolet or electromagnetic (e.g. X-Ray,
Radar, MRI) information.
[0080] As used herein, dialogue pertains to spoken words such as by
humans, cartoon characters and the like. While in the normal sense
dialogue pertains to a two-way conversation, as used herein it will
encompass a one-way conversation such as a radio or television
announcer broadcasting to a listener.
[0081] As will be known from context, frame and image frame refer
to the entire frame of images viewed by a viewer whereas image
refers to the particular image or images of interest within (or not
within but relative to) the frame. Most commonly image refers to
the image which corresponds to the sound being discussed, for
example the face of a speaker which is analyzed for visemes and for
which the sound is being analyzed for phonemes. That face may be in
the frame or not, such as when a person in the frame is looking at
the speaker which is talking but cannot be seen by the viewer.
[0082] When describing actions and activities such as detection
and/or response to phonemes, MUEVs, visemes and the like, it is
meant that there is a specific detection or response which more
readily results from the occurrence of that particular event than
other events. For example, while an activity detector responding to
a particular frequency band would inherently respond to a phonemes
falling within that band, it has no particular discrimination of or
affinity to the phoneme as compared to other sounds falling within
the band. Thus, as used herein such an activity detector would not
be considered to detect or respond to the phoneme.
* * * * *