U.S. patent application number 13/629292 was filed with the patent office on 2013-12-19 for system and methods for encoding live multimedia content with synchronized resampled audio data.
This patent application is currently assigned to DIVX, LLC. The applicant listed for this patent is DIVX, LLC. Invention is credited to Kirill Erofeev, Galina Petrova, Dmitry Sahno.
Application Number | 20130336379 13/629292 |
Document ID | / |
Family ID | 49755885 |
Filed Date | 2013-12-19 |
United States Patent
Application |
20130336379 |
Kind Code |
A1 |
Erofeev; Kirill ; et
al. |
December 19, 2013 |
System and Methods for Encoding Live Multimedia Content with
Synchronized Resampled Audio Data
Abstract
Systems and methods for encoding live multimedia content with
audio data synchronized with other streams of data within the
multimedia content, including video data in accordance with
embodiments of the invention are disclosed. In one embodiment of
the invention, an encoding system includes live multimedia content
storage configured to store live multimedia content including audio
data and video data, a processor, and a multimedia encoder, wherein
the multimedia encoder configures the processor to receive live
multimedia content, generate a timeline using the video data,
compute a first time window, align the audio data to the video data
using the audio samples and the timeline, measure a synchronization
value of the aligned audio data to the video data, resample at
least one audio sample in the aligned audio data when the
synchronization value exceeds a threshold value, and multiplex the
audio data and video data into a container file.
Inventors: |
Erofeev; Kirill; (Tomsk,
RU) ; Petrova; Galina; (Tomsk, RU) ; Sahno;
Dmitry; (Tomsk, RU) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
DIVX, LLC |
Santa Clara |
CA |
US |
|
|
Assignee: |
DIVX, LLC
Santa Clara
CA
|
Family ID: |
49755885 |
Appl. No.: |
13/629292 |
Filed: |
September 27, 2012 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61659111 |
Jun 13, 2012 |
|
|
|
Current U.S.
Class: |
375/240.01 ;
375/E7.026 |
Current CPC
Class: |
H04N 21/242 20130101;
G11B 27/031 20130101; H04N 21/2187 20130101; G11B 27/10
20130101 |
Class at
Publication: |
375/240.01 ;
375/E07.026 |
International
Class: |
H04N 7/26 20060101
H04N007/26 |
Claims
1. An encoding system, comprising: live multimedia content storage
configured to store live multimedia content, where the live
multimedia content comprises audio data and video data, where the
audio data comprises a plurality of audio samples having an audio
sample duration and the video data comprises a plurality of video
frames; a processor; and a multimedia encoder; wherein the
multimedia encoder configures the processor to: receive live
multimedia content; generate a timeline using the video data, where
the timeline contains a plurality of timestamps, where at least one
timestamp in the plurality of timestamps is determined using at
least one video frame in the plurality of video frames; compute a
first time window, where the first time window comprises a first
time window duration corresponding to the difference in time
between a first timestamp in the timeline and a second timestamp in
the timeline; align the audio data to the video data using the
audio samples and the timeline by assigning at least one audio
sample to the first time window based upon the number of audio
sample durations that occur within the first time window duration;
measure a synchronization value of the aligned audio data to the
video data using the timeline; resample at least one audio sample
in the aligned audio data when the synchronization value exceeds a
threshold value; and multiplex the audio data and video data into a
container file.
2. The encoding system of claim 1, wherein the audio sample
duration is a fixed duration.
3. The encoding system of claim 1, wherein the audio sample
duration is a variable duration.
4. The encoding system of claim 1, wherein the synchronization
value is measured by subtracting the duration of at least one audio
sample from the first time window duration.
5. The encoding system of claim 1, wherein the threshold value is
pre-determined.
6. The encoding system of claim 1, wherein the threshold value is
determined dynamically.
7. The encoding system of claim 1, wherein the at least one audio
sample comprises a sampling rate and the audio sample is resampled
by increasing the sampling rate.
8. The encoding system of claim 1, wherein the at least one audio
sample comprises a sampling rate and the audio sample is resampled
by decreasing the sampling rate.
9. The encoding system of claim 1, wherein the multimedia encoder
further configures the processor to perform pitch compensation of
the resampled audio sample.
10. The encoding system of claim 1, wherein: at least one video
frame in the plurality of video frames comprises a video frame
timestamp; and at least one timestamp in the plurality of
timestamps is determined using the video frame timestamp.
11. The encoding system of claim 1, wherein: at least one video
frame in the plurality of video frames comprises a video frame
duration; and at least one timestamp in the plurality of timestamps
is determined using the video frame duration.
12. A method for encoding live multimedia content, comprising:
receiving live multimedia content using an encoding system;
generating a timeline using the video data and the encoding system,
where the timeline contains a plurality of timestamps, where at
least one timestamp in the plurality of timestamps is determined
using at least one video frame in the plurality of video frames;
computing a first time window using the encoding system, where the
first time window comprises a first time window duration
corresponding to the difference in time between a first timestamp
in the timeline and a second timestamp in the timeline; aligning
the audio data to the video data using the audio samples and the
timeline by assigning at least one audio sample to the first time
window based upon the number of audio sample durations that occur
within the first time window duration using the encoding system;
measuring a synchronization value of the aligned audio data to the
video data using the timeline and the encoding system; resampling
at least one audio sample in the aligned audio data when the
synchronization value exceeds a threshold value using the encoding
system; and multiplexing the audio data and video data into a
container file using the encoding system.
13. The method of claim 12, wherein the audio sample duration is a
fixed duration.
14. The method of claim 12, wherein the audio sample duration is a
variable duration.
15. The method of claim 12, wherein measuring the synchronization
value comprises subtracting the duration of at least one audio
sample from the first time window duration using the encoding
system.
16. The method of claim 12, wherein the threshold value is
pre-determined.
17. The method of claim 12, wherein the threshold value is
determined dynamically.
18. The method of claim 12, wherein the at least one audio sample
comprises a sampling rate and resampling an audio sample comprises
increasing the sampling rate using the encoding system.
19. The method of claim 12, wherein the at least one audio sample
comprises a sampling rate and resampling an audio sample comprises
decreasing the sampling rate using the encoding system.
20. The method of claim 12, further comprising performing pitch
compensation of at least one resampled audio sample using the
encoding system.
21. The method of claim 12, wherein: at least one video frame in
the plurality of video frames comprises a video frame timestamp;
and determining at least one timestamp in the plurality of
timestamps utilizes the video frame timestamp and the encoding
system.
22. The method of claim 12, wherein: at least one video frame in
the plurality of video frames comprises a video frame duration; and
determining at least one timestamp in the plurality of timestamps
utilizes the video frame duration and the encoding system.
23. A machine readable medium containing processor instructions,
where execution of the instructions by a processor causes the
processor to perform a process comprising: receiving live
multimedia content; generating a timeline using the video data,
where the timeline contains a plurality of timestamps, where at
least one timestamp in the plurality of timestamps is determined
using at least one video frame in the plurality of video frames;
computing a first time window, where the first time window
comprises a first time window duration corresponding to the
difference in time between a first timestamp in the timeline and a
second timestamp in the timeline; aligning the audio data to the
video data using the audio samples and the timeline by assigning at
least one audio sample to the first time window based upon the
number of audio sample durations that occur within the first time
window duration; measuring a synchronization value of the aligned
audio data to the video data using the timeline; resampling at
least one audio sample in the aligned audio data when the
synchronization value exceeds a threshold value; and multiplexing
the audio data and video data into a container file.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of U.S. Provisional
Patent Application No. 61/659,111, filed on Jun. 13, 2012, the
disclosure of which is hereby incorporated by reference in its
entirety.
FIELD OF THE INVENTION
[0002] The present invention is directed, in general, to systems
and methods for encoding multimedia content and more specifically
to systems and methods for encoding live multimedia content with
synchronized resampled audio data.
BACKGROUND
[0003] Streaming video over the Internet has become a phenomenon in
modern times. Many popular websites, such as YouTube, a service of
Google, Inc. of Mountain View, Calif., and WatchESPN, a service of
ESPN of Bristol, Conn., utilize streaming video in order to provide
video and television programming to consumers via the Internet.
[0004] Video data is often compressed to facilitate the storage and
transmission of video data, particularly over networks such as the
Internet. A number of video compression standards (codecs) exist,
including MPEG2 by the Moving Picture Experts Group (MPEG) of the
International Organization for Standardization (ISO) of Geneva,
Switzerland, with the International Electrotechnical Commission
(IEC) of Geneva, Switzerland, MPEG4 by the ISO/IEC MPEG, and
H.264/MPEG4 AVC by the International Telecommunication Union
Telecommunication Standardization Sector of Geneva, Switzerland.
Video data is compressed, also known as encoded, using an encoder.
Encoded video data is decompressed using a decoder corresponding to
the encoder used to encode the video data.
[0005] Scalable Video Coding (SVC) is an extension of the
H.264/MPEG-4 AVC video compression standard, which is specified by
the ITU-T H.264 standard by the International Telecommunication
Union Telecommunication Standardization Sector of Geneva,
Switzerland. SVC enables the encoding of a video bitstream that
additionally contains one or more sub-bitstreams. The
sub-bitstreams are derived from the video bitstream by dropping
packets of data from the video bitstream, resulting in a
sub-bitstream of lower quality and lower bandwidth than the
original video bitstream. SVC supports three forms of scaling a
video bitstream into sub-bitstreams: temporal scaling, spatial
scaling, and quality scaling. Each of these scaling techniques can
be used individually or combined depending on the specific video
system.
[0006] Pulse Code Modulation (PCM) is a method used to create a
digital representation of analog signals, including analog audio
data. A PCM stream is a digital representation of an analog signal
where the magnitude of the analog signal is sampled at uniform
intervals, known as the sample rate, and quantized to a value
within a range of digital steps. PCM streams are commonly created
using analog to digital converters, and are decoded using digital
to analog converters. Systems and methods for performing pulse code
modulation of analog signals are described in U.S. Pat. No.
2,801,281, entitled "Communication System Employing Pulse Code
Modulation" to Oliver et al., dated Jul. 30, 1957, the entirety of
which is incorporated by reference.
[0007] A variety of multimedia containers may be used to store
encoded multimedia content, including the Matroska container. The
Matroska container is a media container developed as an open
standard project by the Matroska non-profit organization of
Aussonne, France. The Matroska container is based upon Extensible
Binary Meta Language (EBML), which is a binary derivative of the
Extensible Markup Language (XML). Decoding of the Matroska
container is supported by many consumer electronics (CE) devices.
The DivX Plus file format developed by DivX, LLC of San Diego,
Calif. utilizes an extension of the Matroska container format,
including elements that are not specified within the Matroska
format.
[0008] In adaptive streaming systems, multimedia content is
typically stored on a media server as a top level index file
pointing to a number of alternate streams that contain the actual
video and audio data. Each stream is typically stored in one or
more container files. A variety of container files, including the
Matroska container, may be utilized in adaptive streaming
systems.
SUMMARY OF THE INVENTION
[0009] Systems and methods for encoding live multimedia content
with audio data synchronized with other streams of data within the
multimedia content, including video data in accordance with
embodiments of the invention are disclosed. In one embodiment of
the invention, an encoding system includes live multimedia content
storage configured to store live multimedia content, where the live
multimedia content includes audio data and video data, where the
audio data includes a plurality of audio samples having an audio
sample duration and the video data includes a plurality of video
frames, a processor, and a multimedia encoder, wherein the
multimedia encoder configures the processor to receive live
multimedia content, generate a timeline using the video data, where
the timeline contains a plurality of timestamps, where at least one
timestamp in the plurality of timestamps is determined using at
least one video frame in the plurality of video frames, compute a
first time window, where the first time window includes a first
time window duration corresponding to the difference in time
between a first timestamp in the timeline and a second timestamp in
the timeline, align the audio data to the video data using the
audio samples and the timeline by assigning at least one audio
sample to the first time window based upon the number of audio
sample durations that occur within the first time window duration,
measure a synchronization value of the aligned audio data to the
video data using the timeline, resample at least one audio sample
in the aligned audio data when the synchronization value exceeds a
threshold value, and multiplex the audio data and video data into a
container file.
[0010] In another embodiment of the invention, the audio sample
duration is a fixed duration.
[0011] In an additional embodiment of the invention, the audio
sample duration is a variable duration.
[0012] In yet another additional embodiment of the invention, the
synchronization value is measured by subtracting the duration of at
least one audio sample from the first time window duration.
[0013] In still another additional embodiment of the invention, the
threshold value is pre-determined.
[0014] In yet still another additional embodiment of the invention,
the threshold value is determined dynamically.
[0015] In yet another embodiment of the invention, the at least one
audio sample includes a sampling rate and the audio sample is
resampled by increasing the sampling rate.
[0016] In still another embodiment of the invention, the at least
one audio sample includes a sampling rate and the audio sample is
resampled by decreasing the sampling rate.
[0017] In yet still another embodiment of the invention, the
multimedia encoder further configures the processor to perform
pitch compensation of the resampled audio sample.
[0018] In yet another additional embodiment of the invention, at
least one video frame in the plurality of video frames includes a
video frame timestamp and at least one timestamp in the plurality
of timestamps is determined using the video frame timestamp.
[0019] In still another additional embodiment of the invention, at
least one video frame in the plurality of video frames includes a
video frame duration and at least one timestamp in the plurality of
timestamps is determined using the video frame duration.
[0020] Still another embodiment of the invention includes a method
for encoding live multimedia content includes receiving live
multimedia content using an encoding system, generating a timeline
using the video data and the encoding system, where the timeline
contains a plurality of timestamps, where at least one timestamp in
the plurality of timestamps is determined using at least one video
frame in the plurality of video frames, computing a first time
window using the encoding system, where the first time window
includes a first time window duration corresponding to the
difference in time between a first timestamp in the timeline and a
second timestamp in the timeline, aligning the audio data to the
video data using the audio samples and the timeline by assigning at
least one audio sample to the first time window based upon the
number of audio sample durations that occur within the first time
window duration using the encoding system, measuring a
synchronization value of the aligned audio data to the video data
using the timeline and the encoding system, resampling at least one
audio sample in the aligned audio data when the synchronization
value exceeds a threshold value using the encoding system, and
multiplexing the audio data and video data into a container file
using the encoding system.
[0021] In yet another additional embodiment of the invention, the
audio sample duration is a fixed duration.
[0022] In still another additional embodiment of the invention, the
audio sample duration is a variable duration.
[0023] In yet still another additional embodiment of the invention,
measuring the synchronization value includes subtracting the
duration of at least one audio sample from the first time window
duration using the encoding system.
[0024] In yet another embodiment of the invention, the threshold
value is pre-determined.
[0025] In still another embodiment of the invention, the threshold
value is determined dynamically.
[0026] In yet still another embodiment of the invention, the at
least one audio sample includes a sampling rate and resampling an
audio sample includes increasing the sampling rate using the
encoding system.
[0027] In yet another additional embodiment of the invention, the
at least one audio sample includes a sampling rate and resampling
an audio sample includes decreasing the sampling rate using the
encoding system.
[0028] In still another additional embodiment of the invention,
encoding live multimedia content includes performing pitch
compensation of at least one resampled audio sample using the
encoding system.
[0029] In yet still another additional embodiment of the invention,
at least one video frame in the plurality of video frames includes
a video frame timestamp and determining at least one timestamp in
the plurality of timestamps utilizes the video frame timestamp and
the encoding system.
[0030] In yet another embodiment of the invention, at least one
video frame in the plurality of video frames includes a video frame
duration and determining at least one timestamp in the plurality of
timestamps utilizes the video frame duration and the encoding
system.
[0031] Yet another embodiment of the invention includes a machine
readable medium containing processor instructions, where execution
of the instructions by a processor causes the processor to perform
a process including receiving live multimedia content, generating a
timeline using the video data, where the timeline contains a
plurality of timestamps, where at least one timestamp in the
plurality of timestamps is determined using at least one video
frame in the plurality of video frames, computing a first time
window, where the first time window includes a first time window
duration corresponding to the difference in time between a first
timestamp in the timeline and a second timestamp in the timeline,
aligning the audio data to the video data using the audio samples
and the timeline by assigning at least one audio sample to the
first time window based upon the number of audio sample durations
that occur within the first time window duration, measuring a
synchronization value of the aligned audio data to the video data
using the timeline, resampling at least one audio sample in the
aligned audio data when the synchronization value exceeds a
threshold value, and multiplexing the audio data and video data
into a container file.
BRIEF DESCRIPTION OF THE DRAWINGS
[0032] FIG. 1 is a system diagram of a system for encoding and
delivering live multimedia content in accordance with an embodiment
of the invention.
[0033] FIG. 2 conceptually illustrates a media server configured to
encode live video data with synchronized resampled audio data in
accordance with an embodiment of the invention.
[0034] FIG. 3 is a flow chart illustrating a process for encoding
live multimedia content with audio data synchronized with video
data in accordance with an embodiment of the invention.
[0035] FIG. 4 is a flow chart illustrating a process for encoding
live multimedia content with resampled audio data synchronized with
video data in accordance with an embodiment of the invention.
DETAILED DESCRIPTION
[0036] Turning now to the drawings, systems and methods for
encoding live multimedia content with synchronized resampled audio
data in accordance with embodiments of the invention are disclosed.
Multimedia content typically includes audio data and video data. In
many embodiments, video data is encoded using one of a variety of
video compression schemes and the audio data is encoded using pulse
code modulation (PCM). The audio data can then be multiplexed with
the frames of encoded video data and stored in a container file.
When encoding live multimedia content, such as for use in live
streaming over the Internet, the audio data is synchronized quickly
in order to facilitate the encoding and delivery of the live
multimedia content.
[0037] Container files are composed of blocks of content (e.g.
fragments, elements, or chunks), where each block of content
includes audio data and/or video data. A number of container files
have restrictions as to how timestamps are applied to the data
stored in the container file and/or the container files may have
blocks of content of a fixed size. Furthermore, the generation of
timestamps for live sources of audio and video data may contain
errors or other issues. For example, the difference between
adjacent timestamps present in the audio data may not be equal to
the actual duration of the audio data contained between the
adjacent timestamps. These restrictions and errors can cause the
audio and video data to become desynchronized when the audio and
video data is multiplexed in a container file.
[0038] In many embodiments, the likelihood of desynchronization is
reduced by constructing a timeline using the timestamps of the
encoded video data and synchronizing the audio data to the video
data based upon the sampling rate of the audio data. In a number of
embodiments, the audio data is synchronized to the timeline by
adjusting the number of PCM samples assigned to a particular time
interval. In several embodiments, the container file format permits
a specific number of audio samples in each frame interval and the
audio data is synchronized to the timeline by resampling the audio
data to obtain an appropriate number of samples. The audio data and
video data can then be multiplexed into a container file. Systems
and methods for encoding live multimedia content with synchronized
resampled audio data in accordance with embodiments of the
invention are discussed further below.
System Overview
[0039] Media servers in accordance with embodiments of the
invention are configured to encode live multimedia content to be
stored and/or streamed to network clients. A media streaming
network including a media server configured to encode live
multimedia content in accordance with an embodiment of the
invention is illustrated in FIG. 1. The illustrated media streaming
network 10 includes a media source 100 configured to encode
multimedia content in real time. The media source 100 is configured
to capture and/or receive streams of live audio data from an audio
source and streams of live video data from a video source. In
accordance with embodiments of the invention, the audio source and
video source generate and apply a timestamp to their respective
captured data. However, the timestamps applied by the audio source
and video source are likely to not be synchronized with each other
and cannot be relied upon by the video source 100 to synchronize
the encoded audio data and the encoded video data. In a number of
embodiments of the invention, the media source 100 contains
pre-encoded multimedia content. The media source 100 is connected
to a network renderer 102. In accordance with embodiments of the
invention, the network renderer 102 synchronizes the encoded audio
data the encoded video data by using the encoded video data to
create a timeline and synchronizing the encoded audio data to the
timeline based upon the sampling rate of the encoded audio data. In
several embodiments, the initial synchronization of the video data
and the audio data may be obtained using an initial synchronization
sequence. In many embodiments, the media source 100 and the network
renderer 102 are implemented using a media server. In accordance
with embodiments of the invention, the network renderer 102 is
connected to a plurality of network clients 104 utilizing a network
108.
[0040] In many embodiments of the invention, the network renderer
102 is implemented using a single machine. In several embodiments
of the invention, the network renderer 102 is implemented using a
plurality of machines. In many embodiments, the network 108 is the
Internet. In several embodiments, the network 108 is any IP
network.
[0041] The network clients 104 contain a media decoder 106. In
several embodiments of the invention, the network client 104 is
configured to receive and decode received multimedia content using
the media decoder 106.
[0042] In many embodiments of the invention, a media server, where
the media server includes a media source 100 and a network renderer
102, is implemented using a machine capable of receiving live
multimedia content and multiplexing the received live multimedia
content into a container file. In accordance with embodiments of
the invention, the media server is also capable of encoding the
received live multimedia content. The basic architecture of a media
server in accordance with an embodiment of the invention is
illustrated in FIG. 2. The media server 200 includes a processor
210 in communication with non-volatile memory 230, volatile memory
220, and a network interface 240. In the illustrated embodiment,
the non-volatile memory includes a media encoder 232 that
configures the processor to encode live multimedia content by
creating a timeline using the video data in the live multimedia
content, synchronizing samples of the audio data with the video
data using the timeline, and multiplexing the audio data and video
data in a container file. In many embodiments, the container file
contains blocks of content with a fixed number of audio samples; in
accordance with embodiments of the invention, the audio data is
resampled to obtain the appropriate number of audio samples prior
to multiplexing the audio data and the video data in a container
file. In several embodiments, the network interface 240 may be in
communication with the processor 210, the volatile memory 220,
and/or the non-volatile memory 230. Although a specific media
server architecture is illustrated in FIG. 2, any of a variety of
architectures including architectures where the media encoder 232
is located on disk or some other form of storage and is loaded into
volatile memory 220 at runtime can be utilized to implement media
servers in accordance with embodiments of the invention.
[0043] Although specific architectures for a media streaming
network and a media server configured to encode live multimedia
content are described with respect to FIGS. 1 and 2, other
implementations appropriate to a specific application can be
utilized in accordance with embodiments of the invention. Methods
for encoding live multimedia content with synchronized audio data
in accordance with embodiments of the invention are discussed
below.
Encoding Live Multimedia Content with Synchronized Audio Data
[0044] Live multimedia content can be encoded for a variety of
purposes including, but not limited to, streaming the live
multimedia content to a number of network clients. In order to
provide a high quality viewing experience, the video data and the
audio data should be closely synchronized so that the encoded
multimedia content provides an experience similar to viewing the
content live. In this way, the audio will correspond with relevant
visual cues such as (but not limited to) lip motion associated with
a person talking or singing. Traditionally, this synchronization is
performed using the timestamps associated with the audio data and
the video data; these timestamps are created by the hardware and/or
software capturing the audio data and video data. However, the
timestamps generated during the capture of the audio data and the
video data may not be aligned and/or differences in the hardware
and/or software may result in timestamps generated by the hardware
recording the audio data and the hardware recording the video data
being inconsistent with each other over the course of the recording
of the live multimedia content. Further compounding the problem,
the timestamps generated when recording the live audio data and
video data may not accurately represent the real world elapsed time
between the recorded timestamps. Moreover, direct multiplexing of
the audio data and the video data may result in the loss of
timestamps captured by the recording hardware, potentially causing
synchronization problems as additional live multimedia content is
encoded. However, these issues may be minimized, or even avoided,
by constructing a new timeline using the video data and
synchronizing the audio data using the timeline based upon the
sampling rate of the audio data in accordance with embodiments of
the invention.
[0045] A process for encoding live multimedia content in which
audio data is synchronized with encoded video data is illustrated
in FIG. 3. The process 300 includes receiving (310) live multimedia
content, where the multimedia content includes (but is not limited
to) audio data and video data. A timeline containing one or more
timestamps is generated (312) using the timestamps associated with
the frames of video data and/or the known frame rate of the video
data. The audio data is aligned (314) to the video data using the
timeline. The synchronization of the audio data is measured (316)
using a synchronization threshold. If the audio data is
de-synchronized (318) beyond the synchronization threshold, the
audio data is adjusted (320). The synchronized audio data and video
data are multiplexed (322) into a container file.
[0046] In several embodiments, the video data is encoded in
accordance with video encoding standards including (but not limited
to) MPEG2, MPEG4, H.264, or Scalable Video Coding. In a number of
embodiments, the audio data is encoded using PCM. In many
embodiments, the audio data is aligned (314) to the video data
using the timeline by assigning fragments of PCM samples to the
timeline without any gaps or overlays. In a number of embodiments,
PCM samples have a fixed duration and the audio data is aligned
(314) to the timestamps by assigning enough PCM samples so that the
duration fills the difference between the current timestamp and the
adjacent timestamp in the timeline. In many embodiments, the
difference between the current timestamp and the adjacent timestamp
in the timeline is a time window; the time window has a duration
which is the difference between the timestamps. In several
embodiments, the synchronization of the PCM samples is measured
(316) at each timestamp in the timeline using a variety of methods,
including, but not limited to, subtracting the total length of the
PCM samples from the difference in time between the timestamp and
the adjacent timestamp. In many embodiments, the audio data and
video data are multiplexed (322) into a container file that
includes a block of content (e.g. a fragment, element, or chunk)
that stores one or more frames of video data and the audio data
played back during the display of the one or more frames of video
data. These predetermined portions can have a variable size
depending on the data stored in the block of content. In accordance
with embodiments of the invention, a timestamp can be associated
with the predetermined portions of the container file containing
video frames and associated audio data and the timestamps can be
utilized by decoders to time the display of individual frames of
video and the playback of the accompanying audio.
[0047] In many embodiments, if the measured (316) synchronization
of the audio data exceeds (318) the synchronization threshold, the
audio data is adjusted (320) by moving PCM samples from one
timestamp to the next. For example, if the measured (316)
synchronization indicates that the audio data is falling behind the
video data at timestamp X, the audio data is adjusted (320) by
pulling PCM samples of audio data from the block of content in the
container file associated with adjacent timestamp X+1. Likewise, if
the audio data is ahead of the video data, the audio data is
adjusted (320) by pushing PCM samples of audio data from the
portion of the container file associated with timestamp X to the
block of content associated with adjacent timestamp X+1. Other
adjustments may be utilized in accordance with embodiments of the
invention.
[0048] A specific process for encoding live media with audio data
synchronized to the video data in accordance with embodiments of
the invention is described above with respect to FIG. 3; however, a
variety of processes may be utilized in accordance with embodiments
of the invention. Methods for encoding live media with synchronized
resampled audio using containers with fixed size blocks of content
in accordance with embodiments of the invention are discussed
below.
Encoding Live Multimedia Content with Resampled Audio Data
[0049] As noted above, creating a timeline using video data and
using that timeline to synchronize audio data to the video data
enables the encoding of live multimedia content which will provide
a high quality viewing experience. However, a number of container
file formats utilized in accordance with embodiments of the
invention fix the size of the predetermined portions that contain
one or more frames of video data and the audio data that
accompanies the one or more frames of video. Fixing the size of
each predetermined portion utilized to store video and audio data
typically means that each of the predetermined portions contains
the same number of audio samples. Depending upon the sampling rate
of the audio relative to the frame rate of the video, a different
number of samples may fall within each frame interval. In a number
of embodiments of the invention, the audio samples are resampled to
obtain the appropriate number of samples. In several embodiments,
filters and/or other appropriate adjustments can be applied to the
samples to minimize the audio distortion resulting from playback of
the resampled audio.
[0050] A process for encoding live multimedia content in which
audio data is synchronized with video data and multiplexing the
audio and video data into a container file have a fixed number of
audio samples per frame interval is illustrated in FIG. 4. The
process 400 includes receiving (410) live multimedia content, where
the multimedia content includes, but is not limited to, audio data
and video data. A timeline containing one or more timestamps is
generated (412) using the video data. The audio data is aligned
(414) to the video data using the timeline. The synchronization of
the audio data is measured (416) using a synchronization threshold.
If the audio data is de-synchronized (418) beyond the
synchronization threshold, the audio data is resampled (420), and,
if necessary, corrections are applied (422) to the resampled audio
data. The audio data and video data are multiplexed (424) into the
container file.
[0051] In accordance with embodiments of the invention, a process
similar to the one described above with respect to FIG. 3 may be
utilized for building a timeline and detecting audio
de-synchronization (410)-(418). In several embodiments, the
synchronization threshold is measured (416) by counting the number
of samples of audio data assigned to a timestamp in the timeline
and comparing that number to the number of audio samples allowed
for a block of content in the container file. If the number of
samples of audio data differs from the number of audio samples
allowed for the block of content, the audio data is resampled
(420). In a number of embodiments, resampling (420) the PCM samples
contained in the audio data results in the resampled audio data
having a sample rate that is higher or lower than the original
sample rate. In several embodiments, the difference between the
resampled sample rate and the original sample rate is kept within a
threshold value, such as (but not limited to) 500 Hz. In many
embodiments, the resampled audio data is corrected (422) using one
or more of a variety of techniques, including, but not limited to,
pitch compensation, in order to mask changes in the sound resulting
from the resampling process.
[0052] A specific process for encoding live media with synchronized
audio using containers with fixed block sizes within the multimedia
content is described above with respect to FIG. 4; however, a
variety of processes may be utilized in accordance with embodiments
of the invention.
[0053] Although the present invention has been described in certain
specific aspects, many additional modifications and variations
would be apparent to those skilled in the art. It is therefore to
be understood that the present invention may be practiced otherwise
than specifically described without departing from the scope and
spirit of the present invention. Thus, embodiments of the present
invention should be considered in all respects as illustrative and
not restrictive. Accordingly, the scope of the invention should be
determined not by the embodiments illustrated, but by the appended
claims and their equivalents.
* * * * *