U.S. patent application number 12/765815 was filed with the patent office on 2010-10-28 for efficient video skimmer.
This patent application is currently assigned to DELTA VIDYO, INC.. Invention is credited to Mehmet Reha Civanlar, Tal Shalom, Ofer Shapiro.
Application Number | 20100272187 12/765815 |
Document ID | / |
Family ID | 42992120 |
Filed Date | 2010-10-28 |
United States Patent
Application |
20100272187 |
Kind Code |
A1 |
Civanlar; Mehmet Reha ; et
al. |
October 28, 2010 |
EFFICIENT VIDEO SKIMMER
Abstract
Disclosed are a system, method, apparatus, and computer readable
media containing instructions for displaying video files for rapid
searching. In two different types of exemplary embodiments, a
standalone video skimming system, and a video skimming system
includes a server and a client system are disclosed, where the
video file may be locally or remotely stored, or can be obtained
from a live feed. The system displays many small windows
simultaneously, in which different parts of the video chosen by the
user are shown at the same time to shorten the skimming time. The
video file is encoded using layered encoding to display smaller
versions using lower layers, and without needing any processing to
generate smaller versions of the video from the original full
screen version. A video extractor is described for extracting the
necessary bitstreams from a local video database containing layered
encoded video files according to user specified window sizes, and
distributing the signals over the electronic communications network
channel. The system also includes a skimming control logic which
can receive control commands from clients and invoke the video
extractor to extract appropriate audio-visual signals there from
for each command.
Inventors: |
Civanlar; Mehmet Reha;
(Istanbul, TR) ; Shapiro; Ofer; (Fair Lawn,
NJ) ; Shalom; Tal; (Fair Lawn, NJ) |
Correspondence
Address: |
BAKER BOTTS L.L.P.
30 ROCKEFELLER PLAZA, 44TH FLOOR
NEW YORK
NY
10112-4498
US
|
Assignee: |
DELTA VIDYO, INC.
Hackensack
NJ
|
Family ID: |
42992120 |
Appl. No.: |
12/765815 |
Filed: |
April 22, 2010 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61172355 |
Apr 24, 2009 |
|
|
|
Current U.S.
Class: |
375/240.25 ;
348/563; 348/571; 348/E5.062; 375/E7.027 |
Current CPC
Class: |
H04N 21/234381 20130101;
H04N 21/431 20130101; G11B 27/105 20130101; H04N 21/4621 20130101;
H04N 5/85 20130101; H04N 9/8042 20130101; H04N 5/783 20130101; H04N
21/440227 20130101; H04N 19/162 20141101; H04N 21/234363 20130101;
H04N 5/772 20130101; H04N 19/40 20141101; H04N 5/775 20130101; H04N
21/234327 20130101; H04N 19/61 20141101; H04N 21/8549 20130101;
H04N 9/8205 20130101; H04N 21/4383 20130101; H04N 5/765 20130101;
H04N 21/4314 20130101; H04N 21/64322 20130101; H04N 19/31 20141101;
H04N 21/4384 20130101; H04N 19/33 20141101 |
Class at
Publication: |
375/240.25 ;
348/571; 348/563; 348/E05.062; 375/E07.027 |
International
Class: |
H04N 5/14 20060101
H04N005/14; H04N 7/26 20060101 H04N007/26 |
Claims
1. A system for simultaneous display of a plurality chapters of a
full length video file, wherein the full length video file
comprises video in a layered coding format.
2. The system of claim 1, wherein the full length video file
comprises: a first layer representing a first spatial resolution;
and at least one enhancement layer based on the first layer
representing at least one second spatial resolution.
3. The system of claim 2, wherein the first layer comprises a base
layer.
4. The system of claim 1, wherein the full length video file
comprises: a first layer representing a first temporal resolution;
and at least one enhancement layer based on the first layer
representing at least one second temporal resolution.
5. The system of claim 4, wherein the first layer comprises a base
layer.
6. The system of claim 1, wherein the full length video file
comprises one or more video files located in a local database.
7. The system of claim 1, wherein the full length video file
comprises content received from a live feed.
8. The system of claim 1, wherein the full length video file
comprises content received from a digital video storage
interface.
9. The system of claim 1, wherein the full length video file
comprises content received from a remote video database.
10. The system of claim 1, further comprising an input, coupled to
the display system, adapted for permitting a user to control the
simultaneous display of a plurality of chapters through at least
one user preference.
11. The system of claim 10, wherein the at least one user
preference comprises a preference to specify a number of
windows.
12. The system of claim 10, wherein the at least one user
preference comprises a preference to specify a duration for a video
chapter.
13. The system of claim 10, wherein the at least one user
preference comprises a preference to specify a start time for a
video chapter.
14. The system of claim 10, wherein the at least one user
preference comprises a preference to assign at least one chapter to
at least one window.
15. The system of claim 1, further comprising one or more skimming
control logic modules configured to control at least one aspect of
a display system.
16. The system of claim 10, further comprising one or more skimming
control logic modules configured to control at least one aspect of
a display system.
17. The system of claim 16, wherein the skimming control logic
further comprises logic adapted to translate one or more user
preferences into video extraction commands.
18. A method of video skimming, comprising: a. extracting a
plurality of chapters of a full length video file, wherein the
chapters are coded in a layered bitstream format, b. decoding the
plurality of chapters, and c. simultaneously displaying each of the
decoded plurality of chapters.
19. The method of claim 18 wherein the layered bitstream format
comprises a first layer representing a first spatial resolution and
at least one enhancement layer based on the first layer
representing at least one second spatial resolution, the decoding
further comprises: a. decoding the first layer of the plurality of
chapters; and b. decoding at least one of the enhancement layers of
the plurality of chapters.
20. The method of claim 18, further comprising extracting the full
length video file from a local database.
21. The method of claim 18, further comprising receiving a full
length video file content from a live feed.
22. The method of claim 18, further comprising receiving the full
length video file content from a digital video storage
interface.
23. The method of claim 18, further comprising receiving the
full-length video file from a remote video database.
24. The method of claim 18, wherein the method further comprises
receiving a user input to control the simultaneous display of a
plurality of chapters through at least one user preference.
25. The method of claim 24, wherein the at least one user
preference comprises a preference to specify a number of
windows.
26. The method of claim 24, wherein the at least one user
preference comprises a preference to specify a duration for a video
chapter.
27. The method of claim 24, wherein the at least one user
preference comprises a preference to specify a start time for a
video chapter.
28. The method of claim 24, wherein the at least one user
preference comprises a preference to assign at least one chapter to
at least one window.
29. The method of claim 24, further comprising translating the at
least one user preference into at least one video extraction
command.
30. A computer readable media having computer executable
instructions included thereon for performing a method of video
skimming, comprising: a. extracting a plurality of chapters of a
full length video file, wherein the chapters are coded in a layered
bitstream format, and b. decoding the plurality of chapters, and c.
simultaneously displaying each of the decoded plurality of
chapters.
31. The computer readable media of claim 30, wherein the layered
bitstream format comprises a first layer representing a first
spatial resolution and at least one enhancement layer based on the
first layer representing at least one second spatial resolution,
the decoding further comprises: a. decoding the first layer of the
plurality of chapters; and b. decoding at least one of the
enhancement layers of the plurality of chapters.
32. The computer readable media of claim 30, wherein the method
further comprises extracting the full length video file from a
local database.
33. The computer readable media of claim 30, wherein the method
further comprises receiving a full length video file content from a
live feed.
34. The computer readable media of claim 30, wherein the method
further comprises receiving the full length video file content from
a digital video storage interface.
35. The computer readable media of claim 30, wherein the method
further comprises receiving the full-length video file from a
remote video database.
36. The computer readable media of claim 30, wherein the method
further comprises receiving a user input to control the
simultaneous display of a plurality of chapters through at least
one user preference.
37. The computer readable media of claim 36, wherein the at least
one user preference comprises a preference to specify a number of
windows.
38. The computer readable media of claim 36, wherein the at least
one user preference comprises a preference to specify a duration
for a video chapter.
39. The computer readable media of claim 36, wherein the at least
one user preference comprises a preference to specify a start time
for a video chapter.
40. The computer readable media of claim 36, wherein the at least
one user preference comprises a preference to assign at least one
chapter to at least one window.
41. The computer readable media of claim 36, wherein the method
further comprises translating the at least one user preference into
at least one video extraction command.
42. A video skimming server for preparing and distributing a
plurality of chapters of a full length video file, wherein the full
length video file comprises video coded in a layered bitstream
format, comprising: (a) a video extractor for extracting a
plurality of encoded audio-visual signals from a full length video
file; (b) a streaming server for distributing the extracted
audio-visual signals; and (c) skimming control logic server,
coupled to the video extractor, adapted for receiving at least one
control message from a video skimmer client and instructing the
video extractor to extract the audio-visual signals.
43. The video skimming server of claim 42, wherein the full length
video file comprises one or more video files located in a local
video database.
44. The video skimming server of claim 42, wherein the full length
video file comprises content received from a live feed.
45. The video skimming server of claim 42, wherein the full length
video file comprises content received from a digital storage
interface.
46. The video skimming server of claim 42, wherein the full length
video file comprises content received from a remote database.
47. The video skimming server of claim 46, further comprising a
transcoder, coupled to the video extractor and to the remote
database, for transcoding the content received from the remote
database.
48. The video skimming server of claim 42, wherein the video
extractor comprises at least a portion of a distributed server.
49. The video skimming server of claim 42, wherein the skimming
control logic server comprises at least a portion of a distributed
server.
50. A video skimmer client for preparing a customized display of a
plurality of chapters of a full length video file, wherein the full
length video file is coded in a layered bitstream format,
comprising: (a) at least one streaming client module configured to
receive a plurality of chapters, wherein the chapters are in a
layered bitstream format; (b) one or more decoders configured to
decode the chapters; (c) a graphical user interface for receiving
user input, wherein the graphical user interface is accessed
through a video display; and (d) a skimmer control logic client for
sending at least one control message to a video skimmer server.
51. The video skimmer client of claim 50, wherein said video
display comprises a television.
52. The video skimmer client of claim 50, wherein said video
display comprises a computer monitor.
53. The video skimmer client of claim 50, wherein said video
skimmer client comprises at least a portion of a television.
54. The video skimmer client of claim 50, wherein said receiver
comprises at least a portion of a general purpose computer.
55. The video skimmer client of claim 50, wherein said video
skimmer client comprises at least a portion of a set-top box.
56. The video skimmer client of claim 50, wherein said video
skimmer client comprises at least a portion of a gaming console.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of priority to U.S.
Provisional Application Ser. No. 61,172,355, filed Apr. 24, 2009,
which is hereby incorporated by reference herein in its
entirety.
BACKGROUND
[0002] 1. Technical Field
[0003] The disclosed invention relates to techniques for searching
for content in a compressed digital video file accessed from local
storage or over a network such as the Internet. In particular, it
relates to the use of layered video coding technology in connection
with content searching for retrieving and displaying selected video
segments.
[0004] 2. Background Art
[0005] Subject matter related to the present application can be
found in co-pending U.S. patent application Ser. Nos. 12/015,956,
filed Jan. 17, 2008 and entitled "System And Method For Scalable
And Low-Delay Videoconferencing Using Scalable Video Coding,"
11/608,776, filed Dec. 8, 2006 and entitled "Systems And Methods
For Error Resilience And Random Access In Video Communication
Systems," and 11/682,263, filed Mar. 5, 2007 and entitled "System
And Method For Providing Error Resilience, Random Access And Rate
Control In Scalable Video Communications," and U.S. Pat. No.
7,593,032, filed Jan. 17, 2008 and entitled "System And Method For
A Conference Server Architecture For Low Delay And Distributed
Conferencing Applications," each of which is hereby incorporated by
reference herein in their entireties.
[0006] With increasing computing power and electronic storage
capacity, and ubiquity of network bandwidth, the number of large
digital video libraries accessible through the Internet is growing
rapidly. On the low end of the performance spectrum, databases of
popular websites such as YouTube (www.youtube.com) and Facebook
(www.facebook.com) are becoming sources for millions of user
videos. Currently, these videos are often of small size and low
resolution. Considering the continuous increase in the resolutions
of consumer video equipment, however, such databases are likely to
contain higher resolution videos in the foreseeable future.
[0007] On the other end of the performance spectrum, high
resolution video presentation technologies such as High-Definition
TV (HDTV) are frequently used in entertainment and news video
files. Even after applying digital video compression techniques,
high resolution video results in large file sizes. Movies and
broadcast TV content in standard TV (SDTV) resolution are still of
considerable size. Two other important applications where higher
resolution video may be used are surveillance video and video
segments recorded for scientific experiments.
[0008] With the amount of digital video content growing, and that
content being spread around the Internet, a technology is needed to
search for content in these databases effectively and rapidly.
Searching for content in video files has significant applications.
Users of such video content searching technology can, for example,
include: police officers looking for a specific scene in
surveillance videos (content which can be distributed over many
sites on the Internet, and can be many hours or days long);
students and teachers looking for a specific presentation module;
biologists looking for the instance when a particular biological
event is triggered; or consumers of movies or news looking for a
specific movie scene or news segment. Many users may be interested
only in a small portion--perhaps only few seconds of content--of a
full-length video. A noteworthy aspect of content searching in
video files is that it normally requires a user's full attention
and can't be performed as a background task while
multi-tasking.
[0009] The following terms are used throughout the disclosure. A
"full length video" can be any video footage that is meaningful as
a unit, for example, a movie, a TV show, and similar video. A full
length video may be divided into segments, which are called "video
chapters", or in short, "chapters". Therefore, a full length video
includes one or more concatenated chapters. Meta information
related to a chapter is called "video index", and the process by
which video indices are generated is called "video indexing". A
video index may include information about the starting and ending
times of the chapter, textual information about the chapter's
content, and one or more images derived from the chapter that may
represents its content. Video chapters may be indexed. One simple
form of automatic indexing is to sub-divide a full length video
into video chapters of a length that may be configured in the
system or specified by the user.
[0010] A "raw video" represents uncompressed digitized video. After
processing by a video encoder/compressor, raw video becomes
"compressed video". The term "transcoding" refers to the process of
converting a compressed video into a different type of compressed
video. Transcoding can involve, for example, the transforming of
compressed video into raw video, and compressing this raw video
into a different type of compressed video.
[0011] "Skimming" video, alternatively known as browsing, has been
a technical challenge for a long time. There are some techniques
commonly used to make skimming a full length video a more efficient
process:
[0012] a. Fast forwarding: fast forwarding (also known as
increasing the video playback speed) shortens the video viewing
time. However, speeding up the video rate distorts the video
information and may cause elimination of short events. This method
has been the most popular browsing technique so far. Fast
forwarding is discussed in more detail below.
[0013] b. Text Based Queries: This refers to a querying of metadata
associated with the full length video or video chapters for
specific textual information. For example, a text based query may
be in the form of "scene with George falling off the bridge". Text
based queries today require the video to be annotated, mostly a
manual process, before the video can be queried. Although
text-based video query has been in existence for a long time, only
few applications can afford the required intense human effort
needed to intelligently categorize and annotate the videos. One
example of video content that contains metadata which enables text
based queries is medical records used in some systems.
[0014] c. Automatic Indexing: In the academic literature [for
example, Cees G. M. Snoek and Marcel Worring, "Multimodal Video
Indexing: A Review of the State-of-the-art," Multimedia Tools and
Applications, Volume 25, Number 1/January, 2005, Springer],
techniques have been proposed to automatically index video for
browsing representations based on information within the video.
These indexing systems can use, for example, any of the following
information aspects to generate video chapters: [0015] Motion of
the video; [0016] Scene changes; [0017] Image statistics--such as
color and shape; [0018] Audio information; and/or [0019] Specific
object types in the video.
[0020] The prior art, for example, Michael A. Smith, "Video
Skimming and Characterization through the Combination of Image and
Language Understanding Techniques," p.775, 1997 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR
'97), 1997, discloses techniques based on one or more of the above
aspects that generate a so-called "video skim", a short synopsis
version (or a small understandable portion) of the original indexed
video. By viewing the video skim, a user may obtain an overview of
the content of the full length video in a comparatively short
period of time. A crude form of a video skim can, for example,
consist of images or short scenes representing video chapters of a
fixed duration. More complex skims may be generated using
previously available text based annotation or automatic indexing
techniques as mentioned above.
[0021] If a user becomes interested in one part of the skim, he/she
can opt to view the associated video chapter in its relevant parts
or in its entirety, and possibly at normal playback speed, without
having to view the rest of the full length video. The critical
aspect of creating a good video skim is context understanding,
which is the key to choosing the significant images and words that
should be included in the video skim. Today, when using any form of
automated video skim generation, it is unfortunately quite frequent
that a certain scene, in which a user may be interested, stays
unidentified by the skimming process. Nevertheless, with advances
in automated indexing techniques, video skims are becoming a useful
reality today, and will likely be of even higher interest in the
foreseeable future.
[0022] In summary, the automated context-sensitive generation of
video skims, despite the significant research conducted over the
past decade, has remained a task that is difficult, requiring high
computational complexity and involving human interaction such as
filtering and processing.
[0023] Once a video skim is generated, it often needs to be
presented to the user.
[0024] In its crudest form, video skim generation and presentation
can be performed simultaneously based on a full length video file
that is available on the user premises. A software application
running on the user's computer (or special purpose workstation), or
dedicated hardware-based processing, can use any of the techniques
mentioned below.
[0025] A) Video Skimming Using Raw Video: If the full length video
on the local storage is in a raw format, then the skimming can be
performed without the dedicated creation of a video skim, by simply
"fast forwarding" until the desired segment is found. The fast
forwarding process can be defined as "temporal sub-sampling", or
time domain sub-sampling of the frames of the video (e.g., skip
every other frame for a playback speed-up of two (i.e., a 50%
reduction in playback time), or show every fifth frame for a
speed-up of five (i.e., a roughly 76% reduction in playback time),
etc.). For example, referring to FIG. 1, a full-length video (101)
contains even numbered (102, 104) and odd numbered (103, 105)
frames. Using temporal sub-sampling a temporal sub-sampled sequence
(106) is created, in which only the even-numbered frames of the
full length video sequence are present (107, 108). This temporal
sub-sampling results in a 50% reduction in the playback time. The
speed-up can be varied by other sub-sampling intervals. When
sub-sampling every fifth frame (110, 111) of the original full
length video sequence is used in the temporal sub-sampled sequence
(109), the reduction in playback time is rough 76%.
[0026] Other linear or non-linear sub-sampling factors can also be
used.
[0027] There are at least three disadvantages of fast forwarding
are as follows:
[0028] (1) The search may still take a long time depending on where
the specific video segment of interest is located within the full
length video sequence (particularly if it is located towards the
end).
[0029] (2) The video segment of interest may be made unnoticeable
or totally lost during sub-sampling as it may fall on the deleted
frames (especially when large sub-sampling intervals are in
use).
[0030] (3) The associated audio information, if any, often cannot
be meaningfully presented.
[0031] Another method for skimming raw video is to subdivide the
full length raw video into video chapters (which may be of
preconfigured lengths or configured in real-time by the user), and
allow the user to view more than one video chapter in parallel
using separate small windows for each chapter. This process may
require "spatial sub-sampling" to reduce the resolution of the
original video to fit into smaller windows, because of display size
limitations as illustrated in FIG. 2. The full length video (201)
may include pictures of a certain spatial resolution, illustrated
here by the spatial size of one of the pictures in this sequence
(202). In one very simple form, spatial sub-sampling may be, for
example, averaging the brightness and/or color component values of
four spatially adjacent pixels (203) so as to create a single
sub-sampled pixel (204). Doing so for all pixels of all pictures of
the full-length video (201) yields a series of sub-sampled pictures
(205) at the same frame rate as the full-length video, but at half
of the spatial resolution in each dimension.
[0032] There are at least two specific disadvantages of this
approach are as follows:
[0033] (1) Performing spatial sub-sampling in real-time to generate
smaller versions of the full-length video is computationally
intensive and time consuming. Depending on how many windows are
generated and the size of each window, the sub-sampling may require
significant computing resources.
[0034] (2) The information may be lost during sub-sampling due to
side effects of spatial sub-sampling such as filtering or
aliasing.
[0035] B) Video Skimming Using Compressed Video: If the full length
video is in a compressed format (for example, the full length video
is compliant with video compressions standards such as ITU-T Rec.
H.264 or other video compression standards), then additional
factors come into play. Using a compressed format eliminates the
need for the very large uncompressed video files and is, therefore,
in most cases, advantageous. However, the compressed video file
can't be temporally sub-sampled randomly as the sequence of
compressed frames may depend on other frames due to inter-picture
prediction. Unless the entire video is decoded, only independently
decodable reference (IDR) frames can be used in fast forwarding. If
there are no IDR frames or if their frequency is low, then fast
forwarding will not be feasible without decoding a large percentage
of the coded pictures of the full length video sequence. One way of
remedying this may be to transcode the compressed video with more
frequent IDR frames. The disadvantages of this approach are as
follows:
[0036] (1) It may be time consuming and/or computationally
expensive.
[0037] (2) With an increase of the number of IDR frames, the
compression ratio decreases. The transcoded full length sequence
with a higher number of IDR frames may be significantly larger than
the original compressed full length sequence.
[0038] (3) The disadvantages of fast forwarding with raw files
still remain.
[0039] Another method for skimming compressed video is to subdivide
the compressed video into video chapters and view each segment in
parallel in separate small windows (as described for skimming raw
video). The process of sub-dividing a compressed file suffers from
similar disadvantages as temporal sub-sampling. Further using
traditional video compression technologies, spatial sub-sampling is
not possible in the compressed domain. In order to generate the
required spatially smaller video sequences, a transcoding step may
be necessary, with the spatial sub-sampling being performed after
the decompression of the original video data and before the
compression. Moreover, although use of compressed video file
eliminates the disadvantage of storing a large file, the need to
decode the file several times in real-time introduces significant
additional cost and processing complexity to spatial
re-sampling.
[0040] If the full length video file, be it in raw or compressed
format, is not available locally, the problem of video skimming
according to the described techniques is further exacerbated by the
need to retrieve it in real-time to a local computer over a network
like the public Internet. Particularly, if the file is in raw
format, then the bandwidth requirements are impractically large
(i.e., 45 Mbps for a reasonable speed download of an SDTV
resolution sequence). Accordingly, given the issues of using raw
and compressed video, and using temporal and spatial sub-sampling,
there has not been an acceptable implementation of a practical
real-time video skimmer in the market place.
BRIEF DESCRIPTION OF THE DRAWINGS
[0041] FIG. 1 illustrates temporal sub-sampling existing in the
prior art.
[0042] FIG. 2 illustrates spatial sub-sampling existing in prior
art.
[0043] FIG. 3 is an exemplary video display screen of the video
skimmer in accordance with the present invention.
[0044] FIG. 4 is a block diagram illustrating an exemplary, system
of a standalone video skimmer in accordance with the present
invention.
[0045] FIG. 5 is a block diagram illustrating an exemplary system
with a client-server video skimmer server architecture in
accordance with the present invention.
[0046] FIG. 6 is a block diagram illustrating an exemplary
standalone video skimmer in accordance with the present
invention.
[0047] FIG. 7 is a block diagram illustrating an exemplary
client-server video skimmer in accordance with the present
invention.
[0048] FIG. 8 is an exemplary method flow for controlling MBW
configurations in accordance with the present invention.
[0049] FIG. 9 is an exemplary method flow chart for messaging
between skimming control logic client and server in accordance with
the present invention.
[0050] Throughout the drawings, the same reference numerals and
characters, unless otherwise stated, are used to denote like
features, elements, components or portions of the illustrated
embodiments. Moreover, while the disclosed invention will now be
described in detail with reference to the figures, it is done so in
connection with the illustrative embodiments.
DESCRIPTION
[0051] A video skimmer, according to the present invention, is a
system which implements an approach of displaying multiple chapters
of the full length video file, that may appear to the user as if it
were spatially and/or temporally sub-sampled, simultaneously, by
using a video file that has been compressed using a layered (also
known as scalable) encoder. The video file may be indexed or
un-indexed. According to the invention, no transcoding, or
sub-sampling in either temporal or spatial dimension may be
required in order to enable the skimming process. The system works
efficiently when the full length video file is available locally,
or remotely and accessible only over a network, for example the
Internet.
[0052] According to the invention, the video may be compressed
using a layered codec, such as the one disclosed in ITU-T
Recommendation H.264 Annex G (also known as SVC). In order to take
full advantage of the invention, the scalable video bitstream that
is stored, among other things, in the full length video file,
should contain at least one low resolution version of the video
content, advantageously as a base layer. The low resolution can be
stored in the form of a base layer and one or more enhancement
layers; however, the mentioned combination of base and/or
enhancement layers, after decoding, still results in the low
resolution. The resolution can be chosen such that it is suitable,
after decoding, for displaying in a mini browsing window (MBW) of
the video skimmer display. An MBW can be smaller in spatial size
than a full window, which can be optimized to view the full
resolution video. Full resolution video may be obtained by decoding
a base layer and at least one enhancement layer more than required
for the lower resolution. The sizes of the full window and any MBWs
can be chosen by the user according to his/her user preferences.
The system can include a user interface that can display many MBWs,
and each MBW can display a specific video chapter of the full
length video. The user interface can also allow the user to set
his/her user preferences, for example, number of MBWs, size of each
MBW, start time or duration of each video chapter, assignment of
chapters to MBWs, and so forth.
[0053] The term "codec" is equally used herein to describe
techniques for encoding and decoding, and for implementations of
these techniques. An encoder converts input media data into a
bitstream or a packet stream, and a decoder converts an input
bitstream or packet stream into a media representation suitable for
presentation to a user, for example digital or analog video ready
for presentation through a monitor, or digital or analog audio
ready for presentation through loudspeakers. A transcoder converts
an input bitstream or packet stream compressed using a compression
technique into its original media representation suitable for
presentation to a user and then re-converts into an input bitstream
or packet stream using another type of compression technique.
Encoders and decoders can be dedicated hardware devices or building
blocks of a software-based implementation running on a general
purpose CPU.
[0054] Set-top-boxes and personal computers (PCs) can be built such
that many encoders or decoders may run in parallel or
quasi-parallel. For hardware encoders or decoders, one way to
support multiple encoders/decoders is to integrate multiple of
their instances in the set-top-box or PC. For software
implementations, similar mechanisms can be employed.
[0055] Traditional video codecs used in video distribution systems
provide only a single bit stream at a given bitrate, and no layers.
As explained above, when a lower temporal or spatial resolution is
required from a full length video file (such as for fast forwarding
or for display at a smaller spatial size in a MBW), first, the full
resolution file must be decoded to regenerate the raw
(uncompressed) video, which then needs to be sub-sampled in
temporal and/or spatial dimension, as the case may be, to produce a
lower spatio-temporal resolution appropriate for the MBW. This
process wastes significant bandwidth (if the full length video file
is in a remote location and needs to be transported over a
network), time, and computational resources. However, support for
lower resolutions is beneficial in the video skimmer to enable
display of many video chapters simultaneously, and without
consuming processing time and power to generate them. The network
bandwidth required to transport video for many MBWs may also be
advantageously minimized.
[0056] In one embodiment, a skimmer may support "spatial skimming".
A full length video file available in a layered encoded format may
readily carry a low resolution version of the actual video content,
which may fit into MBWs of the video skimmer system without further
spatial sub-sampling after decoding. The skimmer may simultaneously
display more than one MBW showing more than one chapter. The user
may enlarge the video of a chapter by clicking on the MBW once
he/she identifies the scene of interest in an MBW. As a result, the
skimmer can request and receive information that enables the
skimmer to present to the user a high resolution version of the
video content, as disclosed in the co-pending U.S. patent
application entitled "Systems, Methods and Computer Readable Media
for Instant Multi-Channel Video Content Browsing in digital Video
Distribution Systems", concurrently filed herewith.
[0057] In the same or another embodiment, a video skimmer can
support temporal skimming. A full length video file available in a
layered encoded format may readily carry a temporally sub-sampled
lower layer. The skimmer may disregard the timing information in
the lower layer and present the video as fast forward video. For
example, if the full length video were originally available at 30
fps, and the temporally sub-sampled lower layer is available at 10
fps, the skimmer may display the 10 fps lower layer at 30 fps,
thereby speeding up playback at a factor of 3. Once the user clicks
on the MBW presenting the fast forward video, the skimmer may
display the MBW's content in original speed (in the example, by
slowing down playback speed to 10 fps). It may further request and
receive temporal scalable enhancement layers that enable full
temporal resolution of the MBW's content.
[0058] The video skimming advantageously uses a lower resolution
(spatial and/or temporal) version of the video content from several
video chapters to fit into more than one MBW. The user may view
several MBWs simultaneously, and may assign specific video chapters
to these MBWs. The video chapters may be generated by any of the
options discussed before.
[0059] In the same or another embodiment, video chapters may be the
result of subdividing the full length video into video chapters of
a given length. For example, video chapters may be assigned 10
minutes intervals of the first 40 minutes of the full length video
sequentially, and those video chapters may be displayed in 4
MBWs.
[0060] In the same or another embodiment, the user may decide to
switch the assignment of video chapter to MBWs during the skimming
process (e.g., switch to assign every 10 minutes of the next 40
minutes of the video sequentially to 4 MBWs). An exemplary user
interface with 4 MBWs skimming a 40 minutes full-length video in
only 10 minutes is shown in FIG. 3. The first MBW (301) on the
screen (305) displays the first 10 minutes of the full length
video. The second and third MBWs (302) (303) on the left side of
the screen display minutes 10-20 and 20-30 of the full length
video, respectively. As disclosed in the co-pending U.S. patent
application, entitled "Systems, Methods and Computer Readable Media
for Instant Multi-Channel Video Content Browsing in digital Video
Distribution Systems", concurrently filed herewith, MBWs can be of
different shapes and sizes. Accordingly, the right MBW, which
displays minutes 30-40 of the full length video, is twice the size
of the other three MBWs. A person skilled in the art can easily
construct other screen layouts with more or less MBWs in different
sizes, covering different parts of the full length video.
DETAILED DESCRIPTION OF THE INVENTION
[0061] FIG. 4 shows a standalone video skimmer (401) with an
attached display (402). The video skimmer can receive video content
from a variety of sources: live feed video content, for example
from a camera (403) connected to the video skimmer through
interface (404); video content from a DVD (405) attached to the
video skimmer through interface (406); or even in the form of a
full length video file from an external and/or remote video
database (407). The external remote video database can be located
on the Internet (409) or other suitable networks, using network
interfaces (410, 411). The MBWs are presented on display 402 (as
depicted in FIG. 3). The video skimmer logic is part of the video
skimmer (401). The video skimmer can be implemented based on a
general purpose computer, e.g., a PC, a standalone computer or some
other type of hardware, such as a set-top box in IPTV environment
where the set-top box may be attached to a suitable network (409)
such as the Internet.
[0062] In case a video file is retrieved from a remote database,
one of the well known file transfer protocols such as FTP (RFC959
available at http://www.faqs.org/rfcs/rfc959.html) may be used to
transmit video over the Internet (409) (or any other suitable
network) over links (410, 411).
[0063] A set-top box can be hardware for the video skimmer (401). A
TV connected to the set-top-box can be used as the display (402).
The set-top-box translates the data received from the network (409)
into a signal format the TV understands; traditionally, a
combination of analog audio and video signals are used, but
recently also all digital interfaces (such as HDMI) have become
common. The set-top-box (on the TV side), therefore typically
includes analog or digital audio/video outputs and interfaces.
[0064] Internally, a set-top-boxes can have a hardware architecture
similar to a general purpose computer: a central processing unit
(CPU) executes instructions stored in Random Access Memory (RAM) or
read-only-memory (ROM), and utilizes interface hardware to connect
to the network interface and to the audio/video output interface,
as well as an interface to a form of user control (often in the
form of a TV remote control, computer mouse, keyboard, or other
input device), all under the control of the CPU. A set-top-box may
also include one or more accelerator units (for example dedicated
Digital Signal Processors, DSP) that may help the CPU with
computationally complex tasks of video decoding and video
processing, among others. Those units are typically present for
reasons of cost efficiency, rather than for technical
necessity.
[0065] General purpose computers can often be configured to act
like a set-top-box. In some cases, additional hardware needs to be
added to the general purpose computer to provide the interfaces a
typical set-top-box contains, and/or additional accelerator
hardware to augment the CPU for video decoding and processing.
[0066] The programmable parts of set-top-boxes, PCs, and other
devices suitable as the basis of a video skimmer may require
instructions, which may be supplied by a computer readable media
(408).
[0067] The set-top-box or general purpose computer may run under an
operating system such as Windows. The video skimmer is
advantageously using an operating system that allows the
simultaneous display of more than one motion video in on-screen
windows.
[0068] Referring to FIG. 5, the internal architecture of a
standalone video skimmer is now described. There are several
options for video inputs to the video skimmer, as follows.
[0069] (a) Live video content may be fed from the camera (501),
which attaches to a live video interface (502) through connection
(503). The video interface may connect to frame capture (504). The
frame capture may generate video frames that may feed into a
layered encoder (505).
[0070] (b) Alternatively, the video may be obtained as a file
download from a remote video database (506) attached to a network,
such as the Internet (507), in which case the file download may be
received through network connection and a file download interface
(508). If the video database (506) delivers the video in an
appropriate layered format then it can be processed directly (509).
Otherwise the video may be transcoded into a layered encoded format
by a transcoder (510).
[0071] (c) In the same or another embodiment, video content (in
uncompressed format, or in a compressed format that may not be a
layered format) may be received from a DVD or a similar storage
medium such as a memory stick (511), in which case it is received
on a digital video storage interface (512). If the received video
is not in layered encoded format, then they may first be transcoded
in transcoder (513) into layered encoded format; otherwise, it may
be processed directly (514).
[0072] Regardless of the input mechanism as described above, the
video may be sent as a full length video file in a layered encoding
format in the local video database (515).
[0073] A video extractor (516) may be responsible to retrieve a
video bit stream from video database (515) according to control
commands it may receive from a skimming controller (517). The
skimming control logic (517) may, for example, indicate the MBW
size and the beginning and ending time markers of the video for
that MBW to video extractor, which, in response to the indication,
may find the corresponding layered encoded bitstream in the
database.
[0074] The extracted video may be displayed in the MBW on display
(519) after decoding by a layered decoder (518). A user interface
(520) may send appropriate display commands to properly display the
video. The user input commands may be received through a user
interface device (521) (e.g., a keyboard, mouse or remote control
device), which may be translated into proper display commands
through user interface (520) and may be displayed on the display
(519). The commands may be sent by the user interface (520) to the
skimming control logic (517) for further processing. Such user
commands can, for example, include: (a) selection of size and
display location of a MBW, (b) a click or double click on a MBW
that may result in a request to receive corresponding audio or full
resolution video, (c) entering user desired skimming parameters
such as index markers for video chapter, etc.
[0075] The skimmer control logic (517) may receive user commands
and may translate them into appropriate actions for the video
extractor (516), thereby enabling the video extractor (516) to
extract only those video bits required for proper display.
[0076] An alternative implementation of a video skimmer may follow
a client-server architecture as illustrated in FIG. 6. The video
skimmer may be divided into two components: the video skimmer
client (601) and the video skimming server (602). The video skimmer
client (601) and display (603) are located at the user's premises.
The video skimmer client (601) may be connected to the video
skimming server (602) over a suitable network, such as the Internet
(604). The video skimming server (602) may reside at any suitable
location in the network and not necessarily in the user's premises.
A local video database (605) may be advantageously placed
co-located with the video skimming server (602).
[0077] The video skimming server may also serve non-co-located, but
network attached databases, such as a remote video database (606),
via a suitable network such as the Internet (604). However, if the
video file is located in a remote video database, that video file
may advantageously be downloaded to local video database (605) in
its entirety before starting the skimming process, because the
video extraction logic for skimming resides within the video
skimming server (602).
[0078] A single video skimming server may serve many remote video
databases and many video skimmer clients simultaneously. With the
separation of the client and server, the video skimming service
(server) may become a business for a service provider which can
offer it to many subscribers, each subscriber deploying the client
component for skimming.
[0079] Although, for simplicity, only a single client and a single
server are shown in FIG. 6, the present invention also envisions
distributed architectures of the skimming server. Similarly, one or
more video skimming clients can run simultaneously on a single
client computer.
[0080] The video skimmer server may be responsible for extracting
the user-requested video file from a local video database, and may
send layered encoded video chapters according to a user's requests
across the Internet to the client. Note that the video chapters
displayed in the MBWs may be sent using only those layers required
for proper decoding and display in a MBW without spatial or
temporal sub sampling, thereby significantly reducing the network
bandwidth, compared to the transmission of all layers beneficial
for decoding and display of the video in a main window (at full
resolution).
[0081] In FIG. 7, the detailed architecture of the video skimming
server (701) is shown. The video skimming server (701) can include,
for example, a skimming control logic server (SCLS) (702) which can
communicate with the corresponding skimming control logic client
(SCLC) (704) in the video skimming client (703). The SCLC can, for
example, specify the MBW layout (e.g., each MBW location and size,
or number of MBWs) at the user's endpoint, and video chapter
definitions when the video chapters are not indexed (e.g., video
skimming start time, and/or video chapter lengths, or any other
information that allows for suitable indexing). If the full length
video is already indexed, then the server can have the capability
to send indexed video chapters, and can also send the metadata
associated with each indexed video chapter. If a skimmed version of
the full length video is available at the server, then the server
can have the capability to send the skims and the meta data
associated with each skim.
[0082] The SCLS (702) can serve, and can be controlled by, many
SCLCs (704) simultaneously.
[0083] One purpose of the SCLC (704) is to translate user input
into a protocol format that the SCLS (702) understands. For
example, when the user clicks on an MBW to view the content to a
high-resolution video display in the main window, the SCLC (704)
can send a request to the SCLS (702) to start sending all
enhancement layers for decoding and display in the main window in
high resolution, in addition to the layers for decoding and display
in MBWs (MBW layers), for the video chapter associated with the MBW
user clicks. Meanwhile, the server can continue to send the MBW
layers of the video chapters being decoded and displayed in the
other MBWs.
[0084] The skimming video may advantageously reside in a local
video database (705). If the video file is in a remote database
(706), then the file can be retrieved and placed in the local video
database (705). The file in the remote database (706) can either be
in suitable layered encoded format, in which case it can be placed
directly deposited to the local video database (705). If the video
file is encoded using another type of compression technology
(including, for example, possibly loss-less compression or an
uncompressed format), it can first be transcoded in transcoder
(707), and then it can be deposited to the local video database
(705).
[0085] The video skimmer server may reside in one or more
computers. The video skimming client may reside in a PC, a
standalone general purpose computer, or it may be an IPTV set-top
box.
[0086] If the user endpoint is a set-top-box, it can use, for
example, a TV as the display (708) to display the MBWs.
[0087] User commands for video skimming may be received through a
user input device (709) (e.g., mouse, keyboard or remote control),
which can be translated into information displayed on display (708)
after the user interface (710). When a user selects a video file
for skimming and the skimming configuration (e.g., number of MBWs,
assignment of video chapters to MBWs), user interface (710) can
send these requests to SCLC (704) which, in turn, may send
appropriate skimming messages to the SCLS (702). The SCLS (702) can
instruct video extractor (711) to extract appropriate layers of the
encoded video that can be stored in the local video database (705).
The video extractor (711) can extract the bitstreams and send them
to the streaming server (712). The streaming server can send
streamed bitstreams using protocols such as RTP (RFC 3550,
available from http://www.faqs.org/rfcs/rfc3550.html). On the
client side, a streaming client (713) can receive the RTP packets,
extract the bitstreams, and send the bitstreams to layered decoder
(714) which can decode the bitstreams into raw format ready for
display. User Interface (710) can collaborate with the SCLC (704)
to assign received bitstreams from the layered decoder (714) to
appropriate MBW on the user's display (708).
[0088] The SCLC (704) can communicate with its server component
SCLS (702) to specify:
[0089] a. set or change MBW configuration (e.g., number or
alignment of MBWs, size of MBWs, location of MBW windows on the
display)
[0090] b. select video chapters (e.g., un-indexed, indexed, or
skimmed)
[0091] c. configure video chapters (e.g., video chapter start
times, lengths, chapter location in video file, mapping to MBWs,
etc.)
[0092] d. MBW video controls (e.g., dump content to main window in
higher resolution; receive audio, pause/restart/stop video).
[0093] In summary, the video skimming client (703) can include, for
example, the following functionalities, although others could be
added:
[0094] a. Receive and process streamed media;
[0095] b. Decode streams of video chapters for display;
[0096] c. Display video chapters in MBWs;
[0097] d. Control skimming logic (e.g., user selection of MBWs,
video chapters, start-stop type video controls, and other
controls); and/or
[0098] e. Receive user commands from user interface devices.
[0099] In summary, video skimmer server (701) can include, for
example, the following functionalities, although others could be
added:
[0100] a. Encode and Transcode remote video for storage in local
video database;
[0101] b. Extract appropriate bitstreams from the video database
for display in an MBW
[0102] c. Receive, process and react to commands from skimming
control logic client
[0103] d. Stream and send video towards client applications.
[0104] As shown in FIG. 8, in the same or another embodiment,
certain user activities, recognized by the user interface, can lead
to certain actions of the skimming control logic--either the local
SCLC or the logic that is distributed between SCLS and SCLC. The
user activities can be listed in the form of a menu structure. The
top level menu can be invoked by pressing (821) a "Menu" button or
by a similar user input activity.
[0105] On its top level menu, the user interface can offer
different pull-down menus to input various user preferences, for
example, to: save user settings (801), restore user settings (802),
select all MBWs (803), select one MBW (804), select skimming mode
(805), and/or cancel (806). By selecting to cancel (806), the user
closes the top level menu and all sub-menus that may be open
without further interference in the state of the system. If the
user selects any of the other selections, he/she is presented with
a sub-menu as follows:
[0106] If the user selects to save user default settings (801),
he/she is presented with certain related sub-menu choices, for
example, to: save MBW configuration (808), save default skimming
configuration (807), and/or cancel (809). Electing to cancel (809)
closes the sub-menu without any change in state, and returns to the
main menu. Selecting to save default skimming configuration (807)
saves, possibly after a confirmation, the current skimming
configuration as a default. The skimming configuration can include
aspects such as the length of each chapter (e.g., select a
uniformly assigned length such as 10 minutes, select a length that
is determined by hints or context--e.g., based on metadata that may
be included in the full length video indicating different scenes,
or any other suitable length determination method). Selecting to
save default MBW configuration (808) saves the current MBW
configuration as a default. The MBW configuration is being
determined through the two main menu items discussed next.
[0107] When the user selects the main menu item to restore user
settings (802), once selected, restores the previously saved user
settings, as stored using the menu item for saving user settings
(801).
[0108] If the user chooses to select all MBWs (803), he/she can set
properties related to all MBWs currently being displayed. Sub-menu
items can include, for example, to: close all MBWs (810), arrange
all MBWs on screen (811) (which distributes the MBWs evenly over
the available screen area), to resize all MBWs (812) (which allows
the user to set the size of all MBWs), and to cancel (813).
[0109] If the user chooses to select one MBW (804), he/she first
selects the MBW to which the subsequent changes apply.
Alternatively, or in addition, this menu item can also
advantageously be implemented as a context sensitive menu. For
example, right-clicking on an MBW triggers this submenu without
requiring the user to go through the main menu. The sub-menus can
offer different related actions, for example, to:
[0110] Map content to the main screen (814) (which closes the
skimming user interface and shows the chapter assigned to the MBW
in full screen resolution),
[0111] Change MBW size (815) (which allows to set the size of the
MBW without changing the size of other MBWs),
[0112] Assign chapter (816) (which allows the user to assign a
chapter of the full length video to the selected MBW),
[0113] Move MBW (817) (which allows the user to select the MB W's
spatial position on the screen without affecting the positions of
other MBWs,
[0114] Close MBW (818), to pause/start/stop video in MBW (819),
and/or
[0115] Cancel (820).
[0116] The main menu item to select skimming mode (805) provides
the following related options for skimming mode selection, which
may include, for example, fixed interval (822), scene detection
(823), hint track (824), and/or Cancel (825).
[0117] If the user selects the fixed interval sub-menu option
(822), the user can select the length of the chapters by providing
either an interval (in, for example, seconds, minutes, and/or
hours), or by selecting the number of equally long chapters the
full length video shall be divided into by the video skimmer.
[0118] If the user selects the scene detection sub-menu option
(823) the video skimmer is instructed to assign each scene, as
determined by a scene detection algorithm, to an MBW, up to a
user-selectable maximum number of MBWs.
[0119] The hint track sub-menu option (824) determines the
association of chapters to MBWs using a hint track that may be
present in the full-length video; if the hint track is not present,
this option may be grayed out.
[0120] The cancel sub-menu option (825) leaves the sub-menu without
a state change.
[0121] FIG. 9 shows an exemplary message flow between the skimming
control logic client (SCLC) (901) and the skimming control logic
server (SCLS) (902), using a protocol such as HTTP, or employing
other standard-based or proprietary protocol. The SCLC can request
a specific video chapter mapping to an MBW for an already selected
MBW configuration through a video chapter assignment request
message (903). This message can contain information about the user
(e.g., Client ID), the video file (e.g., File ID or file title),
the MBW (e.g., dimensions), and/or the begin and end time markers
of the requested video chapter.
[0122] If the request is valid (904), then the SCLS (902) can
return a video chapter assignment response message (905) to the
SCLC (901) indicating that the action is accepted, in which case
the SCLS instructs (906) the video extractor to fetch defined
bitstream for local database. If the request can't be implemented,
then the SCLS returns a video chapter assignment response message
(907) indicating that the action is not valid.
* * * * *
References