U.S. patent application number 14/540960 was filed with the patent office on 2016-05-19 for parallax tolerant video stitching with spatial-temporal localized warping and seam finding.
The applicant listed for this patent is Futurewei Technologies Inc.. Invention is credited to Jinwei Gu, Wei Jiang.
Application Number | 20160142643 14/540960 |
Document ID | / |
Family ID | 55953746 |
Filed Date | 2016-05-19 |
United States Patent
Application |
20160142643 |
Kind Code |
A1 |
Jiang; Wei ; et al. |
May 19, 2016 |
PARALLAX TOLERANT VIDEO STITCHING WITH SPATIAL-TEMPORAL LOCALIZED
WARPING AND SEAM FINDING
Abstract
An apparatus is configured to perform a method of parallax
tolerant video stitching. The method includes determining a
plurality of video sequences to be stitched together; performing a
spatial-temporal localized warping computation process on the video
sequences to determine a plurality of target warping maps; warping
a plurality of frames among the video sequences into a plurality of
target virtual frames using the target warping maps; performing a
spatial-temporal content-based seam finding process on the target
virtual frames to determine a plurality of target seam maps; and
stitching the video sequences together using the target seam
maps.
Inventors: |
Jiang; Wei; (Bridgewater,
NJ) ; Gu; Jinwei; (Bridgewater, NJ) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Futurewei Technologies Inc. |
Plano |
TX |
US |
|
|
Family ID: |
55953746 |
Appl. No.: |
14/540960 |
Filed: |
November 13, 2014 |
Current U.S.
Class: |
348/598 |
Current CPC
Class: |
G06T 5/006 20130101;
G06T 3/4038 20130101; H04N 5/265 20130101; G06T 11/60 20130101;
H04N 5/262 20130101 |
International
Class: |
H04N 5/265 20060101
H04N005/265; G06T 11/60 20060101 G06T011/60; G06T 5/00 20060101
G06T005/00 |
Claims
1. A method of parallax tolerant video stitching, the method
comprising: determining a plurality of video sequences to be
stitched together; performing a spatial-temporal localized warping
computation process on the video sequences to determine a plurality
of target warping maps, wherein the spatial-temporal localized
warping computation process comprises: determining a plurality of
spatial global homographies and a plurality of temporal global
homographies using a plurality of visual keypoints associated with
the video sequences; performing a pre-warping process, using the
spatial global homographies and the temporal global homographies,
to obtain a plurality of pre-warped temporal matching pairs and a
plurality of pre-warped spatial matching pairs; and determining a
plurality of target vertices using the pre-warped temporal matching
pairs and the pre-warped spatial matching pairs; warping a
plurality of frames among the video sequences into a plurality of
target virtual frames using the target warping maps; performing a
spatial-temporal content-based seam finding process on the target
virtual frames to determine a plurality of target seam maps; and
stitching the video sequences together using the target seam
maps.
2. (canceled)
3. The method of claim 1, further comprising: determining the
plurality of visual keypoints from the video sequences.
4. The method of claim 1, further comprising: determining the
plurality of target warping maps using the target vertices.
5. The method of claim 1, wherein the plurality of target vertices
are determined by minimizing a cost function.
6. The method of claim 5, wherein the cost function E is given by
the following equation:
E=E.sub.ds+.phi.E.sub.dt+.alpha.E.sub.gs+.beta.E.sub.gt+.phi.E.sub.ss+.th-
eta.E.sub.st, where E.sub.ds is a spatial local alignment
parameter, E.sub.dt is a temporal local alignment parameter,
E.sub.gs is a spatial global alignment parameter, E.sub.gt is a
temporal global alignment parameter, E.sub.ss is a spatial
smoothness parameter, E.sub.st is a temporal smoothness parameter,
and .phi., .alpha., .beta., .phi., .theta. are weight
coefficients.
7. The method of claim 1, wherein the spatial-temporal
content-based seam finding process comprises: performing a
spatial-temporal objectness computation using the target virtual
frames to determine a plurality of spatial-temporal objectness
values; determining a graph comprising a plurality of pixels,
spatial edges, and temporal edges; labeling each of the pixels as
either source or sink; and determining the target seam maps using
the labeled pixels.
8. An apparatus for parallax tolerant video stitching, the
apparatus comprising: at least one memory; and at least one
processor coupled to the at least one memory, the at least one
processor configured to: determine a plurality of video sequences
to be stitched together; perform a spatial-temporal localized
warping computation process on the video sequences to determine a
plurality of target warping maps, wherein to perform the
spatial-temporal localized warping computation process, the at
least one processor is configured to: determine a plurality of
spatial global homographies and a plurality of temporal global
homographies using a plurality of visual keypoints associated with
the video sequences; perform a pre-warping process, using the
spatial global homographies and the temporal global homographies,
to obtain a plurality of pre-warped temporal matching pairs and a
plurality of pre-warped spatial matching pairs; and determine a
plurality of target vertices using the pre-warped temporal matching
pairs and the pre-warped spatial matching pairs; warp a plurality
of frames among the video sequences into a plurality of target
virtual frames using the target warping maps; perform a
spatial-temporal content-based seam finding process on the target
virtual frames to determine a plurality of target seam maps; and
stitch the video sequences together using the target seam maps.
9. (canceled)
10. The apparatus of claim 8, wherein the at least one processor is
further configured to: determine the plurality of visual keypoints
from the video sequences.
11. The apparatus of claim 8, wherein the at least one processor is
further configured to: determine the plurality of target warping
maps using the target vertices.
12. The apparatus of claim 8, wherein the plurality of target
vertices are determined by minimizing a cost function.
13. The apparatus of claim 12, wherein the cost function E is given
by the following equation:
E=E.sub.ds+.phi.E.sub.dt+.alpha.E.sub.gs+.beta.E.sub.gt+.phi.E.sub.ss+.th-
eta.E.sub.st, where E.sub.ds is a spatial local alignment
parameter, E.sub.dt is a temporal local alignment parameter,
E.sub.gs is a spatial global alignment parameter, E.sub.gt is a
temporal global alignment parameter, E.sub.ss is a spatial
smoothness parameter, E.sub.st is a temporal smoothness parameter,
and .phi., .alpha., .beta., .phi., .theta. are weight
coefficients.
14. The apparatus of claim 8, wherein to perform the
spatial-temporal content-based seam finding process, the at least
one processor is configured to: perform a spatial-temporal
objectness computation using the target virtual frames to determine
a plurality of spatial-temporal objectness values; determine a
graph comprising a plurality of pixels, spatial edges, and temporal
edges; label each of the pixels as either source or sink; and
determine the target seam maps using the labeled pixels.
15. A non-transitory computer readable medium embodying a computer
program, the computer program comprising computer readable program
code for: determining a plurality of video sequences to be stitched
together; performing a spatial-temporal localized warping
computation process on the video sequences to determine a plurality
of target warping maps, wherein the computer readable program code
for performing the spatial-temporal localized warping computation
process comprises computer readable program code for: determining a
plurality of spatial global homographies and a plurality of
temporal global homographies using a plurality of visual keypoints
associated with the video sequences; performing a pre-warping
process, using the spatial global homographies and the temporal
global homographies, to obtain a plurality of pre-warped temporal
matching pairs and a plurality of pre-warped spatial matching
pairs; and determining a plurality of target vertices using the
pre-warped temporal matching pairs and the pre-warped spatial
matching pairs; warping a plurality of frames among the video
sequences into a plurality of target virtual frames using the
target warping maps; performing a spatial-temporal content-based
seam finding process on the target virtual frames to determine a
plurality of target seam maps; and stitching the video sequences
together using the target seam maps.
16. (canceled)
17. The non-transitory computer readable medium of claim 15,
further comprising computer readable program code for: determining
the plurality of target warping maps using the target vertices.
18. The non-transitory computer readable medium of claim 15,
wherein the plurality of target vertices are determined by
minimizing a cost function.
19. The non-transitory computer readable medium of claim 18,
wherein the cost function E is given by the following equation:
E=E.sub.ds+.phi.E.sub.dt+.alpha.E.sub.gs+.beta.E.sub.gt+.phi..sub.ss+.the-
ta.E.sub.st, where E.sub.ds is a spatial local alignment parameter,
E.sub.dt is a temporal local alignment parameter, E.sub.gs is a
spatial global alignment parameter, E.sub.gt is a temporal global
alignment parameter, E.sub.ss is a spatial smoothness parameter,
E.sub.st is a temporal smoothness parameter, and .phi., .alpha.,
.beta., .phi., .theta. are weight coefficients.
20. The non-transitory computer readable medium of claim 15,
wherein the computer readable program code for performing the
spatial-temporal content-based seam finding process comprises
computer readable program code for: performing a spatial-temporal
objectness computation using the target virtual frames to determine
a plurality of spatial-temporal objectness values; determining a
graph comprising a plurality of pixels, spatial edges, and temporal
edges; labeling each of the pixels as either source or sink; and
determining the target seam maps using the labeled pixels.
Description
TECHNICAL FIELD
[0001] The present disclosure relates generally to video
processing, and more particularly, to a system and method for
parallax tolerant video stitching with spatial-temporal localized
warping and seam finding.
BACKGROUND
[0002] The increasingly powerful computation, large storage, and
expanding transmission bandwidths have enabled a wide variety of
applications in the market, which provides modern users all kinds
of visual experiences. For instance, with the advent of
high-definition displaying devices such as very large screens and
Ultra-HD TVs, there has been a strong interest in generating
high-quality videos with an ultra-large Field-of-View (FoV) that
can give users immersive media experiences. A variety of devices
and methods have been developed to construct large FoV images. Very
expensive, high-end camera systems are used by professional
agencies for this purpose, such as the AWARE-2 camera used in the
defense industry, which is a monocentric, multi-scale camera that
includes a spherically symmetric objective lens surrounded by an
array of secondary microcameras. For groups with a smaller budget
(e.g., independent photographers or even amateur consumers), a
camera system that can obtain reasonable quality but with much less
expense is desired.
SUMMARY
[0003] According to one embodiment, there is provided a method of
parallax tolerant video stitching. The method includes determining
a plurality of video sequences to be stitched together; performing
a spatial-temporal localized warping computation process on the
video sequences to determine a plurality of target warping maps;
warping a plurality of frames among the video sequences into a
plurality of target virtual frames using the target warping maps;
performing a spatial-temporal content-based seam finding process on
the target virtual frames to determine a plurality of target seam
maps; and stitching the video sequences together using the target
seam maps.
[0004] According to another embodiment, there is provided an
apparatus for parallax tolerant video stitching. The apparatus
includes at least one memory and at least one processor coupled to
the at least one memory. The at least one processor is configured
to determine a plurality of video sequences to be stitched
together, perform a spatial-temporal localized warping computation
process on the video sequences to determine a plurality of target
warping maps, warp a plurality of frames among the video sequences
into a plurality of target virtual frames using the target warping
maps, perform a spatial-temporal content-based seam finding process
on the target virtual frames to determine a plurality of target
seam maps, and stitch the video sequences together using the target
seam maps.
[0005] According to yet another embodiment, there is provided a
non-transitory computer readable medium embodying a computer
program. The computer program includes computer readable program
code for determining a plurality of video sequences to be stitched
together; performing a spatial-temporal localized warping
computation process on the video sequences to determine a plurality
of target warping maps; warping a plurality of frames among the
video sequences into a plurality of target virtual frames using the
target warping maps; performing a spatial-temporal content-based
seam finding process on the target virtual frames to determine a
plurality of target seam maps; and stitching the video sequences
together using the target seam maps.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] For a more complete understanding of the present disclosure,
and the advantages thereof, reference is now made to the following
descriptions taken in conjunction with the accompanying drawings,
wherein like numbers designate like objects, and in which:
[0007] FIG. 1A illustrates an example of parallax artifacts caused
by using a single global homography;
[0008] FIG. 1B illustrates parallax artifacts that are corrected by
using a homography mesh;
[0009] FIGS. 2A and 2B illustrate an example of parallax artifacts
caused by directly applying two-dimensional (2D) stitching
technique in a video;
[0010] FIG. 3 illustrates an overall workflow of video stitching
according to this disclosure;
[0011] FIG. 4 illustrates a detailed view of a spatial-temporal
localized warping framework that implements the functions of a
spatial-temporal localized warping computation block described in
FIG. 3, according to this disclosure;
[0012] FIG. 5 illustrates a detailed view of a spatial-temporal
content-based seam finding framework that implements the functions
of a spatial-temporal content-based seam finding block described in
FIG. 3, according to this disclosure;
[0013] FIG. 6 illustrates an example of a graph that can be
constructed using a spatial-temporal graph construction process
according to this disclosure;
[0014] FIG. 7 illustrates an example method for video stitching
according to this disclosure; and
[0015] FIG. 8 illustrates an example of a computing device for
performing a video stitching workflow according to this
disclosure.
DETAILED DESCRIPTION
[0016] FIGS. 1A through 8, discussed below, and the various
embodiments used to describe the principles of the present
invention in this patent document are by way of illustration only
and should not be construed in any way to limit the scope of the
invention. Those skilled in the art will understand that the
principles of the invention may be implemented in any type of
suitably arranged device or system.
[0017] The following documents are hereby incorporated into the
present disclosure as if fully set forth herein: (i) Brady et al.,
"Multiscale gigapixel photography," Nature 486:386-389, 2012
(hereinafter "REF1"); (ii) F. Zhang and F. Liu, "Parallax-tolerant
image stitching," IEEE CVPR, 2014 (hereinafter "REF2"); and (iii)
Szeliski, "Image alignment and stitching: A tutorial," Foundations
and Trends in Computer Graphics and Computer Vision, 2006
(hereinafter "REF3").
[0018] Using affordable cameras, such as non DSLR (digital
single-lens reflex) or mobile cameras, many methods have been
developed for generating large Field-of-View (FoV) 2D photo
panoramas. These methods does not require recovering the geometric
and photometric scene models, but they require that the captured
scene be planar or distant, or that the camera viewpoints be
closely located, in which cases each image can be stitched to a
reference image by using a single global homography. When such
requirements are not met perfectly, i.e., when a single global
homography is not enough to stitch an image to the reference image
(which is generally the case in real applications), the resulting
stitched panorama usually presents different levels of parallax
artifacts such as ghosting and distortion, as illustrated in FIG.
1A.
[0019] To alleviate the problem of parallax artifacts, one or more
localized content-preserving warping algorithms have been
developed. By using a homography mesh instead of a single global
homography, each image can be stitched to the reference image with
localized homographies to significantly reduce the parallax
artifacts, as illustrated in FIG. 1B. However, it is quite
difficult to generalize the previous 2D panorama methods to
construct video panoramas if the video contains non-negligible
medium- to large-size moving objects. Directly applying the
previous 2D panorama methods to stitch individual video frames will
result in severe artifacts, not only around the object region due
to the object movement, but also for the overall stitched video due
to the inconsistency in 2D warps and/or stitching seams, as
illustrated in FIG. 2B (compared to an earlier frame in FIG. 2A,
which shows little or no artifacts).
[0020] To resolve these issues, embodiments of this disclosure
provide a video stitching system and method that include a
spatial-temporal localized warping framework and a spatial-temporal
content-based seam finding framework. The spatial-temporal
localized warping framework addresses the artifacts caused by
moving objects in video stitching. The framework includes a
spatial-temporal cost function to determine the optimal localized
warping maps to align videos to a reference video by preserving the
spatial-temporal local alignment, preserving the spatial-temporal
global alignment, and maintaining the spatial-temporal
smoothness.
[0021] The spatial-temporal content-based seam finding framework
addresses issues caused by both the inconsistent stitching seams
and the undesired seams cutting through salient foreground objects.
The framework includes a spatial-temporal content-based graph-cut
seam finding mechanism. A spatial-temporal graph is constructed;
the graph contains both spatial and temporal edges, and takes into
account the objectness of the pixels. The optimal flow seam that is
found based on the graph can stitch the videos together more
consistently as well as avoid cutting through salient foreground
objects.
[0022] FIG. 3 illustrates an overall workflow of video stitching
according to this disclosure. The workflow 300 shown in FIG. 3 is
for illustration only. Other embodiments of the workflow 300 may be
used without departing from the scope of this disclosure.
[0023] To better explain the video stitching workflow 300, it is
assumed that there are n video sequences 301a-301n that are to be
stitched together. A reference video sequence is defined, which can
be any one of the n video sequences 301a-301n. A primary objective
of video stitching is to generate a larger video sequence by
stitching the corresponding frames of the n video sequences
301a-301n to the reference video sequence. Let .sub.t denote the
frame in the reference video sequence at time t, and let I.sub.i,t
denote the frame in the i-th video sequence at time t. Using video
stitching, a virtual frame I'.sub.t is generated by stitching
I.sub.i,t, i=1, . . . , n to .sub.t at different times t=1, . . . ,
m.
[0024] The video stitching workflow 300 includes two function
blocks that enable parallax-tolerant video stitching: the
spatial-temporal localized warping computation block 310 and the
spatial-temporal content-based seam finding block 320. The
spatial-temporal localized warping computation block 310 uses the
video sequences 301a-301n to determine a set of target warping maps
M.sub.i,t 302. Each frame I.sub.i,t is warped into a target virtual
frame .sub.i,t 303 using a corresponding target warping map
M.sub.i,t 302. The spatial-temporal content-based seam finding
block 320 uses the target virtual frames 303 to determine a set of
target seam maps 304. The function blocks 310, 320 will now be
described in greater detail.
[0025] FIG. 4 illustrates a detailed view of a spatial-temporal
localized warping framework that implements the functions of the
spatial-temporal localized warping computation block 310 according
to this disclosure. The spatial-temporal localized warping
framework 400 may be used in connection with the video stitching
workflow 300 in FIG. 3. The spatial-temporal localized warping
framework 400 shown in FIG. 4 is for illustration only. Other
embodiments of the framework 400 may be used without departing from
the scope of this disclosure.
[0026] As shown in FIG. 4, the spatial-temporal localized warping
framework 400 takes a set of video sequences I.sub.i,t, i=1, . . .
, n, t=1, . . . , m (represented in FIG. 4 by the video sequences
301a-301n), and determines a set of target warping maps M.sub.i,t,
i=1, . . . , n, t=1, . . . , m (represented in FIG. 4 by the target
warping maps 302). Each target warping map M.sub.i,t includes
information for transforming (or warping) the original frame
I.sub.i,t to a target virtual frame .sub.i,t where .sub.i,t aligns
with the reference frame .sub.t.
[0027] The first step of the spatial-temporal localized warping
framework 400 is to take the set of video sequences I.sub.i,t, i=1,
. . . , n, t=1, . . . , m (video sequences 301a-301n), and extract
a set of visual keypoints (P.sub.i,t,k,d.sub.i,t,k), k=1, . . . ,
K.sub.i (keypoints 401a-401n) from each video sequence, where
d.sub.i,t,k is a visual descriptor. Each visual keypoint
(P.sub.i,t,k,d.sub.i,t,k).sub.i records the spatial-temporal
location of the keypoint in the corresponding video sequence.
[0028] Together, the parameter
P.sub.i,t,k=(x.sub.i,t,k,y.sub.i,t,k) and the visual descriptor
d.sub.i,t,k describe the local visual characteristics around the
visual keypoint in the corresponding video sequence. Various
keypoint extraction techniques can be used to extract the visual
keypoints, such as the 2D or 3D Harris corner detectors. Various
descriptors can be used for d.sub.i,t,k, such as the SIFT (Scale
Invariant Feature Transform), SURF (Speeded Up Robust Features), or
FAST (Features from Accelerated Segment Test) descriptors.
[0029] Using the set of visual keypoints (P.sub.i,t,k,d.sub.i,t,k),
k=1, . . . , K.sub.i (keypoints 401a-401n), the spatial-temporal
localized warping framework 400 determines a set of spatial global
homographies {tilde over (S)}.sub.i, i=1, . . . , n (spatial global
homographies 402), and a set of temporal global homographies
T.sub.i,t, i=1, . . . , n, t=1, . . . , m (temporal global
homographies 403). Each spatial global homography {tilde over
(S)}.sub.i is a 3 by 3 transformation matrix that transforms each
frame I.sub.i,t to align with the reference frame .sub.t.
Similarly, each temporal global homography T.sub.i,t is a 3 by 3
transformation matrix that transforms each frame I.sub.i,t to align
with a temporal reference frame I.sub.i,t for the i-th video
sequence.
[0030] In a preferred embodiment, the temporal reference frame
I.sub.i,t can be determined using two steps. First, an averaged
temporal global homography A.sub.i can be calculated as
A.sub.i=avg.SIGMA..sub.tA.sub.i(t,t+1), where A.sub.i(t,t+1) is a 3
by 3 transformation matrix to transform frame I.sub.i,t+1 to align
with frame I.sub.i,t. Then, in the second step, the temporal
reference frame I.sub.i,t can be calculated as
I.sub.i,t=A.sub.i.sup.(t-1)I.sub.i,1. By transforming each frame
I.sub.i,t using the temporal global homography T.sub.it to align
with the temporal reference frame I.sub.i,t, the benefit of
stabilizing the original video frames I.sub.i,t can be
automatically realized by using a static global camera path defined
by A.sub.i. This is beneficial to the final stitching result when a
small amount of camera shakiness exists during the video capture.
Such shakiness can occur when the camera system is not completely
physically stabilized, for example, when the camera system is used
outdoor in strong wind.
[0031] In a preferred embodiment, temporal matching pairs
(P.sub.i,t,l,P.sub.i,t+1,l), l=1, . . . , L.sub.i,t can be
determined based on the similarity between keypoints
(P.sub.i,t,k,d.sub.i,t,k) and (P.sub.i,t+1,k,d.sub.i,t+1,k), and
A.sub.i(t,t+1) can be determined based on the temporal matching
pairs (P.sub.i,t,l,P.sub.i,t+1,l), l=1, . . . , L.sub.i,t using
Ransac and outlier rejection. Using the averaged temporal global
homography A.sub.i, the first item in the temporal matching pairs
P.sub.i,t,l, l=1, . . . , L.sub.i,t can be transformed to new
locations P'.sub.i,t,l, l=1, . . . , L.sub.i,t, and the temporal
global homography T.sub.i,t is determined based on the matching
pairs (P'.sub.i,t,l,P.sub.i,t+1,l), l=1, . . . , L.sub.i,t using
Ransac and outlier rejection. At the same time, spatial matching
pairs (P.sub.i,t,l,{tilde over (P)}.sub.t,l), l=1, . . . , L.sub.i
can be found based on the similarity between keypoints
(P.sub.i,t,k,d.sub.i,t,k) and ({tilde over (P)}.sub.t,k,{tilde over
(d)}.sub.t,k), and the spatial global homography {tilde over
(S)}.sub.i can be determined using all the spatial matching pairs
at different times (P.sub.i,t,l,P.sub.j,t,l), l=1, . . . ,
L.sub.i,j, t=1, . . . , m using Ransac and outlier rejection, where
({tilde over (P)}.sub.t,k,{tilde over (d)}.sub.t,k), k=1, . . . ,
{tilde over (K)} are the keypoints extracted over the reference
video sequence, which can be any one of the n input video sequences
301a-301n.
[0032] A pre-warping process 404 uses the spatial global
homographies {tilde over (S)}.sub.i, i=1, . . . , n (spatial global
homographies 402) and the temporal global homographies T.sub.it,
i=1, . . . , n, t=1, . . . , m (temporal global homographies 403).
In the pre-warping process 404, each input video frame I.sub.i,t is
transformed to a pre-warped video frame .sub.i,t according to the
equation:
.sub.i,t={tilde over (S)}.sub.iT.sub.i,tI.sub.i,t,
the temporal matching pairs (P.sub.i,t,l,P.sub.i,t+1,l), l=1, . . .
, L.sub.i,t are transformed to a set of pre-warped temporal
matching pairs (P.sub.i,t,l,P.sub.i,t+1,l), l=1, . . . , L.sub.i,t
(pre-warped temporal matching pairs 405) according to the
equation:
P.sub.i,t,l={tilde over
(S)}.sub.iT.sub.i,tP.sub.i,t,l,P.sub.i,t+1,l={tilde over
(S)}.sub.iT.sub.i,t+1P.sub.i,t+1,l,
and the spatial matching pairs (P.sub.i,t,l,{tilde over
(P)}.sub.t,l), l=1, . . . , L.sub.i, t=1, . . . , m are transformed
to a set of pre-warped spatial matching pairs (P.sub.i,t,l,{tilde
over (P)}.sub.t,l), l=1, . . . , L.sub.i (pre-warped spatial
matching pairs 406) according to the equation:
P.sub.i,t,l={tilde over (S)}.sub.iT.sub.i,tP.sub.i,t,l.
[0033] A x.sub.n.times.y.sub.n uniform grid is defined that divides
each image into x.sub.n.times.y.sub.n uniform cells. Let
V.sub.i,t,k, k=1, . . . , (x.sub.n+1)(y.sub.n+1) and V.sub.i,t,k,
k=1, . . . , (x.sub.n+1)(y.sub.n+1) denote the vertices of the grid
mesh in image I.sub.i,t and the pre-warped image .sub.i,t,
respectively. In the spatial-temporal localized warping computation
process, a set of target vertices {circumflex over (V)}.sub.i,t,k,
k=1, . . . , (x.sub.n+1)(y.sub.n+1) (target vertices 407) are
determined based on the input vertices V.sub.i,t,k, k=1, . . . ,
(x.sub.n+1)(y.sub.n+1) and V.sub.i,t,k, k=1, . . . ,
(x.sub.n+1)(y.sub.n+1), and the input pre-warped spatial matching
pairs (P.sub.i,t,l,{tilde over (P)}.sub.t,l), l=1, . . . , L.sub.i
and pre-warped temporal matching pairs (P.sub.i,t,l,P.sub.i,t+1,l),
l=1, . . . , L.sub.i,t. For each mesh cell C.sub.j, its four
vertices {circumflex over (V)}.sub.i,t,j(1), {circumflex over
(V)}.sub.i,t,j(2), {circumflex over (V)}.sub.i,t,j(3), {circumflex
over (V)}.sub.i,t,j(4) and V.sub.i,t,j(1), V.sub.i,t,j(2),
V.sub.i,t,j(3), V.sub.i,t,j(4) defines a perspective transformation
H.sub.i,t,j to transform the pixels of image I.sub.i,t in the mesh
cell C.sub.j to align with the corresponding mesh cell {tilde over
(C)}.sub.j in the reference image .sub.t. In a preferred
embodiment, {circumflex over (V)}.sub.i,t,k, k=1, . . . ,
(x.sub.n+1)(y.sub.n+1), i=1, . . . , n, t=1, . . . , m is
determined by minimizing the following cost function:
E = E ds + .phi. E dt + .alpha. E gs + .beta. E gt + .PHI. E ss +
.theta. E st E ds = t = 1 m i = 1 n l = 1 L i k = 1 4 .lamda. i , t
, l ( k ) V ^ i , t , l ( k ) - P ~ i , t , l 2 E dt = t = 1 m - 1
i = 1 n l = 1 L i , t k = 1 4 .lamda. i , t , l ( k ) V ^ i , t , l
( k ) - P _ i , t + 1 , l 2 E gs = t = 1 m i = 1 n l = 1 ( x n + 1
) ( y n + 1 ) .tau. i , t , l V ^ i , t , l ( k ) - V _ i , t , l 2
E gt = t = 1 m i = 1 n l = 1 ( x n + 1 ) ( y n + 1 ) r .di-elect
cons. .OMEGA. t .sigma. i , t , l V ^ i , t , l - V ^ i , t , l 2 E
ss = t = 1 m i = 1 n l .di-elect cons. .DELTA. .omega. _ s V ^ i ,
t , l ( 1 ) - ( V ^ i , t , l ( 2 ) + u i , t , l ( V ^ i , t , l (
3 ) - V ^ i , t , l ( 2 ) ) + v i , t , l R ( V ^ i , t , l ( 3 ) -
V ^ i , t , l ( 2 ) ) ) 2 E st = t = 1 m - 1 i = 1 n l .di-elect
cons. .DELTA. .omega. _ t V ^ i , t + 1 , l ( 1 ) - ( V ^ i , t + 1
, l ( 2 ) + u i , t , l ( V ^ i , t + 1 , l ( 3 ) - V ^ i , t + 1 ,
l ( 2 ) ) + v i , t , l R ( V ^ i , t + 1 , l ( 3 ) - V ^ i , t + 1
, l ( 2 ) ) ) 2 ( 1 ) ##EQU00001##
[0034] The parameter E.sub.ds measures spatial local alignment,
where (P.sub.i,t,l,{tilde over (P)}.sub.t,l) is a pre-warped
spatial matching pair 406, and P.sub.i,t,l is represented by a
linear combination of the four vertices V.sub.i,t,l(k), k=1, . . .
, 4 that contains P.sub.i,t,l with coefficients
.lamda..sub.i,t,l(k), k=1, . . . , 4. The coefficients can be
determined using any of a number of different methods, such as the
inverse bilinear interpolation method described in REF2. Therefore,
minimizing E.sub.ds encourages the final target vertices to
transform each original frame I.sub.i,t to align with the reference
image .sub.t by matching their corresponding keypoints.
[0035] The parameter E.sub.dt measures temporal local alignment,
where (P.sub.i,t,l,P.sub.i,t+1,l) is a pre-warped temporal matching
pair 405, and P.sub.i,t,l is represented by a linear combination of
the four vertices V.sub.i,t,l(k), k=1, . . . , 4 that contains
P.sub.i,t,l with coefficients .lamda..sub.i,t,l(k), k=1, . . . , 4.
The coefficients can be determined using the same method as in the
preceding paragraph. Therefore, minimizing E.sub.dt encourages the
final target vertices to transform each original frame I.sub.i,t to
align with the reference image .sub.t while maintaining the
temporal correspondence alignment.
[0036] The parameter E.sub.gs measures spatial global alignment.
When there is no pre-warped spatial matching pairs in the spatial
neighborhood of the pre-warped vertex V.sub.i,t,l, the
corresponding vertex {circumflex over (V)}.sub.i,t,l is encouraged
to be the same as the pre-warped vertex V.sub.i,t,l, and therefore,
.tau..sub.i,t,l=1. Otherwise, .tau..sub.i,t,l=0.
[0037] The parameter E.sub.gt measures temporal global alignment.
Let r.epsilon..OMEGA..sub.t denote a temporal neighborhood of time
frame t. When there is no pre-warped temporal matching pairs in the
spatial neighborhood of the pre-warped vertex V.sub.i,t,l, the
corresponding vertex {circumflex over (V)}.sub.i,t,l is encouraged
to remain the same through time (i.e., remain unchanged within the
temporal neighborhood .OMEGA..sub.t), and therefore,
.sigma..sub.i,t,l=1. When there exist pre-warped temporal matching
pairs in the spatial neighborhood of the pre-warped vertex
V.sub.i,t,l, the weight value .sigma..sub.i,t,l is determined by
the scale of pixel movement in the spatial neighborhood of the
pre-warped vertex V.sub.i,t,l. That is, if the scene remains static
in the spatial neighborhood of the pre-warped vertex V.sub.i,t,l,
the corresponding vertex {circumflex over (V)}.sub.i,t,l is
encouraged to remain the same through time, i.e., .sigma..sub.i,t,l
should take a large value close to 1. When there exists substantial
scene movement in the spatial neighborhood of the pre-warped vertex
V.sub.i,t,l, .sigma..sub.i,t,l should take a small value close to
0. In a preferred embodiment, the scale of movement determined
using the pre-warped temporal matching pairs in the spatial
neighborhood of the pre-warped vertex V.sub.i,t,l is used to
determine the weight value .sigma..sub.i,t,l. In other embodiments,
other motion measurements such as the ones based on optical flow
can also be used to determine .sigma..sub.i,t,l.
[0038] The parameter E.sub.ss measures spatial smoothness. Let
.DELTA. denote a set of triplets, where each triplet in .DELTA.
contains three vertices V.sub.i,t,l(1), V.sub.i,t,l(2),
V.sub.i,t,l(3), which define a triangle. The vertex V.sub.i,t,l(1)
can be represented by the other vertices V.sub.i,t,l(2),
V.sub.i,t,l(3) according to the following:
V ^ i , t , l ( 1 ) = ( V ^ i , t , l ( 2 ) + u i , t , l ( V ^ i ,
t , l ( 3 ) - V ^ i , t , l ( 2 ) ) + v i , t , l R ( V ^ i , t , l
( 3 ) - V ^ i , t , l ( 2 ) ) ) , R = [ 1 , 0 0 , 1 ] .
##EQU00002##
[0039] If the triangle undergoes a similarity transformation, its
coordinates in the local coordinate system will remain the same.
Therefore, minimizing E.sub.ss encourages the mesh cells to undergo
a similarity transformation spatially, which helps to reduce the
local distortion during optimization. The value .omega..sub.s is a
weight assigned to each triangle, which is determined by the
spatial edge saliency in the triangle and helps to distribute more
distortion to less salient regions.
[0040] The parameter E.sub.st measures temporal smoothness. Again,
let .DELTA. denote a set of triplets, where each triplet in .DELTA.
contains three vertices V.sub.i,t,l(1), V.sub.i,t,l(2),
V.sub.i,t,l(3), which define a triangle. The vertex V.sub.i,t,l(1)
can be represented by the other vertices V.sub.i,t,l(2),
V.sub.i,t,l(3) as:
V ^ i , t , l ( 1 ) = ( V ^ i , t , l ( 2 ) + u i , t , l ( V ^ i ,
t , l ( 3 ) - V ^ i , t , l ( 2 ) ) + v i , t , l R ( V ^ i , t , l
( 3 ) - V ^ i , t , l ( 2 ) ) ) , R = [ 1 , 0 0 , 1 ] .
##EQU00003##
[0041] If the triangle undergoes a similarity transformation, its
coordinates in the local coordinate system will remain the same.
Therefore, minimizing E.sub.st encourages the mesh cells to undergo
a similarity transformation temporally, which helps to reduce the
local distortion during optimization. The value .omega..sub.t is a
weight assigned to each triangle, which is determined by the
temporal edge saliency in the triangle and helps to distribute more
distortion to less salient regions.
[0042] The weights .phi..gtoreq.0, .alpha..gtoreq.0,
.beta..gtoreq.0, .phi..gtoreq.0, .theta..gtoreq.0 are assigned to
each term in the cost function in Equation (1) to balance the
importance of different terms in optimization. When .phi.=0,
.phi.=0, .theta.=0, the cost function in Equation (1) reduces to
the content-preserving warping method for static image stitching
developed in REF2.
[0043] After obtaining the target vertices {circumflex over
(V)}.sub.i,t,k, k=1, . . . , (x.sub.n+1)(y.sub.n+1), i=1, . . . ,
n, t=1, . . . , m, the set of target warping maps M.sub.i,t, i=1, .
. . , n, t=1, . . . , m can be determined based on the original
vertices, V.sub.i,t,k, k=1, . . . , (x.sub.n+1)(y.sub.n+1), i=1, .
. . , n, t=1, . . . , m and the target vertices {circumflex over
(V)}.sub.i,t,k, k=1, . . . , (x.sub.n+1)(y.sub.n+1), i=1, . . . ,
n, t=1, . . . , m. There are several ways to determine the target
warping maps. In a preferred embodiment, for each mesh cell
C.sub.j, its four vertices {circumflex over (V)}.sub.i,t,j(1),
{circumflex over (V)}.sub.i,t,j(2), {circumflex over
(V)}.sub.i,t,j(3), {circumflex over (V)}.sub.i,t,j(4) and
V.sub.i,t,j(1), V.sub.i,t,j(2), V.sub.i,t,j(3), V.sub.i,t,j(4)
define a perspective transformation H.sub.i,t,j to transform the
pixels of image I.sub.i,t in the mesh cell C.sub.j to align with
the corresponding mesh cell {tilde over (C)}.sub.j in the reference
image .sub.i. The target warping map M.sub.i,t is simply formed as
the set of H.sub.i,t,j, j=1, . . . , x.sub.ny.sub.n, and the whole
image I.sub.i,t can be warped by M.sub.i,t cell by cell into the
target virtual frame .sub.i,t (target virtual frame 303).
[0044] FIG. 5 illustrates a detailed view of a spatial-temporal
content-based seam finding framework that implements the functions
of the spatial-temporal content-based seam finding block 320
according to this disclosure. The spatial-temporal content-based
seam finding framework 500 may be used in connection with the video
stitching workflow 300 in FIG. 3. The spatial-temporal
content-based seam finding framework 500 shown in FIG. 5 is for
illustration only. Other embodiments of the framework 500 may be
used without departing from the scope of this disclosure.
[0045] As shown in FIG. 5, the spatial-temporal content-based seam
finding framework 500 takes the set of target virtual frames
.sub.i,t, i=1, . . . , n, t=1, . . . , m (represented in FIG. 5 by
the target virtual frames 303), and determines a set of target seam
maps Z.sub.t, t=1, . . . , m (represented in FIG. 5 by the target
seam maps 304). Each seam map Z.sub.t includes information
associated with composing the final stitched virtual frame I'.sub.t
from the warped virtual target frames .sub.i,t, i=1, . . . , n.
[0046] The first step of the spatial-temporal content-based seam
finding framework 500 is the spatial-temporal objectness
computation process 501. Given a pair of target virtual frame
sequences .sub.i,t, t=1, . . . , m and .sub.j,t, t=1, . . . , m, in
the spatial-temporal objectness computation process 501, an
objectness value o.sub.i,j,t,k.epsilon.[0,1] is assigned to each
overlapping pixel p.sub.i,j,t,k between .sub.i,t and .sub.j,t. The
objectness value o.sub.i,j,t,k measures the level of object
saliency of the pixel p.sub.i,j,t,k. The more salient the pixel
p.sub.i,j,t,k is, the larger the value o.sub.i,j,t,k has, and it is
less desirable that the target seam cuts through the pixel
p.sub.i,j,t,k. There are a number of different methods of
determining the objectness value o.sub.i,j,t,k. For example, if the
pixel is on a human face, the target seam is not encouraged to cut
through the human face to avoid artifacts. As another example, if
the pixel is on a fast moving object and is close to strong
structural edges, the target seam is not encouraged to cut through
the pixel to avoid artifacts. In a preferred embodiment, the
computation process 501 takes into account the above factors in
computing the objectness value, where
o.sub.i,j,t,k=a*f.sub.i,j,t,k+b*e.sub.i,j,t,k. The value
f.sub.i,j,t,k is the distance from the pixel p.sub.i,j,t,k to an
automatically detected human face, and e.sub.i,j,t,k is the
distance from the pixel p.sub.i,j,t,k to a close-by strong moving
edge. The values a, b are the weights to balance these two
terms.
[0047] After that, a spatial-temporal graph can be constructed
using the spatial-temporal graph construction process 502. FIG. 6
illustrates an example of such a graph constructed according to
this disclosure. As shown in FIG. 6, the graph 600 includes a
plurality of graph nodes 601. Each graph node 601 is an overlapping
pixel p.sub.i,j,t,k. There are two types of edges between each pair
of graph nodes: a spatial edge (represented by the spatial edge
602) and a temporal edge (represented by the temporal edge 603).
The spatial edge is the edge between two graph nodes that
corresponds to pixels at the same time index but different spatial
locations. The temporal edge is the edge between two graph nodes
that corresponds to pixels at the same spatial location but
different time indices. Specifically, the spatial edge 602 between
pixel p.sub.i,j,t,k and p.sub.i,j,t,l is defined as
E.sup.s.sub.i,j,t(k,l) according to the following:
E.sup.s.sub.i,j,t(k,l)=o.sub.i,j,t,kD( .sub.i,t(k),
.sub.j,t(k))+o.sub.i,j,t,lD( .sub.i,t(l), .sub.j,t(l)),
where D( .sub.i,t(k), .sub.j,t(k)) is the distance measurement
between pixel value .sub.i,t(k) and pixel value .sub.j,t(k), and
.sub.i,t(k) is the pixel value of the k-th pixel in frame .sub.i,t.
Various distance measurements can be used to determine D(
.sub.i,t(k), .sub.j,t(k)). For example, in one embodiment:
E.sup.s.sub.i,j,t(k,l)=o.sub.i,j,t,k.parallel. .sub.i,t(k)-
.sub.j,t(k).parallel.+o.sub.i,j,t,l.parallel. .sub.i,t(l)-
.sub.j,t(l).parallel..
[0048] The temporal edge 603 between pixel p.sub.i,j,t,k and
p.sub.i,j,t+1,k is defined as E.sup.t.sub.i,j,k(t,t+1) according to
the following:
E.sup.t.sub.i,j,k(t,t+1)=(o.sub.i,j,t,k+o.sub.i,j,t+1,k)(D(
.sub.i,t(k), .sub.i,t+1(k))+D( .sub.j,t(k), .sub.j,t+1(k)))/2,
where D( .sub.i,t(k), .sub.i,t+1(k)) is the distance measurement
between pixel value .sub.i,t(k) and pixel value .sub.i,t+1(k).
Various distance measurements can be used to determine D(
.sub.i,t(k), .sub.i,t+1(k)). For example in one embodiment:
E.sup.t.sub.i,j,k(t,t+1)=(o.sub.i,j,t,k+o.sub.i,j,t+1,k)(.parallel.I.sub-
.i,t(k)-I.sub.i,t+1(k).parallel.+.parallel.I.sub.j,t(k)-I.sub.j,t+1(k).par-
allel.)/2.
[0049] Without loss of generality, assume that image .sub.i,t is
source and .sub.j,t is sink, the overlapping pixel that is on the
boundary of the overlapped regions between .sub.i,t and .sub.j,t is
given an edge to its closest image (either source or sink), with
infinity edge weight.
[0050] Then, returning to FIG. 5, after the graph is constructed
using the spatial-temporal graph construction process 502, the
max-flow seam computation process 503 is performed to find the
optimal labeling .eta..sub.i,j,t,k of every overlapping pixel
p.sub.i,j,t,k. The labeling .eta..sub.i,j,t,k is either source or
sink, and is determined by finding a minimal-edge-cost path to cut
the graph. If .eta..sub.i,j,t,k is source, the corresponding pixel
in the final stitched image will take the pixel value from
.sub.i,t, and if .eta..sub.i,j,t,k is sink, the corresponding pixel
in the final stitched image will take the pixel value from
.sub.j,t.
[0051] To determine the final target seam map Z.sub.t, the above
process is conducted iteratively by adding frames to the stitched
result one by one. That is, frame .sub.1,t and .sub.2,t are first
stitched together, and then frame .sub.3,t is added in to stitch
with the stitched result of frame .sub.1,t and .sub.2,t, and so
on.
[0052] Once the set of target seam maps Z.sub.t, t=1, . . . , m
(target seam maps 304) are obtained, various color correction, gain
compensation, and blending techniques can be used to visually
enhance the stitched result.
[0053] FIG. 7 illustrates an example method for video stitching
according to this embodiment. For ease of explanation, the method
700 is described as being used with a computing device capable of
video processing, such as the computing device 800 of FIG. 8
(described below). However, the method 700 could be used by any
suitable device and in any suitable system.
[0054] At step 701, a plurality of video sequences are determined
to be stitched together. In some embodiments, this may include a
computing device determining the video sequences 301a-301n in FIG.
3. At step 703, a spatial-temporal localized warping computation
process is performed on the video sequences to determine a
plurality of target warping maps. In some embodiments, this may
include the spatial-temporal localized warping framework 400
performing the functions of the spatial-temporal localized warping
computation block 310 in FIG. 3.
[0055] At step 705, a plurality of frames among the video sequences
are warped into a plurality of target virtual frames using the
target warping maps determined in step 703. At step 707, a
spatial-temporal content-based seam finding process is performed on
the target virtual frames to determine a plurality of target seam
maps. In some embodiments, this may include the spatial-temporal
content-based seam finding framework 500 performing the functions
of the spatial-temporal content-based seam finding block 320 in
FIG. 3. Then, at step 709, the video sequences are stitched
together using the target seam maps.
[0056] Although FIG. 7 illustrates one example of a method 700 for
video stitching, various changes may be made to FIG. 7. For
example, while shown as a series of steps, various steps in FIG. 7
could overlap, occur in parallel, occur in a different order, or
occur any number of times.
[0057] FIG. 8 illustrates an example of a computing device 800 for
performing the video stitching workflow 300 of FIG. 3 or the video
stitching method 700 of FIG. 7. As shown in FIG. 8, the computing
device 800 includes a computing block 803 with a processing block
805 and a system memory 807. The processing block 805 may be any
type of programmable electronic device for executing software
instructions, but will conventionally be one or more
microprocessors. The system memory 807 may include both a read-only
memory (ROM) 809 and a random access memory (RAM) 811. As will be
appreciated by those of skill in the art, both the read-only memory
809 and the random access memory 811 may store software
instructions for execution by the processing block 805.
[0058] The processing block 805 and the system memory 807 are
connected, either directly or indirectly, through a bus 813 or
alternate communication structure, to one or more peripheral
devices. For example, the processing block 805 or the system memory
807 may be directly or indirectly connected to one or more
additional memory storage devices 815. The memory storage devices
815 may include, for example, a "hard" magnetic disk drive, a solid
state disk drive, an optical disk drive, and a removable disk
drive. The processing block 805 and the system memory 807 also may
be directly or indirectly connected to one or more input devices
817 and one or more output devices 819. The input devices 817 may
include, for example, a keyboard, a pointing device (such as a
mouse, touchpad, stylus, trackball, or joystick), a touch screen, a
scanner, a camera, and a microphone. The output devices 819 may
include, for example, a display device, a printer and speakers.
Such a display device may be configured to display video images.
With various examples of the computing device 800, one or more of
the peripheral devices 815-819 may be internally housed with the
computing block 803. Alternately, one or more of the peripheral
devices 815-819 may be external to the housing for the computing
block 803 and connected to the bus 813 through, for example, a
Universal Serial Bus (USB) connection or a digital visual interface
(DVI) connection.
[0059] With some implementations, the computing block 803 may also
be directly or indirectly connected to one or more network
interfaces cards (NIC) 821, for communicating with other devices
making up a network. The network interface cards 821 translate data
and control signals from the computing block 803 into network
messages according to one or more communication protocols, such as
the transmission control protocol (TCP) and the Internet protocol
(IP). Also, the network interface cards 821 may employ any suitable
connection agent (or combination of agents) for connecting to a
network, including, for example, a wireless transceiver, a modem,
or an Ethernet connection.
[0060] It should be appreciated that the computing device 800 is
illustrated as an example only, and it not intended to be limiting.
Various embodiments of this disclosure may be implemented using one
or more computing devices that include the components of the
computing device 800 illustrated in FIG. 8, or which include an
alternate combination of components, including components that are
not shown in FIG. 8. For example, various embodiments of the
invention may be implemented using a multi-processor computer, a
plurality of single and/or multiprocessor computers arranged into a
network, or some combination of both.
[0061] The embodiments described herein provide a solution for
parallax tolerant video stitching By jointly minimizing the
spatial-temporal cost function in the spatial-temporal localized
warping framework, the computed localized warping maps are able to
align frames from multiple videos by optimally preserving the
spatial and temporal data alignment and the spatial temporal
smoothness. As a result, the resulting warped frames are spatially
well aligned with localized warping, and are temporally
consistent.
[0062] By finding the optimal spatial-temporal seams that take into
account the objectness of the pixels in the spatial-temporal
content-based seam finding framework, the resulting seams can be
used to stitch frames from multiple videos together with good
temporal consistency while avoiding cutting through salient
foreground objects to avoid artifacts.
[0063] In some embodiments, some or all of the functions or
processes of the one or more of the devices are implemented or
supported by a computer program that is formed from computer
readable program code and that is embodied in a computer readable
medium. The phrase "computer readable program code" includes any
type of computer code, including source code, object code, and
executable code. The phrase "computer readable medium" includes any
type of medium capable of being accessed by a computer, such as
read only memory (ROM), random access memory (RAM), a hard disk
drive, a compact disc (CD), a digital video disc (DVD), or any
other type of memory.
[0064] It may be advantageous to set forth definitions of certain
words and phrases used throughout this patent document. The terms
"include" and "comprise," as well as derivatives thereof, mean
inclusion without limitation. The term "or" is inclusive, meaning
and/or. The phrases "associated with" and "associated therewith,"
as well as derivatives thereof, mean to include, be included
within, interconnect with, contain, be contained within, connect to
or with, couple to or with, be communicable with, cooperate with,
interleave, juxtapose, be proximate to, be bound to or with, have,
have a property of, or the like.
[0065] While this disclosure has described certain embodiments and
generally associated methods, alterations and permutations of these
embodiments and methods will be apparent to those skilled in the
art. Accordingly, the above description of example embodiments does
not define or constrain this disclosure. Other changes,
substitutions, and alterations are also possible without departing
from the spirit and scope of this disclosure, as defined by the
following claims.
* * * * *