U.S. patent application number 13/291951 was filed with the patent office on 2012-05-10 for augmented reality surveillance and rescue system.
Invention is credited to Suranjit Adhikari, Ted Dunn, Eric Hsiao.
Application Number | 20120113145 13/291951 |
Document ID | / |
Family ID | 46019214 |
Filed Date | 2012-05-10 |
United States Patent
Application |
20120113145 |
Kind Code |
A1 |
Adhikari; Suranjit ; et
al. |
May 10, 2012 |
AUGMENTED REALITY SURVEILLANCE AND RESCUE SYSTEM
Abstract
A system, method, and computer program product for automatically
combining computer-generated imagery with real-world imagery in a
portable electronic device by retrieving, manipulating, and sharing
relevant stored videos, preferably in real time. A video is
captured with a hand-held device and stored. Metadata including the
camera's physical location and orientation is appended to a data
stream, along with user input. The server analyzes the data stream
and further annotates the metadata, producing a searchable library
of videos and metadata. Later, when a camera user generates a new
data stream, the linked server analyzes it, identifies relevant
material from the library, retrieves the material and tagged
information, adjusts it for proper orientation, then renders and
superimposes it onto the current camera view so the user views an
augmented reality.
Inventors: |
Adhikari; Suranjit; (San
Diego, CA) ; Dunn; Ted; (Carlsbad, CA) ;
Hsiao; Eric; (San Diego, CA) |
Family ID: |
46019214 |
Appl. No.: |
13/291951 |
Filed: |
November 8, 2011 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61411053 |
Nov 8, 2010 |
|
|
|
Current U.S.
Class: |
345/633 |
Current CPC
Class: |
G09G 2340/14 20130101;
G09G 5/00 20130101; G09G 2354/00 20130101; G09G 2340/125 20130101;
G06T 19/006 20130101; G06Q 30/0623 20130101; G09G 2340/12
20130101 |
Class at
Publication: |
345/633 |
International
Class: |
G09G 5/377 20060101
G09G005/377 |
Claims
1. A computer-implemented method for providing an augmented reality
interface, comprising: acquiring an image of a real-world scene and
metadata with a camera; storing the image and metadata; retrieving
at least one stored image with metadata having selected features;
manipulating the retrieved image; and combining the manipulated
image with a currently observed real-world scene viewed with a
portable electronic device, wherein the metadata includes
descriptive data relating to at least one of surveillance and
rescue.
2. The method of claim 1, wherein the metadata includes at least
one of the position and orientation of an item of police
evidence.
3. The method of claim 1, wherein the metadata includes information
relating to at least one of a lost child, an invalid, an elderly
person, and a medical emergency.
4. The method of claim 1, wherein the image is at least one of a
still photograph, at least one video frame, analog, digital,
recorded, live, and communicated in a data stream.
5. The method of claim 1, wherein the metadata describes the
physical location and orientation of the camera during the
acquiring, and is provided by at least one of a GPS system, a
gyroscope, and an accelerometer.
6. The method of claim 1, wherein the metadata is provided by the
camera.
7. The method of claim 1, wherein at least one of the currently
observed scene, images, and metadata are stored on at least one of
a server and the portable electronic device.
8. The method of claim 1, wherein the selected features include the
stored physical location and orientation best matching one of (a) a
current physical location and orientation of the portable
electronic device and (b) a predicted physical location and
orientation of the portable electronic device.
9. The method of claim 1, wherein the server searches for the
selected features.
10. The method of claim 1, wherein the retrieved image is in a
second data stream.
11. The method of claim 1, wherein the portable electronic device
is at least one of a smartphone, a hand-held device, the camera, a
second camera, a PDA, and a tablet computer.
12. The method of claim 1, wherein the manipulating includes
adjusting image orientation.
13. The method of claim 1, wherein the combining includes
superimposing the manipulated image on the currently observed
scene.
14. The method of claim 1, wherein the combining includes merging
the data stream with the second data stream.
15. The method of claim 1, wherein the combining includes
displaying the manipulated image with the portable electronic
device in one of a display and a viewfinder.
16. The method of claim 1, wherein the method operates continuously
and substantially in real time.
17. The method of claim 1, wherein the method operates as the
currently observed scene changes as the portable electronic device
is moved, such motion including at least one of translating,
tilting, panning, and zooming.
18. A system for providing an augmented reality interface,
comprising: a processor; and a memory containing instructions that,
when executed by the processor cause the processor to: acquire a
video of a real-world scene and metadata with a camera; store the
video and metadata; retrieve at least one stored video with
metadata having selected features; manipulate the retrieved video;
and combine the manipulated video with a currently observed
real-world scene viewed with a portable electronic device, wherein
the metadata includes descriptive data relating to at least one of
surveillance and rescue.
19. A computer program product for providing an augmented reality
interface, comprising a computer readable medium tangibly embodying
non-transitory computer-executable program instructions thereon
that, when executed, cause a computing device to: acquire a video
of a real-world scene and metadata with a camera; store the video
and metadata; retrieve at least one stored video with metadata
having selected features; manipulate the retrieved video; and
combine the manipulated video with a currently observed real-world
scene viewed with a portable electronic device, wherein the
metadata includes descriptive data relating to at least one of
surveillance and rescue.
20. A system for providing an augmented reality interface,
comprising: means for acquiring a video of a real-world scene and
metadata with a camera; means for storing the video and metadata;
means for retrieving at least one stored video with metadata having
selected features; means for manipulating the retrieved video; and
means for combining the manipulated video with a currently observed
real-world scene viewed with a portable electronic device, wherein
the metadata includes descriptive data relating to at least one of
surveillance and rescue.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit under 35 U.S.C. 119 of
provisional application 61/411,053 filed on Nov. 8, 2010 entitled
"An Augmented Reality Interface for Video Tagging and Sharing"
which is hereby incorporated by reference in its entirety, and is
related to seven other simultaneously-filed applications, including
Attorney Docket No. S1162.1102US1 entitled "Augmented Reality
Interface for Video", Attorney Docket No. S1162.1102US2 entitled
"Augmented Reality Interface for Video Tagging and Sharing",
Attorney Docket No. S1162.1102US3 entitled "Augmented Reality
System for Communicating Tagged Video and Data on a Network",
Attorney Docket No. 51162.1102US4 entitled "Augmented Reality
System for Position Identification", Attorney Docket No.
S1162.1102US5 entitled "Augmented Reality System for Supplementing
and Blending Data", Attorney Docket No. S1162.1102US6 entitled
"Augmented Reality Virtual Guide System", Attorney Docket No.
S1162.1102US7 entitled "Augmented Reality System for Product
Identification and Promotion", each of which is hereby incorporated
by reference in its entirety.
FIELD OF THE INVENTION
[0002] The present patent document relates in general to augmented
reality systems, more specifically to relating stored images and
videos to those currently obtained by an observer's portable
electronic device.
BACKGROUND OF THE INVENTION
[0003] Modern portable electronic devices are becoming increasingly
powerful and sophisticated. Not only are devices running faster
CPUs, they're also equipped with sensors that are making these
devices more versatile than traditional personal computers. The use
of GPS, gyroscopes, accelerometers have made these devices location
aware, and opened up a world of possible applications that did not
seem possible before.
[0004] The standard definition of augmented reality is live direct
or indirect viewing of a physical real-world environment whose
elements are augmented by virtual computer-generated imagery.
Traditionally augmented reality applications have been limited to
expensive custom setups used in universities and academia, but with
the advent of modem smartphones and powerful embedded processors,
many of the algorithms that were once confined to the personal
computer world are becoming a part of the mobile world. Layar and
AroundMe are examples of two such applications that are
increasingly popular and have been ported to many smartphones
(Layar is a product of the company Layar, of the Netherlands, and
AroundMe is a product of the company Tweakersoft). Both the Layar
and AroundMe applications use location data obtained from GPS
sensors to overlay additional information such as direction and
distance of nearby landmarks.
[0005] Typically, augmented reality implementations have relied on
three elemental technologies:
[0006] (1) Sensing technologies to identify locations or sites in
real space using markers, image recognition algorithms, and
sensors.
[0007] (2) Information retrieval and overlay technologies to create
virtual information and to overlay it on top of live images
captured by the camera.
[0008] (3) Display technologies capable of integrating real and
virtual information which includes mobile phone display,
projectors, as well as augmented reality glasses.
[0009] In addition, mobile augmented reality techniques are roughly
classified into two types based on the type of sensing technology
used.
[0010] A. Location Based Augmented Reality
[0011] Location based augmented reality techniques determine the
location or orientation of a device using GPS or other sensor, then
overlay the camera display with information relevant to the place
or direction. The four common sensor platforms used are described
below:
[0012] GPS: The Global Positioning System provides worldwide
coverage and measures the user's 3D position, typically within 30
meters for regular GPS, and about 3 meters for differential GPS. It
does not measure orientation. One of the major drawbacks of using
GPS based systems is that they require direct line-of-sight views
to the satellites and are commonly blocked in urban areas, canyons,
etc. This limits their usability severely.
[0013] Inertial, geomagnetic, and dead reckoning: Inertial sensors
are sourceless and relatively immune to environmental disturbances.
Their main drawback however is that they accumulate drift over a
period of time. The key to using inertial sensors therefore lies in
developing efficient filtering and correction algorithm that can
compensate for this drift error.
[0014] Active sources: For indoor virtual environments, a common
approach is the use of active transmitters and receivers (using
magnetic, optical, or ultrasonic technologies). The obvious
disadvantage of these systems is that modifying the environment in
this manner outdoors is usually not practical and restricts the
user to the location of the active sources.
[0015] Passive optical: This method relies on using video or
optical sensors to track the sun, stars, or surrounding
environment, to determine a frame of reference. However most
augmented reality applications refrain from using these algorithms
since they are computationally intensive.
[0016] B) Vision Based Augmented Reality
[0017] Vision based augmented reality techniques attempt to model
precise descriptions of the shape and location of the real objects
in the environment using image processing techniques or predefined
markers, and use the information obtained to align the virtual
graphical overlay. These techniques may be subdivided into two main
categories.
[0018] Marker Based Augmented Reality: Marker based augmented
reality systems involve recognition of a particular marker called
an augmented reality marker with a camera, and then overlaying
information on the display that matches the marker. These markers
are usually simple monochrome markers and may be detected fairly
easily using less complex image processing algorithms.
[0019] Markerless augmented reality: Markerless based augmented
reality systems recognize a location or an object not by augmented
reality markers but by image feature analysis, then combine
information with the live image captured by the camera. A
well-known example of this image tracking approach is Parallel
Tracking and Mapping (PTAM) developed by Oxford University and
Speeded Up Robust Features (SURF) which has been recently used by
Nokia Research.
[0020] Even though these techniques have been deployed and used
extensively in the mobile space, there are still several technical
challenges that need to be addressed for a robust, usable augmented
reality system.
[0021] There are three main challenges discussed hereafter:
[0022] I. Existing Mobile Rendering APIs are Not Optimal
[0023] Existing Mobile 3D solutions are cumbersome and impose
limitations on seamless integration with live camera imagery. For
complete integration between live camera and overlaid information,
the graphics overlay needs to be transformed and rendered in
real-time based on the user's position, orientation, and heading.
The accuracy of the rendering is important since augmented reality
applications offer a rich user experience by precisely registering
and orienting overlaid information with elements in user's
surroundings. Precise overlay of graphical information over a
camera image creates a more intuitive presentation. User experience
therefore degrades quickly when accuracy is lost. There have been
several implementations that have achieved fast rendering by using
OpenGL, or by remote rendering the information and streaming the
video to mobile embedded devices. Most modern smartphones have
graphics libraries such as OpenGL that use the inbuilt GPU to
offload the more computationally expensive rendering operations so
that other CPU intensive tasks such as the loading of points of
interest are not blocked. However the use of OpenGL on smartphone
platforms introduces other challenges. One of the biggest
disadvantages of using OpenGL is that once perspective-rendered
content is displayed onscreen, it is hard to perform hit testing
because OpenGL ES 1.1 does not provide APIs for "picking mode" or
"selection" used to determine the geometry at particular screen
coordinates. When controls are rendered in a perspective view, it
is hard to determine whether touch events lie within the control
bounds. Therefore, even though OpenGL supports perspective 3D
rendering under the processing constraints typical of modem mobile
smartphones, it is not optimal.
[0024] II. Real-Time Marker/Markerless Systems are too Complex
[0025] Real-time detection and registration of a frame reference is
computationally expensive, especially for markerless techniques.
Mapping a virtual environment onto the real-world coordinate space
requires complex algorithms. To create a compelling experience, the
virtual viewport must update quickly to reflect changes in the
camera's orientation, heading, and perspective as the user moves
the camera. This makes it essential to gather information about the
device's physical position in the environment in real-time.
Traditional techniques for frame of reference estimation depend on
identifiable markers embedded in the environment or
computationally-intensive image processing algorithms to extract
registration features. Most of these image processing techniques
need to be optimized extensively to fit within the hardware
constraints imposed by mobile devices. For closed environments
where markers may be placed beforehand, the use of identifiable
markers for detection and frame of reference estimation is usually
the best viable option. This approach, however, is less suitable
for augmented reality applications in outdoor environments since
setting up the environment with markers prior to the application's
use is unlikely. Attempts to perform real time natural feature
detection and tracking on modem mobile devices have been largely
intractable since they use large amounts of cached data and
significant processing power.
[0026] III. Sensor Data for Location Based Systems is
Inaccurate
[0027] For location based augmented reality systems, especially GPS
based systems, sensor noise makes orientation estimation difficult.
Modem mobile smartphones contain a number of sensors that are
applicable for augmented reality applications. For example, cameras
are ubiquitous and accelerometers and geomagnetic sensors are
available in most smartphones. Geomagnetic and gyroscope sensors
provide information about users headings and angular rate which may
be combined with GPS data to estimate field of view and location.
However these sensors present unique problems, as they do not
provide highly accurate readings and are sensitive to noise. To map
the virtual augmented reality environment into a real-world
coordinate space, sensor data must be accurate and free of noise
that may cause jittering in rendered overlays. The reduction of
noise thus represents a significant challenge confronting augmented
reality software.
[0028] This patent application provides viable approaches to solve
these challenges and present a practical implementation of those
techniques on a mobile phone. A new methodology for localizing,
tagging, and viewing video augmented with existing camera systems
is presented. A smartphone implementation is termed "Looking
Glass".
SUMMARY OF THE EMBODIMENTS
[0029] A system, method, and computer program product for an
augmented reality interface are disclosed and claimed herein.
Exemplary embodiments may comprise acquiring an image of a
real-world scene and metadata with a camera, storing the image and
metadata, retrieving at least one stored image with metadata having
selected features, manipulating the retrieved image, and combining
the manipulated image with a currently observed real-world scene
viewed with a portable electronic device. The image may include a
still photograph, at least one video frame up to a full video. The
image may be in analog or digital format, and may be recorded or
live. The image may be communicated in a data stream. The metadata
may describe the physical location and orientation of the camera
during the acquiring, and may be provided by a GPS system, a
gyroscope, and/or an accelerometer. The metadata may be provided by
the camera.
[0030] The currently observed scene, images, and/or metadata may be
stored on a server and/or the portable electronic device. The
selected features may include the stored physical location and
orientation best matching a current physical location and
orientation of the portable electronic device. Alternately, the
selected features may include the stored physical location and
orientation best matching at least one predicted physical location
and orientation of the portable electronic device. The server may
search for the selected features, and the retrieved image may be in
a second data stream. The portable electronic device may include a
smartphone, a hand-held device, the camera, a second camera, a PDA,
and/or a tablet computer. The embodiment may manipulate the
retrieved image by adjusting image orientation.
[0031] The embodiment may superimpose the manipulated image on the
currently observed scene, which may involve merging the data stream
with the second data stream. The embodiment may combine manipulated
imagery by displaying the manipulated image with the portable
electronic device in a display or a viewfinder. The method
preferably operates continuously and substantially in real time.
The method may operate as the currently observed scene changes as
the portable electronic device is moved, including translating,
tilting, panning, and zooming.
[0032] A system embodiment may comprise a processor and a memory
containing instructions that, when executed by the processor cause
the processor to acquire a video of a real-world scene and metadata
with a camera, store the video and metadata, retrieve at least one
stored video with metadata having selected features, manipulate the
retrieved video, and combine the manipulated video with a currently
observed real-world scene viewed with a portable electronic
device.
[0033] A computer program product embodiment may comprise a
computer readable medium tangibly embodying non-transitory
computer-executable program instructions thereon that, when
executed, cause a computing device to acquire a video of a
real-world scene and metadata with a camera, store the video and
metadata, retrieve at least one stored video with metadata having
selected features, manipulate the retrieved video, and combine the
manipulated video with a currently observed real-world scene viewed
with a portable electronic device.
[0034] In a second embodiment, the metadata may include annotations
by a server or a user acquiring the video. The annotations may
include details of a person, an object, or a location being
photographed. The annotations may help users share their
experiences and/or recommended locations. The acquiring and
retrieving of imagery may be performed by different persons,
including friends or clients for example.
[0035] In a third embodiment, the video and metadata may be
communicated on at least one network. The retrieving may include
pushing the data stream to a network, or pulling the data from a
network in response to a request. The network may include a private
network or the internet.
[0036] In a fourth embodiment, the retrieved video may be compared
with the currently observed real-world scene to enable navigation.
The embodiment may visually verify a real-world path or a
real-world destination for a portable electronic device user.
[0037] In a fifth embodiment, the manipulated video may be combined
with at least one historical image and a currently observed
real-world scene viewed with a portable electronic device. This
embodiment thus may place the user in a historically-based reality,
to for example assist in educating the user on historical
events.
[0038] In a sixth embodiment, guide information related to the
selected features is provided. The guide information may include
historical information and/or current information. The guide
information may include a virtual tour with commentary regarding
identified landmarks, museum exhibits, real properties for sale,
and/or rental properties. Access to the guide information may be
provided as a fee-based service.
[0039] In a seventh embodiment, commercial information regarding
the selected features is provided. The selected features may
include goods or services available commercially. The commercial
information may include a recommendation, a review, a promotion, an
advertisement, a price, an online vendor, a local vendor, a
descriptive differentiation presentation, or a UPC.
[0040] In an eighth embodiment, the metadata may include
descriptive data relating to at least one of surveillance and
rescue. For example, the metadata may include at least one of the
position and orientation of an item of police evidence. The
metadata may also include information relating to a lost child, an
invalid, an elderly person, or a medical emergency.
[0041] As described more fully below, the apparatus and processes
of the embodiments disclosed provide an augmented reality
interface. Further aspects, objects, desirable features, and
advantages of the apparatus and methods disclosed herein will be
better understood and apparent to one skilled in the relevant art
in view of the detailed description and drawings that follow, in
which various embodiments are illustrated by way of example. It is
to be expressly understood, however, that the drawings are for the
purpose of illustration only and are not intended as a definition
of the limits of the claimed invention.
BRIEF DESCRIPTION OF THE DRAWINGS
[0042] FIG. 1 depicts a depicts a position confidence ellipse using
dead reckoning;
[0043] FIG. 2 depicts the basic algorithm for filtering a compass
heading according to an embodiment;
[0044] FIG. 3 depicts the results of the filtering algorithm on raw
sensor data within an iPhone implementation according to an
embodiment;
[0045] FIG. 4 depicts grid based location querying to retrieve and
upload virtual content according to an embodiment;
[0046] FIG. 5 depicts a scene that a user wants to tag and upload
to a server according to an embodiment;
[0047] FIG. 6 depicts an interface for recording, tagging, and
uploading a video of a scene according to an embodiment;
[0048] FIG. 7 depicts that metadata is uploaded from a device to a
server that contains both video data as well as additional location
metadata according to an embodiment;
[0049] FIG. 8 depicts how a live camera image is augmented with
user video which may be either streamed or pre-downloaded based on
user position and orientation according to an embodiment.
DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS
[0050] The challenges mentioned above are now addressed, and
implementations of the present invention tackle each of the three
challenges specifically. Existing mobile rendering APIs are not
optimal; they impose certain intractable limitations on the
interaction between the live and augmented view. To mitigate these
issues, the implementations of the present invention rely on simple
scene graphs based on a nested view approach to render the content
overlay. Each view has a 4.times.4 visual transformation matrix,
which supports basic perspective rendering. The transformation
matrix is applied to graphics output when each view draws its
respective content, and is also applied to user interaction events
as they are passed into the view stack. The created transformation
matrix approximates the perspective distortion caused by the camera
movement, and applies the transformation to all views within the
nested tree. This enables easy rendering of interactive buttons on
the screen, and precludes the need to use other graphics libraries,
such as OpenGL. It also enables user interaction with rendered
content, which is important for mobile augmented reality
applications. Most mobile APIs provide view/widget nesting
mechanisms as well as custom APIs for manipulating transform
matrices. This technique therefore provides the most flexibility
for most augmented reality applications since at any given time
there are not many transformations that need to be handled.
However, it must be noted that as the complexity of the rendering
increases, there will be a marked decrease in performance since all
the transformations are being done in software.
[0051] To test this approach, this nested view transformation was
implemented on the iPhone 4 (iPhone is a registered trademark of
Apple Computer, Inc.). Tests showed that up to 23 different
separate views may be shown on the screen without any performance
degradation. As a result of this investigation, it was determined
that most mobile APIs, such as those for Android (Android is a
trademark of Google, Inc.) and more recently iPhone SDK 4.1, the
video data may be exposed and nested in views using the same
technique. This allows the implementation of an augmented reality
application which not only augments the live camera imagery with
graphics or text, but another live or recorded video.
[0052] Another one of the challenges discussed earlier was the
computational complexity involved in identifying frames of
reference and correspondence. This is one of the most crucial
aspects of augmented reality technologies. Using markers certainly
solves the frame of reference issue. However, it is impractical for
most mobile augmented reality applications since it requires
customized markers to be placed. Markerless approaches attempt to
solve these issues by using CPU intensive image recognition
algorithms to identify features which may be used to determine a
frame of reference, location and position of the virtual overlay
with respect to the live camera image. These techniques however,
are impractical on most mobile devices since they have limited
CPUs. On the other hand, using GPS sensors to locate position works
for most cases and most modern smart phones are equipped with GPS
as well as digital compass sensors. The drawback of using these
sensors is that they are susceptible to noise and GPS sensors
cannot be used indoors which severely limits their use for indoor
applications.
[0053] It is clear that none of the techniques on their own may be
used to create a complete augmented reality system that works in
all scenarios. Therefore, these limitations were addressed by using
a hybrid approach. Embodiments of the present invention use a
combination of GPS sensor, digital compass, gyroscope information
as well as a modified markerless feature tracking algorithm to
achieve real time image registration and location estimation that
may be used in any scenario. These techniques were implemented as
an iPhone 4 application, since it provides the best combination of
sensors that are required for this approach.
[0054] The iPhone 4 contains AGD 1 which is a 3 axis
gyroscope/accelerometer as well as a magnetic sensor which provides
directional information. It also contains a GPS chip. Recent
studies using the iPhone 4 SDK have shown the background location
notification for the GPS has an accuracy of approximately 500
meters and an active accuracy of around 30 meters when there is a
full signal lock. This is a pretty large range, therefore to get a
more refined and consistent location information, the embodiments
of the present invention combine the information from the digital
compass as well as the gyroscope information to determine if a user
was moving, and used the directional as well as the movement data
to approximate location within a 500.times.500 meter grid. The use
of 3-axis gyros to determine location is not new and is used in
most inertial navigation systems. This technique is usually
referred to as dead reckoning.
[0055] Dead reckoning is the process of estimating present position
by projecting heading and speed from a known past position. The
heading and speed are combined into a movement vector representing
the change of position from a known position, P0, to an estimated
position, P1. The accuracy of this estimation may be quoted as a
confidence ellipse whose population mean is in the ellipse 95% of
the time. The axes of the ellipse are determined by the accuracies
of the heading detection and speed measurement. This is illustrated
in FIG. 1, which depicts a position confidence ellipse 100 using
dead reckoning.
[0056] A user moving from point P0 to point P1 may be described as
being within the 95% confidence ellipse 100 centered on P1 with
axes ab, determined by the heading sensor accuracy, and cd,
determined by the speed sensor accuracy. While the uncertainty of a
single reading may be described this way, the uncertainty of
multiple readings is calculated as the cumulative sum of the
uncertainty on all readings since the last precisely known
position. This is simply expressed in the equation
Pn = P 0 + i = 0 i = n ( vi + ve ) ##EQU00001##
where n is the number of dead reckoning calculations since P0,
P.sub.n is the current position, and v.sub.e is the error vector
for each calculation.
[0057] Assuming a straight path, the resultant confidence ellipse
after n iterations has axes of dimension n.times.ab and n.times.cd,
or more simply, in the worst case these ellipses grow linearly with
travel distance. Clearly the accuracy of the sensors is critical to
the confidence that may be placed in position estimation using dead
reckoning. Unfortunately the sensors on most mobile smart phones
are inaccurate and are severely impacted by noise. As a result a
number of noise filtering algorithms were investigated, including
Kalman filter based dead reckoning, and the Savitzky-Golay
smoothing filter, however none of these seemed suitable for real
time performance on mobile phone systems. It was finally decided to
implement a finite impulse response filter, a method proposed by J.
Benjamin Gotow et al. They recently proved that an adapted FIR
filter may be used successfully on iPhone as well as Android phones
with acceptable accuracy. In addition, the more advanced
Savitzky-Golay smoothing filter may be applied offline by uploading
the raw sensor data to a backend server which may run the data and
then provide corrections to algorithm periodically. FIG. 2 outlines
the basic algorithm for filtering compass heading.
[0058] FIG. 3 shows the results of the filtering algorithm on raw
sensor data within an iPhone implementation. In this accelerometer
filter implementation, different colors (not shown) may be used to
represent accelerations in different orthogonal axes.
[0059] In the preferred embodiment, this technique allows users to
record video and tag it with its current location. This tag
contains additional metadata that is uploaded to a server and is
associated with video file. The format of the metadata not only
contains longitude, latitude, and heading data but also grid
coordinates that are calculated based on the location estimation
obtained once the GPS coordinates match and the dead reckoning
algorithm kicks in. This grid based approach to data storage and
point of interest retrieval has several benefits. In areas where
there are a large number of points of interest, such as cities,
retrieving and caching a large number of geotagged points becomes
difficult. As the user moves, the system has to continuously query
its backend server to update the nearest points of interest.
[0060] Unfortunately, there are several problems with this
straightforward approach. First of all, such a system is not
scalable, as the number of users increase querying the database
constantly severely degrades performance. A different approach is
needed to avoid the execution of expensive database queries.
Requesting and retrieving data on a mobile smartphone is also
problematic as continuous network connectivity quickly depletes the
battery, and constantly uploading to and retrieving data from
servers may adversely affect the frame rate of the application. One
way to solve this issue is to cache the data based on approximate
geolocations which are divided and stored as indexed grid
coordinates in the database.
[0061] FIG. 4 depicts grid based location querying to retrieve and
upload virtual content. This grid based approach provides a
scalable approach for information retrieval and caching for mobile
devices. It progressively loads contents from a server based on GPS
coordinates. A hash function places each point denoted by its
latitude/longitude and sub grid location based on accelerometer
data into an indexed two-dimensional grid.
[0062] Each longitude/latitude square in the grid contains all
points within a specific geographical area, and may be loaded by
querying the database for the indexed coordinate values. Each
square is further subdivided into the 50.times.50 grid, each of
which indexes a location roughly 10 square meters. This grid is
indexed based on approximate location within a single
longitudinal/latitudinal grid which is based on information
obtained from the filtering of the gyroscope data. Indexing the
contents of the database using discretized latitude and longitude
values obviates the need for numeric comparison and queries bounded
by latitude and longitude values. Queries may specify an exact
block index and retrieve a group of points within a predefined
geographic area.
[0063] There are several advantages of dividing content into a grid
and retrieving it on block by block basis. Information may be
retrieved and cached using just indexes. Each content item may be
uniquely identified with 4 index numbers, two specifying its
longitude/latitude square and two specifying its sub-grid position.
This alleviates the need for complex retrieval queries on a central
server. Caching retrieved data is also straightforward since data
may be stored and retrieved on the device based on the block index.
Purging cached data based on its distance from the user's current
location does not require iterating through each cached point.
Instead, entire blocks may be quickly deleted from the cache by
using the discrete grid indexes. In addition, filtering blocks of
points is much more efficient than processing each point and also
requires constant evaluation time, regardless of the number of
points present in the area.
[0064] In addition to using accurate location information,
embodiments of the present invention enhance the accuracy of the
frame of reference by analyzing the individual camera frame for
natural features. There has been considerable research in
markerless augmented reality algorithms; techniques such as PTAM,
SURF, and SIFT have all been proven to be efficient descriptors for
augmented reality applications in mobile devices. However all of
these techniques are usually used on their own and therefore are
not suitable for hybrid techniques such as those needed for
implementations of the present invention which needs to calculate
and filter location data, as well as extract image features all at
the same times without decreasing the real time performance of the
system. Therefore a simpler image descriptor is required, which may
be calculated efficiently on a mobile device.
[0065] Recently, Edward Rosten et al presented a fast, efficient
corner detection algorithm called FAST, which stands for Features
from Accelerated Segment Test. The feature detector considers
pixels in a Bresenhams circle of radius r around the candidate
point. If n contiguous pixels are all brighter than the nucleus by
at least given threshold value t or all darker than the nucleus by
given threshold value t, then the pixel under the nucleus is
considered to be a feature. Although r can in principle take any
value, only a value of 3 is used (corresponding to a circle of 16
pixels circumference), and tests show that the best value of n is
9. This value of n is the lowest one at which edges are not
detected. The resulting detector produces very stable features.
Additionally, FAST uses the ID3 algorithm to optimize the order in
which pixels are tested, resulting in the most computationally
efficient feature detector available. ID3 stands for Iterative
Dichotomiser 3, an algorithm used to generate a heuristic decision
tree. It is an approximation algorithm that relies on Occam's razor
rule to form the decision tree.
[0066] The ID3 algorithm may be summarized as follows:
[0067] 1. Take all unused attributes and count their entropy
concerning test samples
[0068] 2. Choose attributes for which entropy is minimum (or,
equivalently, information gain is maximum)
[0069] 3. Make a node containing that attribute
[0070] In embodiments of the present invention, uploaded video on
the server is analyzed for corners features. The entropy in this
case is defined as the likelihood that the current pixel being
analyzed is part of a corner. This likelihood is calculated based
on the intensity of the current pixel with respect to its
neighboring pixels. Fast corner features are also extracted for
each camera image at every frame and matched against those
retrieved from the database. A signed distance metric is used to
correct frame orientation and position to best align the virtual
view with live camera imagery. Thus, by comparing the retrieved
imagery with the currently observed real-world scene, navigation is
enabled. A targeted real-world path and/or real-world destination
may be visually verified for a user of a portable electronic
device.
[0071] The implementation of the hybrid augmented reality algorithm
detailed in the previous section is now presented. "Looking Glass"
is an augmented reality based video tagging and sharing
application. As mentioned before, the choice of platform was the
iPhone 4, as it contained a 3 direction gyro and a stable SDK which
made the implementation easier. However it should be noted that
these same techniques may be easily ported to Android or any other
CE platform as well, as long as they have a hardware profile
similar to that of the iPhone 4G.
[0072] The application may be divided into three distinct
stages:
[0073] In the first stage, the user may record and tag any video
taken from an iPhone 4 with location, orientation and gyroscope
data obtained from the GPS coordinates and the gyroscope filtering.
This additional information is stored in a special binary file and
associated with each video. Users may record video within the
application itself and tag it with description or comments. When
the user is finished, the application collates the location and
gyroscope information along with the tag information and sends it
to the backend server. FIGS. 5 and 6 depict a scene that a user
wants to tag and upload to a server, and the iPhone application
interface for recording, tagging, and uploading a video of the
scene, respectively. The user-tagged videos may include annotations
providing details on a given person, an object of interest, or a
location being photographed. These annotations may help a user
share experiences and recommended locations to visit. Multiple
users may acquire and retrieve videos with each other, such as
friends or clients for example.
[0074] In the second stage, the tagged videos are uploaded either
during the next time the device is connected to a personal computer
or when it connects to a Wifi network. Both the video as well as
the metadata file are sent to the server. The server annotates the
metadata file with additional information that is obtained by
analyzing the video frames. Each video snippet may be sampled at 10
second intervals and from those samples FAST (Features from
Accelerated Segment Test) features are obtained; these features may
be used later to provide image registration information to assist
overlay. FIG. 7 depicts that metadata is uploaded from the phone to
a server that contains both user video data as well as additional
location metadata. The data stream may be pushed to a network, such
as a private network or the internet, or may be pulled from a
network in response to a request.
[0075] FIG. 8 depicts how a live camera image is augmented with
user video, which may be either streamed or pre-downloaded. The
third stage of the methodology involves buffering the video
snippets from the server to the user interface based on location
and orientation information. Given the current location of the
device, the server may determine the videos that will be within the
device's view and preload the smaller video snippets. As the user
pans the camera thru the physical space, the identified video
snippets are overlaid in the location and direction at which they
were originally tagged. Once the user stops panning, the FAST
corner features of the current frame are matched with the tagged
video snippet and the video overlay is adjusted to match the view
and adjust that position of the overlay as the device moves in
physical space.
[0076] This patent application describes the various approaches by
which augmented reality systems are implemented and a hybrid
mechanism to build a viable, practical augmented reality system
which can run efficiently on a modem high end mobile device. The
challenges in implementing a robust, scalable system are
identified, and applicable solutions to overcome those issues are
presented. The current work being done in hybrid techniques is
extended by using a combination of markerless image processing
techniques and location based information.
[0077] The techniques were tested by implementing a novel augmented
reality application on the iPhone 4 which allows user to record,
share and view user generated videos using an augmented reality
interface. The popularity of websites such as YouTube and Facebook
has made the creation and sharing of user generated videos
mainstream. However the viewing and sharing of these videos have
still been limited to the grids and lists of the traditional
personal computer user interface. The "Looking Glass" tool presents
an interface where the physical world around us is tagged with
videos and allows users to see it by just focusing on it.
[0078] Further, the embodiments of the present invention enable the
user to augment the physical real world environment with user
generated videos. The augmented reality interface described makes
video available based on location, enabling sharing and viewing
videos across the physical space. By implementing an efficient
algorithm on a mobile device, such an application could easily be
embedded not only on mobile phones but other CE devices such as
still and video cameras, and tablet devices. Such a system may
provide value added features along with the photos, videos, and
even live streams that may be tagged.
[0079] For example, the system may blend historical and current
visual data to produce an educational system, or electronic
trainer. Videos may be created for specific places of historical
significance that are then visited by users. Users may thus
experience in-situ descriptive videos that educate them about
historical events that occurred there. Blended current and
historical videos may "place" the user in a historically-based
reality, for example.
[0080] More generally, the embodiments of the present invention
enable a virtual guide that provides detailed information regarding
the selected features used to retrieve stored video and metadata.
The guide may provide historical and current information about a
location, including a virtual tour with commentary regarding
identified landmarks, museum exhibits. Real estate for sale or rent
may for example be described by a virtual real estate agent by
taking a personal electronic device to a property of interest and
playing the retrieved videos related to the property. Access to
this guide information may be provided as a fee-based service.
[0081] In another embodiment, the augmented reality interface may
enable product identification and promotion. Retrieved videos may
provide commercial information regarding the selected features
identified in current imagery or in user tags. For example, related
goods or services may be the subject of a user query when viewing
an object or catalog listing with the personal electronic device.
The system may responsively provide advertisements, recommendations
for example from other system users, reviews, a specific promotion
from local and/or online vendors, and price information. A current
local advertisement may be provided, as well as a UPC or other
barcode, and even a descriptive differentiation presentation to
help a user with comparison shopping.
[0082] In another embodiment, the metadata may include descriptive
data related to surveillance or rescue. For example, an observer
may record date, time, and location information when following a
person or vehicle of interest. Similarly, police officers and
others may record and retrieve the position and orientation of an
item of police evidence. Paramedics or other persons may similarly
record and retrieve imagery and/or metadata relating to a lost
child, an invalid, an elderly person, or other medical
emergency.
[0083] As used herein, the terms "a" or "an" shall mean one or more
than one. The term "plurality" shall mean two or more than two. The
term "another" is defined as a second or more. The terms
"including" and/or "having" are open ended (e.g., comprising).
Reference throughout this document to "one embodiment", "certain
embodiments", "an embodiment" or similar term means that a
particular feature, structure, or characteristic described in
connection with the embodiment is included in at least one
embodiment. Thus, the appearances of such phrases in various places
throughout this specification are not necessarily all referring to
the same embodiment. Furthermore, the particular features,
structures, or characteristics may be combined in any suitable
manner on one or more embodiments without limitation. The term "or"
as used herein is to be interpreted as inclusive or meaning any one
or any combination. Therefore, "A, B or C" means "any of the
following: A; B; C; A and B; A and C; B and C; A, B and C". An
exception to this definition will occur only when a combination of
elements, functions, steps or acts are in some way inherently
mutually exclusive.
[0084] In accordance with the practices of persons skilled in the
art of computer programming, embodiments are described below with
reference to operations that are performed by a computer system or
a like electronic system. Such operations are sometimes referred to
as being computer-executed. It will be appreciated that operations
that are symbolically represented include the manipulation by a
processor, such as a central processing unit, of electrical signals
representing data bits and the maintenance of data bits at memory
locations, such as in system memory, as well as other processing of
signals. The memory locations where data bits are maintained are
physical locations that have particular electrical, magnetic,
optical, or organic properties corresponding to the data bits.
[0085] When implemented in software, the elements of the
embodiments are essentially the code segments to perform the
necessary tasks. The non-transitory code segments may be stored in
a processor readable medium or computer readable medium, which may
include any medium that may store or transfer information. Examples
of such media include an electronic circuit, a semiconductor memory
device, a read-only memory (ROM), a flash memory or other
non-volatile memory, a floppy diskette, a CD-ROM, an optical disk,
a hard disk, a fiber optic medium, etc. User input may include any
combination of a keyboard, mouse, touch screen, voice command
input, etc. User input may similarly be used to direct a browser
application executing on a user's computing device to one or more
network resources, such as web pages, from which computing
resources may be accessed.
[0086] While the invention has been described in connection with
specific examples and various embodiments, it should be readily
understood by those skilled in the art that many modifications and
adaptations of the augmented reality interface described herein are
possible without departure from the spirit and scope of the
invention as claimed hereinafter. Thus, it is to be clearly
understood that this application is made only by way of example and
not as a limitation on the scope of the invention claimed below.
The description is intended to cover any variations, uses or
adaptation of the invention following, in general, the principles
of the invention, and including such departures from the present
disclosure as come within the known and customary practice within
the art to which the invention pertains.
* * * * *