U.S. patent application number 11/188385 was filed with the patent office on 2005-12-01 for method and system for interacting with a user in an experiential environment.
Invention is credited to Barton, Christopher Jacques Penrose, Inghelbrecht, Philip, Mukherjee, Dheeraj Shankar, Wang, Avery Li-Chun.
Application Number | 20050267817 11/188385 |
Document ID | / |
Family ID | 24953713 |
Filed Date | 2005-12-01 |
United States Patent
Application |
20050267817 |
Kind Code |
A1 |
Barton, Christopher Jacques Penrose
; et al. |
December 1, 2005 |
Method and system for interacting with a user in an experiential
environment
Abstract
A method and system for provides a user with an ability to
capture a sample of an experiential environment and deliver that
sample to an interactive service to trigger one or more
predetermined events. In exemplary embodiments of the invention
such triggered events include the delivery of information and
services to the user, the execution of tasks and instructions by
the service on the user's behalf, communication events,
surveillance events and other control-oriented events that are
responsive to the user's wishes. In other exemplary embodiments of
the invention, the triggered events include transaction-oriented
events, entertainment events, and events associated with
enhancements to human ability or function.
Inventors: |
Barton, Christopher Jacques
Penrose; (San Diego, CA) ; Inghelbrecht, Philip;
(Berkeley, CA) ; Mukherjee, Dheeraj Shankar;
(London, GB) ; Wang, Avery Li-Chun; (Palo Alto,
CA) |
Correspondence
Address: |
MAYER, FORTKORT & WILLIAMS, PC
251 NORTH AVENUE WEST
2ND FLOOR
WESTFIELD
NJ
07090
US
|
Family ID: |
24953713 |
Appl. No.: |
11/188385 |
Filed: |
July 25, 2005 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
11188385 |
Jul 25, 2005 |
|
|
|
09734949 |
Dec 12, 2000 |
|
|
|
Current U.S.
Class: |
705/26.1 |
Current CPC
Class: |
G10H 7/02 20130101; G06Q
30/0277 20130101; G06Q 30/0251 20130101; G06Q 30/0601 20130101;
G06Q 30/0207 20130101; G06Q 30/0625 20130101; G06Q 30/02 20130101;
G06Q 30/0633 20130101; G06Q 99/00 20130101; G06Q 30/0641 20130101;
G06Q 30/0217 20130101 |
Class at
Publication: |
705/026 |
International
Class: |
G10H 007/00 |
Claims
What is claimed is:
1. A method for interacting with a user comprising: receiving from
the user a captured sample of an experiential environment; and
triggering a predetermined event in response to the captured
sample.
2. The method according to claim 1, further comprising determining
from the signal a characteristic of the captured sample and
triggering a predetermined event in response to the determined
characteristic.
3. The method according to claim 1, wherein the predetermined event
includes a transaction.
4. The method according to claim 3, wherein the transaction
includes sales and purchase of merchandise.
5. The method according to claim 3, wherein the transaction
includes an offer for sale of merchandise.
6. The method according to claim 5, wherein the offer for sale of
merchandise includes an offer to sell recordings of music.
7. The method according to claim 6, wherein the recording is
related to a characteristic of the captured sample.
8. The method according to claim 3, wherein the transaction
includes furnishing and receiving information.
9. The method according to claim 3, wherein the transaction
includes delivery of advertising or promotional offers.
10. The method according to claim 9, wherein the promotional offers
include trial offers.
11. The method according to claim 9, wherein the promotional offers
include offers to sell merchandise or services at discounted
prices.
12. The method according to claim 3, wherein the transaction
includes an exchange of information between a sales source and the
user attendant to a sale of merchandise or services to the
user.
13. The method according to claim 5, wherein the offer is selected
in response to a profile of the user.
14. The method according to claim 5, wherein the offer is selected
in response to history of transactions completed with the user.
15. The method according to claim 1, wherein the predetermined
event includes a communication event.
16. The method according to claim 1, wherein the predetermined
event includes an entertainment event.
17. The method according to claim 1, wherein the predetermined
event includes a surveillance or monitoring event.
18. The method according to claim 1, wherein the predetermined
event includes a human ability enhancement event.
19. The method according to claim 1, wherein the predetermined
event includes an execution of tasks or instructions event.
20. The method according to claim 1, wherein the predetermined
event includes a delivery of information event.
21. The method according to claim 1, wherein the predetermined
event includes a delivery of services event.
22. The method according to claim 1, wherein the predetermined
event includes a control event.
23. The method according to claim 1, wherein the predetermined
event includes a combination of communication events and control
events.
24. The method according to claim 5, wherein the predetermined
event includes an entertainment event.
25. A method for interacting with a user, comprising: providing a
user with an ability to capture a sample of an experiential
environment; and delivering said sample to an interactive service
to trigger one or more predetermined events.
26. The method according to claim 25, wherein the predetermined
event includes a communication event.
27. The method according to claim 25, wherein the predetermined
event includes a surveillance event.
28. A method for interacting with a user, comprising: capturing a
sample of an experiential environment; and triggering one or more
predetermined events at an interactive service upon receipt of said
sample.
29. The method according to claim 28, wherein the one or more
predetermined events includes at least a communication event.
30. The method according to claim 28, wherein the one or more
predetermined events includes at least a surveillance event.
31. A method for receiving instructions from a user in an
interactive service, comprising: receiving a sample of an
experiential environment from the user; and triggering one or more
predetermined events in the interactive service upon receipt of
said sample.
32. The method according to claim 31, wherein the one or more
predetermined events includes at least a communication event.
33. The method according to claim 31, wherein the one or more
predetermined events includes at least a surveillance event.
34. A method for interacting with an interactive service
comprising: sampling an experiential environment; and forwarding
the sample to an interactive service to trigger one or more
predetermined events.
35. The method according to claim 34, wherein the one or more
predetermined events includes at least a communication event.
36. The method according to claim 34, wherein the one or more
predetermined events includes at least a surveillance event.
37. The method according to claim 34, further comprising deriving
information from the sample.
38. The method according to claim 37, further comprising combining
information derived from the sample with information known about
the user to trigger the one or more predetermined events.
39. The method according to claim 38, wherein the one or more
predetermined events includes one or more of the following:
delivery of information and/or services to the user, execution of
tasks and/or instructions by the service on the user's behalf, a
communication event, a surveillance event and one or more
control-oriented events that are responsive to the user's
wishes.
40. The method according to claim 39, wherein the triggered events
also include one or more of the following: transaction-oriented
events, entertainment events, and events associated with
enhancements to human ability or function.
41. A method for enhancing human abilities of a user comprising:
capturing a sample of an experiential environment of the user;
forwarding the captured sample to an interactive service; and
triggering an enhancement event based on the captured sample.
42. A method for interacting with a user comprising: receiving a
captured sample of an experiential event from a user; triggering a
transactional event with the user based on a characteristic of the
captured sample.
43. The method according to claim 42, wherein the transactional
event includes storing the captured sample.
44. The method according to claim 43, wherein the transactional
event includes sorting the captured sample based on a dominant
characteristic inherent in the captured sample.
Description
STATEMENT OF RELATED APPLICATIONS
[0001] This application is a continuation of co-pending U.S. patent
application Ser. No. 09/734,949, filed Dec. 12, 2000, entitled
"Method And System For Interacting With A User In An Experiential
Environment," which is incorporated by reference herein in its
entirety.
BACKGROUND OF THE INVENTION
[0002] The present invention relates generally to methods and
apparatuses for interacting with a user, and more particularly to a
method and apparatus for interacting with a user in an experiential
environment.
[0003] Methods and systems for interacting with a user exist.
Examples include computer graphical user interfaces, computer
pointing and clicking devices, sometimes referred to as "mice" to
name just a few. These devices fail to take advantage of the
experiential environment in which a user often finds oneself and
therefore lack the ability to provide rich information to a
computer as to the user's desires.
[0004] A wide variety of media is available through many delivery
channels to a large audience. Rapidly evolving communication
technologies such as telephony, television, pre-recorded media
including CDs and DVD, radio, and the Internet in combination with
traditional media such as print and visual arts operate to create a
vast experiential environment for many people. Such environments
are further enriched with other sensory experiences such as those
coming from live performances, speeches and lectures, and routine
day-to-day interactions with others. Ambient conditions like the
sights and sounds of a bustling city street or the buzz of insects
and chirping of birds in a more idyllic rural setting provide
additional dimensions to the experiential environment.
[0005] In the midst of these experiential environments, users are
often desirous of communicating with a computer or other automated
processing device. Yet these devices cannot take advantage of the
state of the user, which otherwise provides valuable information to
one attempting to ascertain the needs of the user.
[0006] The present invention is therefore directed to the problem
of developing a method and system for interacting with a user in a
way that takes advantage of the rich and stimulating experiential
environment to which the user is exposed to simplify the processing
of user requests.
SUMMARY OF THE INVENTION
[0007] The present invention solves these and other problems by
providing a method and apparatus for interacting with a user that
employs a captured sample of an experiential environment in which
the user exists as a command to trigger subsequent events.
[0008] A method and system for interacting with the user includes
capturing a sample of an experiential environment and delivering
that sample to an interactive service to trigger one or more
predetermined events. In exemplary embodiments of the invention
such triggered events include the delivery of information and
services to the user, the execution of tasks and instructions by
the service on the user's behalf, communication events,
surveillance events and other control-oriented events that are
responsive to the user's wishes. In other exemplary embodiments of
the invention, the triggered events include transaction-oriented
events, entertainment events, and events associated with
enhancements to human ability or function.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] FIG. 1 depicts a block diagram of an arrangement in
accordance with the invention including an experiential
environment, a recognition engine and related structures, and an
event trigger.
[0010] FIG. 2 depicts a block diagram of the relationship of
elements of an exemplary embodiment of the invention among a
service provider, service user, media provider, and experiential
environment, among other elements.
DETAILED DESCRIPTION
[0011] At this point, it is worthy to note that any reference
herein to "one embodiment" or "an embodiment" means that a
particular feature, structure, or characteristic described in
connection with the embodiment is included in at least one
embodiment of the invention. The appearances of the phrase "in one
embodiment" in various places herein are not necessarily all
referring to the same embodiment.
[0012] The present invention includes a method and system for
interacting with a user that includes providing a user with an
ability to capture a sample of an experiential environment and
delivering that sample to an interactive service to trigger one or
more predetermined events. It is noted that the terms "service" and
"service provider" and "system" as used herein include a service,
service provider, and system, respectively, which employ the
various principles of the present invention unless otherwise
indicated. Some examples of events that can be triggered include
communication events or surveillance events.
[0013] The experiential environment can be of any form where a
message or information (e.g., content), or other sensory stimuli is
available for the user to experience, and may come from many
sources. Radio and television, pre-recorded audio and video (e.g.
tapes, CDs, DVDs, MP3, etc.) signals on internet and computer-based
systems, telephones, live demonstrations or performances, and
sensory stimuli from ambient surroundings are several examples of
elements which may form, alone or in combination, a typical
experiential environment. Using a sampling device, such as an
ordinary mobile (or cellular) phone, video recorder, or camera, or
a personal digital assistant (PDA) or palm pad, the user captures a
sample of the experiential environment and transmits the sample to
a service provider employing the present invention.
[0014] Alternatively, a personal digital assistant or computer
could be specially enabled to act as the interactive service itself
by storing the database in its memory and performing the database
query and processing without externally accessing the service. In
those embodiments wherein the database includes song records, this
can easily be incorporated in current capacity disk drives even for
millions of songs. Thus, the personal digital assistant would
provide complete end-to-end capability for a user and would not
require a separate communication channel to the interactive
service. In this embodiment, transmission to the service provider
would only be needed for additional interaction and potentially for
updates of the music database, such as periodically (e.g.,
weekly).
[0015] Another embodiment consists of a hybrid form of above
mentioned capturing devices and communication channel. In this
embodiment, a Palm handheld device, for example, captures the
experimental environment and pre-processes the characteristics of
this environment before sending it on to the service provider for
recognition. This reduces the bandwidth required for the
communication channel and thereby improves the response time, yet
does not require the very high speed, high capacity hand held
device required by the above embodiments.
[0016] The service provider may employ the sample by itself, may
derive information from the sample, may use data known about the
user (e.g., the user's identity and/or user profile), may accept
input from the user, or may employ a combination of all such
inputs, to trigger a predetermined event that is responsive to the
user's needs.
[0017] As described in more detail in the various embodiments of
the invention below, and as shown in FIG. 1, such events include
the delivery of information and/or services to the user, the
execution of tasks and/or instructions by the service on the user's
behalf, and other control-oriented events that are responsive to
the user's wishes. The triggered events also include
transaction-oriented events, entertainment events, and events
associated with enhancements to human ability or function. It is
noted that combinations of triggered events, as shown in FIG. 1 and
described in greater detail below, may also be advantageously used
in certain applications of the invention.
[0018] A brief overview of one exemplary embodiment of the
invention is now provided to briefly highlight the principles of
the present invention. In this example (which is described in
greater detail below), a method and system is provided for
triggering an information delivery event to a user. A user
listening to a commercial radio broadcast of popular music at home
is interested in singing along with a particular song being
broadcast, and thus desires to obtain songs lyrics. Using a
conventional telephone, the user dials a predetermined telephone
number to access a service utilizing the principles of the present
invention, and then holds the telephone handset within audible
distance of the source of the audio signal (i.e., the speaker of
the user's home stereo), to capture a sample of the environment. In
this example, the experiential environment includes the music
playing on the user's home stereo.
[0019] A system employed by the service provider at the other end
of the telephone automatically answers the phone, and using a
recognition engine, compares the captured sample to music
recordings in a database to identify the song and thereby trigger a
desired event for the user. In this example, the triggered event is
the delivery of the lyric information to the user that corresponds
to the song recognized from the captured sample.
[0020] Event fulfillment, the delivery of the desired song lyrics
to the user, may be accomplished in number of different ways in
both real-time and delayed modes. In real-time mode, lyric data is
delivered to so that the user can read the lyrics to sing along
with the song, in time, as it is being broadcast. Accordingly, data
for the song corresponding to the captured sample is read out of a
database and transmitted by the service to a user device such as a
computer, TTD device, SMS telephone or device, pager, or other
personal communication device capable of displaying text.
[0021] Depending on the display device and user preference, the
real-time delivery of lyrics may be transmitted in a batch
arrangement where relatively large portions or the entire lyric
data is downloaded at a time, or using a discrete arrangement where
lyric data is automatically synchronized to the song and
transmitted in pieces to the user as the song progresses. In the
former case, the user may choose to synchronize the display of the
lyrics to the song being played (so that the lyrics are displayed
at the appropriate time interval in advance of the lyrics being
sung or spoken by the song performer on the broadcast), or in some
applications of the invention, the user device may be configured to
perform synchronization of the batch transmission. In the latter
case, the service transmits the lyric data to automatically
synchronize to the broadcast song without any action required by
the user or user device. Such automatic synchronization is
implemented through a derivation of the temporal position of the
captured sample within the recognized song.
[0022] The delayed mode of event fulfillment includes the delivery
of lyrics corresponding to the sampled song sometime after the song
broadcast. The same delivery methods as with real-time mode may be
used along with delivery methods that are characterized as being
more time-consuming such as mail, courier, and facsimile
transmission.
[0023] In this exemplary embodiment of the invention, additional
user input may be collected via voice or touch-tone (i.e., DTMF
tones) to further control the lyric delivery event or trigger
additional events such as transaction events. For example, by
interacting with the user through the capture device or the
delivery device, the telephone, and text-displaying device
respectively, the service may provide purchase options to the user
to obtain the record album containing the broadcast and identified
song for which the lyrics were sought.
[0024] Other purchase options may also be provided to the user by
the service provider. In addition to being able to purchase the
identified music, the user may purchase other merchandise (which
could be related to the music or not). For example, after
purchasing the album containing the identified song, the user may
choose to purchase tickets to an upcoming performance by the song's
artist. Or, the user may inform friends or potential users of the
service about the identification and/or album purchase. Such
communication could be implemented by the service provider sending
an email or SMS message on the user's behalf.
[0025] Additional details of the invention will now be described.
Referring again to FIG. 1, a block diagram of an exemplary
arrangement of the invention is depicted. A sample captured from an
experiential environment 101 is represented by block 105. The
experiential environment 101 that may be used by the present
invention includes virtually any environment where sensory stimuli
are present for a user to sense or experience. Typically, elements
of such experiential environment 101 include, singularly or in
combination, media or other information content (each singularly or
in combination referred to as "media" here) delivered in a
channelized stream and media delivered in unchannelized streams
that are generally sensed via sight or sound. However, it is
emphasized that sight and sound are not necessarily the only sensed
experiences that may be used in applications of the invention.
[0026] Channelized media as used here means media that is delivered
to a user through discrete and identifiable means through such
conventional sources such as television, radio, pre-recorded audio
and video (e.g. tapes, CDs, DVDs, MP3, streaming IP video and
audio), computers, telephones (both wired and wireless), and
messaging devices such as SMS devices and pagers. Thus, while
channelized media encompasses both unidirectional broadcasts, and
bi-directional communications, it also includes print media such as
newspapers, books, and magazines and the visual arts.
[0027] Unchannelized media means media that is delivered to a user
through means that may be indistinct, less readily identified, or
involve multiple sources of media. Examples of unchannelized media
include the performing arts, speeches and lectures and the sensory
experiences derived from interactions with other persons, or the
surrounding environment. Unchannelized media further includes
sights, sounds, and other sensory stimuli that result from a user's
ambient surroundings.
[0028] In accordance with the invention, a user (not shown in FIG.
1) captures a sample of the experiential environment 101 via line
103 using a capture device 102. The capture device 102 captures a
sample of experiential environment 105 and provides it in a format
suitable for processing by the recognition engine 110 on line 117.
Illustrative embodiments of capture devices include, but are not
limited to, microphone, telephone, mobile telephone, PDA, tape
recorder, digital voice recorder, answering machine, radio
receiver, walkie-talkie, internet streaming socket, voice-over-IP
transceiver, or other such audio sampling devices. Typically then,
the signal capture device is incorporated into a device that the
user employs in a location remote from the service. Conventional
devices like mobile and regular telephones, PCs, radios, and other
recording and communication devices that users already own or use
everyday for other purposes may conveniently be used, without
modification, to sample both channelized and unchannelized
media.
[0029] Video or image capture devices may be also be used as
capture device 102, in accordance with the invention to provide a
sample of experiential environment 105 to recognition engine 110
via line 107, as depicted. For example, camcorders, video
conferencing equipment, motion and still video cameras,
surveillance devices, cameras, digital cameras, image grabbers,
picture telephones and other devices that can capture an image
sample of an experiential environment may be conveniently used to
provide a sample to recognition engine 110.
[0030] User input/identification block 120 may identify the user to
the recognition engine 110 as indicated by line 122, and may
optionally be operationally coupled to the event trigger block 130
via line 124. Examples of devices that generate the appropriate
identification for use with user input/identification block 120 may
include caller ID on a POTS (Plain Old Telephone Service) line or a
mobile telephone, internet IP address of a terminal sending in the
captured signal, or a cookie file stored on an internet browser on
the user's terminal.
[0031] In another illustrative example of user input/identification
block 120, the user may be required to enter an account code, for
example by keying it in on a touchtone pad on a telephone or saying
a pass phrase while signing on to a service incorporating the
principles of the present invention if dialing in. Alternatively,
the user may be identified by inserting an object carrying
identification codes into a terminal. Examples of this include a
credit card, ATM card, or Dallas Semiconductor Java Ring. The user
may also be identified by a biometric device to scan fingerprints,
retinas, palm print, or other such physical characteristics of the
user. A speaker identification system to identify the user by vocal
characteristics is another alternative method. User
input/identification block 120 is an optional component of the
present arrangement, which is employed if billing and tracking of
user activity is desired.
[0032] As shown in FIG. 2, the elements shown and described in FIG.
1 are typically associated with entities that are independent of
one another. Experiential environment 101 is typically associated
with a media operator or content provider such as radio or
television broadcasters, CATV provider, internet service providers,
entertainment organizations, private network or LAN operators, and
the like. However, it is important to emphasize that the present
invention contemplates that the experiential environment may
comprise a live demonstration or performance, or other
unchannelized media, taking place for example, at a nightclub, bar,
or discotheque. Moreover, as described above, the experiential
environment may include sights, sounds, and other sensory stimuli
from the ambient surroundings.
[0033] Experiential environment capture device 102 is generally
associated with users, however, such association may be merely
temporary, as public access devices (e.g., public telephones and
internet access facilities) may be readily used, without
modification, in order to realize the benefits provided by the
present invention. Experiential environment sample 105 represents
features and functionalities that, for example, are implemented by
the microphone and associated transceiver circuits in a user's
mobile telephone or computer. Such features and functionalities
could also be implemented by a charge coupled device ("CCD") array
that is integrated in a camera associated with a picture telephone
or computer. User input/ID block 120 is also associated with the
user.
[0034] The remaining elements of FIG. 1 are collectively associated
as indicated in FIG. 2 with a service provider. Recognition engine
110, reporting and transaction block 137, database 115, and user
profile/transaction history 135 represent features and
functionalities of an integrated system that form key elements of
an interactive service arrangement that may be particularly useful
in some applications of the invention. In such interactive service
arrangements, these collected elements are typically implemented in
a system formed by one or more CPUs. The interactive service
arrangement is identified by reference numeral 250 in FIG. 2.
[0035] As depicted in FIG. 2, a media provider 210 utilizes signal
source 201 which transmits signal 203 via media distribution
network 220 which may be arranged from network and transmission
elements or other channelized distribution arrangements (as for
example, with copper or fiber networks for data or telephony
services) or a free space/transmitter infrastructure array (as for
example with radio and television broadcasts, satellite systems,
and cellular/PCS/GSM wireless telephone services or networks which
operate according to short-range wireless protocols such as the
Bluetooth.TM. wireless standard).
[0036] Receiver/monitor 240 is employed by user 230 to receive
signal 203 and transform the signal into a format that facilitates
signal 203 to be monitored by the user. Receiver/monitor 240 may be
a radio, television, PC, Hi-fi (i.e., stereo) with speakers or any
other device that may be used to create a media experience
(including audio and video) that may be monitored by the user. User
230 using, for example, a mobile telephone, obtains a sample of
signal 203 played on receiver/monitor 240, where the sample
includes media content of interest selected by the user, such as a
portion of an image or sound.
[0037] Thus, as shown in FIG. 2, receiver/monitor 240 both outputs
a rendition of signal 203 to 105 and allows the user 230 to monitor
signal 203. However, it is noted that signal capture device 102 may
capture a sample of signal 203 via a direct connection to media
distribution network 220 (i.e., not relying on receiver/monitor 240
or similar device for signal input). In such instances, the user
monitoring of signal 203 is accomplished through other means or
user monitoring is not performed.
[0038] In the case of unchannelized media, capture device captures
a sample directly from experiential environment 101 via line 222
without relying on a media distribution network 220. In such cases,
a signal source transmits directly into the environment, as
indicated by line 221, or the experiential environment is itself
being captured by capture device 102. For example, as described
above, the arrangement of the present invention may be
advantageously used to trigger a variety of predetermined events
from a sample of live performances, interaction, or the sights and
sounds of the ambient surroundings.
[0039] The captured sample of signal 203 is relayed by the user 230
to the interactive service arrangement 250, as shown, via
communication network 260 on line 207, as shown. Communication
network 260 may have a similar arrangement as distribution network
220, or may be unitary with distribution network 220. It should be
noted, however, that in certain applications of the invention
distribution network 220 may typically be characterized by
unidirectional signal propagation (as in the case with broadcast
radio and television or typical CATV head end systems) while
communication network 260 may typically be characterized by
bi-directional signal propagation (as in the case with the public
switched telephone network and wireless or wired voice, data, and
internet systems). Such bi-directional nature of communication
network 260 is indicated by signal flow lines 207 and 208 as
depicted in FIG. 2.
[0040] In accordance with the invention, the interactive service
arrangement 250 derives information or characteristics of sample of
signal 203. Such derived information may be returned to the user
230 from the interactive service arrangement 250 using the same
communication network 260 or other networks. The signal return path
is indicated with lines 208 in FIG. 2. And, as described above, the
interactive service arrangement 250 may interact with the user and
other entities. For illustrative purposes, such interaction
pathways are depicted in FIG. 2 as lines 289 and 282, input and
output, respectively, via alternate network 280. Alternate network
280 may be a network of any type, however, in some applications of
the invention it may be advantageous to employ private networks,
dedicated lines, or other high-capacity transmission methods should
high-bandwidth interactions be desired. Such bandwidth intensive
interactions could occur, for example, between the interactive
service arrangement 250 and fulfillment partners such as record
distributors or merchandise retailers. This communication path is
shown in FIG. 2 where fulfillment center 286 interacts with
interactive service arrangement 250 using alternate network 280 and
fulfills user requests as indicated by line 288. Fulfillment center
286 may also interact with interactive service arrangement 250
using communication network 260 over interaction pathways 292 and
294 as indicated.
[0041] Thus, FIG. 2 may serve to additionally highlight the
principles applied in the exemplary embodiment of the invention
that was briefly overviewed at the beginning of the Detailed
Description. A radio station (media provider 210) employing signal
source 201 broadcasts a song (signal 203) over the air (media
distribution network 220), which is received on a radio
(receiver/monitor 240) of user 230. User 230 wishes to sing along
with the song. User 230 places a mobile telephone call over a
wireless network (communication network 260) to interactive service
arrangement 250. User 230 positions the microphone of his wireless
telephone (capture device 102) to capture a sample of the music
being played on the radio. The interactive service arrangement 250
receives the sample over the wireless network and derives the
identity of the song.
[0042] Optionally, the identity of the user may be derived from the
user's mobile telephone number that is sent from the communication
network 260 to the interactive service arrangement 250 typically
during call set up. The song lyrics are returned to back to the
user's mobile telephone over the same network. Further interactions
between the user 230 and the interactive service arrangement 250
may occur, and if the user chooses to purchase related merchandise
(such as a CD recording of the album containing the broadcast song,
or hardcopy printout of the desired lyrics), the interactive
service arrangement 250 can send the purchase information to the
service's distribution facility (fulfillment center 286) via data
or voice communication using its local area PC network (alternate
network 280) or via data or voice communication over communication
network 260 as discussed above.
[0043] The distribution center sends the merchandise to the user
via overnight courier service (line 288). Similarly, alternate
network 280 may be utilized to send an email or facsimile
containing the song lyrics to the user's email account.
[0044] In some embodiments of the invention, the service relies
upon a profile for the user that is created in advance of the call
(but may be updated during a call as described below). A "user
profile" may typically include general information about the user
such as name, address, preferred method of payment (i.e., credit
card pre-authorization), and set dollar limits on purchases. In
addition, service-specific information regarding the user may also
be included in the profile, such as demographic and user-identified
preference information, to facilitate the service tailoring the
transaction to fit a particular user. Moreover, with automatic
telephone number identification, i.e., "caller ID", profiles can be
built without prior user registration.
[0045] Age, education, residence, gender, occupation, and personal
interests, likes and dislikes, among other criteria, may be
employed to most effectively match transaction offers to users'
interests and purchase habits. For example, one particular customer
of the service may have a user profile that indicates that the user
is a member of a demographic group that is music-savvy and aware of
music trends. After offering to sell a recording of the song
selected by the user, the service could offer to sell a recording
by an artist that is "moving up the charts" in popularity. Thus, by
employing a user profile in some applications of the invention, a
higher transaction closing rate may be realized as offers are more
accurately targeted to users who may be predisposed to view the
offer favorably.
[0046] It is noted that user profiles are often dynamic and that
the present invention includes an ability to update and change a
user profile in response to changes in usage patterns (as described
below), market and industry trends, user input, and other
factors.
[0047] A feature of the invention related to user profile is usage
tracking. By tracking the frequency and time of access, and
purchase history, for example, of a user, the service can gain
additional insight into factors which may influence a user.
Patterns of usage may be derived which may allow predictive
modeling to be utilized, for example, to enhance and refine service
offerings. The system of the present invention can thus
differentiate between repeat users (who heavily access the service)
and casual users (who occasionally or infrequently use the service)
and adjust the options, offers, and interactive scripting (as
described below) so that interest and enthusiasm is maintained
among the service's customers. It is contemplated that the user
profile and usage tracking/pattern features described here may be
used in other embodiments and applications of the inventions as
well.
[0048] User profile and usage tracking are examples of a priori
information (i.e., information known by the service provider in
advance of a specific interaction with a user) and external
information (i.e., information known by the service provider that
is determined other than by analyzing the captured sample itself).
It will be recognized that other sources of a priori and/or
external information may also be advantageously used in some
applications of the invention. For example, the location of the
user, caller ID signal, dialed number indication service ("DNIS")
and the capture device employed by the user may, alone or in
combination, be used by the service provider to select or otherwise
tailor the triggered events to meet user expectations or needs.
[0049] Referring again to FIG. 1, the experiential environment
sample is received by recognition engine 110 on line 117.
Recognition engine 110 derives characteristics of the received
sample by using data stored in database 115. Recognition 110 and
database 115 are operationally coupled via line 119, as shown in
FIG. 1. A variety of derivation methods may be used. In the case of
audio samples, the techniques described in Appendix may be used.
However, it is noted that the derivation methods that may be used
in this invention are not limited to such techniques. The
particular derivation method chosen is only required to be able to
derive sufficient characteristics from the experiential environment
sample to enable a predetermined event to be triggered. Thus, the
strength or resolution of the derivation is coupled, in large
measure, to the level of discrimination required in selecting an
event to be triggered. As the number of potentially triggered
events increases, the necessity to resolve ambiguity in the sample
also increases.
[0050] For example, in the case of the exemplary embodiment where
song lyrics corresponding to a broadcast song are sought by a user,
a relatively large number of characteristics about the sample may
be derived and compared against stored data to be able to identify
the particular song from the many such songs that may be stored.
That is, as more songs are potentially identified, more lyric
delivery events are potentially triggered. By comparison, in
service offerings where are relatively small number of events are
potentially triggered, fewer sample characteristics need typically
be derived in order to resolve ambiguity as to which event to
trigger. Such service offering may include those where a binary
"Yes" or "No" event may be triggered as may be the case for
customer surveys and voting/polling type services.
[0051] The derived sample characteristics are provided to event
trigger 130 on line 128, as indicated in FIG. 1. Event trigger 130
may trigger one or more predetermined events according to the
derived sample characteristics. The triggering criteria used by
event trigger 130 may be provided by the characteristics themselves
or be externally provided by database 115. Alternatively, in the
case of binary type events, the mere receipt of any input signal at
the event trigger may be sufficient to trigger a particular event.
As used here, triggering criteria means those factors, data, or
inputs, which are used by the event trigger 130 to select the
particular one (or group) of actions to take among the many that
may be available.
[0052] While triggering criteria are provided by recognition engine
110 as a result of the derivation of characteristics from the
captured experiential environment sample, additional triggering
criteria may also be used in combination with the derived
characteristics. As depicted in FIG. 1, such additional triggering
criteria may come from the user on line 124 as user input in the
form of voice, audio or visual commands. For example, a user may
provide input to the event trigger by using the touch tone keypads
on a telephone, providing keyboard input on a computer, or
providing captioning information over a video feed using a video
camera. In addition, user profile and transaction history
information, as described above, may be used as additional
triggering criteria. As shown in FIG. 1, user profile/transaction
history block 135 is operably coupled to event trigger 130 via line
131 to provide such additional triggering criteria. However, in
some applications of the invention, user profile/transaction
history block 135 may provide user and/or transaction data to
recognition engine 110 (not shown in FIG. 1) that may be used to
assist in the characteristic derivation process. For example in a
music identification embodiment of the invention, recognition
engine 110 may, in certain applications, rely on the particular
user's usage history which indicates a strong preference for a
specific music genre. This usage history may help eliminate some
potential identification candidates that may have some sonic
correlations to stored data in database 115, but are very
dissimilar to the types of songs sought to be identified by the
user in the past and stored in user profile/transaction history
135.
[0053] Reporting/transaction block 137 is operably coupled to the
event trigger 130 via line 139 as shown so that event usage may be
tracked by the service provider for purposes of creating reports or
billing transactions to the user.
[0054] As shown in FIG. 1, event trigger 130 can trigger a variety
of events indicated collectively by reference numeral 180. It is
noted that the depicted events are merely illustrative and are
shown to highlight various categories or types of events that may
be triggered using the principles of the invention.
[0055] Transaction events 182 are one category of events that may
be triggered in accordance with the invention. These typically
include the purchase and sale of merchandise. In an exemplary
embodiment of the invention, a user reading the newspaper may see
an advertisement for a raincoat that stimulates a buying interest
in the user. Using a video camera on the user's PC, the user holds
the print advertisement near the camera to capture a still image of
the advertisement. By using a predetermined web address of the
World Wide Web, the user reaches a service provider and uploads to
the image to the service provider's recognition engine. The
advertisement could be included in a billboard, in which case the
user could use a still camera or video camera to capture it.
[0056] The recognition engine, using conventional text recognition
algorithms, pulls the name of the advertiser and advertising copy
for the raincoat from the still image. Employing an interactive
script on the web-page, the user is prompted for size and color
preferences. The user's credit card and shipping information may
already be on file with the service provider using user
profile/transaction history 135, or such information may be
collected from the user on-line through the web-interface or using
off-line communications such as telephone, email, or facsimile.
[0057] The derived characteristics, namely the advertiser's name
and clothing information, is combined with information about the
advertiser stored on database 115 and used by event trigger 130 to
trigger a purchase by the user of the desired raincoat. The
transaction particulars are forwarded to an event fulfillment
center, depicted in FIG. 1 by reference numeral 190, for actual
physical fulfillment of the raincoat to the event recipient 195, in
this case, the user. In this exemplary embodiment of the invention,
the service provider may merely function as a deal broker between
the advertiser and user where the order information is provided to
the advertiser and the service provider collects a service fee from
the advertiser. Alternatively, the service provider may derive
revenue from charging the user a subscription fee to use the
service, which facilitates the transaction as described above. In
some applications of the invention, a combination of these revenue
models may be advantageously used.
[0058] Entertainment events 184 may also be triggered in accordance
with the invention. In addition to the exemplary embodiment
involving the delivery of lyrics, as described above, a wide
variety of entertainment-type services may be advantageously
provided by the practice of the invention. Entertainment is defined
here to include services where interaction with the service
directly provides the value to the user. This compares with some
typical transaction events where the service provider furnishes a
tangential broker or facilitation service, which is attendant to
the purchase and sale transaction between the user and a third
party.
[0059] An illustrative example of an entertainment event 184
provided by the present invention includes game-type events. A user
listening to music captures a sample of a song on a mobile
telephone, which has been dialed in to a service provider. The
recognition engine 110 compares the sample to known songs in
database 115 and derives the identity of the song. A game type
entertainment event is then triggered by the service to
automatically send a small excerpt of the originally recorded song
(i.e., not the captured sample of the song) to a pre-determined
group of the user's friends via. For example, a WAV file or MP3
attachment to an email message sent by the service provider or
through a call placed by the service to the friend's wired or
wireless telephone. The identification data for members of the
group may be stored in a user profile, for example.
[0060] The friends are prompted in the message or call to try to
"Name that Tune" by identifying the song's title or artist from the
small excerpt. The friend's guesses may be collected by the service
provider using a variety of methods, including for example, an
interactive web-site, telephone call center, email, or conventional
mail. If no one correctly identifies the song, the song excerpt may
be increased in length, or a different excerpt may be furnished, in
an iterative manner, until a "winner" is determined.
[0061] Human ability enhancement events 186 are also depicted in
FIG. 1. These events are similarly triggered by captured samples of
an experiential environment. In embodiments of inventions using
these events, enhancements to ability and function are realized by
the ability of capture devices 102 to sense the environment in
manners that transcend human ability or in manners that replace
ordinary senses that are undeveloped or lost through illness or
accident in particular individuals.
[0062] For example, in the case of sensory enhancement, capture
devices such as telephones may quite easily arranged to be more
sensitive to sound or to particular frequencies of sound than the
human ear through the use of amplifiers and filters. Similarly,
visual capture devices may be arranged to provide magnification of
far away scenes or make small scenes appear large. Thus, an ability
to extend one's senses through the use of technology, in accordance
with the invention, provides great opportunity to trigger desired
events that could otherwise not occur with unaided senses. In an
exemplary embodiment of this aspect of the invention, a bird
watching user may wish to keep a catalog of sightings in the
neighborhood. Using a digital camera with a zoom lens, the user
takes a digital picture of a distant bird and the captured image is
later uploaded to a predetermined service provider's computer. The
recognition engine 110 could be configured to recognize the bird
based on characteristics such as colors and patterns, however, in
other applications, recognition engine 110 may more simply be
configured to derive a smaller constellation of key
characteristics. For example, the recognition engine may derive
that the captured experiential environment sample represented is a
color digital photograph and trigger an image archiving event for
the user. Or, dominant colors may be derived to allow an
image-sorting event to be triggered. In both of these events, the
service provider uses the derived characteristics of the captured
sample (the digital photographs) to automatically archive and
arrange the photos in a manner that is consistent with a service
offering in which the user participates.
[0063] Another exemplary embodiment of sensory enhancement is the
ability for a user to implement a virtual sense in a remote
location. For example, a mother may dial a predetermined number to
access a service provider using a conventional telephone located in
a sleeping child's room. The telephone is left off hook to
constantly sample the ambient conditions in the room. The sampled
environment is processed to determine whether the child has awoken
by listening for the frequency and amplitude characteristics
associated with crying or yelling and places a call or page to the
mother to let her know of the change in circumstances. In such
cases, the database 115 and event trigger 130 are arranged to
trigger the call or page based on the sound characteristics of
crying, but other events could be triggered, such as a call to fire
department, using other sound characteristics such as the timbre of
a smoke or carbon monoxide detector.
[0064] In the case of sensory replacement, capture devices replace
senses lost or undeveloped in an individual user, or augment
existing senses to make up for a lost sense. In these cases,
capture devices may be arranged to capture a sample of an
experiential environment to trigger events that can provide
assistance that more typical sensory aides cannot. For example, a
visually impaired user may configure video equipment to monitor the
entrance to the home. A service receiving the video feed may sample
the scene at the remote end to determine when sufficient changes
have occurred in the captured scene to determine that a person is
at the door. If desired, the user may provide data to the service
to allow particular individuals to be recognized and announced to
the user through a telephone call, page, SMS message, email, and
the like.
[0065] Block 188 indicates that the execution of task and
instruction events are provided by the invention. In embodiment of
the invention using this feature, a variety of predetermined events
are triggered by the receipt of a particular sample that involve a
service performing actions on behalf of the user. Tasks such as
bill or invoice payment may be readily automated by uploading a
scanned image of the bill to a service provider. The service
provider, upon receipt of the captured image, and optionally in
combination with user profile data, can derive the identity of the
payee and automatically pay the bill on behalf of the user and
debit the user's bank account or credit card.
[0066] Block 170 in FIG. 1 depicts that events for the delivery of
information can be triggered in response to the receipt of
particular captured samples. In embodiments of the invention
utilizing this feature, a service provider triggers the delivery of
information to the user or third parties. For example, a service
provider may receive a set of sound samples from a user's
microphone attached to a computer that is monitoring financial news
broadcasts. Deriving the names of companies from the sampled news
that the user has indicated an interest in monitoring, the service
automatically downloads stock quotes and related news information
to the user's computer.
[0067] In a similar manner as with the delivery of information,
delivery of services events, as shown in block 172 may also be
triggered in accordance with the invention. Continuing with the
financial example above, a service provider may provide financial
documentation services to the user such as the automated provision
of public securities and investment analysis documents based on the
companies of interest that are discussed in the monitored news
broadcast. In such embodiments of the invention, the user may
provide other event triggering criteria such as the minimum number
of utterances of a particular company name in a particular time
period before the automated provision of documents is
triggered.
[0068] Block 174 in FIG. 1 shows that control events may also be
triggered in response to a sampled experiential environment in
accordance with the invention. Control events are those that
provide the user with an ability to control or otherwise manipulate
information and data, services, or other events in a predetermined
manner according to the captured sample received by a service
provider. For example, a human resources recruiter may organize a
data archive of job candidates and associated demographic data by
engaging a service provider that automatically manipulates the data
according to web-site images of potential hiring companies that are
captured in a frame grabber running on the user's computer and
uploaded to the service provider. In such cases, the candidate
database can be sorted according the to captured web-document and
derived by deriving preselected characteristics such as industry
type, key-words in the text elements of the page, and other
characteristics.
[0069] Communication events may be triggered in accordance with the
invention as depicted by block 175 in FIG. 1. Communication events
include, for example, communicative interactions among users,
between users and the service provider, or such interactions
between users, the service provider, and third parties.
[0070] An exemplary embodiment of the invention involving a
triggered communication event follows. In this embodiment, a user
listening to music dials into a service and captures a sample of a
song on his mobile telephone. The recognition engine 110 compares
the sample to known songs in database 115 and derives the identity
of the song. In accordance with the invention, the service then
triggers a communication event to notify a second user that the
first user engaged the service to identify the song. The
notification may be implemented, for example, by the service
sending an email or SMS message to the second user. An excerpt of
the original recording of the identified song may optionally be
attached in the message for the second user to hear. This
embodiment of the invention may be of particular relevance to
persons interested in music trends who want to communicate, share,
or kept be abreast of the latest developments in music.
[0071] The particular user or group of users chosen to be notified
by the user may be selected in advance, for example, by employing a
user profile that stores the email address or SMS number of the
users to be notified. Or, the user may interact with the service
via a conventional interactive voice response unit to indicate a
choice relatively contemporaneously with the sample capture
process.
[0072] In some applications of the invention, the dial-in number to
the service may be used to indicate a choice by the user. For
example, the user may program his mobile phone (or the phone may be
preprogrammed by the service provider or partner) so that a
plurality of numbers are stored, each indicating a particular user
selection to the service. If, for example, the mobile phone stores
three dial-in numbers, say *211, *212, and *213, each number may be
used to reach the service while indicating a particular user
choice. While all the numbers dial into the same service provider,
the fact that a particular dialed number was employed by the user
is given to the service provider by communication networks 260
(FIG. 2) via a dialed number identification service ("DNIS"). In
the music trend embodiment described above, *211 could be used to
indicate a user desire to trigger a particular communication event,
according to a pre-programmed user profile, for example, to notify
user A that a song identification was sought by the user.
Similarly, *212 could indicate communications with users A and B,
and *213 could indicate the user desire to notify and provide a
song excerpt to a pre-defined third party.
[0073] In other applications of the invention, the DNIS may be
advantageously utilized to allow user selection of a triggerable
event from a plurality of events that may be hosted by a service
provider. If, for example, a single service provider hosts both
shopping services (a transaction event) and game services (an
entertainment event), then the DNIS may be used by the service
provider (perhaps among other factors used) to disambiguate among
the offered events and trigger the particular event desired by the
user in response to a triggering characteristic derived from the
experiential environment sample received from the user.
[0074] An embodiment of the invention similar to that above also
involves communication events. In this embodiment, the user
captures a sample of a song to trigger a communication event, which
is embodied in the form of a musical greeting card. Such musical
greeting cards include an excerpt of the original recording of the
identified song with a text or voice annotation from the user that
may be personalized to a specific occasion or event. Thus, as the
two communication event embodiments described above demonstrate,
the principles of the present invention may be advantageously
utilized to trigger user-to-user (also referred to as
"peer-to-peer") services in response to a received sample of a
captured experiential environment.
[0075] Surveillance and/or monitoring events are also contemplated
by the present invention, as depicted by block 176 in FIG. 1. In
such events, a sample or series of samples of an experiential
environment is captured by the user and relayed to the service
provider in order to trigger events that may facilitate the
detection of the presence or absence of one or more
characteristics, elements, or features in the environment, or a
change in such environment.
[0076] In a first embodiment of the invention involving a
triggerable surveillance or monitoring event, a chemical detection
device or "sniffer" is employed by a user to sample the air in the
environment. Such sniffers are known and may employ various sensory
techniques, such as gas chromatography, to determine the presence
and concentration of chemical elements or compounds in the air
sample. A signal indicative of the elemental constituents of the
sample is forwarded to the service provider whereby analysis of the
signal is performed to derive characteristics of the sample used to
trigger a desired event. For example, an agricultural user may
desire to monitor the conditions of crops. By monitoring the
emission of certain gases associated with the decomposition of
fertilizer and organic matter, and other gases produced by the
growing crops, the user may optimize the application of water,
fertilizers and pesticides to produce the highest yield. The
service provider can trigger an alert to the user, via facsimile,
email, mail, phone call, SMS message, or a variety of other
communication methods, when a particular threshold condition has
been met or exceeded.
[0077] Advantageously, the present invention separates the
experiential environment sensing apparatus (i.e., the sniffer) from
the analysis and event triggering apparatus (i.e., the recognition
engine 110 and event trigger 130 in FIG. 1). The sensing apparatus
is associated with the user who can best control its deployment in
the experiential environment, while the analysis and event
triggering apparatus is detached from the user. Such detachment
allows the service provider the opportunity to host a wide variety
of surveillance and monitoring events that would be impractical to
implement in a portable local device. For example, in accordance
with the invention, a single sniffer may be used to monitor both
crop condition, as described above, as well as methane levels in
effluents from a milk parlor of a dairy farm. The required analysis
of the gas samples, and the determination of whether a alarm
condition exists would be based on criteria that vary widely
between these two applications, yet the service provider could
readily trigger the different monitoring events due to flexibility
associated with remote analysis/trigger arrangement.
[0078] A variation of this first embodiment includes an arrangement
where a scaled down sniffer is carried by a user. While such a
device would be limited in the range of elements it could detect,
the flexibility of the remote analysis and triggering arrangement
of the present invention still may provide many advantages. That
is, a less sensitive sensing device, in accordance with the
invention, may be coupled with a remotely located, flexible and
powerful analysis tool to trigger useful and desirable events. For
example, a user with a sensitivity to a particular environment
condition, such as an allergy to a specific food, may carry a small
device that has sensitivity to the chemical signature of that food.
The device samples the environment periodically and sends a
representative signal via a wireless transceiver to the service
provider for analysis. Upon derivation of characteristics
indicating an alert event, the service provider can communicate
back to the user, via the same sensing device or using alternative
communication pathways such as telephone, pager or fax. Optionally,
the alert signal may be sent by the service provider to third
parties including, for example, the user's doctor, emergency
contact or others as conditions warrant.
[0079] In a second embodiment of the invention involving a
triggerable surveillance or monitoring event, a spectrum analyzer
is configured to sample the strength of constituent wavelengths of
light in an experiential environment. A particular spectral sample
of the environment can be compared by the service provider against
stored signatures to trigger a variety of surveillance or
monitoring events. For example, the level of ultraviolet radiation
in a warehouse may be compared against stored data to trigger
remote control of vents, skylights, or other HVAC devices in the
building by the service provider.
[0080] Block 178 in FIG. 1 indicates that various combinations of
the events shown and described in blocks 182-188, and 170-176 are
also contemplated as being advantageously used in some applications
of the invention. That is, some applications may benefit by
triggering multiple events, in sequence or in parallel, that may be
of the same or differing types.
[0081] Other embodiments of the invention may be implemented in
accordance with the claims that follow.
APPENDIX 1
[0082] Abstract
[0083] We disclose a method and apparatus for recognizing sound,
music, and other similar signals. The disclosed invention is
capable of recognizing an exogenous sound signal that is a
rendition of a known recording indexed in a database. The exogenous
sound signal may be subjected to distortion and interference,
including background noise, talking voices, compression artifacts,
band-limited filtering, transmission dropouts, time warping, and
other linear and nonlinear corruptions of the original signal. The
algorithm is capable of identifying the corresponding original
recording from a large database of recordings in time proportional
to the logarithm of the number of entries in the database. Given
sufficient computational power the system can perform the
identification in nearly real-time, i.e. as the sound is being
sampled, with a small lag.
[0084] Database Construction
[0085] The sound database may consist of any collection of
recordings, such as speech, music, advertisements, or sonar
signatures.
[0086] Indexing
[0087] In order to index the sound database, each recording in the
library is subjected to landmarking and fingerprinting analysis to
generate an index set for each item. Each recording in the database
has a unique index, sound_ID.
[0088] Landmarking
[0089] Each sound recording is landmarked using methods to find
distinctive and reproducible locations within the sound recording.
The ideal landmarking algorithm will be able to mark the same
points within a sound recording despite the presence of noise and
other linear and nonlinear distortion. The landmarking method is
conceptually independent of the fingerprinting process, but may be
chosen to optimize performance of the latter. Landmarking results
in a list of timepoints {landmark.sub.k} within the sound recording
at which fingerprints should be calculated. A good landmarking
scheme marks about 5-10 landmarks per second of sound recording, of
course depending on the amount of activity within the sound
recording.
[0090] Power Norms
[0091] A simple landmarking technique is to calculate the
instantaneous power at every timepoint and to select local maxima.
One way of doing this is to calculate the envelope by rectifying
and filtering the waveform directly. Another way is to calculate
the Hilbert transform (quadrature) of the signal and use the sum of
the magnitudes squared of the Hilbert transform and the original
signal.
[0092] Spectral Lp Norms
[0093] The power norm method of landmarking is especially good for
finding transients in the sound signal. The power norm is actually
a special case of the more general Spectral Lp Norm, where p=2. The
general Spectral Lp Norm is calculated at each time along the sound
signal by calculating the spectrum, for example via a
Hanning-windowed Fast Fourier Transform (FFT). The Lp norm for that
time slice is then calculated as the sum of the p-th power of the
absolute values of the spectral components, optionally followed by
taking the p-th root. As before, the landmarks are chosen as the
local maxima of the resulting values over time.
[0094] Multislice landmarks
[0095] Multi-slice landmarks may be calculated by taking the sum of
p-th powers of absolute values of spectral components over multiple
timeslices instead of a single slice. Finding the local maxima of
this extended sum allows optimization of placement of the
multislice fingerprints, described below.
[0096] Fingerprinting
[0097] The algorithm computes a fingerprint at each landmark
timepoint in the recording. The fingerprint is generally a value or
set of values that summarize a set of features in the recording
near the timepoint. In our implementation the fingerprint is a
single numerical value that is a hashed function of multiple
features.
[0098] The following are a few possible fingerprint categories.
[0099] Salient Spectral Fingerprints
[0100] In the neighborhood of each landmark timepoint a frequency
analysis is performed to extract the top several spectral peaks. A
simple such fingerprint value is just the single frequency value of
the strongest spectral peak. The use of such a simple peak resulted
in surprisingly good recognition in the presence of noise, but
resulted in many false positive matches due to the non-uniqueness
of such a simple scheme. Using fingerprints consisting of the two
or three strongest spectral peaks resulted in fewer false
positives, but in some cases created a susceptibility to noise if
the second-strongest spectral peak was not sufficiently strong
enough to distinguish it from its competitors in the presence of
noise--the calculated fingerprint value would not be sufficiently
stable. Despite this, the performance of this case was also
good.
[0101] Multislice Fingerprints
[0102] In order to take advantage of the time-evolution of many
sounds a set of timeslices is determined by adding a set of offsets
to a landmark timepoint. At each resulting timeslice a Salient
Spectral Fingerprint is calculated. The resulting set of
fingerprint information is then combined to form one multitone
fingerprint. Each such fingerprint is much more unique than the
single-time salient spectral fingerprint since it tracks temporal
evolution, resulting in fewer false matches. Our experiments
indicate that using two or three timeslices along with the single
strongest spectral peak in each timeslice results in very good
performance, even in the presence of significant noise.
[0103] LPC Coefficients
[0104] In addition to finding the strongest spectral components,
there are other spectral features that can be extracted and used as
fingerprints. LPC analysis extracts the linearly predictable
features of a signal, such as spectral peaks, as well as spectral
shape. LPC coefficients of waveform slices anchored at landmark
positions can be used as fingerprints by hashing the quantized LPC
coefficients into an index value. LPC is well-known in the art of
digital signal processing.
[0105] Cepstral Coefficients
[0106] Cepstral coefficents are useful as a measure of periodicity
and may be used to characterize signals that are harmonic, such as
voices or many musical instruments. A number of cepstral
coefficients may hashed together into an index and used as a
fingerprint. Cepstral analysis is well-known in the art of digital
signal processing.
[0107] Index Set
[0108] The resulting index set for a given sound recording is a
list of pairs (fingerprint, landmark) of analyzed values. Since the
index set is composed simply of pairs of values, it is possible to
use multiple landmarking and fingerprinting schemes simultaneously.
For example, one landmarking/fingerprinting scheme may be good at
detecting unique tonal patterns, but poor at identifying
percussion, whereas a different algorithm may have the opposite
attributes. Use of multiple landmarking/fingerprinting strategies
results in a more robust and richer range of recognition
performance. Different fingerprinting techniques may be used
together by reserving certain ranges of fingerprint values for
certain kinds of fingerprints. For example, in a 32-bit fingerprint
value, the first 3 bits may be used to specify which of 8
fingerprinting schemes the following 29 bits are encoding.
[0109] Searchable Database
[0110] Once the index sets have been processed for each sound
recording in the database, a searchable database is constructed in
such a way as to allow fast (log-time) searching. This is
accomplished by constructing a list of triplets (fingerprint,
landmark, sound_ID), obtained by appending the corresponding
sound_ID to each doublet from each index set. All such triplets for
all sound recordings are collected into a large index list. In
order to optimize the search process, the list of triplets is then
sorted according to the fingerprint. Fast sorting algorithms are
well-known in the art and extensively discussed in D. E. Knuth,
"The Art of Computer Programming, Volume 3: Sorting and Searching,"
hereby incorporated by reference. High-performance sorting
algorithms can sort the list in N log(N) time, where N is the
number of entries in the list. Once this list is sorted it is
further processed by segmenting it such that each unique
fingerprint in the list is collected into a new master index list.
Each entry in this master index list contains a fingerprint value
and a pointer to a list of (landmark, sound_ID) pairs. Rearranging
the index list in this way is optional, but saves memory since each
fingerprint value only appears once. It also speeds up the database
search since the effective number of entries in the list is greatly
reduced to a list of unique values.
[0111] Alternatively, the master index list could also be
constructed by inserting each triplet into a B-tree with non-unique
fingerprints hanging off a linked list. Other possibilities exist
for constructing the master index list. The master index list is
preferably held in system memory, such as DRAM, for fast
access.
[0112] Recognition System
[0113] Once the master index list has been built it is possible to
perform sound recognition over the database.
[0114] Sound Source
[0115] Exogenous sound is provided from any number of analog or
digital sources, such as a stereo system, television, Compact Disc
player, radio broadcast, telephone, mobile phone, internet stream,
or computer file. The sounds may be real-time or offline. They may
be from any kind of environment, such as a disco, pub, submarine,
answering machine, sound file, stereo, radio broadcast, or tape
recorder. Noise may be present in the sound signal, for example in
the form of background noise, talking voices, etc.
[0116] Input to the Recognition System
[0117] The sound stream is then captured into the recognition
system either in real-time or presented offline, as with a sound
file. Real-time sounds may be sampled digitally and sent to the
system by a sampling device such as a microphone, or be stored in a
storage device such as an answering machine, computer file, tape
recorder, telephone, mobile phone, radio, etc. The sound signal may
be subjected to further degradation due to limitations of the
channel or sound capture device. Sounds may also be sent to the
recognition system via an internet stream, FTP, or as a file
attachment to email.
[0118] Preprocessing
[0119] Once the sound signal has been converted into digital form
it is processed for recognition. As with the construction of the
master index list, landmarks and fingerprints are calculated. In
fact, it is advisable to use the very same code that was used for
processing the sound recording library to do the landmarking and
fingerprinting of the exogenous sound input. The resulting index
set for exogenous sound sample is also a list of pairs
(fingerprint, landmark) of analyzed values.
[0120] Searching
[0121] Searching is carried out as follows: each
fingerprint/landmark pair (fingerprint.sub.k, landmark.sub.k) in
the resulting input sound's index set is processed by searching for
fingerprint.sub.k in the master index list. Fast searching
algorithms on an ordered list are well-known in the art and
extensively discussed in Knuth, Volume 3 (ibid), incorporated by
reference. If fingerprint.sub.k is found then the corresponding
list of matching (landmark*.sub.j, sound_ID.sub.j) pairs having the
same fingerprint is copied and augmented with landmarkk to form a
set of triplets of the form(landmark.sub.k,
landmark*.sub.j,sound_ID.sub.j). This process is repeated for all k
ranging over the input sound's index set, with the all the
resulting triplets being collected into a large candidate list.
[0122] After the candidate list is compiled it is further processed
by segmenting according to sound_ID. A convenient way of doing this
is to sort the candidate list according to sound_ID, or by
insertion into a B-tree. The result of this is a list of candidate
sound_IDs, each of which having a scatter list of pairs of landmark
timepoints, (landmark.sub.k,landmark*.sub.j) with the sound_ID
stripped off.
[0123] Scanning
[0124] The scatter list for each sound_ID is analyzed to determine
whether it is a likely match.
[0125] Thresholding
[0126] One way to eliminate a large number of candidates is to toss
out those having a small scatter list. Clearly, those having only 1
entry in their scatter lists cannot be matched.
[0127] Alignment
[0128] A key insight into the matching process is that the time
evolution in matching sounds must follow a linear correspondence,
assuming that the timebases on both sides are steady. This is
almost always true unless the sound on one side has been
nonlinearly warped intentionally or subject to defective playback
equipment such as a tape deck with a warbling speed problem. Thus,
the matching fingerprints yielding correct landmark pairs
(landmark.sub.n, landmark*.sub.n) in the scatter list of a given
sound_ID must have a linear correspondence of the form
landmark*.sub.n=m*landmark.sub.n+offset
[0129] where m is the slope, and should be near 1, landmark.sub.n
is the corresponding timepoint within the exogenous sound signal,
landmark*.sub.n is the corresponding timepoint within the library
sound recording indexed by sound_ID, and offset is the time offset
into the library sound recording corresponding to the beginning of
the exogenous sound signal.
[0130] This relationship ties together the true
landmark/fingerprint correspondences between the exogenous sound
signal and the correct library sound recording with high
probability, and excludes outlier landmark pairs. Thus, the problem
of determining whether there is a match is reduced to finding a
diagonal line with slope near 1 within the scatterplot of the
points in the scatter list.
[0131] There are many ways of finding the diagonal line. A
preferred method starts by subtracting m*landmark.sub.n from both
sides of the above equation.
(landmark*.sub.n-m*landmark.sub.n)=offset
[0132] Assuming that m is approximately 1, we arrive at
(landmark*.sub.n-landmark.sub.n)=offset
[0133] The diagonal-finding problem is then reduced to finding
multiple landmark pairs that cluster near the same offset value.
This is accomplished easily by calculating a histogram of the
resulting offset values and searching for the offset bin with the
highest number of points. Since the offset must be positive if the
exogenous sound signal is fully contained within the correct
library sound recording, landmark pairs that result in a negative
offset are excluded.
[0134] The winning offset bin of the histogram is noted for each
qualifying sound_ID, and the corresponding score is the number of
points in the winning bin. The sound recording in the candidate
list with the highest score is chosen as the winner. The winning
sound_ID is provided to an output means to signal the success of
the identification.
[0135] To prevent false identification, a minimum threshold score
may be used to gate the success of the identification process. If
no library sound recording meets the minimum threshold then there
is no identification.
[0136] Pipelined Recognition
[0137] In a real-time system the sound is provided to the
recognition system incrementally over time. In this case it is
possible to process the data in chunks and to update the index set
incrementally. Each update period the newly augmented index set is
used as above to retrieve candidate library sound recordings using
the searching and scanning steps above. The advantage of this
approach is that if sufficient data has been collected to identify
the sound recording unambiguously then the data acquisition may be
terminated and the result may be announced.
[0138] Reporting the Result
[0139] Once the correct sound has been identified, the result is
reported. Among the result-reporting means, this may be done using
a computer printout, email, SMS text messaging to a mobile phone,
computer-generated voice annotation over a telephone, posting of
the result to an internet account which the user can access
later.
* * * * *