U.S. patent application number 15/702229 was filed with the patent office on 2018-01-04 for systems and methods for assessing viewer engagement.
The applicant listed for this patent is TVision Insights, Inc.. Invention is credited to Yun Fu, Yanfeng Liu, Inderbir Sidhu.
Application Number | 20180007431 15/702229 |
Document ID | / |
Family ID | 59274356 |
Filed Date | 2018-01-04 |
United States Patent
Application |
20180007431 |
Kind Code |
A1 |
Sidhu; Inderbir ; et
al. |
January 4, 2018 |
Systems and Methods for Assessing Viewer Engagement
Abstract
A system for quantifying viewer engagement with a video playing
on a display includes at least one camera to acquire image data of
a viewing area in front of the display. A microphone acquires audio
data emitted by a speaker coupled to the display. The system also
includes a memory to store processor-executable instructions and a
processor. Upon execution of the processor-executable instructions,
the processor receives the image data and the audio data and
determines an identity of the video displayed on the display based
on the audio data. The processor also estimates a first number of
people present in the viewing area and a second number of people
engaged with the video. The processor further quantifies the viewer
engagement of the video based on the first number of people and the
second number of people.
Inventors: |
Sidhu; Inderbir; (Lexington,
MA) ; Liu; Yanfeng; (Long Island City, NY) ;
Fu; Yun; (Brookline, MA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
TVision Insights, Inc. |
Boston |
MA |
US |
|
|
Family ID: |
59274356 |
Appl. No.: |
15/702229 |
Filed: |
September 12, 2017 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
PCT/US2017/012531 |
Jan 6, 2017 |
|
|
|
15702229 |
|
|
|
|
62275699 |
Jan 6, 2016 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06K 9/00369 20130101;
H04N 21/4394 20130101; H04N 5/332 20130101; G06K 9/2018 20130101;
H04N 21/44008 20130101; H04N 21/44222 20130101; H04N 21/42203
20130101; H04N 5/247 20130101; G06K 9/00228 20130101; G06K 9/00597
20130101; H04N 5/33 20130101; G06K 2009/00322 20130101; G06K
9/00302 20130101; G06K 9/00281 20130101; H04H 60/58 20130101; H04N
21/6582 20130101; H04H 60/33 20130101; G06K 9/00288 20130101; H04N
21/437 20130101; H04N 21/44218 20130101; H04H 60/45 20130101; H04N
21/25883 20130101; H04N 21/4223 20130101 |
International
Class: |
H04N 21/442 20110101
H04N021/442; H04N 21/422 20110101 H04N021/422; H04N 21/4223
20110101 H04N021/4223; H04N 21/439 20110101 H04N021/439; H04N
21/437 20110101 H04N021/437 |
Claims
1. A method of quantifying viewer engagement with a video shown on
a display, the method comprising: acquiring, with at least one
camera, images of a viewing area in front of the display while the
video is being shown on the display; acquiring, with a microphone,
audio data representing a soundtrack of the video emitted by a
speaker coupled to the display; determining, with a processor
operably coupled to the at least one camera and the processor, an
identity of the video based at least in part on the audio data;
estimating, with the processor and based at least in part on the
image data, a first number of people present in the viewing area
while the video is being shown on the display and a second number
of people engaged with the video in the viewing area; and
transmitting, by the processor, the identity of the video, the
first number of people, and the second number of people to a remote
server.
2. The method of claim 1, wherein acquiring the images comprises
acquiring a first image of the viewing area using a visible camera
and acquiring a second image of the viewing area using an infrared
(IR) camera.
3. The method of claim 2, wherein estimating the first number of
people in the viewing area comprises: estimating a first raw number
of people from the first image data and a second raw number of
people from the second image data; and comparing the first raw
number with the second raw number to detect possible error in at
least one of the first raw number or the second raw number.
4. The method of claim 1, wherein acquiring the image data
comprises acquiring images of the viewing area at a frame rate
substantially equal to or greater than 20 frames per second.
5. The method of claim 1, wherein acquiring the audio data
comprises acquiring the audio data at an acquisition rate of about
0.1 Hz.
6. The method of claim 1, wherein determining the identity of the
video is based on audio signal fingerprinting.
7. The method of claim 1, wherein estimating the first number of
people present in the viewing area is based on body skeleton
detection.
8. The method of claim 1, wherein estimating the second number of
people engaged with the at least on video is based on eye
tracking.
9. The method of claim 1, further comprising: quantifying the
viewer engagement of the video based at least in part on the first
number of people and the second number of people at each house
household in the plurality of households.
10. The method of claim 9, wherein quantifying the viewer
engagement comprises: estimating an attention rate for the video,
the attention rate representing a ratio of the second number of
people engaged with the video to the first number of people in the
viewing area; and for each unique video in the plurality of videos,
determining an attention index based on the attention rates of the
videos in the plurality of videos.
11. The method of claim 10, wherein the video is a unique video in
a plurality of videos and the method further comprises: estimating
a viewer count and a positive duration ratio based on the image
data and on demographic information about each household in the
plurality of households, the viewer count representing the second
number of people engaged with each unique video and the positive
duration ratio representing a ratio of total time spent by people
in the plurality of households watching the unique video to a
duration of the unique video.
12. The method of claim 9, further comprising: determining an
identity of each person present in the viewing area based at least
in part on the image data, wherein quantifying the viewer
engagement of the video comprises quantifying the viewer engagement
for each identified person.
13. The method of claim 9, further comprising: transmitting the
first number of people and the second number of people to a remote
server, wherein quantifying the viewer engagement is carried out at
the remote server.
14. The method of claim 9, further comprising: determining whether
a predetermined video in the plurality of videos is being displayed
on the display based at least in part on the audio data, wherein
quantifying the viewer engagement is based at least in part on
whether the predetermined video is being displayed.
15. The method of claim 1, further comprising: storing the first
number of people and the second number of people in a memory
operably coupled to the processor; and erasing and/or overwriting
the image data.
16. The method of claim 1, further comprising: estimating an
emotion of each person present in the viewing area.
17. The method of claim 1, further comprising: estimating
demographic information for each person in the viewing area from
the image data.
18. The method of claim 17, wherein estimating the demographic
information comprises estimating age, gender, ethnicity group, and
facial expression.
19. A method of quantifying viewer engagement for unique videos in
a plurality of videos, the method comprising: at each household in
a plurality of households, acquiring image data of a viewing area
in front of a display; determining if the display is showing a
video in the plurality of videos; for each unique video in the
plurality of videos, estimating (i) a viewing rate and (ii) a
watching rate based on the image data and on demographic
information about each household in the plurality of households,
the viewing rate representing a ratio of a total number of people
in the viewing areas to a total number of displays showing videos
and the watching rate representing a ratio of a total number of
people in households with display showing videos to a total number
of people in the plurality of households; and for each unique video
in the plurality of videos, determining a viewability index based
on the viewing rate and the watching rate.
20. The method of claim 19, further comprising: for each unique
video in the plurality of videos, estimating (iii) a viewer count
and (iv) a positive duration ratio based on the image data and on
demographic information about each household in the plurality of
households, the viewer count representing a total number of people
engaged with each unique video and the positive duration ratio
representing a ratio of total time spent by people in the plurality
of households watching the unique video to a duration of the unique
video; and weighting the viewability index based on the viewer
count and the positive duration ratio.
21. The method of claim 20, further comprising: normalizing the
viewability index across the unique videos in the plurality of
videos.
22. The method of claim 19, wherein acquiring the image data
comprises acquiring a first image of the viewing area using an
optical camera and acquiring a second image of the viewing area
using an infrared (IR) camera.
23. The method of claim 19, wherein determining if the display is
showing the video is based at least in part on audio data of the
viewing area via signal fingerprinting technique.
24. The method of claim 19, further comprising: transmitting the
viewing rate and the watching rate to a remote server, wherein the
viewability index is estimated by the remote server.
25. A system for quantifying viewer engagement with a video playing
on a display, the system comprising: at least one camera, disposed
to image a viewing area in front of the display, to acquire image
data of the viewing area; a microphone, disposed in proximity to
the display, to acquire audio data representing a soundtrack of the
video emitted by a speaker coupled to the display; a memory,
operably coupled to the at least one camera and the microphone, to
store processor-executable instructions; and a processor, operably
coupled to the at least one camera, the microphone, and the memory,
wherein upon execution of the processor-executable instructions,
the processor: determines an identity of the video based at least
in part on the audio data; estimates, based at least in part on the
image data, a first number of people present in the viewing area
while the video is being shown on the display and a second number
of people engaged with the video in the viewing area; and transmits
the identity of the video, the first number of people, and the
second number of people to a remote server.
26. The system of claim 25, wherein the video comprises a
television program provided via a set-top box and the processor is
not connected to the set-top box.
27. The system of claim 25, wherein the at least one camera
comprises a visible camera and an infrared camera and the image
data comprises a first image acquired by the visible camera and a
second image acquired by the infrared camera.
28. The system of claim 27, wherein upon execution of the
processor-executable instructions, the processor further: estimates
a first raw number of people from the first image and a second raw
number of people from the second image; and compares the first raw
number with the second raw number to detect possible error in at
least one of the first raw number or the second raw number.
29. The system of claim 25, wherein upon execution of the
processor-executable instructions, the processor: stores the first
number of people and the second number of people in the memory; and
erases and/or overwrites the image data.
30. The system of claim 25, further comprising: a network
interface, operably coupled to the processor, to transmit the first
number of people and the second number of people to a remote
server.
Description
CROSS-REFERENCE TO RELATED APPLICATION(S)
[0001] This application is a bypass continuation of PCT Application
No. PCT/US2017/012531, filed Jan. 6, 2017, entitled "SYSTEMS AND
METHODS FOR ASSESSING VIEWER ENGAGEMENT," which is hereby
incorporated herein by reference in its entirety and which claims
priority to U.S. Application No. 62/275,699, filed Jan. 6, 2016,
entitled "SYSTEMS AND METHODS FOR ASSESSING VIEWER ENGAGEMENT,"
which is also hereby incorporated herein by reference in its
entirety.
BACKGROUND
[0002] Conventional methods of TV audience measurements include
using people meters and diaries to collect data from the audience.
These methods typically try to recognize humans (potential audience
members) in a room where a TV set is placed. The methods may also
involve capturing a series of images (e.g., TV programs or
commercial advertisements) playing on the TV. Then for each image,
the number of people in the room at the time when a particular
image is displayed can be estimated.
[0003] These methods have several flaws. First, the data collected
by these methods normally only include the number of people in the
room where the TV is placed. The data typically gives no indication
of how often the viewer is actually watching the TV (the
measurement takes place when the TV is on). Second, the collected
data may indicate how often people are tuning to specific channels.
However, it does not gauge their reaction to the programs or
advertisements and therefore provides no indication of the
effectiveness of the programs or advertisements. Third, TV ratings
are not given for specific demographics in the household or in the
community.
SUMMARY
[0004] Embodiments of the present invention include apparatus,
systems, and methods of assessing viewer engagement of a TV
audience. In one example, a system for quantifying viewer
engagement with a video playing on a display includes at least one
camera, disposed to image a viewing area in front of the display,
to acquire image data of the viewing area. A microphone is disposed
in proximity to the display to acquire audio data emitted by a
speaker coupled to the display. The system also includes a memory,
operably coupled to the camera and the microphone, to store
processor-executable instructions and a processor, operably coupled
to the camera, the microphone, and the memory. Upon execution of
the processor-executable instructions, the processor receives the
image data from the camera and the audio data from the microphone
and determines an identity of the video displayed on the display
based at least in part on the audio data. The processor also
estimates, based at least in part on the image data, a first number
of people present in the viewing area and a second number of people
engaged with the video in the viewing area. The processor further
quantifies the viewer engagement of the video based at least in
part on the first number of people and the second number of
people.
[0005] In another example, a method of quantifying viewer
engagement with a video shown on a display includes acquiring, with
at least one camera, images of a viewing area in front of the
display while the video is being shown on the display. The method
also includes acquiring, with a microphone, audio data representing
a soundtrack of the video emitted by a speaker coupled to the
display. The method further includes determining, with a processor
operably coupled to the camera and the processor, an identity of
the video based at least in part on the audio data and estimating,
with the processor and based at least in part on the image data, a
first number of people present in the viewing area while the video
is being shown on the display and a second number of people engaged
with the video in the viewing area. The method also includes
transmitting, by the processor, the identity of the video, the
first number of people, and the second number of people to a remote
server.
[0006] In yet another example, a system for assessing viewer
engagement with a video playing on a display is disclosed. The
display is coupled to a speaker emitting a soundtrack of the video.
The system includes a visible camera to acquire visible images of a
viewing area in front of the display at a first sample rate while
the video is playing on the display. An infrared camera is included
in the system to acquire infrared images of the viewing area in
front of the display while the video is playing on the display at
the first sample rate. A microphone is disposed in proximity to the
display to acquire samples of the soundtrack emitted by the speaker
while the video is playing on the display at a second sample rate
lower than the first sample rate. The system also includes a
processor, operably coupled to the visible camera, the infrared
camera, and the microphone, to: (i) identify the video based on the
samples of the soundtrack, (ii) estimate, based on the visible
images and the infrared images, a number of people in the viewing
area while the video is playing on the display and a number of
people engaged with the video, and (iii) overwrite, erase, and/or
discard the samples of the soundtrack, the visible images, and the
infrared images. The system also includes a memory, operably
coupled to the processor, to store representations of an identity
of the video, the number of people in the viewing area while the
video is playing on the display, and the number of people engaged
with the video. The system further includes a network interface,
operably coupled to the processor, to transmit the representations
to a server.
[0007] In yet another example, a method of quantifying viewer
engagement for unique videos in a plurality of videos includes at
each household in a plurality of households, acquiring image data
of a viewing area in front of a display and determining if the
display is showing a video in the plurality of videos. The method
also includes, for each unique video in the plurality of videos,
estimating (i) a viewing rate and (ii) a watching rate based on the
image data and on demographic information about each household in
the plurality of households. The viewing rate represents a ratio of
a total number of people in the viewing areas to a total number of
displays showing videos and the watching rate representing a ratio
of a total number of people in households with display showing
videos to a total number of people in the plurality of households.
The method also includes, for each unique video in the plurality of
videos, determining a viewability index based on the viewing rate
and the watching rate.
[0008] It should be appreciated that all combinations of the
foregoing concepts and additional concepts discussed in greater
detail below (provided such concepts are not mutually inconsistent)
are contemplated as being part of the inventive subject matter
disclosed herein. In particular, all combinations of claimed
subject matter appearing at the end of this disclosure are
contemplated as being part of the inventive subject matter
disclosed herein. It should also be appreciated that terminology
explicitly employed herein that also may appear in any disclosure
incorporated by reference should be accorded a meaning most
consistent with the particular concepts disclosed herein.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] The skilled artisan will understand that the drawings
primarily are for illustrative purposes and are not intended to
limit the scope of the inventive subject matter described herein.
The drawings are not necessarily to scale; in some instances,
various aspects of the inventive subject matter disclosed herein
may be shown exaggerated or enlarged in the drawings to facilitate
an understanding of different features. In the drawings, like
reference characters generally refer to like features (e.g.,
functionally similar and/or structurally similar elements).
[0010] FIG. 1 shows a schematic of a system for assessing viewer
engagement of TV audiences.
[0011] FIG. 2A illustrates a method of quantifying user engagement
using the system shown in FIG. 1.
[0012] FIG. 2B illustrates a method of training a computer vision
model for quantifying user engagement.
[0013] FIG. 3A illustrates methods of viewer engagement including
facial and eyeball tracking, facial recognition, and sentimental
analysis.
[0014] FIG. 3B illustrates the concepts of viewability index and
attention index.
[0015] FIG. 4A illustrates a process for assessing viewer
engagement including estimating viewability index.
[0016] FIG. 4B illustrates a process for assessing viewer
engagement including estimating attention index.
[0017] FIG. 5 illustrates a process for assessing viewer engagement
including determining the orientation of the face of each person in
a viewing area.
[0018] FIG. 6 illustrates a process for detecting skeleton, face,
identification, emotion, and engagement.
[0019] FIG. 7 shows a schematic view of a data acquisition
architecture in exemplary methods of viewer engagement
assessment.
[0020] FIGS. 8A-8G show commercial message (CM) curves acquired
using the architecture shown in FIG. 7.
[0021] FIG. 9 shows the ratios of the CM curves for each of the
sampled TV stations.
[0022] FIG. 10 shows a classification model through a decision tree
with the determination results of the decision tree shown in TABLE
5.
[0023] FIG. 11 illustrates the viewability rate with respect to the
length of the CM.
[0024] FIG. 12 shows the correlation between elapsed time since the
start of the program and the viewability rate.
[0025] FIG. 13 illustrates communication of viewer engagement data
acquired using the technology illustrated in FIGS. 1-12.
[0026] FIG. 14 illustrates dissemination and use of viewer
engagement data acquired using the technology illustrated in FIGS.
1-12.
[0027] FIG. 15 illustrates big data analysis and visualization of
viewer engagement data acquired using the technology illustrated in
FIGS. 1-12.
[0028] FIG. 16 shows a model for acquiring additional data to
complement viewer engagement data acquired using the technology
illustrated in FIGS. 1-12.
DETAILED DESCRIPTION
[0029] To address shortcomings in conventional methods of TV
audience measurements, systems and methods disclosed herein acquire
image data of a viewing area in front of a display (e.g., a TV,
computer, or tablet) that is playing a video (e.g., a TV show,
movie, web show, advertisement, or other content). An example
system determines how many people are in the viewing area and which
of those people are actually watching the video from the image
data. The system also samples the soundtrack of the video with a
microphone and identifies the videos using the samples of the
soundtrack. The system stores (and/or persists) information about
the video, the number of people in the viewing area, and the number
of people watching the video in a local memory and transmits the
information to a remote server via an internet or other network
connection.
[0030] Unlike previous systems for measuring viewer engagement with
videos, which identify videos based on digital watermarks embedded
in the videos themselves, examples of the inventive system identify
videos based on the videos' soundtracks. As a result, the inventive
systems do not have to be connected to the display, the set-top
box, or the cable connection at the viewer's premises. This makes
them easier to install and remove (and thus more likely to be
adopted). It also makes them less likely to malfunction or to
record "false positive" impressions caused by leaving the set-top
box on while the display is off.
[0031] An inventive system also processes image data locally, i.e.,
on the viewer's premises, to determine the numbers of people in the
viewing area and engaged with the video. It can also process audio
data locally to identify the video being displayed while someone is
in the viewing area. It stores this data locally, i.e., in a memory
in or coupled to a local device on the viewer's premises. The
processed image and audio data consumes far less memory than the
raw image and audio data, so this local memory can store
information covering longer time periods. In other words, an
inventive device uses memory more efficiently because it stores
processed data instead of raw data.
[0032] The local device processes the raw image data, which may
include both visual and depth information, acquired from the
viewing area to assess viewer engagement. The local device can use
artificial intelligence (AI) technology and machine learning
techniques to analyze a viewer's body gestures, movements, and
facial orientation. The local device can also recognize individual
faces in the video audience and determine each viewer's emotions
from the image data. In this processing, the individual's images
are not transmitted outside of the individual's premises. The
recognition can be performed on the local device on premises. Each
individual in the household can receive a unique identifier during
the on-boarding process for that household. When a match is made
during the recognition process, this identifier is assigned to the
match, and this identifier can then be transmitted to remote
servers. In addition, the processing is carried out over the
streaming video or audio data (including images). In other words,
the video or audio data is not persisted in local memory.
[0033] The local device processes the raw audio data by matching or
comparing the raw audio data with samples in an audio database to
identify the specific video (e.g., TV channel, program, or
advertisement) that is being viewed. Alternatively, or
additionally, the local device can submit a query based on the
audio data to a third-party application programming interface
(API), which identifies and returns an identification of the
content to which the audio belongs. In some cases, the database or
API may return multiple possible matches, and the remote server can
select the best match using information about the TV schedule,
subsequent audio samples, or data collected from other sources,
including, but not limited to, the set-top box, cable/internet
connection, or the content provider itself.
[0034] In some implementations, the local device does not store the
raw image or audio data for later retrieval. Instead, the local
device writes the raw image and audio data to one or more buffers
that store the raw image and audio data for processing, then
overwrites or erases the buffers after the raw image and audio data
has been processed. Put differently, the local device holds the raw
image and audio data merely transiently during processing. As used
herein, "holding" of raw images and audio data in local devices
refers to temporary storing of these data for a short time duration
(e.g., less than 100 milliseconds, less than 80 milliseconds, less
than 60 milliseconds, less than 50 milliseconds, or less than 40
milliseconds, including any values and sub ranges in between).
Overwriting or erasing the raw image and audio data offers a number
of advantages, including reducing the amount of memory required by
the local device. It also enables easier compliance with data
privacy laws by eliminating image or audio data that could be used
to identify people, including children, in the viewing area or in
range of the microphone.
[0035] Processing and storing image and audio data locally offers
another technical advantage--it reduces the bandwidth required to
convey information about viewing habits from the local device to
the remote server. Compared to raw image and audio data, processed
image and audio data consumes less memory and therefore requires
less bandwidth for transmission. The processed image and audio data
also fills a given memory more slowly than raw image and audio data
and therefore can be transmitted to the remote server less
frequently. A local device may take advantage of this flexibility
by scheduling burst transmissions during times when network
bandwidth usage is relatively low, e.g., late night or early
morning. Transmitting processed image and audio data, which doesn't
necessarily include information identifying people, including
children, in the viewing area or in range of the microphone, also
ensures or increases the ease of compliance with data privacy
laws.
[0036] The remote server collects processed image and audio data
from local devices in different households. It processes this data
to assess viewer engagement across an entire community by
statistically analyzing the viewer engagement information collected
from the different households in the community. For example, the
server can quantify the ratio of the viewer engagement from the
highly granular data collected from each household to the total
length of the programming that was detected.
[0037] The statistical analysis can further take into account
demographic information (e.g., age, gender, household income,
ethnicity, etc.) of the people watching the videos and/or the
people in the household. Based on all this information, the server
may calculate various indices, such as a viewability index and an
attention index (both defined below), to quantify viewer
engagement. These viewer engagement indices may be based on any and
all information provided by the local devices, including
information about the viewers' body gesture(s), movement(s), and
facial orientation(s) of viewers, as well as the video information.
These quantitative indices can indicate, among other things, (i)
who is/are really watching display, (ii) how often an audience
member looks at the display, and ii) the audience's reaction
towards the programs and advertisements on the display.
[0038] The quantitative indices can then be transferred by the
remote server to a central storage (e.g., a cloud-based database)
where third parties, including but not limited to TV advertising
agencies and TV networks, can access the indices and possibly other
data as well. Alternatively, the raw data collected by the sensors
can be transferred to a central storage on the cloud where it is
analyzed by methods described herein and made available to
interested third parties. A third party may optionally access the
raw data through the system. The raw data in this example includes
data collected after processing of the video and audio streams
(instead of the video and audio streams themselves). Generally,
speaking, the raw data can include unique identifiers of the
viewers, the attentiveness of the viewer(s), and the programming
being viewed by the viewer(s), on a sub second basis (e.g., every
half second or less). More quantitative indices (see more details
below) can be computed on the remote server using this raw
data.
[0039] This acquired and analyzed data can allow a collection
entity, such as a content provider or advertising agency, to
accurately evaluate the impact of videos, including unprecedented
measurements of individual demographics, which can be valuable to
the advertisers. For example, advertising agencies can use the data
to determine which commercial slots would be a best fit for their
targeted audience. With demographic information, the data can be
matched to the type of audience and can effectively lead to
purchasing behavior, thereby increasing return on investment (ROI)
in programming. TV networks can also benefit from the data as they
can glean more accurate ratings of their TV programs, audience
type, reactions, and predictive ad slot value. This further allows
them to improve their programs to better fit the type of audience
and eliminate less popular shows, in addition to determining which
ad slots may have the highest value for a particular target
demographic.
[0040] The acquired and analyzed data also allows various business
models. For example, a collection entity can provide
performance-based TV ratings data and raw data for analysis, which
is collected from a motion-sensing device put into selected-user
households that represent a national and/or local demographic, to
TV networks, advertising agencies, and other interested third
parties and indirectly to advertisers who obtain the data from
advertising agencies.
[0041] Systems of Assessing Viewer Engagement
[0042] FIG. 1 illustrates a schematic view of a system 100 for
assessing viewer engagement in a household, a sports bar, or other
space with a display. The system 100 includes a local device 105
disposed in each household to collect viewer engagement data and a
remote server 170, such as a cloud storage and computing device
that includes a memory to store data and a processor (also called a
remote processor) to analyze data. The local device 105 is
communicatively coupled to the remote server 170 via a network
connection 172, such as an internet connection. For instance, the
local device 105 may include a network interface 165, such as a
WiFi antenna or Ethernet port, for connecting to a household local
area network (LAN). This LAN is in turn connected to a wide area
network (WAN), e.g., via a cable or fiber optic connection provided
by an Internet Service Provider (ISP).
[0043] The local device 105 in FIG. 1 includes an infrared (IR)
emitter 110 to illuminate a viewing area 101 in front of a display
11, such as a television (TV), computer screen, tablet, or other
device, with IR light. This IR light may be structured or modulated
to produce an illumination pattern that scatters or reflects off
objects (including the human audience) in the viewing area 101. The
local device 105 also includes an IR sensor 120 that detects the IR
light reflected or scattered by these objects. A processor 150
(also called a local processor 150) coupled to the IR emitter 110
and IR sensor 120 uses information about the illumination pattern
and the detected IR light to produce one or more IR depth images or
IR depth maps of the viewing area 101. More specifically, the
processor 150 converts information derived from the reflected beams
into depth information measuring the distance between a viewer and
the sensor 120. The processor 150 uses these IR depth images to
determine how many people are in the viewing area and which of
those people are watching the display. The processor 150 may also
derive information from the IR depth images about the identities of
the people watching the display, possibly by recognizing their
faces or gestures or determining their demographics (e.g., age,
gender, etc.).
[0044] The local device 105 further includes an RGB sensor 130
(also referred to as a visible camera) that captures color images
of the viewing area 101. The processor 150 is also coupled to the
RGB sensor and may use the color images, alone or in combination
with the IR depth images, to estimate the number of people are in
the viewing area, the number of people engaged with the display,
and information about the people in the viewing area. The color
images can also be used for facial recognition. In some cases, the
processor 150 uses both the color images and the IR depth images to
improve the fidelity of the estimates of the numbers of people in
the viewing area and engaged with the video.
[0045] The local device 105 also includes one or more microphones
140 positioned to detect sound emitted by a speaker 13 coupled to
the display 11. In operation, the speaker 13 plays the soundtrack
of the video shown on the display 11. And the microphone 140
captures audio samples of the soundtrack played by the speaker 13.
The processor 150, which is coupled to the microphone 140, uses
these audio samples to create an audio fingerprint of the video
(soundtrack), which it compares with other audio fingerprints in a
proprietary or third-party database to identify the video being
shown on the display 11.
[0046] The system 100 can further include a Bluetooth receiver 180
matched with a Bluetooth transmitter 185. In some cases, the
Bluetooth transmitter 185 can be included in a wristband or a
wristwatch worn by the viewer. In operation, the Bluetooth
transmitter 185 transmits a low power Bluetooth beacon, which is
received by the Bluetooth receiver 180. The processor 150 can then
gauge the viewer's distance from the display 11 based on the
received Bluetooth beacon. In addition, each Bluetooth transmitter
185 can have a unique ID that can be recognized by the processor
150. The transmitter ID can be further associated with a unique
viewer (e.g., each viewer in the household has his or her own
transmitter). In this manner, the identity of the viewer can also
be determined.
[0047] In some cases, the system 100 can include more than one
Bluetooth receiver. These receivers can be disposed at different
locations such that each receiver can receive different Bluetooth
signal strength from the transmitter 185. This configuration can
allow the processor 150 to estimate not only the distance of the
viewer from the display 11 but also the relative location of the
viewer (e.g., to the left or right of the display 11).
[0048] The system 100 may include other motion-sensing devices,
such as a 3-axis accelerometer to detect position and motion. The
motion-sensing device can be connected, for example, via a USB
cable with a data-analyzing and processing device such as a desktop
machine.
[0049] FIG. 1 shows the data collection components--here, the IR
emitter 110, IR sensor 120, RGB sensor 130, and microphone 140--as
part of the local device 105 (e.g., within the same housing). In
other embodiments, one or more of these components may be
implemented as separate devices that are coupled to the processor
150 by one or more wired connections, such as USB connections, RS
232 connections, Ethernet connections, fiber connections, or one or
more wireless connections, such as WiFi connections, Bluetooth
connections, other RF connections, or infrared connections. For
instance, the IR emitter 110 and IR sensor 120 may be (in) a
commercially available device, such as a Microsoft Kinect, that is
connected to the processor 150. Likewise, the microphone 140 may be
implemented as an array of microphones that are placed around the
viewing area or close to the speaker 13. A microphone array may be
better able to extract voice input from ambient noises. The local
device 105 may include or be coupled to other sensors as well.
[0050] The processer 150 in the system 100 is employed to process
the raw data acquired by the sensors, including the IR emitter 110,
the IR sensor 120, the RGB sensor 130, and the microphone 140. The
processing can be carried out upon execution of
processor-executable instructions that are stored in a memory 160
coupled to the processor 150. In one example, a user can manually
store the instructions in the memory 160 by downloading the
instructions from the remote server 170. In another example, the
local device 105 can be configured to (routinely) check whether
there are updated instructions available for downloading from the
remote server 170. If so, the local device 105 can automatically
download the update via the network connection 172 and the network
interface 165. In yet another example, the remote server 170 can be
configured to send a notification to the local device 105 when an
update or a set of new instructions is ready for downloading. Upon
receiving the notification, a user can decide whether to download
and/or install the update. In yet another example, the remote
server 170 can be configured to send update notification to another
user device, such as a smartphone. Upon receiving the notification,
the user can decide whether the download and/or install the
update.
[0051] The memory 160 in the local device 105 also stores the
processed data (e.g., the estimate of the number of people in the
viewing area, the estimate of the number of people engaged with the
display, and the identification of the video, as well as any
demographic information or indices derived from the raw image and
audio data). Once the memory 160 has accumulated enough processed
data, the processor 150 transmits the processed data to the remote
server 170 via the network interface 165 and the network connection
172 for aggregation, further processing, and reporting. The local
memory 160 also temporarily holds the image and audio data during
the local processing. In some cases, this processing is completed
in less than a quarter of a second.
[0052] Collecting and Processing Image and Audio Data with a Local
Device
[0053] FIG. 2A illustrates a process 200 for collecting and
processing image and audio data acquired with a system like the
system 100 shown in FIG. 1. As described above, the system can
include a visible sensor, an IR sensor, or both to images of the
viewing area in front of the display (202). In one example, the RGB
sensor 130 and the IR sensor 120 operate independently from each
other; the sensors acquire images in an unsynchronized fashion. In
another example, the image acquisition by the RGB sensor 130 and
the IR sensor 120 is substantially synchronized. Each time the RGB
sensor 130 acquires a visible image, the IR sensor 120 acquires an
IR image, e.g., at the same time or in an interleaved fashion.
[0054] A local processor (e.g., processor 150) detects the number
of people in the images of the viewing area (204) and also
determines which of those people are engaged with the display
(206). For instance, the local processor may use the techniques
described below, including skeleton detection techniques, facial
recognition techniques, and eye tracking techniques as known in the
art of computer vision/image processing. In some cases, the local
processor 150 can determine additional indices related to the
duration of each viewer's presence in the viewing area, the
duration of each viewer's engagement with the display, and the
identity of the video being displayed (208), which can be derived
from audio data as described below (222).
[0055] The local processor can further identify each person
detected in the viewing area 101, on a demographic level (e.g., man
aged 25-30, girl aged 12-15) (210). If the local processor 150 has
access to information about the household where the local device
105 is placed, e.g., via the local memory 160 or the remote server
170, it may use this demographic information to provide more
confident demographic information estimates of each person detected
in the viewing area 101. The local processor may even identify the
particular people in the household who are in the viewing area.
[0056] The local processor 150 can also estimate the mood or
emotion of each person detected in the viewing area 101 (212). The
emotions that can be determined by the processor 150 can include,
for example, happy, sad, or neutral. The classification of a
viewer's emotion, when watching a video on the display 11, can be
used to gauge the viewer's reaction to the video, thereby
facilitating targeted delivery of advertisement.
[0057] To estimate the mood or emotion of each person, the local
processor 150 can capture the visual information (e.g., from the
images of the viewing area 101) in real-time from both RGB and IR
channels. The visual information can be further processed to
extract patterns and features that can be signatures of different
mood or emotion states. The features extracted from both channels
can be fused as a unified feature. A classifier can be trained to
take such feature as input. Estimation of emotion/mood can be then
made based on the classifier/s response to certain patterns in each
time.
[0058] In some cases, the estimation of mood or emotion can be
achieved by the following method. The method includes collecting
training images with people displaying various emotions, such as,
smiling and frowning, among others. Features representative of each
emotion are extracted (e.g., by a processor) from these training
images. The features and the images are then used to train a
classifier to correlate each feature to a corresponding emotion. In
this manner, the classifier can assign these features to the
various emotions. The method also includes deploying the classifier
on the local device so as to recognize the viewers emotions in real
time.
[0059] In cases where the system collects visible and IR images in
a synchronized fashion, the visible and IR cameras can collect
images for training a computer vision model used by the processor
to detect people (204), count engaged viewers (206), identify
viewers demographically (210), and estimate mood (212). The
training can be employed to establish a "ground truth." Having
collected image data from both IR and RGB sensors almost in
parallel, a human can annotate the people detected in each image.
This manual data can be fed to a training algorithm, giving rise to
two separate models, one trained on visible RGB spectrum, and the
other on the IR spectrum. The detection rate of each model against
the "ground truth" is then compared to select the model that
performs better. More details of this training are described below
with reference to FIG. 2B.
[0060] Synchronization of the two cameras (e.g., sensors 120 and
130 in FIG. 1) can also allow the local processor to double-check
the image processing. For example, the processor 150 can compare
the number of people identified in each image or remove errors
visible in one image and less visible or invisible in the other
image. If the results are in agreement with each other, the
processor 150 can record the results. If not, the processor 150 can
then detect possible errors in at least one of the images.
Alternatively, the processor 150 can generate an alert for a human
to intervene. The processor 150 can also generate a flag associated
with the data estimated from these two images, indicating that
there this data might be less reliable. In subsequent analysis,
this data may not be used at all, if images take shortly before or
after this pair of images at issue can provide reliable people
recognition.
[0061] In one example, the local device 105 uses the visible and IR
sensors 120 and 130 all the time to take image data. In another
example, the local device 105 can use only one of the sensors 120
or 130 to take image data. In yet another example, the local device
105 can use one sensor as a default sensor and use the other sensor
as a backup sensor. For example, the local device 105 can use the
RGB sensor 130 most of the time for image taking. However, if the
processor 150 has trouble satisfactorily analyzing the visible
images (e.g., the analysis is not as reliable as desired), the
processor 150 can turn on the IR sensor 120 as backup (or vice
versa). This may occur, for example, when the ambient light level
in the viewing area is low.
[0062] The local processor may also adjust the image acquisition
rate for the visible sensor, the IR sensor, or both based on the
number of people in viewing area, their positions in the viewing
area, and the identity of the video on the display (214).
Generally, the image acquisition for either or both sensors can be
substantially equal to or greater than about 15 frames per second
(fps) (e.g., about 15 fps, about 20 fps, about 30 fps, about 50 fps
or even greater, including any values and sub ranges in between).
At this image acquisition rate, the sensor can detect eye movements
well enough for the local processor to assess viewer engagement
(206).
[0063] The local processor may increase or decrease the image
acquisition rate based on the number of people in the viewing area
101. For example, if the processor determines that nobody is in the
viewing area 101, it may reduce the image acquisition rate to
reduce power and memory consumption. Likewise, if the processor
determines that the viewer(s) are not engaged with the video (e.g.,
because they appear to be sleeping), it may reduce the image
acquisition rate to conserve power, memory, or both. Conversely,
the processor may increase the image acquisition rate (e.g., to
greater than 15 fps) if the viewers appear to be shifting their
attention rapidly, if they are watching a fast-paced video (e.g., a
football game or action movie), if they are changing channels
rapidly (e.g., channel surfing), or if the content is changing
relatively rapidly (e.g., during a series of advertisements).
[0064] If the system includes both IR and visible image sensors,
the local processor may also vary the image acquisition based on
the lighting conditions or relative image quality. For instance, in
low light conditions, the local processor may acquire IR images at
a higher rate than visible images. Similarly, if the local
processor gets better results processing visible images than IR
images, it may acquire visible images at a higher rate than IR
images (or vice versa if the opposite is true).
[0065] The system also records samples of the video's soundtrack
with the microphone 140 (220). Generally, the audio data
acquisition rate or audio sampling rate is lower the image
acquisition rate. For instance, the microphone acquires audio
samples at a rate of once every 30 seconds. In each acquisition,
the microphone 140 records an audio sample having a finite duration
so as to allow identification of the video associated with the
audio sample. The duration of the audio sample can be substantially
equal to or greater than 5 seconds (e.g., about 5 seconds, about 6
seconds, about 8 seconds, about 10 seconds, about 20 seconds, or
about 30 seconds, including any values and sub ranges in
between).
[0066] The local processor uses the audio samples recorded by the
microphone 140 to identify the video being played on the display
(222). For example, the processor 150 can create a fingerprint of
the audio data and use the fingerprint to query a third-party
application programming interface (API), which responds to the
query with an identification of the video associated with the audio
data. In another example, the processor 150 can compare the
fingerprint against a local table or memory to determine the
identity of the video.
[0067] As mentioned above, using samples of the video soundtrack to
identify the video offers several advantages over the digital
watermarks used by conventional TV survey devices to identify
videos. It does not require inserting digital watermarks into the
video, which eliminates the need to coordinate with content
producers and providers. This simplifies content production and
distribution and makes it possible to identify and assess a wider
range of video content, including producers and distributors who
cannot or will not provide digital watermarks. And it eliminates
the need to connect the local device to the cable or set-top
box.
[0068] In addition, using audio data instead of digital watermarks
reduces the risk of "false positives," or instances where the
system detects people in the viewing area and identifies a video
that is not actually being watched even when the TV is off. This
can happen with a conventional system hooked to set-top box if the
household members leave their set-top box on even when their TV is
off.
[0069] In some examples, the local processor adjusts the audio
sampling rate (224), e.g., based on the identity of the video, the
number of people in the viewing area, the number of people engaged
with the video, etc. For instance, if the local processor cannot
identify the video from a single fingerprint (e.g., because the
video soundtrack includes a popular song that appears in many
different video soundtracks), the local processor and microphone
may acquire samples at a higher rate or of longer duration to
improve video resolve any ambiguity. The processor may also
decrease the audio sampling rate if nobody is in the viewing area
101 or the viewer(s) are not engaged with the video (e.g., because
they appear to be sleeping) to conserve power, memory, or both.
Conversely, the processor may increase the audio sampling rate if
the viewers are changing channels rapidly (e.g., channel surfing)
or if the content is changing relatively rapidly (e.g., during a
series of advertisements).
[0070] Depending on the implementation, the microphone may record
audio samples at regular intervals (i.e., periodically) or at
irregular intervals (e.g., aperiodically or with a time-varying
period). For instance, the microphone may acquire audio data
throughout the day at a constant rate (e.g., about two samples per
minute). In other cases, the microphone may operate at one sampling
rate when the TV is on or likely to be on (e.g., early evening) and
at another, lower sampling rate when the TV is off or likely to be
off (e.g., early morning, mid-day). If the local processor detects
that the TV has been turned on (off) from the audio samples, it may
increase (decrease) the sample rate accordingly. The may also
trigger the image sensors to start (stop) imaging the viewing area
in response to detecting that the TV has been turned on (off) from
the audio samples.
[0071] As or once the raw image and audio data has been processed,
the local processor overwrites the raw image and audio data or
erases the raw image and audio data from memory (230). In other
words, each image is held in the memory 150, while the processor
150 detects and identifies humans and gauges their engagement and
expressions. The detection, identification, and engagement data is
collected per frame, and this information is persisted and
eventually uploaded to the backend server 170. Similarly, the audio
data is also held in the memory 160, while the third-party API is
processing the audio fingerprint and returning the identity of the
associated video. The identity is stored and/or uploaded to the
backend server 170 as described below.
[0072] By overwriting or erasing (or otherwise discarding) the raw
image and audio data, the local processor reduces demands on the
memory and reduces or eliminates the ability to identify the
individuals in the viewing area. This maintains the individuals'
privacy by exposing less information to potential attempts to hack
the system. It also eliminates the possibility of transmitting
images of the individuals to third parties. This is especially
beneficial for preserving the privacy of children in the viewing
area in accordance with the Children's Online Privacy Protection
Act.
[0073] In some cases, the local processor actively erases the raw
image and audio data from the memory. In other cases, the local
processor stores that raw image and data in one or more buffers in
the memory that are sized not to store more than a predetermined
amount of raw image and audio data (e.g., one image or one audio
sample). The local processor analyzes the raw image and data in the
time period between samples so that the next image or audio sample
overwrites the buffer.
[0074] The local processor 150 also stores the processed data into
the memory 160. The processed data may be stored in a relatively
compact, such as comma-separated variable (CSV) format, to reduce
memory requirements. The data included in the CSV or other file may
indicate, for example, whether anyone is present in each image, the
number of people in the viewing area 101 in each image, the number
of people who are actually watching the display 11 in the viewing
area 101, the classification of each viewer's emotion, and the
identity of each viewer. The processed data may also include
indications about the local device's operational state, including
the IR image acquisition rate, visible image acquisition rate,
audio sampling rate, current software/firmware update, etc.
[0075] The local processor transmits the processed data to the
remote server (e.g., via a network interface) for storage or for
further processing (236). Because the processed data is in a
relatively compact format, the upload bandwidth is much lower than
it would be for raw image and audio data. And because the
transmitted data does not include images of the viewing area or
audio samples that could include the viewers' voices, there is less
risk of compromising the viewers' privacy. In addition, the audio
and image portions of the processed data are more likely to be and
remain synchronized because they are processed locally than if the
raw image and audio image were transmitted to and processed by a
remote server.
[0076] In some cases, the local processor may transmit the
processed data to the remote as it is processed. In other cases,
the local processor may identify transmission windows (234), e.g.,
based on the available upstream bandwidth, the amount of data, etc.
These transmission windows may be predetermined (e.g., 2 am ET),
set by a household member during local device installation, set by
the remote server (e.g., via a software or firmware update), or
determined by the local processor based on bandwidth
measurements.
[0077] FIG. 2B illustrates a method of training a computer vision
model for quantifying viewer engagement. At 241, both the RGB and
IR sensors acquire video data, which undergoes two types of
processing. At 242a, the video data is manually annotated to
identify faces in each frame. At 242b, a current model (e.g., a
default model or a model from previous use) is sued to
automatically detect faces in each frame. At 243b, a processor is
used to compute accuracy of the automatic detection at 242b against
the annotated videos acquired at 242a. At 244, if the accuracy is
acceptable, the method 240 proceeds to 245, where the current model
is set as the production model for facial recognition (e.g., used
in the method 200). If the accuracy is not acceptable, the method
200 proceeds to 243a, where the videos are split into a training
set of videos (246a) and a test set of videos (246b). For example,
the RGB videos can be selected as the training videos 246a and the
IR videos can be selected as the test videos 246b (or vice
versa).
[0078] The training videos 246a are sent to train a new model at
247a, while the test videos (246b) are sent to step 247b for
testing the new model. At 247b, the training videos 246a and the
test videos 246b are collected together so as to compute accuracy
of the new model at 247c. At 249, the processor again computes the
accuracy of the new model. If the accuracy is acceptable, the new
model is set as the production model (245). If not, the method 240
proceeds to 248, where parameters of the new model are tuned.
Alternatively, another new model can be built at 248. In any event,
parameters of the new model are sent back to 247a, where the
training videos 246a are used to training the new model. In this
manner, a new model can be iteratively built to have an acceptable
accuracy.
[0079] Remote Server Operation
[0080] In operation, the remote server 170 collects data
transmitted from different local devices 105 disposed in different
households. The remote server 170 can read the incoming data on a
regular basis. The remote server 170 can also parse the received
data and join the video recognition data with the audio recognition
data using the timestamps of when each was saved.
[0081] The remote server 170 can also correct mislabeled data. For
example, the remote server 170 can fix blips when a viewer is not
identified or is misidentified using data from preceding and
following timestamps. If a person is identified in an image
preceding the image at issue and also in an image following the
image at issue, the remote server 170 can determine that this
person also appears in the image at issue.
[0082] The remote server 170 can also load data received from local
devices 105 and/or data processed by the remote server 170 into a
query-able database. In one example, the remote server 170 can also
provide access to users, who can then use the stored data for
analysis. In another example, the stored data in the query-able
database can also facilitate further analysis performed by the
remote server 170. For example, the remote server 170 can calculate
attention index and viewer index using the database.
[0083] Assessing Viewer Engagement
[0084] FIGS. 3A-6 illustrate methods of quantifying viewer
engagement with videos using measures such as viewability index and
attention index. The following definitions may be helpful in
understanding the inventive methods and apparatus for quantifying
viewer engagement with videos:
[0085] Program Duration is defined as the total duration of a
unique program, e.g., in seconds, minutes, or hours. The actual
unit (seconds, minutes, or hours) used is immaterial as long as the
durations of different programs can be compared.
[0086] Commercial Duration is defined as the total duration (e.g.,
in seconds or minutes) of a unique commercial.
[0087] Watching Duration (Seconds) is defined as the total duration
(number of seconds) that are watched of a unique program or
commercial per home. Alternatively, Watching Seconds can be defined
as the total duration of program in seconds minus the total time
(in seconds) during which no home watches the program.
[0088] Aggregated Watching Duration (Seconds) is defined as the
total duration (number of seconds) that are watched of a unique
program or commercial across all homes.
[0089] Positive Duration Ratio is defined as the percentage (%) of
a program or commercial advertise that has been watched. More
specifically, the Positive Duration Ratio of a program or
advertisement can be calculated as the ratio of the Aggregated
Watching Duration over total duration of the program or
advertisement times the number of households.
[0090] Viewer Count (VC) is defined as the total number of viewers
in the viewing area across all homes with positive Watching Seconds
for a given program or commercial advertisement.
[0091] Watching Rate (WR) is defined as the ratio of the total
number of people across all homes where the TV is on over the total
number of people in all households. For example, if the methods
take into account 100 households having a total number of 300
people. If 30 households having 100 people have their TV set on,
the watching rate is then 33.3% (i.e., 100/300). However, if the
same 30 households have 150 people, then the watching rate is 50%
(i.e., 150/300).
[0092] Viewing Rate (VR) is defined as the ratio of the total
number of people in the viewing area across all homes over the
total number of TV sets that are on. For example, if 100 people are
in the viewing areas defined by 40 different TV sets (each TV set
defines one viewing area), then the viewing rate is 2.5 (i.e.,
100/40).
[0093] Attention Rate (AR) is defined as the ratio of the total
number of people attentive to the TV across all homes over the
total number of people in the viewing area across all homes. For
example, if 100 people are in the viewing areas across all
individuals taken into account by the methods, but only 60 people
are actually watching TV (the rest 40 people may just leave the TV
on while doing other things), then the attention rate is 0.6 or
60%.
[0094] Viewability Index (VI) is defined as the average of Viewing
Rates (VRs) for each program and commercial.
[0095] Attention Index is defined as the average of Attention Rates
(ARs) for each program and commercial.
[0096] FIG. 3A illustrates a method 300 of assessing viewer
engagement (e.g., box 206 in the method 200 of FIG. 2A) including
facial and eyeball tracking 310, facial recognition 320, and
sentimental analysis 330. A processor (e.g., the local processor
150 shown in FIG. 1) can be used to implement the method 300. The
input data in method 300 can be the data acquired by the local
device 105 shown in FIG. 1, such as the image data, audio data, or
depth data of the viewing area. Face and eyeball tracking 310 is
employed to identify characteristic data points to track the face
as it moves and determine if user is watching screen. Facial
recognition 320 is employed to determine a viewer's identity using,
for example, artificial intelligence. Sentimental analysis 330 is
employed to determine a viewer's emotion using, for example,
artificial intelligence to analyze facial features, gestures, and
heart rate, among others.
[0097] The acquired information, including whether a viewer is in
fact watching the screen, the identity of the viewer, and the
emotion of the viewer, is used to determine various video ratings
340. In one example, the acquired information is used to estimate
individual video rating for each household. In another example, the
acquired information is used to estimate individual video rating
for each demographic region. In yet another example, the acquired
information is used to estimate overall video rating for a group of
videos. In yet another example, the acquired information is used to
estimate audience reactions to specific videos (e.g., programs and
advertisements). The acquired information can also be used to
determine quantitative measures of viewer engagement, such as
viewability index and attention index as described below.
[0098] Steps 310, 320, and 330 in the method 300 can be achieved
using pattern recognition techniques. These techniques can
determine whether any viewer is present in the viewing area by, for
example, recognizing one or more human faces. If there is indeed a
face recognized, these techniques can further determine who the
viewer is by, for example, comparing the recognized face with a
database including the facial data of the household where the video
is playing. Alternatively, these techniques may use extended
database to include facial data of more people (e.g., the entire
community if possible) in case the viewer is not from the
household. These techniques can also trace the movement of the face
and analyze the orientation of the face so as to determine, for
example, whether the viewer is watching the videos.
[0099] Artificial intelligence, machining learning, and trained
neural network learning techniques can also be used to analyze the
emotion of the viewer. To this end, these techniques analyze the
body gestures (static gestures at certain time), body movements
(change of gestures), facial orientations,
direction/movement/positioning of faces, and heart rate, among
others.
[0100] In another example, the method 300 can first recognize a
face from image data acquired by, for example, the RGB sensor 140
and IR sensor 120 shown in FIG. 1. The method 200 can also detect
the position of the face, identify characteristic points on the
face (e.g., boundaries points of eyes and mouth as shown in FIG.
2A), and track the face as it moves. Using eyeball tracking
techniques, the method 300 can determine whether the view is
actually watching the videos (or instead just sitting in the
viewing area but doing something else). Then, using techniques of
trained neural network learning, the method 300 can match the
viewer with a known person in the household by comparing facial
features from the database in a similar position. Once the viewer
has been identified, the method 300 can continually track the
viewer for notable facial configurations to determine the user's
mood and/or emotion.
[0101] The method 300 can also compare the audio data (e.g.,
acquired by the microphone 140 shown in FIG. 1) with an audio
database of videos (e.g., TV shows) and other audio so as to
determinate which video is being played at a specific timing point.
In one example, the video matching can determine which TV station
is being viewed by the viewer(s) identified by the method 300. In
another example, the video matching can determine which TV program
is being viewed by the viewer. In yet another example, the video
matching can determine which commercial advertisement is being
viewed. Alternatively, or additionally, the TV channel, program, or
advertisement that is being viewed can be determined from data
collected from other sources, including, but are not limited to, a
cable or satellite set top box or other programming provider's
hardware or broadcast signal.
[0102] FIG. 3B illustrates the concepts of viewability index and
attention index that can be estimated via techniques described
herein to quantify viewer engagement. In general, viewability index
quantifies the propensity of what is on screen to bring people into
the room. Attention index quantifies the propensity of what is on
screen to engage a viewing audience. In other words, the
viewability index can be regarded as the probability of a video (or
other displayed content) to attract a viewer in the first place,
while the attention index can be regarded as the probability of a
video to keep a viewer in front of the display after the viewer is
already in the viewing area. As illustrated in FIG. 3B, the
viewability index is dependent on the number of people present in
the viewing area, while the attention index is dependent on the
number people who are actually watching the display.
[0103] Assessing Viewer Engagement with a Viewability Index and an
Attention Index
[0104] FIG. 4A illustrates a method 401 of quantifying viewer
engagement using viewability index. The method 401 can be
implemented by a processor. The method 401 starts at step 411, in
which image data in acquired by the processor at each household in
a plurality of households, which participate in the method via, for
example, installing or using the local device 105 in the system
shown in FIG. 1. The image data includes images of a viewing area
in front of a display which can play videos (e.g., TV programs,
advertisement, user-request video, or any other video). In
addition, the processor also determines if the display is showing a
video at step 411. At step 421, the processor estimates the viewing
rate and watching rate for each video that is played by the
display. The viewing rate represents a ratio of a total number of
people in the viewing areas to a total number of displays showing
videos, as defined above. Similarly, the watching rate represents a
ratio of total number of people in households with display showing
videos to a total number of people in the plurality of households,
as defined above.
[0105] The estimation of the viewing rates and the watching rates
is based on the image data acquired at step 411 and on demographic
information about each household in the plurality of households.
The demographic information can be stored in a memory operably
coupled to the processor such that the processor can readily
retrieve the demographic information. In another example, the
processor can acquire the demographic information from another
server. At step 330, the processor determines a viewability index
based on the viewing rate and the watching rate, for each unique
video in the plurality of videos. The viewability index is defined
above as an average of viewing rate for each video, such as a
program and a commercial.
[0106] The method 401 can further include estimating the viewer
count and the positive duration ratio of each video played by the
display. The estimation is based on the image data and on
demographic information about each household in the plurality of
households. As defined above, the viewer count represents a total
number of people engaged with each unique video, and the positive
duration ratio represents a ratio of total time spent by people in
the plurality of households watching the unique video to a duration
of the unique video.
[0107] Based on the viewer count and the position duration ratio, a
balanced viewability index can be determined. In one example, the
balanced viewability index can be calculated as the weighted
average of viewability index (VI) by factoring in the viewer count
and positive duration Ratio for each given program and commercial.
In another example, the balanced viewability index can be
calculated by normalizing the viewability index across the unique
videos in the plurality of videos.
[0108] The method 401 can further include averaging the viewability
index across all programs and commercials for a finite period of
time so as to produce an average viewability index. The viewability
index of each program and commercial can be divided by the average
viewability index (e.g., computed on a daily, weekly, or monthly
basis) so as to produce a final viewability index (dimensionless
quantity) for users, such as advertising agencies, TV stations, or
other content providers. In one example, the finite period of time
is about two weeks. In another example, the finite period of time
is about one month. In yet another example, the finite period of
time is about three months.
[0109] The image data can be acquired at various acquisition rates.
In one example, the image data can be taken 50 times per second (50
Hz). In one example, the image data can be taken 30 times per
second (30 Hz). In yet another example, the image data can be taken
every second (1 Hz). In yet another example, the image data can be
taken every 2 seconds (0.5 Hz). In yet another example, the image
data can be taken every 5 seconds (0.2 Hz). In addition, the method
300 can take and categorize image data for each viewer in the
viewing area so as to derive viewer engagement information taking
into account demographic information of the household.
[0110] FIG. 4B illustrates a method 402 of quantifying user
engagement with videos using attention index. The method 402
includes step 412, at which image data of a viewing area in front
of a display is taken for each household participating in the
viewer engagement assessment. At step 412, a processor determines
whether the display is showing any video when the image data is
taken (e.g., via audio data acquired by the microphone 140 in the
local device 105 shown in FIG. 1). At step 422, for each video
played by the display, the processor estimates an attention rate
based on the image data and on demographic information about the
household. As defined above, the attention rate represents a ratio
of a total number of people engaged with the video to a total
number of people in the viewing areas. Based on the attention rates
of videos, an attention index is determined at step 432 to indicate
the effectiveness of the video.
[0111] The method 402 can further include estimating viewer count
and positive duration ratio of the video(s) played by the display.
Similar to the method 401, the method 402 can determine the viewer
count and positive duration ration based on the image data and on
demographic information about each household. Using the viewer
count and positive duration ration, the processor can then
determine a balanced attention index. The method 402 can include
producing a normalized attention index by normalizing the attention
index across the unique videos in the plurality of videos over a
given period of time (e.g., one week, or one month).
[0112] The method 402 can further include averaging attention index
across all programs and commercials for a finite period of time so
as to produce an average attention index. The attention index of
each program and commercial can be divided by the average attention
index so as to produce a final attention index (dimensionless
quantity) for customers, such as advertising agencies, TV stations,
or other content providers.
[0113] Assessing Viewer Engagement Using Facial Recognition
Techniques
[0114] FIG. 5 illustrates a method of assessing viewer engagement
with videos using facial recognition techniques and other
artificial intelligence techniques. The method 500 starts at step
510 where images of a viewing area in front of a display are
captured (e.g., using the system shown in FIG. 1). For each
acquired image, the number of people in the viewing area is
estimated at step 520. In one example, the estimation can be
performed using, for example, facial recognition techniques. In
another example, the estimation can be performed based on body
skeleton detection.
[0115] At step 530, with respect to the display, the orientation of
the face of each person in the viewing area is determined. For
example, the orientation of the face can be toward the display,
indicating that the viewer is actually watching the videos on the
display. Alternatively, the orientation of the face can be away
from the display, indicating that the viewer is not watching the
video, although he or she is within the viewing area of the
display. Therefore, based on the orientation of the viewers' faces,
a processor can assess whether each person in the viewing area is
actually engaged with the video, at step 540. By distinguishing
people actually watching the videos from those who are not
watching, the processor can make more accurately determination of
the effectiveness of the video. The effectiveness of the video can
be quantified by, for example, how long the video can keep the
viewer in an engaged state.
[0116] Detecting Skeleton, Face, Identification, Emotion, and
Engagement
[0117] FIG. 6 is a flowchart illustrating a method 600 to detect
skeleton, face, identification, emotion, and engagement, which in
turn can be used for viewer engagement assessment described above.
The method 600 can be implemented by a processor (e.g., the
processor 150 or the processor in the remote server 170). The
method 600 starts at step 610, where image data of a viewing area
in front of a display is provided (e.g., by a memory or directly
from the image taking device, such as the RGB sensor 130 shown in
FIG. 1). At step 620, the processor acquires a skeleton frame
(i.e., an image frame including image of at least one possible
viewer, see, e.g., 230 in FIG. 2A) from the image data. At step
630, a processing loop is initiated, where the processor uses six
individual skeleton data points/sets for each skeleton frame for
further processing, including facial recognition, emotion analysis,
and engagement determination. Once the skeleton data has been
processed, the method 600 returns to skeleton frame acquisition at
step 620 via a refreshing step 625.
[0118] Step 635 in the method 600 is a decision step, at which the
processor determines whether any skeleton is detected in the
selected skeleton data in the skeleton frame. If not, the method
600 returns to step 630, where a new skeleton data is picked up for
processing. If at least one skeleton is detected, the method 600
proceeds to step 640, where a bounding box is generated to identify
head area of viewers in the image data. The bounding box can be
generated based on, for example, the skeleton information, e.g., by
identifying the head from the overall skeleton.
[0119] Step 645 again is a decision step, where the processor
determines whether a bounding box is generated (i.e., whether a
head area is detected). It is possible that an image includes an
overall skeleton of a viewer but the head part of the viewer is
obstructed and therefore is absent from the image. In this case,
the method 600 again returns to step 630, where the processor
picked up new skeleton data. If a bounding box is detected, the
method 600 goes to step 650, where the processor carries out a
second level facial recognition (also referred to as face
detection). At this step, the processor attempts to detect human
face within the bounding box generated at step 640. The face
detection can be performed using, for example, Haar Feature-based
Cascade Classifier in OpenCV. More information can be found in U.S.
Pat. No. 8,447,139 B2, which is incorporated herein by reference in
its entirety.
[0120] At step 655, the processor determines whether a face is
detected at step 650. If not, a first level facial recognition is
performed at step 660. This first level facial recognition step can
be substantially similar to the second level facial recognition
performed at step 650. Performing another round of face detection
may reduce the possibility of accidental failure of the facial
recognition techniques. Step 665 is a decision step similar to step
655, where the processor determines whether a face is detected.
[0121] If a face is detected at either first level facial
recognition or second level facial recognition, the method 600
proceeds to step 670 to perform facial landmark detection, also
referred to as facial feature detection or facial key points
detection. The step 670 is employed to determine locations of
different facial features (e.g. corners of the eyes, eyebrows, and
the mouth, the tip of the nose, etc.). More information of facial
landmark detection can be found in U.S. Patent Publication No.
2014/0050358 A1 and U.S. Pat. No. 7,751,599 B2, which are
incorporated herein in their entireties.
[0122] At step 672, the processor determines whether any facial
landmark is detected at step 670. If not, the method 600 returns to
step 630 to select another skeleton data for further processing. If
at least one facial landmark is detected, the processor further
determines, at a decision step 674, whether any face is detected at
the second level facial recognition in step 650. If yes, the method
600 proceeds to step 690, where the detected face is identified
(i.e., determining who the viewer is), after which the method goes
to step 680, where emotion of the face based on the facial landmark
is predicted. If, at step 674, the processor finds that no face was
detected at step 650, the method 600 directly proceeds to step 680
for the processor to estimate emotion of the viewer. Emotion
analysis can be performed using, for example, a Support Vector
Machine (SVM) in Open CV. More information can be found in U.S.
Pat. No. 8,488,023, which is incorporated herein in its
entirety.
[0123] In one example, the methods illustrated in FIGS. 3-6 analyze
all available videos (including TV programs and advertisement)
regardless of the duration of the video or viewer count of the
video. In another example, the methods illustrated in FIGS. 3-6
perform preliminary filtering to exclude videos that are either too
short or have too small a viewer count before performing the
quantitative analysis of viewer engagement. In this way, the
quantitative analysis can result in more statistically reliable
results. For example, videos that are watched for less than a
finite amount of time (e.g., less than 30 seconds, less than 20
seconds, or less than 10 seconds) can be excluded. In addition,
videos that are watched by less than certain number of people
(e.g., less than 20 people, less than 15 people, or less than 10
people) over a finite period (e.g., 1 month, two weeks, or one
week) can also be excluded.
[0124] In one example, the methods illustrated in FIGS. 3-6 are
performed over live TV programs. In another example, the methods
illustrated in FIGS. 3-6 are performed over recorded TV programs.
If it is recognized that the timing of a program is greater than 10
minutes shifted from its original "finger creation timestamp"
(e.g., from database of TV stations), the program is determined as
recorded watching. Otherwise, the program is determined as live
watching.
[0125] Experimental Assessment of the Commercial Message (CM)
Effect
[0126] This section describes accurate viewing data collection and
analysis to examine commercial message (CM) effect management. An
index termed "viewability" indicates when a person is "in front of
the TV". The viewability index is created for this description and
the survey that generates the data. The survey conducted for two
weeks with a sample of 84 people from 30 households. CM curves are
defined as patterns that show the time series curves of viewability
rates between two scenes. Although the personal viewing rate of CM
between scenes can be constant, the viewability rate may change.
The findings show that there are 7 patterns of the CM curve. The
variables of the length of CM and viewability rate can
significantly contribute to the shape of the CM curve. In addition,
multinomial logit model can be help in determining the CM
curve.
[0127] This experiment investigated the relationship between
commercial messages (CM), programs, and human viewing attitudes.
The experiment also characterized the systems and methods described
above. The correlation between program information, such as
broadcast timing and TV stations, and viewing attitudes using
statistical methods were analyzed. Currently, the personal audience
rating survey used in Japan registers people through a colored
button on the TV remote control and records when they press the
colored button at the start and end of TV viewing. Further, the
People Meter (PM) indicator records what the TV audience watched
and who watched the programs (Video Research Ltd. (2014): "TV
rating handbook", available at the VIDEOR.COM website in PDF
format, incorporated herein by reference). However, this audience
rating survey usually does not allow one to distinguish between
focused and casual viewing even if the audience rating is
accurately captured.
[0128] Hiraki and Ito (Hiraki, A. & Ito, K. (2000): Cognitive
attitudes to television commercials based on eye tracking analysis
combined with scenario, Japanese Journal of Human Engineering, Vol.
36, pp. 239-253, incorporated herein by reference) proposed a
method for analyzing the impact of CM on image recognition using
visual information based on eye movement analysis. They conducted
CM viewing experiments with real CM in an environment of recreated
viewing situations. According to them, auditory and visual
information may interfere with commodity understanding.
[0129] In this experiment, besides personal audience ratings, an
indicator of physical presence captured by the system was used to
measure viewing attitudes. For example, during CM, people may leave
their seats and turn their attention to one another without sitting
in front of the TV. Thus, viewing attitudes during CM was
statistically analyzed using two indexes-personal audience ratings
and physical presence. The latter index is referred to herein as
"viewability."
[0130] The viewing attitude survey experiment of 84 individuals
from 30 households was conducted from mid-November to the end of
November in 2014. Data was obtained 24 hours per day over 14
days.
[0131] FIG. 7 shows a schematic view of a data acquisition system
700 that measures engagement of viewers in a viewing area 701 with
a program or advertisement shown on a TV 702 or other display. The
system 700 includes an image sensor 710 that captures images of the
viewing area 701 while the TV 702 is on. The system 700 also
includes a computing device 750 that stores and processes image
data from the image sensor 710 and communicates the raw and/or
processed image data to a server (not shown) via a communication
network.
[0132] In some cases, the computing device 750 and/or the server
measures viewability in addition to personal audience ratings.
Viewability indicates "being in front of the TV," and this term is
defined as the audience within a distance of about 0.5 m to about 4
m from the TV with the face towards the front of the TV between
70.degree. to the left and the right. In one example, viewability
is captured at the rate of 1 second, and it denotes the number of
samples for one second divided by the all the samples (84 in this
case).
[0133] FIGS. 8A-8G shows seven different shapes of CM curves, which
denote the transition in the value of viewability divided by the
personal audience rating. This value can indicate the percentage of
people who are actually watching the TV.
[0134] To explain the differences in the shape of CM curves,
classification and modeling of the data can be performed. The
methods of analysis employed in this experiment are discussed
below. First, the multinomial logit model (see, e.g., Agresti, A.
Categorical data analysis. John Wiley & Sons (2013),
incorporated herein by reference) can be employed for data
modeling. Then, non-hierarchical clustering can be performed using
the K-means method, at least because the sample size (1,065) is
large. Next, a decision tree can be constructed. Explanatory
variables are used and all samples are classified using stepwise
grouping. In general, the decision tree is a classification model
that expresses the plurality of classification rules in a tree
structure. The Gini coefficient was used as a non-purity
function.
[0135] When determining the shape of the CM curve using these
methods, the analysis also considers approaches or variables that
are closely related to determining the shape of the CM curve. Thus,
any variables that are observed substantially simultaneously with
the CM broadcast can also be included.
[0136] Data from a high viewability time range of the day is used,
which, in this experiment, is six hours-from 18:00 to 24:00. The
viewing attitudes towards CM from five TV stations are analyzed.
The ratios of the CM curves for every TV station are shown in FIG.
9.
[0137] In the analysis, the shape of the CM curve is the dependent
variable, and it is categorized from A to G, as shown in FIGS.
8A-8G. The explanatory variables are length of CM, television
station, genre, elapsed time since the start of the program,
average personal audience rating for the CM, average viewability
rate of the CM, average personal audience rating for the previous
scene, average viewability of the previous scene, viewability rate
of the current scene divided by the personal audience rating,
viewability rate of the previous scene divided by the personal
audience rating, and date and day of the week. The previous scene
refers to the scene between the CM and the previous CM.
[0138] The discrimination results based on the multinomial logit
model are shown in TABLE 1. The discrimination rate in the
multinomial logit model is 20% higher than the discrimination rate
at random. The discrimination rate is particularly high when the
shape of the CM curve is B or G.
[0139] In this model, seven explanatory variables are used: length
of CM, TV stations, elapsed time since the start of the program,
average personal audience rating for the CM, viewability rate,
viewability rate of the CM divided by the personal audience rating,
and viewability rate of the previous scene divided by the personal
audience rating. Of the seven variables, length of CM and TV
station contribute the most to the discrimination rate.
TABLE-US-00001 TABLE 1 Result of the multinomial logit model
True/Prediction A B C D E F G Sum A 34 14 13 13 48 13 1 136 B 11
114 2 44 31 15 2 219 c 14 11 21 12 17 4 4 83 D 8 57 7 86 38 7 0 203
E 17 30 10 43 110 18 0 228 F 17 42 -- 17 36 37 3 152 G 0 16 1 4 7 8
8 44 Sum 101 284 54 219 287 102 18 1065 Discrimination Rate 33.66
10.14 38.89 39.27 38.33 36.27 44.44 38.50
[0140] The explained variables of the seven shapes can also be
stratified. Although several different kinds of stratifications can
be considered, for efficient examination, the following two kinds
of stratifications were compared.
[0141] Stratification 1: Monotonic shape types (C/D/E) and
non-monotonic shape types (A/B/F/G). First, monotonic shape types
that do not have extreme values and non-monotonic shape types that
do have extreme values were stratified. The multinomial logit model
to each group is applied, and then the discrimination rate for each
group can be calculated. The discrimination results of
stratification 1 are shown in TABLE 2. The discrimination rate of
the monotonic shape type is 59.34%, while that of the monotonic
shape type is 51.72%, and the overall discrimination rate is
53.62%.
[0142] After stratifying the monotonic and non-monotonic shape
types, the overall discrimination rate is 15% higher than that in
the multinomial logit model without stratification. Compared to the
multinomial logit model without stratification, the difference in
the discrimination rates between the shapes of the CM curve could
be determined correctly (D/E/G) and incorrectly (C).
[0143] The selected explanatory variables are as follows. In the
monotonic shape types, six variables are selected, namely, TV
station, elapsed time since the start of the program, average
personal audience rating for the CM, viewability of the CM,
viewability of the previous scene, and viewability of the previous
scene divided by the personal audience rating. In the non-monotonic
shape types, the six variables selected are length of CM, TV
stations, elapsed time since the start of the program, average
personal audience rating for the CM, viewability rate of the CM,
and viewability rate of the previous scene. Length of CM, which
contributes to the multinomial logit model without stratification,
is not selected in the monotonic shape types.
TABLE-US-00002 TABLE 2 Discrimination results of stratification 1
True/Prediction A B C D E F G Sum A 67 42 0 0 0 26 I 136 B 26 169 0
0 0 24 0 219 C 0 0 IS 25 43 0 0 83 D 0 0 10 139 54 0 0 203 E 0 0 14
63 151 0 0 228 F 30 75 0 0 0 26 4 152 G 4 22 0 0 0 14 4 44 Sum 127
308 39 227 248 90 6 1065 Discrimination 52.76 54.87 38.46 61.23
60.89 28.89 66.67 53.62 Rate
[0144] Stratification 2: Simple shape types (A/B/C/D/E) and
complicated shape types (F/G). Second, simple shape types can be
stratified, which have at most one extreme value, and complicated
shape types, which have more than one extreme value. The
discrimination results of stratification 2 are shown in TABLE 3.
The discrimination rate of the simple shape type is 46.50%, while
that of the complicated shape type is 77.55%, and the overall
discrimination rate is 52.21%.
[0145] For the simple shape types, nine variables are
selected-length of CM, TV station, elapsed time since the start of
the program, average personal audience rating for the CM,
viewability rate of the CM, average personal audience rating of the
previous scene, viewability rate divided by the personal audience
rating of the CM, viewability of the previous scene divided by the
average personal audience rating, and date. Further, for the
complicated shape types, only one variable is selected-TV stations.
As this model has only one variable, all samples are classified
under F. For the simple shape types, the selected variables are
similar to that of the multinomial logit model without
stratification.
TABLE-US-00003 TABLE 3 Discrimination results of stratification 2
True/Prediction A B C D E F G Sum A 39 19 13 20 45 0 0 136 B 15 121
2 46 35 0 0 219 C 12 15 23 12 21 0 0 83 D 11 so 7 103 32 0 0 203 E
22 38 10 40 118 0 0 228 F 0 0 0 0 0 152 0 152 G 0 0 0 0 0 44 0 44
Sum 99 243 55 221 251 196 0 1065 Discrimination 39.39 49.79 41.82
46.61 47.01 77.55 0.00 52.21 Rate
[0146] Cluster analysis using the explanatory variables can be
performed. The discrimination results of the cluster analysis are
shown in TABLE 4. The discrimination rate is 15.77%, and there is
no difference in the discrimination rate between cluster analysis
and random selection. In other words, in the nonhierarchical
cluster analysis, the CM curve could not be classified.
TABLE-US-00004 TABLE 4 Discrimination results of cluster analysis
True/Prediction A B C D E F G Sum A 10 21 10 14 58 14 9 136 B 22 25
19 11 116 16 10 219 C 6 10 4 11 38 10 4 83 D 17 28 6 10 110 25 7
203 E 32 29 10 13 109 28 7 228 F 11 29 7 16 76 9 4 152 G 4 7 2 3 26
1 1 44 Sum 102 149 58 78 533 103 42 1065 Discrimination 9.80 16.78
6.90 12.82 20.45 8.74 2.38 15.77 Rate
[0147] FIG. 10 shows a classification model through a decision
tree. The determination results of the decision tree are shown in
TABLE 5. The discrimination rate of the decision tree is 40%. From
TABLE 5, one can see that the discrimination rate of G is 0%, but
that of D is higher than that of other CM curves by as much as 73%.
The discrimination rate of the decision tree is slightly higher
than that of the multinomial logit model without
stratification.
[0148] From FIG. 10, the characteristics of each shape of the CM
curve can be identified. Shape A occurs when the viewability rate
is high. Shape B occurs when the viewability rate is low and the
length of CM is long. Shape C occurs when the viewability rate of a
scene is not very different from that of the previous scene. Shape
D occurs when the viewability rate is low and the length of CM is
short. Shape E occurs when the viewability rate of the previous
scene is low and the length of CM is short. Shape F occurs when the
viewability rate of a scene is low while the viewability rate of
the previous scene is high.
TABLE-US-00005 TABLE 5 Discriminant results of the decision tree
True/Prediction A B C D E F G Sum A 17 10 14 0 32 63 0 136 B 4 121
5 8 46 35 0 219 C 5 3 31 1 34 9 0 83 D 6 70 4 30 68 25 0 203 E 5 17
8 2 128 68 0 228 F 6 29 2 0 16 99 0 152 G 2 11 2 0 7 22 0 44 Sum 45
261 66 41 331 321 0 1065 Discrimination Rate 37.78 46.36 46.97
73.17 38.67 30.84 0.00 40
[0149] Comparison and consideration. The discrimination rate by
each method is summarized in TABLE 6. The method of stratification
1 has the highest rate among all methods. However, since the
explained variables were stratified, it is impossible to verify the
entire connection.
TABLE-US-00006 TABLE 6 Summary of discrimination rates Multiple
Cluster Logit Stratification1 Stratification2 Analysis Decision
Tree 38.5 53.62 52.21 15.77 40
[0150] The discrimination rate of the multinomial logit model
without stratification is almost the same as the rate of the
decision tree. Because the decision tree is determined by whether
or not the viewability rate is higher than a fixed value, it is
difficult to understand intuitively, and the fixed value is not
replicable. Therefore, the most suitable method to determine the CM
curve is the multinomial logit model without stratification.
[0151] In all the methods, the variables of length of CM and
viewability rate contribute the most to determining the CM curve.
Therefore, TV viewing attitudes do not depend on the genre and
broadcast time of the program, but on the length of CM and the
viewability rate of the current and previous scenes.
[0152] In these five methods, the variables of length of CM and
viewability rate greatly contribute to determining the CM curve. In
this regard, two points are considered: 1) the relationship between
the length of CM and viewability rate, and 2) in what kinds of
situations the viewability rate is high.
[0153] The relationship between the length of CM and viewability
rate is illustrated in FIG. 11. In general, the shorter the length
of CM, the higher the viewability rate is. The longer the CM, the
lower the viewability rate, because people will become uninterested
and stop looking at the TV.
[0154] Further, what kinds of situations lead to a high viewability
rate was investigated. When little time elapses after the program
begins (depending on the genre), the viewability rate is high. As
TABLE 7 shows, there are noticeable differences between the average
viewability rates of each genre. The viewability rate of news
programs is low, whereas that of movies and music is high. FIG. 12
shows the correlation between elapsed time since the start of the
program and the viewability rate. From FIG. 12, one can see that
the viewability rate is higher when shorter time has elapsed since
the start of the program.
TABLE-US-00007 TABLE 7 Average viewability rate by genre Genre
Viewability Animation/Tokusatsu 0.706 Sports 0.668 Documentary
0.907 Drama 0.807 News 0.814 Variety shows 0.988 Film 1.252 Music
1.359 Hobby/Education 0.816 Tabloid shows 0.776 All 0.939
[0155] This experimental study elucidates the relationship between
CM, programs, and human viewing attitudes using an exemplary
embodiment of the hardware and software components of the present
invention. The most suitable method to determine the CM curve is
the multinomial logit model.
[0156] The variables are analyzed that can be observed during CM to
examine the relationship between the CM curve and these variables.
In all the method employed, the variables of length of CM and
viewability rate contribute the most to determining the CM curve.
Since the discrimination rate of the monotonic shape type is high,
discrimination is easier, whether unchanged or changed. In other
words, the shape of the CM curve is not relevant to program
characteristics such as genre and date. This indicates that when
the CM broadcast time is longer, the audience gets tired of
watching. Moreover, if the previous scene of the program is
uninteresting to the audience, then they do not watch the next
CM.
[0157] Applications of Viewer Engagement Data
[0158] FIG. 13 illustrates a system of communication of data
acquired using the methods and systems described herein. The system
1300 stores and processes raw data 1310 captured from TV audience
panels through the motion-sensing devices, which is transferred to
the computing device 1320 such as, but without limitation, the
desktop machine. Then, methods of assessing viewer engagement can
be performed on, for example, desktop machines to analyze and
processes the data. The methods transform the after-analyzed data
into performance-based TV ratings data that can be used to
determine (1) who is really watching TV (who is in the audience),
(2) how often the audience members look at the TV, and (3) the
audience's reaction towards the TV programs and advertisements.
This processed and/or summarized data is then transferred to a
central storage location 1330, such as a server, on the cloud where
third parties, including but not limited to TV advertising agencies
1340, TV networks 1350, and any other potential clients 1360 that
might find the data useful, can conveniently access the data
anytime, through the collection entity's software, an application
programming interface, or a web portal, specifically developed for
the collection entity's clients. Alternatively, the raw data 1310
collected by the sensors of the hardware component is transferred
to a central storage 1330 on the cloud directly or indirectly
through an Internet connection where it is analyzed by the software
component and made available to interested third parties 1340-1360.
A third party may optionally access the raw data through the
system.
[0159] FIG. 14 illustrates basic elements of an example system 1400
that can utilize the data acquired and analyzed by the systems and
methods described herein. The collection entity 1430 (e.g., TVision
Insights) may compensate panel members 1410 (e.g., household
members) who, in exchange for compensation or volunteering, allow
for the placement of the hardware components depicted in FIG. 1 to
be placed atop televisions in their household for the purpose of TV
viewership data collection. Panel members may be asked to provide
additional information 1420, including but not limited to, credit
card transaction data, demographic and socio-economic information,
social media account logins, and data from tablets, smartphones,
and other devices. This data is collected, video and IR images are
recorded using the system depicted in FIG. 1, and the video can be
analyzed by the methods described in FIGS. 2-6. Once analyzed, data
describing the video may be transmitted to the collection entity
1430, which may then sell or otherwise provide the data to
advertisers 1440, TV stations 1460, TV agencies 1450, and other
interested third parties. Optionally, the collection entity 1430
may provide access to raw collected data for separate analysis. As
part of the disclosed business model, the collection entity 1430
can motivate advertisers 1440 to encourage their TV agencies 1450
to purchase this data.
[0160] FIG. 15 illustrates big data analysis and visualization
based on data acquired in methods of assessing viewer engagement.
In these models 1500, the collection entity 1520 (e.g., TVision
INSIGHTS shown in FIG. 15) can collect data from households 1510
having TV sets. In return, the participating households 1510 can
receive monetary compensation (or other benefit) from collection
entity 1520. The collection entity 1520 then analyzes the data
collected from the participating households using big data analysis
1530a and visualization techniques 1530b to derive information such
as the effectiveness of certain TV program or advertisement. This
data can be then provided to advertisers, advertising agencies, TV
stations, or other content providers or promoters (collectively
referred to as customers 1540) to instruct them to improve the
effectiveness of their programs. In one example, the customers 1540
can subscribe this data service to the collection entity 1520 on a
monthly basis with monthly fees. In another example, the customers
1540 can buy data relating to a particular video (e.g., campaign
video, special advertisement during sports events, etc.) from the
collection entity 1520.
[0161] FIG. 16 illustrates examples of collection of additional
information 1600 from individuals and households (TV audiences)
participating in viewer engagement data collection. The TV
audiences can represent national and/or local demographics useful
to interested third parties. The collection entity can collect
video data 1610 and the demographic information and, packaged with
data gathered by the system and analyzed by the methods regarding
TV viewership, provide this information to customers for
compensation. Examples of information that may be collected from TV
audiences include any and all information that can be obtained
through social media profiles 1620 such as, but not limited to,
TWITTER, Instagram, FACEBOOK, among others. The information can
further include video data and audio data 1640 obtained from the
systems (including both television audio and audio such as
conversation originating from individuals in the household),
multi-screen data 1630 including smartphone and tablet search
habits, internet search history, email account information, and
credit card transaction data 1650. This list is not exhaustive, and
should not be interpreted as limiting.
[0162] The collected information and data enables a collection
entity to accurately evaluate the impact of TV
advertisements-including unprecedented measurements of individual
demographics, which are valuable to the advertisers. The
advertisers can use the data to determine which ad slots would be a
best fit for their targeted audience. The message can also be more
pertinent to the type of audience and can effectively lead to
purchasing behavior, increasing return of investment (ROI) for the
advertisers.
[0163] TV networks can also benefit from the disclosed invention as
they will be able to glean more accurate ratings of their TV
programs, audience type, reactions, and predictive ad slot value.
This will allow them to improve their programs to better fit the
type of audience and eliminate non-popular ones, in addition to
determining which ad slots will have the highest value for a
particular target demographic. The data can also be used to compare
programs across multiple channels at the same or different time
slots for a comparative evaluation of programs and advertising.
Similarly, TV audience data and behavior can be collected and
compared for any given programming time slot to streaming content.
TV pilot programs can also be evaluated using the system before
ordering episodes.
CONCLUSION
[0164] While various inventive embodiments have been described and
illustrated herein, those of ordinary skill in the art will readily
envision a variety of other means and/or structures for performing
the function and/or obtaining the results and/or one or more of the
advantages described herein, and each of such variations and/or
modifications is deemed to be within the scope of the inventive
embodiments described herein. More generally, those skilled in the
art will readily appreciate that all parameters, dimensions,
materials, and configurations described herein are meant to be
exemplary and that the actual parameters, dimensions, materials,
and/or configurations will depend upon the specific application or
applications for which the inventive teachings is/are used. Those
skilled in the art will recognize, or be able to ascertain using no
more than routine experimentation, many equivalents to the specific
inventive embodiments described herein. It is, therefore, to be
understood that the foregoing embodiments are presented by way of
example only and that, within the scope of the appended claims and
equivalents thereto, inventive embodiments may be practiced
otherwise than as specifically described and claimed. Inventive
embodiments of the present disclosure are directed to each
individual feature, system, article, material, kit, and/or method
described herein. In addition, any combination of two or more such
features, systems, articles, materials, kits, and/or methods, if
such features, systems, articles, materials, kits, and/or methods
are not mutually inconsistent, is included within the inventive
scope of the present disclosure.
[0165] The above-described embodiments can be implemented in any of
numerous ways. For example, embodiments of designing and making the
technology disclosed herein may be implemented using hardware,
software or a combination thereof. When implemented in software,
the software code can be executed on any suitable processor or
collection of processors, whether provided in a single computer or
distributed among multiple computers.
[0166] Further, it should be appreciated that a computer may be
embodied in any of a number of forms, such as a rack-mounted
computer, a desktop computer, a laptop computer, or a tablet
computer. Additionally, a computer may be embedded in a device not
generally regarded as a computer but with suitable processing
capabilities, including a Personal Digital Assistant (PDA), a smart
phone or any other suitable portable or fixed electronic
device.
[0167] Also, a computer may have one or more input and output
devices. These devices can be used, among other things, to present
a user interface. Examples of output devices that can be used to
provide a user interface include printers or display screens for
visual presentation of output and speakers or other sound
generating devices for audible presentation of output. Examples of
input devices that can be used for a user interface include
keyboards, and pointing devices, such as mice, touch pads, and
digitizing tablets. As another example, a computer may receive
input information through speech recognition or in other audible
format.
[0168] Such computers may be interconnected by one or more networks
in any suitable form, including a local area network or a wide area
network, such as an enterprise network, and intelligent network
(IN) or the Internet. Such networks may be based on any suitable
technology and may operate according to any suitable protocol and
may include wireless networks, wired networks or fiber optic
networks.
[0169] The various methods or processes outlined herein may be
coded as software that is executable on one or more processors that
employ any one of a variety of operating systems or platforms.
Additionally, such software may be written using any of a number of
suitable programming languages and/or programming or scripting
tools, and also may be compiled as executable machine language code
or intermediate code that is executed on a framework or virtual
machine.
[0170] In this respect, various inventive concepts may be embodied
as a computer readable storage medium (or multiple computer
readable storage media) (e.g., a computer memory, one or more
floppy discs, compact discs, optical discs, magnetic tapes, flash
memories, circuit configurations in Field Programmable Gate Arrays
or other semiconductor devices, or other non-transitory medium or
tangible computer storage medium) encoded with one or more programs
that, when executed on one or more computers or other processors,
perform methods that implement the various embodiments of the
invention discussed above. The computer readable medium or media
can be transportable, such that the program or programs stored
thereon can be loaded onto one or more different computers or other
processors to implement various aspects of the present invention as
discussed above.
[0171] The terms "program" or "software" are used herein in a
generic sense to refer to any type of computer code or set of
computer-executable instructions that can be employed to program a
computer or other processor to implement various aspects of
embodiments as discussed above. Additionally, it should be
appreciated that according to one aspect, one or more computer
programs that when executed perform methods of the present
invention need not reside on a single computer or processor, but
may be distributed in a modular fashion amongst a number of
different computers or processors to implement various aspects of
the present invention.
[0172] Computer-executable instructions may be in many forms, such
as program modules, executed by one or more computers or other
devices. Generally, program modules include routines, programs,
objects, components, data structures, etc., that perform particular
tasks or implement particular abstract data types. Typically, the
functionality of the program modules may be combined or distributed
as desired in various embodiments.
[0173] Also, data structures may be stored in computer-readable
media in any suitable form. For simplicity of illustration, data
structures may be shown to have fields that are related through
location in the data structure. Such relationships may likewise be
achieved by assigning storage for the fields with locations in a
computer-readable medium that convey relationship between the
fields. However, any suitable mechanism may be used to establish a
relationship between information in fields of a data structure,
including through the use of pointers, tags or other mechanisms
that establish relationship between data elements.
[0174] Also, various inventive concepts may be embodied as one or
more methods, of which an example has been provided. The acts
performed as part of the method may be ordered in any suitable way.
Accordingly, embodiments may be constructed in which acts are
performed in an order different than illustrated, which may include
performing some acts simultaneously, even though shown as
sequential acts in illustrative embodiments.
[0175] All definitions, as defined and used herein, should be
understood to control over dictionary definitions, definitions in
documents incorporated by reference, and/or ordinary meanings of
the defined terms.
[0176] The indefinite articles "a" and "an," as used herein in the
specification and in the claims, unless clearly indicated to the
contrary, should be understood to mean "at least one."
[0177] The phrase "and/or," as used herein in the specification and
in the claims, should be understood to mean "either or both" of the
elements so conjoined, i.e., elements that are conjunctively
present in some cases and disjunctively present in other cases.
Multiple elements listed with "and/or" should be construed in the
same fashion, i.e., "one or more" of the elements so conjoined.
Other elements may optionally be present other than the elements
specifically identified by the "and/or" clause, whether related or
unrelated to those elements specifically identified. Thus, as a
non-limiting example, a reference to "A and/or B", when used in
conjunction with open-ended language such as "comprising" can
refer, in one embodiment, to A only (optionally including elements
other than B); in another embodiment, to B only (optionally
including elements other than A); in yet another embodiment, to
both A and B (optionally including other elements); etc.
[0178] As used herein in the specification and in the claims, "or"
should be understood to have the same meaning as "and/or" as
defined above. For example, when separating items in a list, "or"
or "and/or" shall be interpreted as being inclusive, i.e., the
inclusion of at least one, but also including more than one, of a
number or list of elements, and, optionally, additional unlisted
items. Only terms clearly indicated to the contrary, such as "only
one of" or "exactly one of," or, when used in the claims,
"consisting of," will refer to the inclusion of exactly one element
of a number or list of elements. In general, the term "or" as used
herein shall only be interpreted as indicating exclusive
alternatives (i.e., "one or the other but not both") when preceded
by terms of exclusivity, such as "either," "one of," "only one of,"
or "exactly one of" "Consisting essentially of," when used in the
claims, shall have its ordinary meaning as used in the field of
patent law.
[0179] As used herein in the specification and in the claims, the
phrase "at least one," in reference to a list of one or more
elements, should be understood to mean at least one element
selected from any one or more of the elements in the list of
elements, but not necessarily including at least one of each and
every element specifically listed within the list of elements and
not excluding any combinations of elements in the list of elements.
This definition also allows that elements may optionally be present
other than the elements specifically identified within the list of
elements to which the phrase "at least one" refers, whether related
or unrelated to those elements specifically identified. Thus, as a
non-limiting example, "at least one of A and B" (or, equivalently,
"at least one of A or B," or, equivalently "at least one of A
and/or B") can refer, in one embodiment, to at least one,
optionally including more than one, A, with no B present (and
optionally including elements other than B); in another embodiment,
to at least one, optionally including more than one, B, with no A
present (and optionally including elements other than A); in yet
another embodiment, to at least one, optionally including more than
one, A, and at least one, optionally including more than one, B
(and optionally including other elements); etc.
[0180] In the claims, as well as in the specification above, all
transitional phrases such as "comprising," "including," "carrying,"
"having," "containing," "involving," "holding," "composed of," and
the like are to be understood to be open-ended, i.e., to mean
including but not limited to. Only the transitional phrases
"consisting of" and "consisting essentially of" shall be closed or
semi-closed transitional phrases, respectively, as set forth in the
United States Patent Office Manual of Patent Examining Procedures,
Section 2111.03.
* * * * *