U.S. patent application number 16/731025 was filed with the patent office on 2020-07-02 for video-based fall risk assessment system.
This patent application is currently assigned to AltumView Systems Inc.. The applicant listed for this patent is AltumView Systems Inc.. Invention is credited to Jie Liang, Chao Shen, Dong Zhang, Jiannan Zheng.
Application Number | 20200205697 16/731025 |
Document ID | / |
Family ID | 71123693 |
Filed Date | 2020-07-02 |
![](/patent/app/20200205697/US20200205697A1-20200702-D00000.png)
![](/patent/app/20200205697/US20200205697A1-20200702-D00001.png)
![](/patent/app/20200205697/US20200205697A1-20200702-D00002.png)
![](/patent/app/20200205697/US20200205697A1-20200702-D00003.png)
![](/patent/app/20200205697/US20200205697A1-20200702-D00004.png)
![](/patent/app/20200205697/US20200205697A1-20200702-D00005.png)
![](/patent/app/20200205697/US20200205697A1-20200702-D00006.png)
![](/patent/app/20200205697/US20200205697A1-20200702-D00007.png)
![](/patent/app/20200205697/US20200205697A1-20200702-D00008.png)
![](/patent/app/20200205697/US20200205697A1-20200702-D00009.png)
![](/patent/app/20200205697/US20200205697A1-20200702-D00010.png)
View All Diagrams
United States Patent
Application |
20200205697 |
Kind Code |
A1 |
Zheng; Jiannan ; et
al. |
July 2, 2020 |
VIDEO-BASED FALL RISK ASSESSMENT SYSTEM
Abstract
Various embodiments of a video-based fall risk assessment system
are disclosed. During operation, this fall risk assessment system
can receives a sequence of video frames including a person being
monitored for fall risk assessment. The system next generates a
sequence of action labels for the sequence of video frames by, for
each video frame in the sequence of video frames: estimating a pose
of the person within the video frame; and classifying the estimated
pose as a given action among a set of predetermined actions. Next,
the system identifies a subset of action labels within the sequence
of action labels. The system next extracts a set of gait features
for the person from a subset of video frames within the sequence of
video frames corresponding to the subset of action labels.
Subsequently, the system analyzes the set of extracted gait
features to generate a fall risk assessment for the person. In some
embodiments, the sequence of video frames is captured during a
predetermined time period, such as an hour, a day, or a week.
Inventors: |
Zheng; Jiannan; (Delta,
CA) ; Shen; Chao; (Richmond, CA) ; Zhang;
Dong; (Port Coquitlam, CA) ; Liang; Jie;
(Coquitlam, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
AltumView Systems Inc. |
Port Moody |
|
CA |
|
|
Assignee: |
AltumView Systems Inc.
Port Moody
CA
|
Family ID: |
71123693 |
Appl. No.: |
16/731025 |
Filed: |
December 30, 2019 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62786541 |
Dec 30, 2018 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06N 3/08 20130101; G06K
2009/00738 20130101; G08B 5/222 20130101; G08B 21/043 20130101;
G06K 9/00342 20130101; G06K 9/46 20130101; G06T 3/0093 20130101;
G06T 2210/22 20130101; G16H 50/30 20180101; A61B 5/1117 20130101;
G16H 30/40 20180101; A61B 5/112 20130101; G08B 21/0476 20130101;
G06T 11/00 20130101; G06K 9/00718 20130101; A61B 5/7275
20130101 |
International
Class: |
A61B 5/11 20060101
A61B005/11; A61B 5/00 20060101 A61B005/00; G16H 50/30 20060101
G16H050/30; G08B 21/04 20060101 G08B021/04 |
Claims
1. A method of performing video-based fall risk assessment,
comprising: receiving a sequence of video frames including a person
being monitored for fall risk assessment; generating a sequence of
action labels for the sequence of video frames by, for each video
frame in the sequence of video frames: estimating a pose of the
person within the video frame; and classifying the estimated pose
as a given action among a set of predetermined actions; identifying
a subset of action labels within the sequence of action labels;
extracting a set of gait features for the person from a subset of
video frames within the sequence of video frames corresponding to
the subset of action labels; and analyzing the set of extracted
gait features to generate a fall risk assessment for the
person.
2. The method of claim 1, wherein the sequence of video frames is
captured during a predetermined time period.
3. The method of claim 2, wherein the predetermined time period is
an hour, a day, or a week.
4. The method of claim 1, wherein prior to estimating a pose of the
person within the video frame, the method further comprises
detecting the person within the video frame.
5. The method of claim 1, wherein the set of predetermined actions
a standing action, a sitting action, a walking action, and all
other actions.
6. The method of claim 5, wherein identifying the subset of action
labels within the sequence of action labels includes identifying
all action labels classified the walking action.
7. The method of claim 1, wherein the set of gait features includes
one or more of: step count, average step duration, variance of step
duration for one foot or both feet, speed, cadence, step balance,
and body sway factor.
8. The method of claim 2, wherein analyzing the set of extracted
gait features to generate a fall risk assessment for the person
includes analyzing the sequence of video frames captured during the
predetermined time period.
9. The method of claim 1, wherein analyzing the set of extracted
gait features to generate a fall risk assessment includes perform
one or more statistical analyses on a given extracted gait feature
in the set of extracted gait features.
10. The method of claim 1, wherein the method further comprises
triggering a high-fall-risk warning to be sent to the caregivers
when analyzing the set of extracted gait features generates a
high-fall-risk assessment for the person.
11. A video-based fall risk assessment system, comprising: one or
more processors; a memory coupled to the one or more processors,
wherein the memory stores instructions that, when executed by the
one or more processors, cause the system to: receive a sequence of
video frames including a person being monitored for fall risk
assessment; generate a sequence of action labels for the sequence
of video frames by, for each video frame in the sequence of video
frames: estimating a pose of the person within the video frame; and
classifying the estimated pose as a given action among a set of
predetermined actions; identify a subset of action labels within
the sequence of action labels; extract a set of gait features for
the person from a subset of video frames within the sequence of
video frames corresponding to the subset of action labels; and
analyze the set of extracted gait features to generate a fall risk
assessment for the person.
12. The system of claim 11, wherein the sequence of video frames is
captured during a predetermined time period.
13. The system of claim 12, wherein the predetermined time period
is an hour, a day, or a week.
14. The system of claim 11, wherein the memory further stores
instructions that, when executed by the one or more processors,
cause the system to detect the person within the video frame prior
to estimating a pose of the person within the video frame.
15. The system of claim 11, wherein the set of predetermined
actions a standing action, a sitting action, a walking action, and
all other actions.
16. The system of claim 15, wherein identifying the subset of
action labels within the sequence of action labels includes
identifying all action labels classified the walking action.
17. The system of claim 11, wherein the set of gait features
includes one or more of: step count, average step duration,
variance of step duration for one foot or both feet, speed,
cadence, step balance, and body sway factor.
18. The system of claim 12, wherein analyzing the set of extracted
gait features to generate a fall risk assessment includes perform
one or more statistical analyses on a given extracted gait feature
in the set of extracted gait features.
19. The system of claim 11, wherein the memory further stores
instructions that, when executed by the one or more processors,
cause the system to trigger a high-fall-risk warning to be sent to
the caregivers when analyzing the set of extracted gait features
generates a high-fall-risk assessment for the person.
20. An embedded system, comprising: one or more cameras configured
to capture a sequence of video frames including a person; one or
more processors; a memory coupled to the one or more processors and
storing instructions that, when executed by the one or more
processors, cause the system to: receive a sequence of video frames
including a person being monitored for fall risk assessment;
generate a sequence of action labels for the sequence of video
frames by, for each video frame in the sequence of video frames:
estimating a pose of the person within the video frame; and
classifying the estimated pose as a given action among a set of
predetermined actions; identify a subset of action labels within
the sequence of action labels; extract a set of gait features for
the person from a subset of video frames within the sequence of
video frames corresponding to the subset of action labels; and
analyze the set of extracted gait features to generate a fall risk
assessment for the person.
Description
PRIORITY CLAIM AND RELATED PATENT APPLICATIONS
[0001] This patent document claims benefit of priority under 35
U.S.C. 119(e) to U.S. Provisional Patent Application No. 62/786,541
entitled "METHOD AND SYSTEM FOR PRIVACY-PRESERVING FALL DETECTION,"
by inventors Him Wai Ng, Xing Wang, Jiannan Zheng, Andrew Tsun-Hong
Au, Chi Chung Chan, Kuan Huan Lin, Dong Zhang, Eric Honsch,
Kwun-Keat Chan, Adrian Kee-Ley Auk, Karen Ly-Ma, Jianbing Wu, and
Ye Lu, and filed on Dec. 30, 2018 (Attorney Docket No.
AVS010.PRV01). The disclosures of the above application are
incorporated by reference in their entirety as a part of this
document.
[0002] This patent application is also related to a pending U.S.
patent application entitled, "METHOD AND SYSTEM FOR
PRIVACY-PRESERVING FALL DETECTION," by inventors Him Wai Ng, et
al., having patent application Ser. No. 16/672,432, and filed on 2
Nov. 2019 (Attorney Docket No. AVS010.US01).
TECHNICAL FIELD
[0003] The present disclosure generally relates to the field of
medical and health monitoring, and more specifically to systems,
devices and techniques for performing highly-reliable and
privacy-preserving fall detections on humans.
BACKGROUND
[0004] As life expectancy worldwide continues to rise, a rapidly
aging population has become a serious social problem faced by many
countries. An aging population is generally composed of people over
65 years old. As the number of people in this age group is growing
rapidly, the ever-increasing demands for quality healthcare
services impose significant challenges for the healthcare providers
and the society. Of various medical and health problems associated
with an aging population, falls are one of the most common but
extremely serious problems faced by the elderly people. Elderly
people have significantly higher risk of falling which continues to
increase with age, and a fall often leads to serious and
irreversible medical consequences. However, if a fall does occur,
the ability to generate an alert/alarm signal in the first moments
after the fall so that medical help can be rendered immediately can
have vital importance. Nowadays, such fall alarms can be generated
by various fall detection devices which monitor and detect falls
for those people with the higher risk of falling.
[0005] Various types of fall detection devices have been developed.
For example, these fall detection devices include wearable
fall-detection devices, which typically rely on using
accelerometers or gyroscopes for detecting a fall. However,
wearable fall-detection devices need to be worn by the people being
monitored most of the time and recharged frequently, thereby making
them cumbersome and inconvenient to use. Moreover, many people tend
to forget wearing them, and some even refuse to wear them. Some
existing wearable fall-detection devices are based on
acoustic/vibration sensors. However, these fall-detection devices
tend to have lower accuracy, and are generally only useful for
detecting heavy impact.
[0006] Another type of fall-detection devices uses various
vision-based fall-detection techniques, e.g., based on captured
videos of a high-risk individual. For example, one existing
technique uses a depth camera to detect falls. However, the
accuracies of depth cameras are often inadequate for monitoring
large areas. In another existing technique, the field of view of a
captured video is partitioned into an upper region and a lower
region, and a motion event corresponding to a person in the lower
region is detected based on the magnitude and the area of the
motion. In still another existing technique, fall detection is also
performed by using the height and aspect ratio of the person
detected in a captured video. However, in the above techniques, the
decision rules for identifying a fall are quite naive and the
performances of these system cannot meet desired accuracy
requirements.
[0007] In another video-based fall-detection system, gradient-based
feature vectors are calculated from the video images and used to
represent human objects. These feature vectors are subsequently
sent to a simple three-layer Elman recurrent neural network (RNN)
for fall detection. However, the generally low complexity of this
simple RNN architecture also limits the performance of the
associated fall detection outcomes.
[0008] Recently, convolutional neural network (CNN)-based
techniques have been applied to fall detections. These CNN-based
techniques are generally more accurate and robust than the
above-described techniques that use simple rules or parameters to
make falls predictions. For example, one such technique uses
CNN-based architectures to identify human actions captured in an
image. However, the existing CNN-based fall-detection techniques
require significant amount of computational resources and therefore
are not suitable for embedded system implementations.
[0009] In addition to the needs for generating immediate
alerts/alarms of fall events to enable medical assists in the first
moment of a fall, effective fall risk assessments prior to making
fall predictions can potentially prevent fall events from
happening. Currently, in most hospitals and elderly care
facilities, questionnaires are adopted as the primary fall risk
assessment tool. More specifically, to evaluate a subject's
potential fall risk, questionnaires are completed by the subject or
the family member of the subject, sometimes with the supervision of
doctors. In the questionnaires, questions such as the subject's age
and gender, history of previous fall events, bowel and urine
elimination, current medications and medications history, patient
care equipment (e.g., chest tube, etc.), mobility and cognition are
presented. Single or multiple selections are available for each
question and each selection can be assigned with a certain amount
of points. After a questionnaire is completed, all points
associated with all of the selections will be summed and used as a
fall risk score for the subject. Based on the fall risk score, a
particular level of fall risks (e.g., high risk, medium risk, low
risk, etc.) will be assigned to the subject and corresponding fall
risk intervention measurements can then be applied to the subject.
Although questionnaires provide a simple way to assess the
subject's fall risk, the associated results are often inaccurate,
and can be highly subjective to the medical knowledge of the
subject or their family members.
[0010] Recently, several in-clinic fall risk tests under controlled
environment have been introduced to provide a more accurate and
objective assessment of a subject's fall risk. For example, a
30-second sit-and-stand test can be used to evaluate the subject's
lower limb strength and mobility. Through this test, potential fall
risk can be determined by the number of sit-stand actions that are
successfully performed by the subject. Generally, the higher number
of the sit-stand actions can be completed by the subject, the lower
the fall risk is associated with the subject. Moreover, a balancing
test can be used to test the subject's ability in balance, which
can be an effective indicator for the fall risk. During such a
test, the subject is asked to perform a series of balancing acts
including single foot stance. Failure in performing one or more
acts will be considered as a higher fall risk.
Standing-and-three-meter walking test can also test the subject's
mobility. At the beginning of this test, the subject will be
sitting in a chair. After the starting signal of the test, the
subject needs to stand, walk three meters forward, and turn around
and sit back onto the chair. Time of completing the test will be
measured and used as the indicator of fall risk, and the more time
the subject used to complete the test, the higher the fall risk is
predicted. Although the above-described in-clinic tests can provide
more subjective and reliable fall risk evaluations, the tests are
usually carried out in clinics with controlled environment, and
evaluated by doctors or trained personals. As a result, they can be
quite troublesome to perform and thus difficult to be used to
monitor and evaluate the subject's fall risk on a daily basis.
[0011] More recently, researchers have found that for many
subjects, fall risk is a progressive issue. Consequently,
continuously monitoring a subject's fall risk in daily living
environment can be crucial in effectively and accurately evaluating
the fall risk and offering the subsequent intervention procedures.
Unfortunately, existing in-home gait analysis techniques rely
heavily on wearable sensors, which need to be worn by the subject
most of the time and recharged frequently, thereby making them
cumbersome and inconvenient to use.
SUMMARY
[0012] In this patent disclosure, various embodiments of a
privacy-preserving embedded fall-detection vision system (which is
also referred to as the "embedded fall-detection system" or simply
the "embedded vision system" in this patent disclosure) including
various software and/or hardware modules for implementing various
vision-based and privacy-preserving fall-detection functionalities
are disclosed. Specifically, this embedded fall-detection system is
a standalone system that can include hardware modules such as one
or more cameras for capturing video images of one or more persons
being monitored for potential falls and one or more processors for
processing the captured video images. Moreover, this embedded
fall-detection system can include various software modules for
processing the captured video images and subsequently generating
fall-detection output including fall alarms/notifications based on
the captured video images. The disclosed embedded fall-detection
system can be implemented as a single-unit embedded fall-detection
vision sensor. For various fall detection applications, this
single-unit embedded fall-detection vision sensor can be installed
at a single fixed location for monitoring persons/individuals with
high falling risks, such as seniors, people with disabilities, or
people with certain illnesses.
[0013] Also in this patent disclosure, various embodiments of a
distributed privacy-preserving fall-detection system including: one
or multiple standalone embedded fall-detection vision sensors
implemented based on the disclosed embedded fall-detection system;
a server; and an associated mobile application (or "mobile app"),
all of which coupled together through a network are disclosed. In
some embodiments, this distributed fall-detection system can be
implemented as a multi-vision-sensor fall-detection system which is
composed of multiple standalone embedded fall-detection vision
sensors. The multiple standalone embedded fall-detection vision
sensors can be installed at multiple fixed locations different from
one another, wherein each of the multiple embedded fall-detection
vision sensors can include at least one camera for capturing video
images and various software and hardware modules for processing the
captured video images and generating corresponding fall-detection
output including fall alarms/notifications based on the captured
video images.
[0014] In various embodiments, the server in the disclosed
fall-detection system can be configured to collect and process
multiple sources of fall detection outputs generated by the
multiple standalone embedded fall-detection vision sensors, select
one source of fall-detection output among the multiple sources of
outputs, and subsequently transmit the selected source of
fall-detection output to the associated fall-detection mobile app
installed on one or more mobile devices. In various embodiments,
the server can be a cloud-based server or a local server. In
various embodiments, the server and the mobile app can also be used
to add and remove profiles within the multiple standalone embedded
fall-detection vision sensors for people to be monitored or being
monitored by the distributed fall-detection system. In such
embodiments, the server can be used to distribute information to
the multiple standalone embedded fall-detection vision sensors. In
some embodiments, the disclosed distributed fall-detection system
is composed of a single embedded fall-detection vision sensors
(instead of multiple embedded fall-detection vision sensors), the
server, and the mobile app.
[0015] In various embodiments, to preserve the privacies of people
being monitored or captured by either the disclosed embedded
fall-detection system or the disclosed distributed fall-detection
system, all fall-detection-related computations on captured video
images are performed in-situ inside the embedded fall-detection
system or each of the standalone embedded fall-detection vision
sensors within the distributed fall-detection system. In some
embodiments, after processing the captured video images in-situ,
each embedded fall-detection vision sensor of the disclosed
distributed fall-detection system only transmits sanitized video
images and/or video clips (e.g., by transmitting only the
keypoints/skeleton/stick figure representations of each detected
person instead of the actual images of the detected person) to the
server of the distributed fall-detection system along with fall
alarms/notifications. This privacy-preserving feature of the
disclosed embedded fall-detection system can be enabled by the
recent developments of various powerful artificial intelligence
(AI) integrated circuit (IC) chips which can be easily integrated
with the disclosed embedded fall-detection system.
[0016] Also in this patent disclosure, various embodiments of a
video-based fall risk assessment system based on gait-analysis for
both clinical and in-home fall risk assessment are disclosed. The
disclosed fall risk assessment system can include various software
modules for processing videos captured by cameras or other forms of
image/video sensors of a subject and subsequently generating
fall-risk-assessment results including fall risk
warnings/notifications based on the captured videos for the
subject. The disclosed fall risk assessment system can also be
integrated into the disclosed embedded fall-detection system as a
function module to make independent fall risk assessment as well as
to assist other modules within the disclosed embedded
fall-detection system to make fall detection decisions. However,
the disclosed fall risk assessment system can also be implemented
as a stand-alone fall-risk-assessment system by including one or
more cameras for capturing videos of a monitored person, one or
more processors for processing the captured videos, and one or more
Human Computer Interaction (or "HCI") devices. The disclosed
video-based fall risk assessment system can be used to capture and
analyze a given subject's in-home daily gait activities, and also
to assist the subject or the caregiver to easily carry out fall
risk tests under controlled environment.
[0017] In another aspect, a video-based fall risk assessment system
is disclosed. During operation, this fall risk assessment system
can receives a sequence of video frames including a person being
monitored for fall risk assessment. The system next generates a
sequence of action labels for the sequence of video frames by, for
each video frame in the sequence of video frames: estimating a pose
of the person within the video frame; and classifying the estimated
pose as a given action among a set of predetermined actions. Next,
the system identifies a subset of action labels within the sequence
of action labels. The system next extracts a set of gait features
for the person from a subset of video frames within the sequence of
video frames corresponding to the subset of action labels.
Subsequently, the system analyzes the set of extracted gait
features to generate a fall risk assessment for the person. In some
embodiments, the sequence of video frames is captured during a
predetermined time period, such as an hour, a day, or a week
[0018] Other features and advantages of the present inventive
concept should be apparent from the following description which
illustrates by way of example aspects of the present inventive
concept.
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] The structure and operation of the present disclosure will
be understood from a review of the following detailed description
and the accompanying drawings in which like reference numerals
refer to like parts and in which:
[0020] FIG. 1 illustrates a block diagram of the disclosed embedded
fall-detection system in accordance with some embodiments described
herein.
[0021] FIG. 2 illustrates a block diagram of the disclosed
distributed fall-detection system including one or multiple
embedded fall-detection vision sensors based on the embedded
fall-detection system of FIG. 1 in accordance with some embodiments
described herein.
[0022] FIG. 3 shows an exemplary skeleton diagram of a detected
person in a video image obtained by connecting 18 neighboring
keypoints with straight lines in accordance with some embodiments
described herein.
[0023] FIG. 4 shows a block diagram illustrating an exemplary
two-level action-recognition module for classifying actions based
on cropped images of a detected person in accordance with some
embodiments described herein.
[0024] FIG. 5 shows fall-detection state transition diagram of the
disclosed state machine for predicting falls based on a set of
consecutive action labels of a detected person in accordance with
some embodiments described herein.
[0025] FIG. 6 presents a flowchart illustrating an exemplary
process for performing image-based fall detection in accordance
with some embodiments described herein.
[0026] FIG. 7 presents a flowchart illustrating an exemplary
process for adding a new profile for a person into the disclosed
fall-detection system in accordance with some embodiments described
herein.
[0027] FIG. 8 presents a flowchart illustrating an exemplary
process for removing an existing profile of a person from the
disclosed fall-detection system in accordance with some embodiments
described herein.
[0028] FIG. 9 presents a flowchart illustrating an exemplary
process for identifying a detected person by the disclosed embedded
fall-detection system in accordance with some embodiments described
herein.
[0029] FIG. 10 illustrates an exemplary hardware environment for
the disclosed embedded fall-detection system in accordance with
some embodiments described herein.
[0030] FIG. 11 shows an exemplary task scheduler for executing the
various fall-detection functionalities of the disclosed embedded
fall-detection system in accordance with some embodiments described
herein.
[0031] FIG. 12 illustrates an exemplary processing pipeline
comprising two task scheduler nodes based on the disclosed task
scheduler coupled in series in accordance with some embodiments
described herein.
[0032] FIG. 13 illustrates a block diagram of the disclosed fall
risk assessment system in accordance with some embodiments
described herein.
[0033] FIG. 14 presents a flowchart illustrating an exemplary
process for performing a video-based fall risk assessment in
accordance with some embodiments described herein.
DETAILED DESCRIPTION
[0034] The following description is presented to enable any person
skilled in the art to make and use the embodiments, and is provided
in the context of a particular application and its requirements.
Various modifications to the disclosed embodiments will be readily
apparent to those skilled in the art, and the general principles
defined herein may be applied to other embodiments and applications
without departing from the spirit and scope of the present
disclosure. Thus, the present invention is not limited to the
embodiments shown, but is to be accorded the widest scope
consistent with the principles and features disclosed herein.
Terminology
[0035] Throughout this patent disclosure, the terms "embedded
fall-detection vision system," "embedded fall-detection system,"
and "embedded vision system" are used interchangeably to refer to
the embedded fall-detection system 100 described in conjunction
with FIG. 1. The terms "embedded fall-detection vision sensor" and
"embedded vision sensor" are used interchangeably to refer to a
standalone fall-detection device/unit which integrates embedded
fall-detection system 100 inside a hardware environment. Moreover,
the term "distributed fall-detection system" refers to an overall
fall-detection system described in conjunction with FIG. 2 which
includes: one or more "embedded fall-detection vision sensors"
implemented based on the "embedded fall-detection system," a
server, and a mobile application.
Proposed Fall-Detection System Overview
[0036] Aging population is a problem faced by many countries.
Elderly people have higher risk of falling, and a fall often leads
to serious medical consequences. Hence, it is desirable to provide
fall detection systems and techniques to monitor and detect falls
for those people with high-risk of falling. Furthermore, it is also
desirable to preserve the privacy of the people being
monitored.
[0037] In this patent disclosure, various embodiments of an
embedded privacy-preserving fall-detection vision system including
various software and/or hardware modules for implementing various
image-based and privacy-preserving fall-detection functionalities
are disclosed. In the discussions below, this embedded
fall-detection vision system is also referred to as the "embedded
fall-detection system" or the "embedded vision system." Note that
this embedded fall-detection system can operate as a standalone
fall-detection system to monitor and detect falls. Specifically,
this embedded fall-detection system can include hardware modules
such as one or more cameras for capturing video images of one or
more persons being monitored for potential falls and one or more
processors for processing the captured video images. Moreover, this
embedded fall-detection system can include various software modules
for processing the captured video images and subsequently
generating fall-detection output including fall
alarms/notifications based on the captured video images. The
disclosed embedded fall-detection system can be implemented as a
single-unit embedded fall-detection vision sensor. For various fall
detection applications, this single-unit embedded fall-detection
vision sensor can be installed at a single fixed location for
monitoring persons/individuals with high falling risks, such as
seniors, people with disabilities, or people with certain
illnesses. Moreover, in the discussions below, the term
"fall-detection engine" will be introduced to refer to the portion
of the embedded fall-detection system that only includes the
various computer software modules for implementing one or more
disclosed fall-detection techniques, but does not include any
hardware module such as a processor or a camera.
[0038] Also in this patent disclosure, various embodiments of a
distributed privacy-preserving fall-detection system including: one
or multiple standalone embedded fall-detection vision sensors
implemented based on the disclosed embedded fall-detection system;
a server; and an associated mobile application (or "mobile app"),
all of which coupled together through a network are disclosed. In
some embodiments, this distributed fall-detection system can be
implemented as a multi-vision-sensor fall-detection system which is
composed of multiple standalone embedded fall-detection vision
sensors. These multiple standalone embedded fall-detection vision
sensors can be installed at multiple fixed locations different from
one another, wherein each of the multiple embedded fall-detection
vision sensors can include at least one camera for capturing video
images and various software and hardware modules for processing the
captured video images and generating corresponding fall-detection
output including fall alarms/notifications based on the captured
video images.
[0039] In various embodiments, the server in the disclosed
distributed fall-detection system can be configured to collect and
process multiple sources of fall detection outputs generated by the
multiple standalone embedded fall-detection vision sensors, select
one source of fall-detection output among the multiple sources of
outputs, and subsequently transmit the selected source of
fall-detection output to the associated fall-detection mobile app
installed on one or more mobile devices. In various embodiments,
the server can be a cloud-based server or a local server. In
various embodiments, the server and the mobile app can also be used
to add and remove profiles within the multiple standalone embedded
fall-detection vision sensors for people to be monitored or being
monitored by the distributed fall-detection system. In such
embodiments, the server can be used to distribute information to
the multiple standalone embedded fall-detection vision sensors. In
some embodiments, the disclosed distributed fall-detection system
is composed of a single embedded fall-detection vision sensor
(instead of multiple embedded fall-detection vision sensors), the
server, and the mobile app.
[0040] In various embodiments, to preserve the privacies of people
being monitored or captured by either the disclosed embedded
fall-detection system or the disclosed distributed fall-detection
system, all fall-detection-related computations on captured video
images are performed in-situ inside the embedded fall-detection
systems or each of the standalone embedded fall-detection vision
sensors within the distributed fall-detection system. In some
embodiments, after processing the captured video images in-situ,
each embedded fall-detection vision sensor of the disclosed
distributed fall-detection system only transmits sanitized video
images and/or video clips (e.g., by transmitting only the
keypoints/skeleton/stick figure representations of each detected
person instead of the actual images of the detected person) to the
server of the distributed fall-detection system along with fall
alarms/notifications. This privacy-preserving feature of the
disclosed embedded fall-detection system can be enabled by the
recent developments of various powerful artificial intelligence
(AI) integrated circuit (IC) chips which can be easily integrated
with the disclosed embedded fall-detection system. One example of
such AI chips is the HiSilicon Hi3559A System on Chip (SoC), which
includes 2 ARM Cortex A73 CPUs, 3 ARM Cortex A53 CPUs, a dual-core
ARM Mali G71 GPU, a dual-core Neural Network Inference Acceleration
Engine (NNIE), and a quad-core DSP module. Note that this
particular SoC also includes built-in security, signature
verification, and tamper-proofing functionalities.
[0041] Note that various embodiments of the disclosed embedded
fall-detection system are based on implementing various
deep-learning-based fast neural networks while combining various
optimization techniques, such as network pruning, quantization, and
depth-wise convolution. As a result, the disclosed embedded
fall-detection system can perform a multitude of
deep-learning-based functionalities such as real-time
deep-learning-based pose estimation, action recognition, fall
detection, face detection, and face recognition. FIG. 1 illustrates
a block diagram of the disclosed embedded fall-detection system 100
in accordance with some embodiments described herein.
[0042] As can be seen in FIG. 1, embedded fall-detection system 100
includes a fall-detection engine 101 and a camera 102.
Fall-detection engine 101 further includes various fall-monitoring
and fall-detection functional modules including: a pose-estimation
module 106, an action-recognition module 108, a fall-detection
module 110, a scene-segmentation module 112, a face-detection
module 116, and a face-recognition module 118. However, other
embodiments of the disclosed embedded fall-detection system can
include additional functional modules or omit one or more of the
functional modules shown in embedded fall-detection system 100
without departing from the scope of the present disclosure.
Exemplary implementations of the various functional modules of
embedded fall-detection system 100 are described further below.
[0043] Embedded fall-detection system 100 can use camera 102 to
monitor human activities within a given space such as a room, a
house, a lobby, or a hallway, and to capture video images and/or
still images which can be used for fall analysis and prediction. In
some embodiments, when embedded fall-detection system 100 is
active, camera 102 generates and outputs video images 104 which can
includes video images of one or multiple persons present in the
monitored space. Fall-detection engine 101 receives video images
104 as input and subsequently processes input video images 104 and
makes fall/non-fall predictions/decisions based on the processed
video images 104. Embedded fall-detection system 100 can generate
fall-detection output 140 including fall alarms/notifications 140-1
and sanitized video clips 140-2 when human falls are detected.
However, embedded fall-detection system 100 can also output
activities of daily living (ADLs) statistics for a monitored person
even when no fall is detected. Note that camera 102 does not have
to be a part of embedded fall-detection system 100 but rather a
part of an overall embedded fall-detection device referred to as
the "embedded fall-detection vision sensor" below. When embedded
fall-detection system 100 only includes fall-detection engine 101
without any additional hardware component, embedded fall-detection
system 100 can be implemented entirely in computer software.
[0044] In some embodiments, embedded fall-detection system 100 of
FIG. 1 can be implemented as an embedded fall-detection vision
sensor (also referred to as an "embedded vision sensor"
hereinafter). In these embodiments, various functional modules of
the fall-detection engine 101 (i.e., pose-estimation module 106,
action-recognition module 108, fall-detection module 110, scene
segmentation module 112, face-detection module 116, and
face-recognition module 118) are integrated into the embedded
fall-detection vision sensor. This embedded fall-detection vision
sensor can use one or more cameras, such as camera 102 to monitor a
space such as a room, a house, a lobby, or a hallway to detect
falls, and use fall-detection engine 101 to process captured video
images and to generate fall-detection output 140 including both
fall alarms/notifications 140-1 and sanitized video clips 140-2.
More specifically, this embedded fall-detection vision sensor can
include one or more memories for storing instructions for
implementing fall-detection engine 101, one or more processors
including CPUs and/or neural processing units (NPUs) for executing
the instructions from the one or more memories to implement the
various functional modules of fall-detection engine 101. Moreover,
this embedded fall-detection vision sensor can also include one or
more cameras, one or more sensors, and a network interface, among
others. When implemented as a single-unit fall-detection and
monitoring device, this embedded fall-detection vision sensor will
also include a housing/enclosure, one or more attachment
mechanisms, and possibly a stand/base. More detailed
implementations of an embedded fall-detection vision sensor are
described below in conjunction with FIG. 10.
[0045] FIG. 2 illustrates a block diagram of a disclosed
distributed fall-detection system 200 including one or multiple
embedded fall-detection vision sensors based on embedded
fall-detection system 100 of FIG. 1 in accordance with some
embodiments described herein. More specifically, each of the one or
multiple embedded fall-detection vision sensors 202-1, 202-2, . . .
, and 202-N is a standalone fall-detection unit implemented based
on the above-described embedded fall-detection system 100 of FIG.
1. In other words, each embedded fall-detection vision sensor 202
within distributed fall-detection system 200 includes embedded
fall-detection system 100 or otherwise integrates embedded
fall-detection system 100 in its entirety. Note that each embedded
fall-detection vision sensor 202 can be configured to perform
independent fall-monitoring and fall-detection functionalities. In
some embodiments, distributed fall-detection system 200 includes
only one embedded fall-detection vision sensor 202-1 (i.e., N=1).
In these embodiments, distributed fall-detection system 200 can
include just one camera for capturing video images of one or more
persons being monitored and just one fall detection engine 101 for
processing the captured video images to detect falls for the one or
more persons.
[0046] In some other embodiments, distributed fall-detection system
200 includes more than one embedded fall-detection vision sensor
(i.e., N>1). Note that because a single camera can have an
associated blind zone, it can be difficult to use such a
single-camera embedded fall-detection system to monitor certain
large areas. Hence, for fall-monitoring and fall-detection in a
large area, distributed fall-detection system 200 including
multiple embedded fall-detection vision sensors 202 installed at
multiple locations within the large area can be used to eliminate
such blind zones, thereby improving the robustness of the overall
fall-detection performance. As mentioned above, each of the
multiple embedded fall-detection vision sensors 202-1, 202-2, . . .
, and 202-N (N>1) is a standalone fall-detection unit
implemented based on embedded fall-detection system 100 of FIG.
1.
[0047] Note that each of the multiple embedded vision sensors 202
is coupled to server 204 through network 220. In various
embodiments, server 204 can be a cloud-based server or a local
server. Server 204 itself is further coupled to a number of mobile
devices 206, 208, and 210, which can monitored by caregivers and/or
medical personnel, via network 220. Server 204 can be
communicatively coupled to a client application, such as a
fall-detection mobile app 212 (or simply "mobile app 212")
installed on each of the mobile devices 206, 208, and 210. In some
embodiments, mobile app 212 on a given mobile device is configured
to receive from server 204, fall alarms/notifications along with
sanitized video clips outputted by the multiple embedded vision
sensors 202-1, 202-2, . . . , and 202-N, via network 220. In some
embodiments, server 204 can also host a multi-camera management
application which is configured to divide each monitored area into
a set of zones, and assign one or more embedded vision sensors
202-1, 202-2, . . . , and 202-N to monitor each zone in the set of
zones.
[0048] As mentioned above, server 204 can be configured to divide a
large monitored area into a set of zones, wherein each zone in the
set of zones can be covered by two or more embedded vision sensors
202-1, 202-2, . . . , and 202-N (N>1). Moreover, for each zone
in the set of zones, server 204 can be configured to "fuse" or
otherwise combine fall-detection outputs from two or more embedded
vision sensors 202 covering the given zone. For example, if a
monitored person's identity cannot be identified or determined
based on fall-detection output from a first embedded vision sensor
positioned at a bad angle, that person's identity may be identified
or determined based on fall-detection output from a second embedded
fall-detection vision sensor positioned at a good angle. Generally
speaking, server 204 can combine two or more sources of
fall-detection outputs from two or more embedded vision sensors
202-1, 202-2, . . . , and 202-N and make a collective
fall-detection decision on a given person based on the two or more
sources of fall-detection outputs.
[0049] More specifically, if a given person's fall in a monitored
area is detected by two or more embedded vision sensors 202, each
of the two or more embedded vision sensors can send a respective
fall alarm/notification 140-1 and a sanitized video clip 140-2
(e.g., using a skeleton/stick-figure representation of the detected
person instead of the actual image of the detected person)
depicting the falling process to server 204. In some embodiments,
the sanitized video clip includes video images buffered for a
predetermined amount of time (e.g., 10-15 seconds) immediately
before the fall is detected. Hence, the video clip can include a
sequence of video images depicting the entire process of
falling.
[0050] Note that when server 204 receives multiple sources of fall
detection outputs from the two or more embedded vision sensors 202,
server 204 is configured to determine if the multiple sources of
fall detection outputs belong to the same person. If so, server 204
can then select one source of fall detection output among the
multiple sources of fall-detection outputs having the highest
confident level/score. In some embodiment, this confident score can
be embedded in each source of the fall detection output. As will be
described further below, both pose-estimation module 106 and
action-recognition module 108 in embedded fall-detection system 100
can generate probabilities for the estimated poses and the
classified actions for each detected person. As such, a confident
score of a generated fall alarm can be determined based on these
probability values. Hence, server 204 can select the source of data
among the multiple sources associated with the highest confident
score and subsequently transmit the selected source of
fall-detection output including the associated fall
alarm/notification and associated sanitized video clip to
fall-detection mobile app 212 installed on mobile devices 206-210.
However, when server 204 receives only one source of fall detection
output from a single vision sensor among the two or more embedded
vision sensors 202, server 204 can directly transmit the received
single source of fall-detection output to fall-detection mobile app
212 installed on mobile devices 206-210.
[0051] In some embodiments, after receiving the fall-detection
output from server 204, mobile app 212 can play the received
sanitized video clip on one or more mobile devices 206-210 of one
or more caregivers. The disclosed mobile app 212 can also be
configured to assist adding or removing profiles of persons to be
tracked by the disclosed distributed fall-detection system 200. In
some embodiments, a profile of a person can include the person's
identity such as person's name, as well as profile photos of the
person. In some embodiments, prior to performing fall detection on
a person, a profile of the person can be constructed and stored
both on server 204 and on each embedded fall-detection vision
sensor 202. For example, mobile app 212 can be used to construct a
new profile of the person by combining the identity of the person
with one or multiple profile photos of the person. In some
embodiments, mobile app 212 can be used to take the one or multiple
profile photos of the person. Mobile app 212 can then send the
profile of the person including the one or multiple profile photos
and the person's identity, such as the name of the person to server
204.
[0052] Next, at server 204, a profile-management program can be
used to generate and assign a unique person-ID for the person
(e.g., based on the unique identity of the person) and associated
the person-ID with the one or multiple profile photos. In some
embodiments, the person-ID of the person generated by server 204
can be a unique numerical value (e.g., an integer value) without
any identity information of the person. Hence, the disclosed
person-ID can facilitate protecting the privacy of the person.
Server 204 can then send the newly generated person-ID of the
person along with the profile photos of the person to embedded
fall-detection system 100, which maintains a person-ID dictionary.
Next, embedded fall-detection system 100 can generate a new entry
for the person based on the received person-ID and the profile
photos, and add this new entry in the person-ID dictionary.
[0053] In some embodiments, server 204 can be a single computing
device such as a computer server. In other embodiments, server 204
can represent more than one computing device working together to
perform the actions of a server computer, e.g., as a cloud server.
Server 204 can include one or more processors and a data storage
device. These one or more processors can execute computer
instructions stored in the data storage device to perform the
various disclosed functions of server 204. Network 220 can include,
for example, any one or more of a personal area network (PAN), a
local area network (LAN), a campus area network (CAN), a
metropolitan area network (MAN), a wide area network (WAN), a
broadband network (BBN), the Internet, and the like. Furthermore,
network 220 can include, but is not limited to, any one or more of
the following network topologies, including a bus network, a star
network, a ring network, a mesh network, a star-bus network, tree
or hierarchical network, and the like.
[0054] Referring back to FIG. 1 but in collaboration with FIG. 2,
note that when a person's fall is detected by embedded
fall-detection system 100, the embedded fall-detection system can
send fall alarm/notification 140-1 along with sanitized video clip
140-2 depicting the falling action to a server, such as server 204
in FIG. 2. Specifically, this sanitized video clip can use a
keypoints/skeleton/stick-figure representation of the detected
person to replace the actual image of the detected person in each
video image. In some embodiments, the sanitized video clip 140-2
can include video images buffered for a predetermined amount of
time (e.g., 10-15 seconds) immediately before the fall is detected.
Hence, sanitized video clip 140-2 can include a sequence of video
images depicting the entire process of falling.
[0055] In some embodiments, embedded fall-detection system 100 can
track the detected person through the sequence of video images
using face-detection module 116, and face-recognition module 118.
To facilitate tracking each unique person through a sequence of
video frames, embedded fall-detection system 100 can identify and
subsequently associate each detected person with a corresponding
person-ID stored in the above-described person-ID dictionary
(described in more detail below). Embedded fall-detection system
100 can then transmit the identified person-ID along with other
fall-detection data associated with the detected person to the
server. After receiving the fall-detection output of the detected
person including the fall alarm/notification 140-1, the associated
sanitized video clip 140-2, and the associated person-ID 136 (if
the person is identified), the server, such as server 204 can
transmit the above fall-detection data to an associated
fall-detection mobile app (e.g., mobile app 212) installed on one
or more mobile devices (e.g., mobile device 206-210).
[0056] Note that embedded fall-detection system 100 can perform
fall detection on a person with or without an associated person-ID.
In other words, once a person is detected in the input video images
104, embedded fall-detection system 100 will perform fall detection
on the detected person and generate fall alarms/notifications when
necessary, even if the detected person does not have an established
person-ID or the system fails to identify the detected person. As
mentioned above and will be described in more detail below,
embedded fall-detection system 100 can include a person-ID
dictionary which stores a set of established person-IDs of a group
of people that can be tracked by embedded fall-detection system
100. For example, this person-ID dictionary (i.e., person-ID
dictionary 150) can be integrated with face-recognition module
118.
[0057] In some embodiments, if the detected person doesn't match
any stored person-ID in person-ID dictionary 150, then embedded
fall-detection system 100 can generate and output the fall
alarm/notification 140-1 along with an "unknown person" tag.
However, if embedded fall-detection system 100 can successfully
match the detected person to an established person-ID in person-ID
dictionary 150, then embedded fall-detection system 100 can
generate and transmit fall alarm/notification 140-1 along with the
identified person-ID 136 of the detected person to the server, such
as server 204. After receiving fall alarm/notification 140-1 with
the associated person-ID, server 204 can translate the person-ID to
an actual identity of the detected person, such as the name of the
person, and associate the fall alarm/notification with the actual
identity of the detected person. Server 204 can then transmit the
selected fall alarm/notification and the identity of the detected
person to mobile app 212.
[0058] We now describe each of the functional modules of
fall-detection engine 101 within the disclosed embedded
fall-detection system 100 in more details below.
Pose-Estimation Module
[0059] In some embodiments, embedded fall-detection system 100
monitors human motions or actions and predicting falls by first
estimating the pose of each person captured in a given video
image/frame using pose-estimation module 106 in FIG. 1. As can be
seen in FIG. 1, pose-estimation module 106 can receive and process
input video images/frames 104 prior to action recognition model 108
and fall-detection module 110. Pose-estimation module 106 next
identifies humans captured in the video images 104. For each
detected person, pose-estimation module 106 subsequently determines
a pose for the detected person. In some embodiments,
pose-estimation module 106 can first identify a set of human
keypoints 122 (or simply "human keypoints 122" or "keypoints 122")
for the detected person within an input video image 104, and then
represent a pose of the detected person using the configuration
and/or localization of the set of keypoints, wherein the set of
keypoints 122 can include, but are not limited to: the eyes, the
nose, the ears, the chest, the shoulders, the elbows, the wrists,
the knees, the hip joints, and the ankles of the person. In some
embodiments, instead of using a full set of keypoints, a simplified
set of keypoints 122 can include just the head, the shoulders, the
arms, and the legs of the detected person. A person of ordinary
skill in the art can easily appreciate that a different pose of the
detected person can be represented by a different geometric
configuration of the set of keypoints 122.
[0060] To implement the above-described functions of
pose-estimation module 106 in FIG. 1, various CNN-based techniques
for performing human pose estimation can be used. In some
embodiments, "bottom-up"-based pose-estimation techniques, such as
"OpenPose" (described in "Realtime Multi-Person 2D Pose Estimation
Using Part Affinity Fields," by Cao et al., CVPR 2017) can be used.
These pose-estimation techniques first use a strong CNN-based
feature extractor to extract visual features from an input image,
and then use a two-branch multi-stage CNN to detect various human
keypoints within the input image. Next, the pose-estimation
techniques perform a set of bipartite matching operations to
"assemble" or connect the detected keypoints into full-body poses
for some or all people detected in the image. This type of
bottom-up pose-estimation techniques can have both high performance
and low complexity, and can also estimate a "probability" of each
detected keypoint. Here the probability of a detected keypoint
represents a confidence score assigned to the detected keypoint by
the pose-estimation model. Typically, under more difficult
detection conditions such as poor lighting, confusing background,
or obstacles in front of a detected person, the confidence score or
the probability of each detected keypoint will be relatively low.
For example, if a person wears clothing having very similar color
to the background (e.g., white shirt against a white wall), it
would be more difficult for the pose-detection algorithm to
identify the correct keypoints and their associated locations. In
this scenario, the pose-detection algorithm will generate lower
probabilities for the uncertain keypoint detections.
[0061] In some embodiments, a skeleton diagram of a detected person
in an input video image 104 can be obtained by connecting
neighboring keypoints representing the detected person with
straight lines. FIG. 3 shows an exemplary skeleton diagram 300 of a
detected person in a video image obtained by connecting 18
keypoints with straight lines in accordance with some embodiments
described herein. As can be seen in FIG. 3, skeleton diagram 300
comprises 18 keypoints corresponding to the two eyes 302 and 304,
two ears 306 and 308, nose 310, neck 312, two shoulders 314 and
316, two elbows 318 and 320, two wrists 322 and 324, two hips 326
and 328, two knees 330 and 332, and two ankles 334 and 336 of the
detected person, and the resulting skeleton diagram 300 includes 17
line segments connecting these keypoints.
[0062] In some embodiments, to allow bottom-up pose-estimation
models to run in real-time with optimized performance on embedded
systems/devices such as embedded fall-detection system 100, the
proposed pose-estimation module 106 implements a bottom-up
pose-estimation framework with a number of improvements to the
existing framework. Some of these modifications/improvements
include: [0063] Replacing the commonly used complex VGG16 network
(described in "Very Deep Convolutional Networks for Large-Scale
Image Recognition," Simonyan et al., arXiv:1409.1556) with a faster
VGG16.times.4 network (described in "Channel Pruning for
Accelerating Very Deep Neural Networks," He et al., ICCV 2017 and
"AMC: AutoML for Model Compression and Acceleration on Mobile
Devices," He et al., ECCV 2018) as the backbone/feature extractor,
which has an inference speed 4.times. faster than the VGG16
network. Note that the term "backbone" herein refers to the neural
network which receives an input image and extracts image features
for use in subsequent deep-learning tasks such as classification,
regression, and segmentation. This speed-up is largely due to
performing channel pruning, i.e., reducing the width of the feature
map, which in turn shrinks the network into a thinner one; [0064]
Reducing the number of stages in the two-branch multi-stage CNN;
[0065] Reducing each convolution layer filter size to 3.times.3 in
the multi-stages. Although the existing network and the modified
network have substantially the same receptive field size, the
modified network can be executed much more efficiently; [0066]
Quantizing the network parameters and run the network inference in
8-bit integer precision instead of the typical 32-bit
floating-point precision. This modification not only reduces the
memory usage and the frequency of memory access, it also
significantly speeds up the arithmetic computations, making it
particularly useful and desirable for resource-limited embedded
system applications; and [0067] During the network training,
applying data augmentation to improve the pose estimation
performance for different imaging capturing angles. Note that as a
person falls onto the floor, the position of the person's body,
which can be represented by a line connecting the person's head and
the torso, can take on any angle between 0 and 360 degrees within a
video frame that captures the person's body. In some embodiments,
to train pose-estimation module 106 so that the trained
pose-estimation module 106 can recognize different scenarios/poses
of a person's fall corresponding to the different possible angles
of the person being captured in a video frame, a training image set
can be prepared to include images of falls that simulate various
capturing angles between 0 and 360 degrees. The training image set
can then be used to train pose-estimation module 106 to improve the
pose estimation performance for different imaging capturing
angles.
[0068] After making the above modifications/improvements to the
existing bottom-up pose-estimation technique and implementing the
modified network in pose-estimation module 106, it is observed that
the inference speed of the proposed pose-estimation technique
implemented on a Hi3559A-based embedded platform can be increased
by reducing the inference time from .about.550 ms to .about.86 ms
when processing an input image size of 656.times.368 pixels.
[0069] In some embodiments, after locating human keypoints 122 of a
detected person in an input video image 104, the full image of the
detected person can be cropped out from input video image 104 by
forming a bounding box around the set of keypoints 122 and the
associated skeleton representation of the detected person.
[0070] A person of ordinary skill in the art will appreciate that,
for a sequence of video frames of a captured video received by
pose-estimation module 106, pose-estimation module 106 is
configured to extract and subsequently output a corresponding
sequence of estimated poses for a detected person (assuming the
detected person remains in the captured video through the sequence
of video frames), wherein each estimated pose in the sequence of
estimated poses corresponding to a given video frame in the
sequence of video frames can be represented by a corresponding set
of estimated keypoints 122. Note that for various applications
using pose-estimation module 106, including both fall-detection
applications and later-described fall-risk assessment applications,
it is generally desirable to maintain pose-estimation consistency
between consecutive video frames of a captured video. However, as
the detected person moves (e.g., walking) in the sequence of video
frames, viewing angle changes, illumination condition variations,
and occlusions as a result of and/or during the human motion can
cause pose-estimation errors and inconsistencies between
consecutive video frames, which can further lead to unstable poses
in consecutive video frames that appear visually vibrating.
[0071] In some embodiments, after extracting poses in a number of
video frames, to better capture the effect of body movements and
minimize pose-estimation errors and noises, additional "filtering"
of the extracted poses for tow or more consecutive frames can be
applied. In some embodiments, the Kalman filtering technique
(described in "A New Approach to Linear Filtering and Prediction
Problems," Kalman, Journal of Basic Engineering, vol. 82, no. 1,
pp. 35-45, doi:10.1115/1.366 2552) can be used. Generally speaking,
to apply the Kalman filtering technique, a system model needs to be
established. For the above-described keypoints technique, we can
assume that the set of keypoints 122 are independent to one another
in a sequenced of video frames. Next, for each keypoint in the set
of keypoints 122, a system model can be constructed for the
keypoint based on the Newton's laws of motion. Next, for a given
video frame, the system model of each keypoint can use a series of
position and velocity measurements observed over previous video
frames to make a prediction of the current location of the
keypoint. The predicted location of the keypoint can then be used
to adjust the estimated current location generated by the CNN-based
technique and output the filtered and updated keypoint location. In
this manner, a "filtered pose" of the detected person for a given
video frame can be generated as the ensemble of the set of filtered
keypoint locations for the set of keypoints 122. Note that the
filtered poses are generally more stable and statistically more
accurate, which can improve the accuracies and reliabilities in the
subsequent data processing. Note that the above-described
pose-filtering technique can be implemented on and integrated with
pose-estimation module 106. As mentioned above, for each detected
person in a sequence of video frames/video clip, pose-estimation
module 106 can generate a sequence of estimated poses, wherein each
estimated pose in the sequence of estimated poses represents the
location of the detected person in a corresponding video frame.
Action-Recognition Module
[0072] Referring back to FIG. 1, note that pose-estimation module
106 is coupled to action-recognition module 108, which is
configured to receive the outputs from the pose estimation module.
In some embodiments, the outputs from pose-estimation module 106
can include detected human keypoints 122, the associated skeleton
diagram (also referred to as the "stick figure diagram"
throughout), and a two-dimensional (2-D) image 132 of the detected
person cropped out from original video image 104 based on the
detected keypoints 122 (also referred to as "cropped image 132" of
the detected person). Action-recognition module 108 is further
configured to predict, based on the outputs from pose-estimation
module 106, what type of action or activity the detected person is
associated with. For example, action-recognition module 108 can
include an action classifier 128 configured to classify each
detected person as being in one of a set of pre-defined actions,
referring to as action label/classification 124 for the detected
person. In some embodiments, action classifier 128 can be
configured to use only cropped image 132 of the detected person to
classify the action for the detected person. In some other
embodiments, action classifier 128 can be configured to use only
the human keypoints 122 of the detected person to classify the
action for the detected person. Note that using cropped image 132
to classify the action for the detected person typically can
achieve more accurate results than using only human keypoints 122
to classify the action for the detected person. In still other
embodiments, action classifier 128 can be configured to use the
combined data of cropped image 132 and human keypoints 122 of the
detected person to classify the action for the detected person.
[0073] More specifically, cropped image 132 of the detected person
and/or the set of human keypoints 122 of the detected person can be
fed into action classifier 128 configured to predict the
probability of the detected person being in a given action among a
set of pre-defined actions related to the person's state of daily
living, and subsequently classify the detected person to one of
these pre-defined actions based on the set of probabilities
corresponding to the set of pre-defined actions. For example, for
fall-monitoring and fall-detection applications, an exemplary set
of pre-defined actions of interests can include the following five
actions: (1) standing; (2) sitting; (3) bending; (4) struggling;
and (5) lying down. In some embodiments, a CNN-based architecture
can be used to construct such an action classifier. Note that among
these five pre-defined actions, the first three actions are
generally considered as normal actions, whereas the last two
actions are generally considered as dangerous actions indicative of
a fall. In some embodiments, to perform this action classification
in action-recognition module 108, 5 classes of data are collected
based on the above-described 5 types of actions, which can then be
used to train a neural network to classify the 5 types of
actions.
[0074] In some embodiments, to improve prediction accuracy, action
classifier 128 can be configured to implement a two-level action
recognition technique based on using CNN architectures. FIG. 4
shows a block diagram illustrating an exemplary two-level
action-recognition module 400 for classifying actions based on
cropped images of the detected person in accordance with some
embodiments described herein. However, as mentioned above, other
embodiments of the disclosed action-recognition module can also use
the human keypoints 122 instead of cropped image 132, or the
combination of cropped image 132 and human keypoints 122 as inputs
to the action classifiers.
[0075] As can be seen in FIG. 4, in the first level of action
recognition, a first CNN module 404 receives a cropped image 132
and uses a binary classifier (not shown) to generate a "fall"
prediction 406 and a "normal" (i.e., non-fall) prediction 408 for
the detected person in input image 132. Note that each of the fall
prediction 406 and normal prediction 408 is associated with a
category of different actions. Next, in the second level of
action-recognition module 400, two more CNNs 410 and 412 are
employed and configured to further characterize each of the binary
predictions 406 and 408 into a more specific action in the
associated category of actions.
[0076] More specifically, CNN 410 can further classify a fall
prediction 406 into a set of actions related to a fall. In the
embodiment shown in FIG. 4, these fall actions can include a
"lying" action 414 and a "struggling" action 416. However, other
embodiments of action-recognition module 400 can include additional
actions or a different set of fall actions as the possible outputs
of CNN 410. Separately, CNN 412 can further classify a normal
prediction 408 into a set of actions related to a non-fall
condition. In the embodiment shown in FIG. 4, these normal actions
can include a "standing" action 418, a "sitting in chair" action
420, a "sitting on floor" action 422, a "bending" action 424, and a
"squatting" action 426. However, other embodiments of
action-recognition module 400 can include additional actions or a
different set of non-fall actions as the possible outputs of CNN
412.
[0077] Note that either in the disclosed single-level
action-recognition technique or the two-level action-recognition
technique of FIG. 4, various fast CNN architectures can be used to
classify the actions of people detected by the embedded vision
system. In one embodiment, a SqueezeNet architecture (described in
"SqueezeNet: AlexNet-level accuracy with 50.times. fewer parameters
and <0.5 MB model size," Iandola, arXiv:1602.07360, 2016) can be
used. In some embodiments, to implement the SqueezeNet architecture
in the disclosed action-recognition module 108, one can modify the
number of output classes in the existing CNN networks based on the
number of pre-defined actions/activities to be detected while
retraining the configurations of the neural networks.
[0078] For example, in the above-described single-level
action-recognition technique including 5 classes of actions, the
number of output classes in the SqueezeNet network can be reduced
to 5 while retraining the same neural network configuration.
However, to implement the disclosed action-recognition techniques
for detecting greater or fewer numbers of actions of interest, one
can easily modify the SqueezeNet network with more or less output
classes.
[0079] Note that the disclosed action-recognition techniques
implemented on action-recognition module 108 are generally applied
to individual video frames to generate an action classification for
each detected person in each processed video frame. Meanwhile, the
disclosed action-recognition techniques can be continuously applied
to a sequence of video frames on a frame-by-frame basis, and can
continue to generate updated action classifications for each
detected person based on the newly processed frames. Hence, in some
embodiments, the disclosed action recognition techniques may be
referred to as frame-level action-recognition techniques, while
action-recognition module 108 may be referred to as frame-level
action-recognition module 108.
Scene-Segmentation Module
[0080] In some embodiments, to robustly and reliably detect a fall
action, especially falling from a bed or a sofa, the disclosed
embedded fall-detection system 100 is configured to distinguish
different types of lying and struggling actions of a detected
person. For example, lying in bed or sofa would generally be
classified as normal human actions (i.e., non-fall actions),
whereas lying or struggling on the floor would be classified as
dangerous actions (i.e., fall actions). In some embodiments, the
ability to distinguish different types of lying and struggling
actions of a detected person can be achieved by scene-segmentation
module 112, which is configured to process input video images 104
and extract room layout information 126.
[0081] More specifically, room layout information 126 can include
locations of dangerous regions/objects such as a floor and a
carpet. In some embodiments, if an identified lying action of the
detected person is determined to be within an identified dangerous
region, such as a floor region, it is reasonable to classify an
identified lying action as a dangerous action (e.g., falling on the
floor). Moreover, if the identified lying action was previously
classified as a dangerous action by action-recognition module 108,
such classification can be further confirmed by the room layout
information 126, e.g., by increasing the probability/confident
score of the classification. Room layout information 126 can also
include locations of normal regions/objects such as a bed and a
sofa. In some embodiments, if an identified lying action of the
detected person is determined to be within an identified normal
region, such as a bed, it is reasonable to classify the identified
lying action as a normal action (e.g., sleeping on the bed).
Moreover, if the identified lying action was previously classified
as a dangerous action, such classification needs to be reclassified
as a normal action based on room layout information 126. Note that
because room layout information 126 is relatively static,
scene-segmentation module 112 does not have to extract room layout
information 126 from every input video frame 104. In some
embodiments, scene-segmentation module 112 only extracts room
layout information 126 periodically, e.g., for every N input video
frames 104 (wherein N is determined based on a predefined time
period). In some embodiments, room layout information 126 can also
be extracted during the setup/installation/initialization of
distributed fall-detection system 200, or when requested by the
user of the distributed fall-detection system 200 through a button
within mobile app 212 from a mobile device.
[0082] In some embodiments, scene-segmentation module 112 can be
implemented by various fast CNN-based semantic segmentation models.
In one embodiment, scene-segmentation module 112 can be implemented
based on a DeepLabV3+ model (described in "Encoder-Decoder with
Atrous Separable Convolution for Semantic Image Segmentation,"
arXiv:1802.02611, Chen et al., August 2018), which can achieve good
scene segmentation performance by combining the advantages of both
a spatial pyramid pooling technique and an encode-decoder
structure. In some embodiments, scene-segmentation module 112 can
be implemented based on the DeepLabV3+ model by making some or all
of the following modifications/improvements to the original
DeepLabV3+ model: [0083] Modifying the original DeepLabV3+ network
output to segment the indoor scenes into three regions/categories:
(1) the dangerous region which can contain the floor and a carpet;
(2) the safe region which can contain objects where one can lie
down, such as a bed and a sofa; and (3) the background region such
as walls and furnitures other than bed and sofa; [0084] Modifying
the original DeepLabV3+ model by using a fast MobileNetV2 network
(described in "MobileNetV2: Inverted Residuals and Linear
Bottlenecks," Sandler et al., arXiv:1801.04381) as the
backbone/feature extractor the modified DeepLabV3+ model to speed
up and simplify the original DeepLabV3+ model. Note that the
MobileNetV2 network is based on depth-wise convolution, wherein a
high-dimensional tensor is approximated by the product of
low-dimensional tensors. However, other networks similar to
MobileNetV2 network can be used in place of MobileNetV2 network as
the backbone in the above-described modification to the original
DeepLabV3+ network; [0085] Quantizing the network parameters and
running the network inference in 8-bit integer precision instead of
the existing 32-bit floating-point precision to reduce the memory
usage and the frequency of memory access, and to speed up the
arithmetic computations, thereby making the modification
particularly useful and desirable in resource-limited embedded
system applications; and [0086] Removing some preprocessing
functions embedded in the original DeepLabV3+ model and
implementing these functions on a CPU.
[0087] The above-described network modifications/improvements can
significantly speed up the execution of the disclosed
scene-segmentation model. For example, the runtime of the disclosed
scene segmentation model on Hi3559A CPU can be reduced from about
43 seconds to .about.2 seconds when the above modifications are
implemented. In some embodiments, the disclosed scene-segmentation
module 112 is only executed during the booting-up phase of embedded
fall-detection system 100 or distributed fall-detection system 200
when the system is being calibrated, or when there is no motion in
the input video images 104 for some time. As a result, the
execution speed of the disclosed scene-segmentation module 112 is
sufficient fast to allow room layout information 126 to be
generated for an input image before the generation of action labels
124 for that input image.
Fall-Detection Module
[0088] Referring back for FIG. 1, note that action-recognition
module 108 is followed by fall-detection module 110, which receives
the outputs from both pose-estimation module 106 (i.e., human
keypoints 122) and action-recognition module 108 (i.e., the action
labels/classifications 124). As described above, embedded
fall-detection system 100 uses pose-estimation module 106 to
identify human keypoints 122 of each detected person, estimate the
locations of human keypoints 122 in the corresponding video frame
104, and output cropped image 132 of each detected person based on
the human keypoints 122. Action-recognition module 108 can then use
cropped image 132 and/or keypoints 122 of a detected person to
generate frame-by-frame action labels/classifications 124 for the
detected person. Subsequently, fall-detection module 110 can use at
least the action labels/classifications 124 from action-recognition
module 108 to distinguish dangerous actions from normal actions,
and subsequently generate fall-detection output 140 including both
a fall alarm 140-1 and a corresponding sanitized video clip 140-2
if a fall of the detected person can be confirmed.
[0089] However, to generate more reliable fall-detection output
140, a room layout and temporal information of a sequence of video
frames need to be considered. As described above,
scene-segmentation module 112 is configured to provide the room
layout information 126 relevant to the fall detection. As shown in
FIG. 1, scene-segmentation module 112 can receive raw video images
104 and process video images 104 in parallel to the processing of
video images 104 by pose-estimation module 106 and
action-recognition module 108. Hence, scene-segmentation module 112
can identify certain room layout information from each video image
104, which can includes, but not limited to the floor, the bed, and
the sofa in the input video frame. Note that fall-detection module
110 can receive room layout information 126 from scene-segmentation
module 112 and combine this information with received human
keypoints 122 from pose-estimation module 106 and action labels 124
from action-recognition module 108 when making fall-detection
decisions.
[0090] As can be seen in FIG. 1, fall-detection module 110 can
additionally include a state machine 120 and an invalid pose filter
138. By combining room layout information 126 from
scene-segmentation module 112 with the functionalities of the later
described state machine 120 and invalid pose filter 138,
fall-detection module 110 can generate highly-reliable
fall-detection output 140. We now describe scene-segmentation
module 112, state machine 120, and invalid pose filter 138 in more
details below.
[0091] Fall-Detection State Machine
[0092] Note that if fall-detection module 110 generates fall
alarms/notifications 140-1 directly based on frame-by-frame action
labels/classifications 124 generated by action-recognition module
108, then fall alarms/notifications 140-1 can include false alarms
because such fall decisions generally do not take into account
correlations among consecutive video frames and the continuous
nature of a given human action. In some embodiments, to reduce
false alarms caused by the more naive frame-by-frame action
recognition/fall-detection technique, a state machine 120 can be
developed which incorporates temporal information from consecutive
video frames into fall-detection decisions by fall-detection module
110. An exemplary implementation of state machine 120 is shown in
FIG. 5. By combining the outputs from action-recognition module 108
and the temporal correlations between consecutive video frames
using the disclosed state machine 120, the fall/non-fall decisions
generated by fall-detection module 110 become more robust and
reliable and fall alarms generated by fall-detection module 110 can
include significantly less false alarms.
[0093] FIG. 5 shows a fall-detection state transition diagram 500
of the disclosed state machine 120 for predicting falls based on a
set of consecutive action labels of a detected person in accordance
with some embodiments described herein. As can be seen in FIG. 5,
the disclosed state transition diagram 500 can include four states
representing different levels of fall possibility: "green" state
502, "yellow" state 504, "orange" state 506 and "red" state 508.
More specifically, green state 502 represents the normal state
associated with normal actions/activities of the detected person,
yellow and orange states 504-506 represent the warning states
associated with potentially risky actions/activities of the
detected person, and red state 508 represents the alarm state
associated with dangerous actions/activities of the detected person
indicative of a fall.
[0094] In some embodiments, each of the states 502-508 in state
transition diagram 500 is associated with a state score, and a
pre-specified upper bound and a pre-specified lower bound
associated with the state score. Hence, each time the state score
of the current state of the state machine is updated, the updated
state score can be compared to the pre-specified upper/lower
bounds. If the updated state score is going above/below the
upper/lower bounds of the current state, the state of state
transition diagram 500 will transition to a more/less dangerous
state in the set of states 502-508, as shown in state transition
diagram 500 with the arrows between these states. Moreover, a fall
alarm 510 (and hence a fall alarm 140-1 in FIG. 1) can be generated
when the alarm state (i.e., red state 508) is reached, which
indicates that a fall has occurred.
[0095] In some embodiments, each state in state transition diagram
500 can have a maximum state score of 100 (i.e., the upper bound)
and a minimum state score of 0 (i.e., the lower bound). The
recognized dangerous actions by action-recognition module 108
(e.g., struggling and lying on the floor) can be used to increase
the state score associated with a current state, whereas the
detected normal actions (e.g., standing, sitting, bending, and
squatting) can be used to decrease the state score associated with
a current state. Consequently, for a sequence of video frames
depicting a continuous human action of a detected person, the state
score of the current state can be continuously increased or
decreased. Note that, as long as the current state score is bounded
between the associated upper bound and the lower bound, the current
state in the fall-detection state transition diagram 500 does not
transition to another state.
[0096] However, when the current state score exceeds the associated
upper bound, the current state will transition to a more dangerous
state in state transition diagram 500, for example, from orange
state 506 to red state 508, thereby triggering a fall alarm 510. On
the other hand, when the current state score goes below the
associated lower bound, the current state will transition to a less
dangerous state, e.g., from yellow state 504 to green state 502.
Note that while different color-coded states in state transition
diagram 500 represent different seventies of the current state of a
detected person in terms of the risk of falling, these states are
generally not corresponding to specific actions of the person, such
as standing, sitting, bending, or lying. Note that while the
embodiment of state transition diagram 500 includes four states,
other embodiments of state machine 120 can include a greater or
fewer number of states. For example, one embodiment of state
machine 120 can include only three states with just one warning
state instead of the two warning states as shown in FIG. 5.
[0097] We now describe an exemplary technique for determining the
state score for the current state of state transition diagram 500.
Recall that human keypoints 122 generated by pose-estimation module
106 are part of inputs to fall-detection module 110. As described
above, when generating human keypoints 122 for a detected person,
pose-estimation module 106 can also generate a probability for each
keypoint 122. Hence, for the detected person, we can first
calculate two types of weighted scores w.sub.fall and w.sub.normal
for the person from the set of detected keypoints 122 of that
person, wherein w.sub.fall are calculated for fall actions and
w.sub.normal are calculated for normal actions. For example, the
weighted scores w.sub.fall and w.sub.normal can be defined as:
### w.sub.fall=W.sub.k(P.sub.kW.sub.floor);### (1)
w.sub.normal=-W.sub.kP.sub.k, (2)
In Eqn. (1) above, "" denotes the element-wise product of two
vectors, and "" denotes the dot product of two vectors. Assuming
that the detected person is in the dangerous region (i.e. floor
region), w.sub.fall will have a positive value, while w.sub.normal
will have a negative value. For example, if the detected person is
lying on the floor which is considered to be a dangerous region,
both W.sub.floor and w.sub.fall will be positive, which will also
cause the state score described in Eqn. (3) below to increase.
However, when the detected person is in the normal/safe region,
w.sub.fall will have a negative value because elements in
W.sub.floor will be set to all negative values, while w.sub.normal
will also have a negative value. For example, if the detected
person is lying in bed which is considered to be a normal region,
both w.sub.fall and W.sub.floor will be negative, which will cause
the state score described in Eqn. (3) below to decrease. Note that
regardless whether the detected person is in a dangerous region or
a normal region, w.sub.normal remains negative because it is always
associated with normal situations.
[0098] For the exemplary skeleton diagram/representation of a
detected person shown in FIG. 3, P.sub.k can be an 18.times.1
keypoint probability vector formed by the probabilities of the 18
keypoints of the estimated pose, and W.sub.k is an 18.times.1
keypoint weight vector formed by 18 weight values associated with
the 18 keypoints of the estimated pose. In some embodiments, to
facilitate detecting falls, larger weight values in W.sub.k can be
assigned to lower limb keypoints whereas smaller weight values in
W.sub.k can be assigned to upper body keypoints. Moreover, because
a fall action is strongly correlated to whether the detected person
is in a dangerous region (e.g., the floor area), we can integrate
the extracted floor information by room layout information 126 into
the first type of weighted score w.sub.fall through vector
W.sub.floor. For example, in the same 18 keypoints example of FIG.
3, W.sub.floor can be configured as an 18.times.1 mask vector. In
some embodiments, when a keypoint of the detected person is
determined to be in the dangerous region (e.g., on or near the
floor or carpet), the corresponding weight element in W.sub.floor
can be set to 1 so that this keypoint will have a positive
contribution to the fall action, and subsequently a positive
contribution to the state score described below. Otherwise (i.e.,
when the keypoint is not in the dangerous region), the value of the
corresponding weight element in W.sub.floor, is set to -1 so that
this keypoint will have a negative contribution to the fall action,
and subsequently a negative contribution to the state score
described below. Generally speaking, w.sub.normal is designed to be
a negative value which has little or no correlation to the floor
information. Consequently, when a normal action is detected, a
corresponding w.sub.normal can be computed based on Eqn. (2), which
will have a negative contribution to the state score described
below.
[0099] As mentioned above, each state in the state transition
diagram 500 can maintain a state score. In some embodiments, the
state score s for the current state in the state transition diagram
500 can be updated based on the following equation:
s=s'+w.sub.s(W.sub.aP.sub.a), (3)
wherein s and s' are the state scores in the current and previous
video frames, respectively, and
w.sub.s=[w.sub.fall,w.sub.normal].sup.T is the vector form of the
above-described weighted scores w.sub.fall and w.sub.normal of the
detected person in the current video frame. Moreover, P.sub.a is a
2.times.1 vector including two probabilities associated with the
"fall action" and "normal action" predictions from the first-level
output of action-recognition module 108, W.sub.a is a 2.times.1
positive weighting vector including two weight values associated
with the two categories of actions (i.e., fall actions and normal
actions), respectively, and W.sub.aP.sub.a is the dot product of
the these two vectors. Assuming that the detected person is in the
dangerous region (i.e. floor region), w.sub.fall will have a
positive value, while w.sub.normal will have a negative value.
Subsequently, each identified dangerous action of the detected
person will cause the current state score s to increase toward the
upper bound of the current state; whereas each identified normal
action of the detected person will cause the current state score s
to decrease toward the lower bound of the current state. By way of
example, a typical example of P.sub.a associated with a possible
fall action can be P.sub.a=[0.9, 0.1].sup.T. In this case, based on
Eqns. (1)-(3), a positive value will be added to s', which will
cause current state score s to increase. On the other hand, a
typical example of P.sub.a associated with a possible normal action
can be P.sub.a=[0.9, 0.1].sup.T. In this case, based on Eqns.
(1)-(3), a negative value will be added to s', which will cause
current state score s to decrease.
[0100] Generally speaking, by tuning the values of the two elements
in W.sub.a, one can modify the sensitivity and robustness of the
disclosed state machine. More specifically, the two elements of
W.sub.a are corresponding to the fall and normal actions,
respectively, wherein one of the two elements (e.g., the first
element) of W.sub.a can be used to control how long it will take
for a fall action to trigger an alarm, and the other element (e.g.,
the second element) of W.sub.a can be used to control how long it
will take for a normal action to recover from a fall alarm back to
green state 502. Hence, by properly setting the value of the
element in W.sub.a associated with fall actions, it is possible to
tune the disclosed state machine to be more or less sensitive to
fall actions. By way of example, to avoid certain false alarms in
fall detection, we can set W.sub.a=[10, 30].sup.T so that a normal
action controlled by the second element can have a stronger effect
on the state score s. Using this setup, if 50% of the input video
frames within a predetermined period of time are classified as
being associated with fall actions, the fall alarm would not be
triggered. Instead, it may require approximately 75% of the input
frames within the predetermined period of time to be classified as
fall actions to trigger the fall alarm. Based on this setup,
embedded fall-detection system 100 can have an increased confidence
level in fall-detection output 140. In this manner, the disclosed
W.sub.a can control the confidence level in fall detections by
tuning the sensitivity to fall actions.
[0101] In some embodiments, when a person is first detected by
embedded fall-detection system 100 in an input video image 104, an
initial state score s.sub.0 can be assigned to this person. In some
embodiments, it can be assumed that the detected person is
initially in a perfectly normal condition so that the initial state
of the person can be set to the normal state in the state
transition diagram, which is the green state 502 in the exemplary
state transition diagram 500, and the initial state score s.sub.0
can be set to the lower bound of the normal state. However, in
other embodiments, the initial state score s.sub.0 can have set to
a value in the middle between the upper bound and the lower bound
of the normal state.
[0102] Invalid Pose Filter
[0103] Note that when a person is standing too close to camera 102
of embedded fall-detection system 100, the lower limbs of the
person may be cut off by the field of view of the camera, and
action-recognition module 108 is likely to misclassify the standing
action as a struggling or lying action. In some embodiments, to
filter out these false alarms, fall-detection module 110 can
additionally include an invalid pose filter 138 which can be used
to check for invalid pose locations, and the associated keypoints
and skeleton segments. More specifically, we can define a set of
binary flags corresponding to a set of invalid poses. For example,
the set of binary flags can include three flags f.sub.c,
f.sub.pt.sup.i (1=1 to 18), f.sub.l.sup.j (j=1 to 17) defined as
follows: [0104] Invalid pose flag: f.sub.c is set to 1 if the
coordinates of the center of the detected pose in an input video
image is below a certain threshold (e.g., when the center of the
pose is too low in the video image). Otherwise, f.sub.c can be set
to 0; [0105] Invalid keypoints flag: f.sub.pt.sup.i is set to 1 if
the i-th keypoint in the detected pose in an input video image is
missing, e.g., when the i-th keypoint is out of the field of view.
Otherwise, f.sub.pt.sup.i can be set to 0; [0106] Invalid skeleton
segments flag: f.sub.l.sup.j is set to 1 if the length of the j-th
skeleton segment in the detected pose in an input video image
exceeds a predetermined threshold value. Otherwise f.sub.l.sup.j
can be set to 0. For example, when a person is standing too close
to camera 102, the lengths of certain skeleton segments, such as
eye-ear segment, eye-nose segment, and/or nose-chest segment can be
significantly larger than normal values, and can also exceed the
corresponding threshold values. The above-defined flags can then be
fused/combined into a weighted invalidity score s.sub.inv as
follows:
[0106]
s.sub.inv=w.sub.c.times.f.sub.c+w.sub.pt.SIGMA..sub.i=1.sup.18f.s-
ub.pt.sup.i+w.sub.l.SIGMA..sub.j=1.sup.17f.sub.l.sup.j, (3)
wherein w.sub.c, w.sub.pt, w.sub.l are the weights assigned to the
center of the pose, keypoints and skeleton segments, respectively.
In some embodiments, if the computed invalidity score s.sub.inv is
larger than a predetermined threshold, the detected pose by
action-recognition module 108 can be marked as invalid and is
ignored by embedded fall-detection system 100. As a specific
example of using this filter, we can assign a larger value to
w.sub.l to more effectively filter out false alarms caused by
standing skeleton representations of people positioned too close to
the camera.
[0107] Note that when the disclosed embedded fall-detection vision
sensors are installed in hallways, the cameras are usually mounted
higher than in the rooms in order to cover larger areas. For these
hallway applications, a rectangle invalid zone can be set up at the
bottom of the screen/field-of-view to filter out skeleton
representations of people detected in the rectangle invalid zone,
i.e., at the bottom of the screen. In some embodiments, multiple
embedded fall-detection vision sensors 202-1, 202-2, . . . , and
202-N can be set up in such a way so that each invalid zone of each
standalone embedded vision sensor 202-i (i=1 to N) can be covered
by one or more of the neighboring embedded vision sensors 202. In
some embodiments, the size of the invalid zone of an installed
embedded vision sensor 202-i can be determined based on the height
of the embedded vision sensor 202-i from the floor.
[0108] FIG. 6 presents a flowchart illustrating an exemplary
process 600 for performing image-based fall detection in accordance
with some embodiments described herein. In one or more embodiments,
one or more of the steps in FIG. 6 may be omitted, repeated, and/or
performed in a different order. Accordingly, the specific
arrangement of steps shown in FIG. 6 should not be construed as
limiting the scope of the technique.
[0109] Process 600 may begin by receiving a sequence of video
images capturing one or more persons being monitored for potential
falls (step 602). For example, the video images may be captured by
a fall-detection camera installed at an assisted living facility or
a nursing care home, and the one or more persons being monitored
can be elderly people living in the assisted living facility or the
nursing care home. In the captured images, the one or more persons
can be performing any activities of daily living (ADLs), such as
sleeping, sitting, walking, and other types of ADLs. Next, for a
given video image in the sequence of video images, process 600
detects each person in the video image, and subsequently estimates
a pose for each detected person and generates a cropped image for
the detected person (step 604). For example, process 600 can first
identify a set of human keypoints for each detected person and then
generate a skeleton diagram/stick figure of the detected person by
connecting neighboring keypoints with straight lines. In various
embodiments, step 604 can be performed by the disclosed
pose-estimation module 106 of embedded fall-detection system
100.
[0110] Next, for each detected person, process 600 classifies the
cropped image of the detected person as a particular action within
a set of pre-defined actions, such as (1) standing; (2) sitting;
(3) bending; (4) struggling; and (5) lying down (step 606). In some
embodiments, process 600 can employ the aforementioned two-level
action-recognition technique described in conjunction with FIG. 4
to classify the action in the cropped image by: (1) classifying the
action as either a general "fall" action or a general
"non-fall/normal" action; and (2) further classifying the
classified general action into a specific action within a category
of actions associated with the classified general action. In
various embodiments, step 606 can be performed by the disclosed
action-recognition module 108 of embedded fall-detection system
100.
[0111] Next, for each detected person, process 600 combines
multiple action labels/classifications generated for multiple
consecutive video images within the sequence of video images to
generate a fall/non-fall decision (step 608). As mentioned above,
by combining the action classifications generated for the multiple
consecutive video images, process 600 takes into account the
correlations among the consecutive video frames including the
temporal correlations, and subsequently makes fall/non-fall
decisions with higher reliability by reducing or eliminating false
alarms typically associated with frame-by-frame based
fall-detection decisions. In some embodiments, step 608 can be
performed by the state machine 120 of fall-detection module 110
within embedded fall-detection system 100. Note that, to further
increase the reliability of the fall/non-fall decisions, room
layout information such as the locations of the floor, the bed, and
the sofa can be extracted from the multiple consecutive video
images and combined with other inputs to action classifiers of
fall-detection module 110 to further distinguish different types of
lying and struggling actions of each detected person. In various
embodiments, such room layout information can be generated by
scene-segmentation module 112 of embedded fall-detection system
100.
[0112] Process 600 next determines if a fall has been detected
based on the fall/non-fall decision (step 610). For example, using
state transition diagram 500, step 610 determines that, after
processing the multiple consecutive video images, whether the
current state of the system is in red state 508 of state transition
diagram 500 or not. If so, process 600 generates a fall
alarm/notification (step 612). Otherwise, process 600 can return to
step 608 to use the most recent action labels/classifications to
update the fall/non-fall decision and continue the fall
monitoring/detection process.
Infrared Image-Based Detection of Falling from Bed
[0113] In some embodiments, the embedded fall-detection system 100
can also be configured to detect a falling-off-bed event/action,
e.g., when a monitored person lying on the bed is experiencing a
serious medical condition that would result in a fall from the bed
to the floor. In particular, to detect such falls in a dark
environment, e.g., at night, a visual sensor such as a camera with
a night vision mode/function can be used. Specifically, when the
lighting condition within a monitored area is poor, e.g., when the
level of illumination is determined to be below a detection
threshold, embedded fall-detection system 100 can automatically
turn on an infrared (IR) lighting/light source and, if necessary,
also turn off the IR filter to begin capturing infrared
video/images. The captured infrared images can then be transformed
into grayscale images, which can then be used as inputs to
pose-estimation module 106, action-recognition module 108,
fall-detection module 110, and scene-segmentation module 112 for
fall detections.
[0114] In some embodiments, embedded fall-detection system 100 can
be configured to process both daylight RGB input images and
night-vision infrared input images. Moreover, embedded
fall-detection system 100 can also be configured to handle special
requirements for falling-off-bed detection. For example, even when
a person being monitored is covered by a blanket or a comforter,
pose-estimation module 106 can still detect the head and shoulder
keypoints of the person which generally remain visible, and
subsequently estimate the positions of the upper body and limb
keypoints for the person. Action-recognition module 108 can then be
used to generate the proper action labels for the detected person
based on the cropped images of the person and/or the skeleton
representations of the person, and subsequently trigger the
fall-detection state machine 120 within fall-detection module 110
to transition accordingly.
Statistics of Activities of Daily Living (ADLs)
[0115] In some embodiments, embedded fall-detection system 100 can
also be used to recognize and generate statistics of a person's
activities of daily living, e.g., how much time is spent on
sleeping, sitting, and moving. More specifically, outputs of
scene-segmentation module 112 and outputs of action-recognition
module 108 based on analyzing consecutive video frames can be
combined to recognize various activities of daily living (ADLs),
such as sleeping and walking. Based on this ADL information, useful
statistics can be generated for a monitored person, such as how
much time of the person is spent on sleeping, sitting, walking, and
other types of ADLs. In some embodiments, embedded fall-detection
system 100 can periodically output the generated ADL statistics of
a monitored person, e.g., as a part of fall-detection output 140.
By merging such ADL statistics from multiple embedded
fall-detection vision sensors installed within a healthcare
facility or a house, the disclosed distributed fall-detection
system 200 can obtain the ADL summary of each person being
monitored, and such summary can be used by caregivers to analyze
the person's health condition. In some embodiments, embedded
fall-detection system 100 can include a dedicated ADL statistics
module (not shown) for computing the above ADL statistics.
Face Detection and Face Recognition Modules
[0116] Referring back to FIG. 1, note that embedded fall-detection
system 100 also includes face-detection module 116 configured to
perform face detection functions. Specifically, face-detection
module 116 can directly receive raw video images 104 and process
video images 104 in parallel to the processing of video images 104
by pose-estimation module 106, action-recognition module 108, and
scene-segmentation module 112. Face-detection module 116
subsequently outputs detected face 130 within video images 104.
[0117] There are many fast face-detection models which can be used
to implement face-detection module 116 in embedded fall-detection
system 100. In one embodiment, a S3FD model (described in "S3FD:
Single Shot Scale-invariant Face Detector," Zhang et al., ICCV
2017) can be used to implement face-detection module 116. The S3FD
model has shown to have good performances in handling faces of
different scales. In some embodiments, to run a S3FD-based face
detect model in real-time on embedded fall-detection system 100,
the following modifications/improvements can be made to the
original S3FD model: [0118] Replacing the complex VGG16 network in
the original S3FD model with a lightweight MobileNetV2 (described
in "MobileNetV2: Inverted Residuals and Linear Bottlenecks,"
Sandler et al., arXiv:1801.04381) as the backbone/feature
extractor; [0119] Incorporating the feature pyramid network (FPN)
(described in "Feature Pyramid Networks for Object Detection," Lin
et al., arXiv:1612.03144, 2016) into the modified S3FD framework to
improve the small faces detection performance; [0120] Reducing the
training and inference data size in the original S3FD model design
from 640.times.640 to 320.times.320 to further reduce the inference
time; [0121] Adding a landmark-detection CNN module which is
configured to receive face detection outputs from the modified S3FD
network and output accurate facial landmarks for the detected faces
for use in subsequent face recognition operation. In some
embodiments, the landmark-detection CNN module and the S3FD-based
face detection model can be jointly trained; and [0122] Quantizing
the network parameters and running the network inference in 8-bit
integer precision instead of in the existing 32-bit floating-point
precision to reduce the memory usage and the frequency of memory
access, and to speed up the arithmetic computations, thereby making
the modification particularly useful and desirable in
resource-limited embedded system applications. Based on the
above-described modifications, the disclosed S3FD-based face
detection model can reduce the face-detection inference time from
.about.1.2 s to .about.100 ms on an ARM v8 CPU. Note that this
performance improvement can be achieved without using any neural
network acceleration engine.
[0123] Further referring to FIG. 1, note that the disclosed
embedded fall-detection system 100 also includes face-recognition
module 118 configured to perform face recognition functions based
on the detected faces 130 from face-detection module 116. There are
many good face recognition models which can be used to implement
face-recognition module 118 in embedded fall-detection system 100.
In one embodiment, an ArcFace face recognition model (described in
"ArcFace: Additive Angular Margin Loss for Deep Face Recognition,"
Deng et al., arXiv:1801.07698, 2018) can be used to implement
face-recognition module 118. In a particular implementation of
face-recognition module 118, a number of modifications have been
made to the original ArcFace model to tailor to the needs of
embedded fall-detection system 100. First, the proposed
face-recognition model can train a lightweight ResNet-18 network
(described in "Deep Residual Learning for Image Recognition," He et
al., CVPR 2016) on the MS1M-refine-v2 dataset. Second, the proposed
face-recognition model can be configured to quantize the neural
network model and run the inference using 8-bit integer precision
instead of 32-bit floating-point precision as in the original
ArcFace model. With these modifications, the inference speed of the
proposed face-recognition model can be increased to reduce the
inference time to about 12 ms on the Hi3559A NNIE engine. Note that
using the above proposed implementation of the face-recognition
module 118, it is also possible to detect other useful properties
of people in the captured video images 104, such as facial
expressions.
Person-ID Library and Profile Database
[0124] In some embodiments, during fall detection, face recognition
module 118 can generate a facial feature vector (which can be a 1-D
facial feature vector, a 2-D facial feature vector, or a 3-D facial
feature vector) for each detected face within an input video image
104. Next, the generated facial feature vector can be compared
against a person-ID dictionary, such as person-ID dictionary 150
stored in a memory of embedded fall-detection system 100. In some
embodiments, the person-ID dictionary can include a set of entries
associated with a set of existing/established person-IDs of a group
of people that can be tracked by embedded fall-detection system
100, wherein each entry in the person-ID dictionary can include
both one or multiple facial feature vectors (e.g., generated based
on one or multiple profile photos, which can be 1-D facial feature
vectors, 2-D facial feature vectors, or 3-D facial feature vectors)
and a corresponding person-ID.
[0125] For each facial feature vector generated by face-recognition
module 118 during the fall-detection process, if the facial feature
vector matches a stored facial feature vector within an entry in
the person-ID dictionary, it means that the detected person has an
established profile at the server. Face-recognition module 118 will
then output the person-ID within the matched entry as a person-ID
136 indicating that the detected person has been identified. In the
same manner, face-recognition module 118 can output all person-IDs
136 for all of the detected persons that can be identified by
face-recognition module 118 based on their corresponding facial
feature vectors. Next, embedded fall-detection system 100 can
output fall alarms 140-1 along with person-IDs 136 to the server,
such as server 204. The server can then use a received person-ID
136 to locate the corresponding person's identity (e.g., the
person's name) which has been previously established and stored on
the server, and subsequently send a fall notification to the mobile
app, such as mobile app 212 including the identity of the
corresponding person which is determined to have fallen.
[0126] In some embodiments, the disclosed person-ID dictionary can
be updated based on the following steps within distributed
fall-detection system 200, which involve interactions among the one
or multiple embedded vision sensors 202-1, 202-2, . . . , and
202-N, server 204, and mobile app 212: [0127] Each user of
distributed fall-detection system 200 can add or remove a person
that is to be tracked by distributed fall-detection system 200
using mobile app 212. More specifically, for each person to be
tracked by distributed fall-detection system 200, mobile app 212
can be used to construct a new profile of the person by combining
the identity of the person with one or multiple profile photos of
the person. For example, mobile app 212 can be used to take one or
multiple profile photos of the person. Mobile app 212 can then send
the profile of the person including the one or multiple profile
photos and the person's identity, such as the name of the person to
server 204; [0128] Based on a received profile of a given person,
server 204 can generate a unique person-ID (e.g., a unique integer
value) for the given person, e.g., based on the identity of the
person. Server 204 can then associate the unique person-ID with the
received one or multiple profile photos of the given person. Server
204 then sends the unique person-ID along with the profile photos
of the given person to the one or multiple embedded vision sensors
202-1, 202-2, . . . , and 202-N. [0129] On each embedded vision
sensor 202, the one or multiple profile photos of the given person
can be used to extract one or multiple facial feature vectors for
the person using the above-described face-detection module 116 and
face-recognition module 118. Next, the person-ID dictionary, such
as person-ID dictionary 150 can be updated by adding a new entry
for the given person, wherein the new entry can store both the one
or multiple newly generated facial feature vectors and the
associated unique person-ID of the given person.
[0130] Next, during a fall-detection process, the person-ID
dictionary can be used for person identification and tracking
purposes on each embedded vision sensor 202. More specifically,
face-recognition module 118 within each embedded vision sensor 202
can generate a facial feature vector for each detected person in an
input image 104. Face-recognition module 118 can then search the
generated facial feature vector of each detected person in the
person-ID dictionary stored in a memory of each embedded vision
sensor 202, and specifically compare the facial feature vector
against the stored facial feature vectors in each entry of the
person-ID dictionary. Recalled that each entry in the person-ID
dictionary stores a profile of a known person, which can include
one or multiple facial feature vectors, and a corresponding
person-ID of the person. Based on the outcome of the search,
face-recognition module 118 determines if the detected person has a
corresponding entry (i.e., a matching facial feature vector) in the
person-ID dictionary. If so, the detected person is identified, and
face-recognition module 118 can output the stored person-ID
associated with the matched facial feature vector as person-ID 136
of the detected person. If an embedded vision sensor 202 determines
that the detected person is involved in a fall, the embedded vision
sensor 202 can generate fall-detection output that includes the
identified person-ID 136 of the detected person. However, if the
facial feature vector of the detected person doesn't match any
stored facial feature vector in the person-ID dictionary,
face-recognition module 118 can generate an "unknown person" tag
for the detected person.
[0131] Note that the above-described distributed fall-detection
system design ensures that each embedded fall-detection vision
sensor 202 does not transmit any detected face image of any
detected person from a captured video image. Instead, all face
detection and recognition operations are performed within each
embedded fall-detection vision sensor 202, and each embedded
fall-detection vision sensor 202 is configured to only transmit an
encoded person-ID and sanitized video images to server 204, without
including any actual identity of the detected person. This
distributed fall-detection system design allows for preserving the
privacy of each monitored person by each embedded fall-detection
vision sensor to the maximum extent. This distributed
fall-detection system design can also minimize the amount of data
transmitted over the network and the amount computation performed
on the server (e.g., on a cloud server), thereby minimizing the
daily operating cost of the disclosed distributed fall-detection
system 200.
[0132] FIG. 7 presents a flowchart illustrating an exemplary
process 700 for adding a new profile for a person into the
disclosed distributed fall-detection system 200 in accordance with
some embodiments described herein. In one or more embodiments, one
or more of the steps in FIG. 7 may be omitted, repeated, and/or
performed in a different order. Accordingly, the specific
arrangement of steps shown in FIG. 7 should not be construed as
limiting the scope of the technique. Note that process 700 can be
understood in conjunction with embedded fall-detection system 100
of FIG. 1 and distributed fall-detection system 200 of FIG. 2.
[0133] Process 700 may begin when the server (e.g., server 204 in
system 200) receives a new profile request along with a profile of
a person to be added in the distributed fall-detection system (step
702). As mentioned above, the server can receive the new profile
request from the mobile app (e.g., mobile app 212 installed on
mobile device 206 in system 200). More specifically, the mobile app
can be used to generate the new profile, which includes the
identity of the person, and one or more profile photos of the
person, and then transmit the new profile request along with the
new profile to the server. Next, at the server, process 700
generates a unique person-ID (e.g., a unique integer value) for the
person based on the received profile of the person (step 704). For
example, the unique person-ID may be created based on the identity
of the person (e.g., the name) in the received profile. Process 700
next creates a new entry in a profile database stored on the server
for the person, wherein the entry can include the identity, the
unique person-ID and the one or multiple profile photos of the
person (step 706). Process 700 subsequently transmits the unique
person-ID along with the one or more profile photos from the server
to the one or multiple embedded fall-detection vision sensors
(e.g., embedded vision sensors 202-1 to 202-N) (step 708).
[0134] Next, on each embedded vision sensor, process 700 extracts
one or more facial feature vectors of the person based on the
received one or more profile photos (step 710). For example,
process 700 can use the above-described face-recognition module in
conjunction with the face-detection module to generate the facial
feature vectors. Process 700 next updates a respective person-ID
dictionary stored on each embedded vision sensor by adding a new
entry for the person in the person-ID dictionary, wherein the new
entry includes both the generated facial feature vectors and the
received person-ID of the person (step 712). As mentioned above,
after a profile entry is established for the person in the
person-ID dictionary, each embedded fall-detection vision sensor
can identify and subsequently track the person if that person is
detected during a fall-detection process.
[0135] Note that in some embodiments, process 700 can be reversed
to remove an established entry/profile of a person from the
person-ID dictionary. FIG. 8 presents a flowchart illustrating an
exemplary process 800 for removing an existing profile of a person
from the disclosed distributed fall-detection system 200 in
accordance with some embodiments described herein. In one or more
embodiments, one or more of the steps in FIG. 8 may be omitted,
repeated, and/or performed in a different order. Accordingly, the
specific arrangement of steps shown in FIG. 8 should not be
construed as limiting the scope of the technique. Note that process
800 can be understood in conjunction with embedded fall-detection
system 100 of FIG. 1 and distributed fall-detection system 200 of
FIG. 2.
[0136] For example, process 800 may begin when the server (e.g.,
server 204 in system 200) receives a profile removal request to
remove the profile of a given person from the distributed
fall-detection system (step 802). In some embodiments, the profile
removal request can be made using the mobile app, and the server
can receive the profile removal request from the mobile app. Note
that the profile removal request should include the identity of the
person to be removed. When the profile removal request is received
at the server, process 800 next searches a profile database storing
established profiles of a group of people based on the identity of
the person in the profile (step 804). As described above, the
stored profiles of the group of people include the established
person-IDs of the group of people. Once the profile of the person
is located in the profile database, process 800 then sends the
associated person-ID of the person along with the profile removal
request to the one or multiple embedded fall-detection vision
sensors (e.g., embedded vision sensors 202-1 to 202-N) (step
806).
[0137] Next, on each embedded vision sensor, process 800 identifies
an entry of the person within a respective person-ID dictionary
based on the received person-ID of the person (step 808). Process
800 subsequently removes the identified entry of the person from
the respective person-ID dictionary (step 810). Next, process 800
may send an acknowledgement to the server indicating that the
profile of the person has been successfully removed from the
embedded vision sensor. After receiving the acknowledgements from
the one or multiple embedded vision sensors at the server, process
800 can remove the profile of the person including the identity,
the person-ID and the one or multiple profile photos of the person
from the profile database (step 812).
[0138] FIG. 9 presents a flowchart illustrating an exemplary
process 900 for identifying a detected person with the disclosed
embedded fall-detection system 100 in accordance with some
embodiments described herein. In one or more embodiments, one or
more of the steps in FIG. 9 may be omitted, repeated, and/or
performed in a different order. Accordingly, the specific
arrangement of steps shown in FIG. 9 should not be construed as
limiting the scope of the technique. Note that process 900 can be
understood in conjunction with embedded fall-detection system 100
of FIG. 1 and in particular face-recognition module 118 within
embedded fall-detection system 100. In some embodiments, process
900 can be fully implemented on face-recognition module 118.
[0139] Process 900 may begin when face-recognition module 118
receives a detected face of a detected person within an input video
image 104 from face-detection module 116 (step 902). Process 900
next generates a facial feature vector based on the detected face
using a facial feature extraction submodule within face-recognition
module 118 (step 904). In various embodiments, this facial feature
vector can be a 1-D facial feature vector, a 2-D facial feature
vector, or a 3-D facial feature vector. Next, process 900 searches
the generated facial feature vector in a person-ID dictionary, such
as person-ID dictionary 150 by comparing the facial feature vector
against the stored facial feature vectors in each entry of the
person-ID dictionary (step 906). In some embodiments, the person-ID
dictionary is stored in a memory within embedded fall-detection
system 100. Next, process 900 determines if the detected face has a
corresponding entry in the person-ID dictionary based on whether a
matched facial feature vector can be found (step 908). If so, the
detected face/person is identified, and process 900 can output the
stored person-ID associated with the matched facial feature vector
in the person-ID dictionary as the person-ID of the detected
face/person (step 910). Subsequently, if the embedded
fall-detection system determines that the detected person is
involved in a fall, the embedded fall-detection system can output
the fall alarm along with the identified person-ID of the detected
person. However, if the facial feature vector of the detected
face/person doesn't match any stored facial feature vector in the
person-ID dictionary, process 900 can output an "unknown person"
tag for the detected face/person (step 912).
Privacy-Preserving Design
[0140] The disclosed embedded fall-detection system 100 and
distributed fall-detection system 200 are designed to preserve the
privacies of each person/user captured by each embedded
fall-detection vision sensor 202 in the disclosed distributed
fall-detection system 200. In some embodiments, the
privacy-preserving nature of the disclosed embedded fall-detection
system 100 and distributed fall-detection system 200 is achieved by
performing some or all of the above-described
fall-detection-related operations on input video images 104 in-situ
inside each standalone embedded vision sensor 202. Moreover, after
processing the captured video images in-situ, each embedded vision
sensor 202 can only transmit sanitized video images along with fall
alarms to server 204 (e.g., by transmitting only the
keypoints/skeleton/stick figure representations of each detected
person instead of the actual cropped images of the detected
person).
[0141] In some embodiments, various features extracted from a
sequence of most recent video frames can be stored in a memory
buffer of each embedded vision sensor 202. These stored features
can include human keypoints, skeleton diagrams/stick figures, and
face recognition results including person-IDs 136 from each
processed video frame. In some embodiments, these stored features
can be used to reconstruct a sanitized video clip of the most
recent N seconds (e.g., N=5.about.15) of the captured video frames.
Hence, once a fall is detected by the associated embedded
fall-detection system 100, the given embedded vision sensor 202 can
send a fall alarm/notification 140-1 along with the reconstructed
sanitized video clip 140-2 of the most recent N seconds (e.g., 10
seconds) of the captured video frames to server 204.
[0142] In some embodiments, reconstructing a sanitized video clip
can include first identifying a common background image for the
sequence of original video frames, wherein the common background
image is a static image that does not include the detected person.
For example, the common background image can be extracted from a
static video image before the detected person enters the camera
view. Next, the sequence of sanitized video frames can be generated
by directly superimposing the sequence of skeleton diagrams of the
detected person corresponding to the sequence of original video
frames onto the common background image. For example, to generate a
sanitized video frame i in the sanitized video clip, we can
superimpose the skeleton diagram i generated from frame i in the
sequence of original video frames directly onto the common
background image. Note that this sanitized video reconstruction
technique can have lower computational and storage costs than
directly processing/modifying the original video frames.
[0143] Similarly, to preserve the privacy of a person when a live
streaming is requested for the person, the disclosed embedded
vision sensors 202 do not transmit the original live video images
to server 204 or to mobile devices 212. Instead, each embedded
fall-detection vision sensor 202 is configured to send sanitized
live video images (e.g., the keypoints or the skeleton
representations of the person). In some embodiments, the amount of
information that can be included in the sanitized video images can
be tailored based on the specific privacy needs of a given
user.
[0144] For example, in a highly-restrictive privacy-preserving
mode, embedded fall-detection system 100 can be configured to only
include the skeleton representations/stick figures of the people
detected in each video frame, which is sufficient to show how a
person takes a fall, but will not include any human identity
information and background information in the transmitted video
frame. Alternatively, in a less restrictive privacy-preserving
mode, in addition to transmitting skeleton representations/stick
figures of the detected people to the server, embedded
fall-detection system 100 can be configured to also transmit some
segmented background masks (e.g., generated by scene-segmentation
module 112) of the captured scene/video frames. For example, the
segmented background masks can include labeled regions
corresponding to non-human objects detected in the scene to help
understand the scene or the detected fall, such as beds and sofas
in the scene relative to the person. However, these segmented
background masks do not show the original images of these
identified objects.
[0145] In another exemplary privacy-preserving mode, a transmitted
video can include the original background images in the video.
However, by sending the human keypoints or the associated skeleton
representations instead of the original video images of the
detected persons, the disclosed fall-detection systems 100 and 200
can effectively preserve each detected person's privacy, making it
suitable for people monitoring in bedrooms and bathrooms. In some
embodiments however, when proof of human identity is required,
e.g., for legal purposes, embedded fall-detection system 100 can
also be configured to transmit a region in the video images
corresponding to the head and the face of a given person, but the
body portion of the person can still be represented by the
associated skeleton representation in the transmitted video
images.
Embedded Vision System--Hardware Environment
[0146] FIG. 10 illustrates an exemplary hardware environment 1000
for the disclosed embedded fall-detection system 100 of FIG. 1 in
accordance with some embodiments described herein. Note that
hardware environment 1000 can be used to implement each of the one
or multiple embedded fall-detection vision sensors 202-1, 202-2, .
. . , and 202-N within distributed fall-detection system 200. As
can be seen in FIG. 10, hardware environment 1000 can include a bus
1002, one or more processors 1004, a memory 1006, a storage device
1008, a camera system 1010, sensors 1011, one or more neural
network accelerators 1012, one or more input devices 1013, one or
more output devices 1014, and a network interface 1016.
[0147] Bus 1002 collectively represents all system, peripheral, and
chipset buses that communicatively couple the various components of
hardware environment 1000. For instance, bus 1002 communicatively
couples processors 1004 with memory 1006, storage device 1008,
camera system 1010, sensors 1011, neural network accelerators 1012,
input devices 1013, output devices 1014, and network interface
1016.
[0148] From memory 1006, processors 1004 retrieves instructions to
execute and data to process in order to control various components
of hardware environment 1000, and to execute various
functionalities described in this patent disclosure including the
various disclosed functions of the various functional modules in
the disclosed embedded fall-detection system 100, including but not
limited to: pose-estimation module 106, action-recognition module
108, fall-detection module 110 including state machine 120 and
invalid pose filter 138, scene-segmentation module 112,
face-detection module 116, face-recognition module 118, and the ADL
statistics module (not shown). Processors 1004 can include any type
of processor, including, but not limited to, one or more central
processing units (CPUs), one or more microprocessors, one or more
graphic processing units (GPUs), one or more tensor processing
units (TPUs), one or more digital signal processors (DSPs), one or
more field-programmable gate arrays (FPGAs), one or more
application-specific integrated circuit (ASICs), a personal
organizer, a device controller and a computational engine within an
appliance, and any other processor now known or later developed.
Furthermore, a given processor 1004 can include one or more cores.
Moreover, a given processor 1004 itself can include a cache that
stores code and data for execution by the given processor 1004.
[0149] Memory 1006 can include any type of memory that can store
code and data for execution by processors 1004, neural network
accelerators 1012, and some other processing modules of hardware
environment 1000. This includes but not limited to, dynamic random
access memory (DRAM), static random access memory (SRAM), flash
memory, read only memory (ROM), and any other type of memory now
known or later developed.
[0150] Storage device 1008 can include any type of non-volatile
storage device that can be integrated with hardware environment
1000. This includes, but is not limited to, magnetic, optical, and
magneto-optical storage devices, as well as storage devices based
on flash memory and/or battery-backed up memory. In some
implementations, various programs for implementing the various
disclosed functions of the various disclosed modules in the
disclosed embedded fall-detection system 100, including
pose-estimation module 106, action-recognition module 108,
fall-detection module 110 including state machine 120 and invalid
pose filter 138, scene-segmentation module 112, face-detection
module 116, face-recognition module 118, and the ADL statistics
module (not shown), are stored in memory 1006 and storage device
1008.
[0151] Bus 1002 is also coupled to camera system 1010. Camera
system 1010 is configured to capture a sequence of video images at
predetermined resolutions and couple the captured video images to
various components within hardware environment 1000 via bus 1002,
such as to memory 1006 for buffering and to processors 1004 and
neural network accelerators 1012 for various deep-learning and
neural network-based operations. Camera system 1010 can include one
or more digital cameras. In some embodiments, camera system 1010
includes one or more digital cameras equipped with wide-angle
lenses. The captured images by camera system 1010 can have
different resolutions including high-resolutions such as at
1280.times.720p, 1920.times.1080p or other high resolutions.
[0152] In some embodiments, neural network accelerators 1012 can
include any type of microprocessor designed as hardware
acceleration for executing AI-based and deep-learning-based
programs and models, and in particular various deep learning neural
networks such as various CNN and RNN frameworks mentioned in this
disclosure. Neural network accelerators 1012 can perform the
intended functions of each of the described deep-learning-based
modules within the disclosed embedded fall-detection system 100,
i.e., pose-estimation module 106, action-recognition module 108,
fall-detection module 110, scene-segmentation module 112,
face-detection module 116, face-recognition module 118, and the ADL
statistics module. Examples of neural network accelerators 1012 can
include but are not limited to: the dual-core ARM Mali-G71 GPU,
dual-core Neural Network Inference Acceleration Engine (NNIE), and
the quad-core DSP module in the HiSilicon Hi3559A SoC.
[0153] Bus 1002 also connects to input devices 1013 and output
devices 1014. Input devices 1013 enable the user to communicate
information and select commands to hardware environment 1000. Input
devices 1013 can include, for example, a microphone, alphanumeric
keyboards and pointing devices (also called "cursor control
devices").
[0154] Hardware environment 1000 also includes a set of sensors
1011 coupled to bus 1002 for collection environment data in
assisting various fall-detection functionalities of the disclosed
embedded fall-detection system 100. Sensors 1011 can include a
motion sensor, an ambient light sensor, and an infrared sensor such
as a passive infrared sensor (PIR) sensor. To enable the
functionality of a PIR sensor, hardware environment 1000 can also
include an array of IR emitters.
[0155] Output devices 1014 which are also coupled to bus 1002,
enable for example, the display of the results generated by
processors 1004 and neural network accelerators 1012. Output
devices 1014 include, for example, display devices, such as cathode
ray tube displays (CRT), light-emitting diode displays (LED),
liquid crystal displays (LCD), organic light-emitting diode
displays (OLED), plasma displays, or electronic paper. Output
devices 1014 can also include audio output devices such as a
speaker. Output devices 1014 can additionally include one or more
LED indicators.
[0156] Finally, as shown in FIG. 10, bus 1002 also couples hardware
environment 1000 to a network (not shown) through a network
interface 1016. In this manner, hardware environment 1000 can be a
part of a network, such as a local area network ("LAN"), a Wi-Fi
network, a wide area network ("WAN"), or an Intranet, or a network
of networks, such as the Internet. Hence, network interface 1016
can include a Wi-Fi network interface. Network interface 1016 can
also include a Bluetooth interface. Any or all components of
hardware environment 1000 can be used in conjunction with the
subject disclosure.
[0157] In a particular embodiment of hardware environment 1000,
hardware environment 1000 is implemented as an embedded
fall-detection vision sensor which includes at least the following
components: one or more cameras, multiple CPUs; multiple GPUs;
multiple neural network accelerators (e.g., NNIE accelerators);
multiple DSPs; multiples memory modules; a storage device; a WiFi
module; a Bluetooth module; a microphone; a speaker; a display
interface; multiple sensors including a motion sensor, an ambient
light sensor, and an IR sensor; and finally multiple LED
indicators.
Task Scheduling and Low-Level Optimizations
[0158] In some embodiments, to take full advantage of the available
processing power of hardware environment 1000, a customized task
scheduler can be designed to utilize multiple hardware resources
such as ARM CPU and NNIE accelerator in parallel to achieve a
maximum processing throughput. FIG. 11 shows an exemplary task
scheduler 1100 for executing the various disclosed fall-detection
functionalities of embedded fall-detection system 100 in accordance
with some embodiments described herein.
[0159] As can be seen in FIG. 11, task scheduler 1100 can include
an input scheduler 1102 and an output scheduler 1104. Each task
scheduler 1100 can instantiate an arbitrary number of workers to
complete the same task in parallel, e.g., three CPU workers:
CPU_Worker0, CPU_Worker1, and CPU_Worker2, and two NNIE workers:
NNIE_Worker0 and NNIE_Worker1. Furthermore, each worker in task
scheduler 1100 can use a different hardware resource (i.e., either
the CPU or the NNIE accelerator) offered by hardware environment
1000. In some embodiments, input scheduler 1102 can be configured
to receive raw video images as input 1106, and schedule the set of
workers to perform the following two streams of tasks on the input
video images: (1) the pose-estimation tasks followed by the
action-recognition tasks and the fall-detection tasks, and
subsequently generating fall detection output including fall
alarms, sanitized video clips and/or ADLs as output 1108; and (2)
face-detection tasks followed by face-recognition tasks, and
subsequently generating person-IDs as output 1108. Moreover, input
scheduler 1102 and the output scheduler 1104 of task scheduler 1100
can be configured to ensure that the order in output 1108 (e.g.,
the fall-detection alarms, sanitized video clips, and ADLs) matches
the order of the raw video images in input 1106.
[0160] Note that multiple instances of task scheduler 1100 can be
chained/coupled in series to form a processing pipeline, with each
node (i.e., each instance of task scheduler 1100) of the processing
pipeline performing a specific task. For example, FIG. 12
illustrates an exemplary processing pipeline 1200 comprising two
task scheduler nodes based on the above-described task scheduler
coupled in series in accordance with some embodiments described
herein. As shown in FIG. 12, the first scheduler node (i.e., Node
0) includes two NNIE workers (NNIE0 and NNIE1) configured to
perform the above-described pose estimation tasks, whereas the
second scheduler node (i.e., Node 1) employs three CPU cores (CPU0,
CPU1, and CPU2) in parallel to perform the above-described face
detection and recognition tasks. Node 0/Scheduler 0 can receive raw
video images as input 1202, whereas Node 1/Scheduler 1 can generate
certain fall detection output such as person-IDs as output
1204.
[0161] In some embodiments, to speed up the various neural network
modules used by the disclosed embedded fall-detection system,
certain computationally-intensive layers within a given neural
network module can be redesigned using ARM NEON instructions.
[0162] Note that while the various techniques for modifying and
optimizing existing models and frameworks to implement the
disclosed embedded fall-detection system 100 and the various task
scheduling techniques are described in the scope fall-detection
systems, the concepts of the disclosed modifications and
optimization and task scheduling techniques can be applied to other
similar embedded systems, not just fall-detection systems.
Proposed Fall Risk Assessment System
[0163] This patent disclosure also discloses various embodiments of
a video-based fall risk assessment system (or simply "fall risk
assessment system" hereinafter) including various software modules
for implementing various video-based fall risk assessment
functionalities. The disclosed fall risk assessment system can
include various software modules for processing videos captured by
cameras or other forms of image/video sensors of a subject and
subsequently generating fall-risk-assessment results including fall
risk warnings/notifications based on the captured videos for the
subject. The disclosed fall risk assessment system can also be
integrated into embedded fall-detection system 100 as a function
module to make independent fall risk assessment decisions as well
as to assist other modules within the disclosed embedded
fall-detection system to make fall detection decisions. However,
the disclosed fall risk assessment system can also be implemented
as a stand-alone fall-risk-assessment system by including one or
more cameras for capturing videos of a monitored person, one or
more processors for processing the captured videos, and one or more
Human Computer Interaction (or "HCI") devices. In various
embodiments, the HCI devices can include, but are not limited to,
mobile devices, computer monitors, speakers, keyboards, and
computer mice.
[0164] FIG. 13 illustrates a block diagram of the disclosed fall
risk assessment system 1300 in accordance with some embodiments
described herein. As can be seen in FIG. 13, fall risk assessment
system 1300 includes: a pose-estimation module 1306, an
action-recognition module 1308, a gait-feature extraction module
1310, a gait analysis module 1312, and a controlled fall-risk test
module 1314. However, other embodiments of the disclosed fall risk
assessment system can include additional function modules or omit
one or more of the function modules shown in fall risk assessment
system 1300 without departing from the scope of the present
disclosure.
[0165] Pose-estimation module 1306 in fall risk assessment system
1300 can be implemented based on the above-described
pose-estimation module 106 of embedded fall-detection system 100.
In some embodiments, pose-estimation module 1306 is identical to
pose-estimation module 106. Note that pose-estimation module 1306
can receive a video 1302 which includes a sequence of video frames
as input and generate cropped images 1332 and human keypoints 1322
of a detected person corresponding to the sequence of video frames
as outputs.
[0166] Action-recognition module 1308 in fall risk assessment
system 1300 can be implemented based on the above-described
action-recognition module 108 of embedded fall-detection system
100. In some embodiments, action-recognition module 1308 is
substantially identical to action-recognition module 108. Note that
action-recognition module 1308 can include an action classifier
1328 configured to classify each detected person as being in one of
a set of pre-defined actions, referring to as the action
label/classification for the detected person. In some embodiments,
action classifier 1328 can be configured to use only cropped image
1332 of the detected person to classify the action for the detected
person. In some other embodiments, action classifier 1328 can be
configured to use only the human keypoints 1322 of the detected
person to classify the action for the detected person. In still
other embodiments, action classifier 1328 can be configured to use
the combined data of cropped image 1332 and human keypoints 1322 of
the detected person to classify the action for the detected
person
[0167] However, for fall risk assessment applications, action
classifier 1328 in action-recognition module 1308 can be designed
differently from action classifier 128 in action-recognition module
108. For example, action classifier 1328 can be configured to
classify each detected person as being in one of a set of
pre-defined actions of interests that is different from the set of
pre-defined actions of interests associated with action classifier
128. An exemplary set of pre-defined actions associated with action
classifier 1328 can include the following four types of actions:
(1) standing; (2) sitting; (3) walking; and (4) other actions.
Similarly to action classifier 128, a CNN-based architecture can be
used to construct action classifier 1328. In some embodiments, to
perform the above-described action classification in
action-recognition module 1308, 4 classes of data can be collected
based on the above-described 4 types of actions, which can then be
used to train a neural network, e.g., a CNN to classify the 4 types
of actions. For each detected person in video 1302,
action-recognition module 1308 can generate a sequence of action
labels 1324, wherein each label in the sequence of action labels
1324 represents the action of the detected person in the
corresponding video frame.
[0168] As can be seen in FIG. 13, gait-feature extraction module
1310 in fall risk assessment system 1300 is configured to receive
the outputs from both pose-estimation module 1306, such as human
keypoints 1322 and action-recognition module 1308, such as action
labels 1324, and subsequently extract useful gait features 1330 of
the detected person based on these output data for further
analysis. In some embodiments, for a sequence of estimated-poses of
a detected person generated by pose-estimation module 1306, various
gait features of the detected person can be extracted from a subset
of video frames classified with "walking" action labels 1324. For
example, gait-feature extraction module 1310 can be configured to
extract certain basic walking-step (or "step") statistics from this
subset of video frames, which can include, but are not limited to a
step count, average step duration (in time), a variance of step
duration, a speed, and a cadence. Note that these basic step
statistics can be extracted for one foot or both feet of the
detected person. Moreover, gait-feature extraction module 1310 can
also be configured to determine a "step balance" feature of walking
by comparing the differences between the corresponding basic
statistics extracted for the two feet of the detected person. As
another example, gait-feature extraction module 1310 can also be
configured to extract a "body sway" factor by measuring the offset
of the chest keypoint (e.g., referring to chest keypoint 312 in
FIG. 3) in horizontal direction in relation to the center of two
hip keypoints (e.g., referring to hip keypoints 326 and 328 in FIG.
3). A person skilled in the art can easily appreciated that
gait-feature extraction module 1310 can be configured to extract
and output a wide range of gait-related features for the detected
person based on analyzing the outputs from pose-estimation module
1306 and action-recognition module 1308, and hence are not limited
to the few examples described-above.
[0169] Further referring to FIG. 13, gait analysis module 1312 in
fall risk assessment system 1300 is configured to receive the
extracted gait features 1330 from gait-feature extraction module
1310, and subsequently analyze the gait features 1330 collected
over a period of time to generate a fall risk assessment. In some
embodiments, the disclosed fall risk assessment system 1300 can
continuously capture videos of a monitored person's daily activity
and continuously analyze the captured videos. Based on the captured
videos, gait analysis module 1312 is configured to accumulate the
extracted gait features 1330, including but are not limited to,
step count, average step duration, variance of step duration for
one foot or both feet, speed, cadence, step balance, and body sway
factor for a predetermined period of time (e.g., an hour, a day, a
week etc.).
[0170] Gait analysis module 1312 is further configured to analyze
each extracted gait feature over the predetermined time period
(e.g. hourly, daily, weekly etc.) to estimate a fall risk of the
monitored person based on the analyses. In some embodiments, gait
analysis module 1312 can perform one or more statistical analyses
on a given extracted gait feature 1330 using the data collected
over the predetermined time period. For example, temporal
variations of the extracted gait features 1330 over time can be
determined. Based on the determined variations from mean values or
values measured from the previous time periods of the same gait
features, an abnormal behavior can be identified, e.g., by
comparing the determined variations with predefined threshold
values, or using a Kalman-filter-based anomaly detection. For
example, if the step count of a monitored person during a day is
determined to have dropped down to a predetermined threshold value,
e.g., 100 steps, this is an indication of lack of mobility behavior
which is considered to be linked to a high fall risk. As another
example, if the computed medium value of the body sway factor of a
monitored person exceeds a predetermined percentage (e.g., 15%) of
the upper body size (e.g., based on a distance from chest keypoint
312 to the center of hip keypoints 326 and 328), the balance of the
monitored person is considered poor which is linked to a high fall
risk. In some embodiments, instead of comparing statistical values
to the predetermined thresholds to detect high fall risks,
Kalman-filter-based anomaly detection can be applied to the
statistical values to detect high fall risks. In some embodiments,
gait analysis module 1312 is configured to generate a
high-fall-risk warning 1340 as the output fall risk assessment
system 1300, which can be sent to a caregiver (e.g., through the
associated mobile app) when an anomaly behavior is detected.
[0171] In some embodiments, the disclosed fall risk assessment
system 1300 can be used to perform certain fall risk tests under
controlled environment. In particular, fall risk assessment system
1300 includes a controlled fall-risk test module 1314 (or
"fall-risk test module 1314" hereinafter) configured to control the
tests. These tests can be initiated either by the subject, i.e.,
the person being tested, or by caregivers of the subject. When fall
risk assessment system 1300 is integrated with embedded
fall-detection system 100, the visual and voice instructions of the
fall risk tests can be given through mobile app 212 of
fall-detection system 200. However, if fall risk assessment system
1300 is implemented as a stand-alone system or integrated into
other fall-detection systems, other HCI devices (e.g., monitors,
speakers etc.) can be used to provide visual and voice instructions
of the fall risk tests. We now describe examples of the fall risk
tests and how to use the disclosed fall risk assessment system 1300
in these tests.
[0172] A standing-and-three-meter walking test is a standard test
to measure the subject's mobility. Before the test, a chair can be
placed as the starting position, and a marker can be placed three
(3) meters in front of the chair. At the beginning of the test, the
subject will be sifting in the chair. Next, the subject his/herself
or the caregiver will start the test via the associated mobile-app,
or some other HCI device. When configured to control the test,
controlled fall-risk test module 1314 will trigger the
standing-and-three-meter-walking test sequence by sending a
starting signal to the subject via the mobile-app or other HCI
devices. After receiving the starting signal of the test, the
subject needs to stand up, walk 3 meters forward, turn around, walk
back to the chair and sit back in the chair again. Fall-risk test
module 1314 is configured to measure the time for completing the
test and used the measured time as an indicator of fall risk, e.g.,
the more time the subject used to complete the test, the higher the
fall risk is predicted. In some embodiments, if the measured time
exceeds a predetermined threshold time, fall-risk test module 1314
is configured to generate a high-fall-risk warning 1340 as the
output fall risk assessment system 1300.
[0173] Note that fall-risk test module 1314 is coupled to
gait-feature extraction module 1310, and configured to receive
extracted gait features 1330, such as step count, average and
variance of step duration, body sway factor. Similar to gait
analysis module 1312, fall-risk test module 1314 can also be
configured to analyze each extracted gait feature over a
predetermined time period (e.g. hourly, daily, weekly etc.) to
estimate a fall risk of the monitored person based on the analyses,
and subsequently generate a high-fall-risk warning 1340 as the
output fall risk assessment system 1300 if one or more particular
gait features exceed predefined thresholds. For example, if the
total walking test time exceeds 12 seconds, the subject is deemed
to have low mobility and a high fall risk. As another example, if
the step count during the walking test exceeds 14 steps, or the
medium value of body sway factor of a subject exceeds 15% of the
upper body size (e.g., based on the distance from chest keypoint
312 to the center of hip keypoints 326 and 328), the subject's
balance is deemed poor and therefore the subjected is considered to
have high fall risk.
[0174] A 30-second-sit-and-stand fall risk test can be used to
estimate the subject's lower limb strength and mobility. In this
test, the potential fall risk can be determined based on the number
of sit-stand actions that successfully performed by the subject.
Generally, the higher number of the sit-stand actions can be
completed by the subject, the lower the fall risk is associated
with the subject. Usually, a chair is used in this test. At the
beginning of the test, the subject will be sitting in the chair.
Next, the subject or the caregiver will start the test via the
associated mobile-app, or other HCI devices. When configured to
control the test, controlled fall-risk test module 1314 will
trigger the 30-second-sit-and-stand test sequence by sending a
starting signal to the subject via the mobile-app or other HCI
devices. After receiving the starting signal of the test, the
subject needs to continuously perform stand-up and sit-down
actions. At the end of the 30-second period, fall-risk test module
1314 is configured to send out an ending signal to the subject via
the mobile-app or other HCI devices. Note that fall-risk test
module 1314 is also coupled to action-recognition module 1308 to
receive action labels 1324. Because the subject can be monitored by
fall risk assessment system 1300 during the test, fall-risk test
module 1314 can be configured to determine the number of stand-up
and sit-down actions during the 30-second period based on counting
a number of "standing"-action-label to "sitting"-action-label
transitions generated by action-recognition module 1308. In some
embodiments, fall-risk test module 1314 is configured to generate a
high-fall-risk warning 1340 as the output fall risk assessment
system 1300 if the determined number of stand-up and sit-down
actions is lower than a predetermined threshold value (e.g., 10),
because such a low number indicates a low limb strength and poor
mobility which are linked to a high fall risk.
[0175] A balancing test can be used to test the subject's ability
in balance, which can be an effective indicator for the fall risk.
During such a test, the subject will be asked to perform a series
of standing postures including, but are not limited to: (1)
standing with two feet in normal standing posture; (2) placing the
instep of one foot so it is touching the big toe of the other foot;
(3) placing one foot in front of the other, heel of the front foot
touching the toe of the back foot; and (4) standing on just one
foot. Typically, at the beginning of each stage of the tests, the
subject will be standing. The subject or the caregiver will start
the test via the associated mobile-app, or other HCI devices. When
configured to control the test, controlled fall-risk test module
1314 will trigger the balancing test sequence by sending a voice
instruction of the specific standing posture and a starting signal
to the subject via the associated mobile-app or other HCI devices.
After receiving the starting signal of the test, the subject needs
to stand in the instructed posture for a predetermined period of
time, e.g., 10 seconds. At the end of the time period, controlled
fall-risk test module 1314 is configured to send out an ending
signal to the subject via the associated mobile-app or other HCI
devices. Because the subject can be monitored by fall risk
assessment system 1300 during the balancing test, fall-risk test
module 1314 can receive the extracted gait features from
gait-feature extraction modules 1310 based on the movement of the
subject during the test. Fall-risk test module 1314 can be
configured to analyze the movement of the subject's feet and body
sway factor during the balancing test. If fall-risk test module
1314 detects any foot movement, or determines that the body sway
factor exceeds a predetermined threshold, the balancing test can be
considered failed and fall-risk test module 1314 is configured to
generate a high-fall-risk warning 1340 as the output fall risk
assessment system 1300. For example, if the maximum value of the
body sway factor of the subject exceeds 25% of the upper body size
(e.g., based on the distance from chest keypoint 312 to the center
of hip keypoints 326 and 328), the subject's balance is considered
poor and the subject is determined to be of high fall risk.
[0176] In some embodiments, one or more of the disclosed gait-based
analysis modules 1310, 1312, and 1314 of the disclosed fall risk
assessment system 1300 can also be used to detect and predict
certain diseases, such as Parkinson's disease of a subject.
Moreover, gait-feature extraction module 1310 by itself or in
combination with gait-analysis module 1312 can also be integrated
into embedded fall-detection system 100 to improve the fall
detection accuracies and reliabilities when in collaboration with
other modules within embedded fall-detection system 100. In such
embodiments, the extracted gait features and gait-features analysis
results can be used as auxiliary information in making
fall/non-fall decisions. For example, the fall detection decisions
made by fall detection module 110 for a given person can be
verified or reinforced by a high-fall-risk warning generated by
gait-analysis module 1312 for the same person. In some embodiments,
gait-analysis module 1312 can also be configured to generate
independent fall detection decisions based on the received gait
features 1330 from gait-feature extraction module 1310.
[0177] FIG. 14 presents a flowchart illustrating an exemplary
process 1400 for performing a video-based fall risk assessment in
accordance with some embodiments described herein. In one or more
embodiments, one or more of the steps in FIG. 14 may be omitted,
repeated, and/or performed in a different order. Accordingly, the
specific arrangement of steps shown in FIG. 14 should not be
construed as limiting the scope of the technique.
[0178] Process 1400 may begin by receiving a sequence of video
images/frames captured during a predetermined time period (e.g., an
hour, a day, a week etc.) including a person being monitored for
fall risk assessment (step 1402). For example, the video
images/frames may be captured by a camera installed at the home of
the person or at a clinic. Next, for a given video frame in the
sequence of video frames, process 1400 detects the person in the
video frame, and subsequently estimates a pose for the detected
person (step 1404). For example, process 1400 can first identify a
set of human keypoints for the detected person and then generate a
skeleton diagram/stick figure of the detected person by connecting
neighboring keypoints with straight lines. In various embodiments,
step 1404 can be performed by the disclosed pose-estimation module
1306 of fall risk assessment system 1300. As a result, step 1404
generates a sequence of estimated poses corresponding to the
sequence of video frames.
[0179] Next, for the sequence of estimated poses, process 1400
classifies each of the estimated poses of the detected person as a
particular action within a set of pre-defined actions, such as (1)
standing; (2) sitting; (3) walking; and (4) other actions (step
1406). In some embodiments, before performing step 1406, 4 classes
of data can be collected based on the above-described 4 types of
actions, which can then be used to train a neural network, e.g., a
CNN to classify the 4 types of actions. In various embodiments,
step 1406 can be performed by the disclosed action-recognition
module 1308 of fall risk assessment system 1300. As a result, step
1404 generates a sequence of action labels based on the sequence of
estimated poses corresponding to the sequence of video frames.
[0180] Next, process 1400 identifies a subset of action labels
classified as "walking" actions within the sequence of action
labels (step 1408). Process 1400 then extracts a set of gait
features for the detected person from a subset of video frames
within the sequence of video frames corresponding to the subset of
action labels based on the estimated poses associated with the
subset of video frames (step 1410). In some embodiments, these gait
features can include but are not limited to, step count, average
step duration, variance of step duration for one foot or both feet,
speed, cadence, step balance, and body sway factor of the detected
person. Process 1400 subsequently analyzes each of the extracted
gait features collected over the predetermined time period to
generate a fall risk assessment for the detected person (step
1412). In some embodiments, process 1400 can perform one or more
statistical analyses on a given extracted gait feature using the
data collected over the predetermined time period. In some
embodiments, if process 1400 generates a high-fall-risk assessment,
process 1400 is also configured to trigger a high-fall-risk warning
to be sent to the caregivers. Note that process 1400 can
continuously receive and process new sequences of video frames
corresponding to the same predetermined time periods and
continuously assessing the fall risk for the person based on the
new sequences of video frames.
[0181] The various illustrative logical blocks, modules, circuits,
and algorithm steps described in connection with the embodiments
disclosed herein may be implemented as electronic hardware,
computer software, or combinations of both. To clearly illustrate
this interchangeability of hardware and software, various
illustrative components, blocks, modules, circuits, and steps have
been described above generally in terms of their functionality.
Whether such functionality is implemented as hardware or software
depends upon the particular application and design constraints
imposed on the overall system. Skilled artisans may implement the
described functionality in varying ways for each particular
application, but such implementation decisions should not be
interpreted as causing a departure from the scope of the present
invention.
[0182] The hardware used to implement the various illustrative
logics, logical blocks, modules, and circuits described in
connection with the aspects disclosed herein may be implemented or
performed with a general purpose processor, a digital signal
processor (DSP), an application specific integrated circuit (ASIC),
a field programmable gate array (FPGA) or other programmable logic
device, discrete gate or transistor logic, discrete hardware
components, or any combination thereof designed to perform the
functions described herein. A general-purpose processor may be a
microprocessor, but, in the alternative, the processor may be any
conventional processor, controller, microcontroller, or state
machine. A processor may also be implemented as a combination of
receiver devices, e.g., a combination of a DSP and a
microprocessor, a plurality of microprocessors, one or more
microprocessors in conjunction with a DSP core, or any other such
configuration. Alternatively, some steps or methods may be
performed by circuitry that is specific to a given function.
[0183] In one or more exemplary aspects, the functions described
may be implemented in hardware, software, firmware, or any
combination thereof. If implemented in software, the functions may
be stored as one or more instructions or code on a non-transitory
computer-readable storage medium or non-transitory
processor-readable storage medium. The steps of a method or
algorithm disclosed herein may be embodied in processor-executable
instructions that may reside on a non-transitory computer-readable
or processor-readable storage medium. Non-transitory
computer-readable or processor-readable storage media may be any
storage media that may be accessed by a computer or a processor. By
way of example but not limitation, such non-transitory
computer-readable or processor-readable storage media may include
RAM, ROM, EEPROM, FLASH memory, CD-ROM or other optical disk
storage, magnetic disk storage or other magnetic storage devices,
or any other medium that may be used to store desired program code
in the form of instructions or data structures and that may be
accessed by a computer. Disk and disc, as used herein, includes
compact disc (CD), laser disc, optical disc, digital versatile disc
(DVD), floppy disk, and Blu-ray disc where disks usually reproduce
data magnetically, while discs reproduce data optically with
lasers. Combinations of the above are also included within the
scope of non-transitory computer-readable and processor-readable
media. Additionally, the operations of a method or algorithm may
reside as one or any combination or set of codes and/or
instructions on a non-transitory processor-readable storage medium
and/or computer-readable storage medium, which may be incorporated
into a computer program product.
[0184] While this patent document contains many specifics, these
should not be construed as limitations on the scope of any
invention or of what may be claimed, but rather as descriptions of
features that may be specific to particular embodiments of
particular inventions. Certain features that are described in this
patent document and attached appendix in the context of separate
embodiments can also be implemented in combination in a single
embodiment. Conversely, various features that are described in the
context of a single embodiment can also be implemented in multiple
embodiments separately or in any suitable subcombination. Moreover,
although features may be described above as acting in certain
combinations and even initially claimed as such, one or more
features from a claimed combination can in some cases be excised
from the combination, and the claimed combination may be directed
to a subcombination or variation of a subcombination.
[0185] Similarly, while operations are depicted in the drawings in
a particular order, this should not be understood as requiring that
such operations be performed in the particular order shown or in
sequential order, or that all illustrated operations be performed,
to achieve desirable results. Moreover, the separation of various
system components in the embodiments described in this patent
document and attached appendix should not be understood as
requiring such separation in all embodiments.
[0186] Only a few implementations and examples are described and
other implementations, enhancements and variations can be made
based on what is described and illustrated in this patent document
and attached appendix.
* * * * *