U.S. patent application number 14/582743 was filed with the patent office on 2016-02-25 for managing power performance of distributed computing systems.
The applicant listed for this patent is Devadatta BODAS, Andy HOFFMAN, Sunil MAHAWAR, Muralidhar RAJAPPA, Joseph A. SCHAEFER, Justin SONG. Invention is credited to Devadatta BODAS, Andy HOFFMAN, Sunil MAHAWAR, Muralidhar RAJAPPA, Joseph A. SCHAEFER, Justin SONG.
Application Number | 20160054779 14/582743 |
Document ID | / |
Family ID | 55348281 |
Filed Date | 2016-02-25 |
United States Patent
Application |
20160054779 |
Kind Code |
A1 |
BODAS; Devadatta ; et
al. |
February 25, 2016 |
MANAGING POWER PERFORMANCE OF DISTRIBUTED COMPUTING SYSTEMS
Abstract
A method of managing power and performance of a High-performance
computing (HPC) systems, including: determining a power budget for
a HPC system, wherein the HPC system includes a plurality of
interconnected HPC nodes operable to execute a job, determining a
power and cooling capacity of the HPC system, allocating the power
budget to the job to maintain a power consumption of the HPC system
within the power budget and the power and cooling capacity of the
HPC system, and executing the job on selected HPC nodes is
shown.
Inventors: |
BODAS; Devadatta; (Federal
Way, WA) ; RAJAPPA; Muralidhar; (Chandler, AZ)
; SONG; Justin; (Olympia, WA) ; HOFFMAN; Andy;
(Olympia, WA) ; SCHAEFER; Joseph A.; (Beaverton,
OR) ; MAHAWAR; Sunil; (Portland, OR) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
BODAS; Devadatta
RAJAPPA; Muralidhar
SONG; Justin
HOFFMAN; Andy
SCHAEFER; Joseph A.
MAHAWAR; Sunil |
Federal Way
Chandler
Olympia
Olympia
Beaverton
Portland |
WA
AZ
WA
WA
OR
OR |
US
US
US
US
US
US |
|
|
Family ID: |
55348281 |
Appl. No.: |
14/582743 |
Filed: |
December 24, 2014 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62040576 |
Aug 22, 2014 |
|
|
|
Current U.S.
Class: |
700/291 ;
700/295 |
Current CPC
Class: |
G06F 9/5094 20130101;
G06F 1/30 20130101; H04L 43/08 20130101; G06F 1/3203 20130101; G06Q
50/06 20130101; G06F 1/3228 20130101; G05B 15/02 20130101; G06F
1/3234 20130101; G06F 1/3209 20130101; H04L 47/821 20130101; G06F
1/324 20130101; G06F 9/4881 20130101; H04L 47/783 20130101; G06F
1/329 20130101; H04L 41/0833 20130101; G06F 9/4893 20130101; G06F
1/3296 20130101; Y02D 10/00 20180101; Y04S 40/00 20130101 |
International
Class: |
G06F 1/32 20060101
G06F001/32; G05B 15/02 20060101 G05B015/02; G06F 1/30 20060101
G06F001/30; G06Q 50/06 20060101 G06Q050/06 |
Claims
1. A method of managing power and performance of a High-performance
computing (HPC) system, comprising: determining a power budget for
a HPC system, wherein the HPC system includes a plurality of
interconnected HPC nodes operable to execute a job; determining a
power and cooling capacity of the HPC system; allocating the power
budget to the job to maintain a power consumption of the HPC system
within the power budget and the power and cooling capacity of the
HPC system; and executing the job on selected HPC nodes.
2. The methods of claim 1, wherein the selected HPC nodes are
selected based on power characteristics of the nodes.
3. The methods of claim 2, wherein the power characteristics of the
HPC nodes are determined based on running of sample workloads.
4. The methods of claim 1, wherein the allocating the power budget
to the job is based on an estimate of power required to execute the
job.
5. The methods of claim 4, wherein the estimate of the required
power to execute the job is based on at least one of a monitored
power, an estimated power, and a calibrated power.
6. The methods of claim 1, wherein determining the power budget for
the HPC system is performed by communicating to a utility provider
through a demand/response interface.
7. The methods of claim 1, wherein determining the power and
cooling capacity of the HPC system includes monitoring and
reporting failures of power delivery and cooling
infrastructures.
8. The methods of claim 7 further comprising adjusting the power
consumption of the HPC system in response to the failure of the
power and cooling infrastructures.
9. The methods of claim 1, wherein the allocating the power budget
to the job and executing the job on selected HPC nodes are governed
by power performance policies.
10. A method of managing power and performance of a
High-performance computing (HPC) system, comprising: defining a
hard power limit based on a thermal and power delivery capacity of
a HPC facility, wherein the HPC facility includes plurality of HPC
systems, and the HPC system includes a plurality of interconnected
HPC nodes operable to execute a job; defining a soft power limit
based on a power budget allocated to the HPC facility; allocating
the power budget to the job to maintain an average power
consumption of the HPC facility below the soft power limit;
executing the job on nodes while maintaining the soft power limit
at or below the hard power limit; and allocating the power budget
to the job and executing the job on the nodes according to power
performance policies.
11. The methods of claim 10, wherein the hard power limit decreases
in response to failures of the power and cooling infrastructures of
the HPC facility.
12. The methods of claim 10, wherein allocating the power budget to
the job is based on an estimate of a required power to execute the
job.
13. The methods of claim 10, wherein the power performance policies
is based on at least one of a HPC facility policy, a utility
provider policy, a HPC administrative policy, and a user
policy.
14. A computer readable medium having stored thereon sequences of
instruction which are executable by a system, and which, when
executed by the system, cause the system to perform a method,
comprising: determining a power budge for a HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job; determining a power and cooling capacity
of the HPC system; allocating the power budget to the job such that
a power consumption of the HPC system stays within the power budget
and the power and cooling capacity of the HPC system; and executing
the job on selected HPC nodes.
15. The computer readable medium of claim 14, wherein the selected
HPC nodes to execute the job are selected based in part by a power
characteristics of the nodes.
16. The computer readable medium of claim 15, wherein the power
characteristics of the HPC nodes are determined upon running a
sample workload.
17. The computer readable medium of claim 14, wherein the
allocating the power budget to the job is based in part on an
estimate of a required power to execute the job.
18. The computer readable medium of claim 17, wherein the estimate
of the required power to execute the job is in part based upon at
least one of a monitored power, an estimated power, and a
calibrated power.
19. The computer readable medium of claim 14, wherein determining
the power budget for the HPC system is performed in part by
communicating to a utility provider through a demand/response
interface.
20. The computer readable medium of claim 14, wherein the
allocating the power budget to the job and executing the job on
selected HPC nodes are governed by power performance policies.
21. A system for managing power and performance of a
High-performance computing (HPC) system, comprising: a HPC Facility
Power Manager to determine a power budget for the HPC system,
wherein the HPC system includes a plurality of interconnected HPC
nodes operable to execute a job; an out of band mechanism to
monitor and report a cooling and power capacity of the HPC system
to a HPC System Power Manager; the HPC System Power Manager to
allocate the power budge to the job within limitations of the
cooling and power capacity of the HPC system; a job manager to
execute the job on selected nodes.
22. The system of claim 21, wherein the HPC System Power Manager
selects the selected HPC nodes to execute the job based in part by
a power characteristics of the nodes.
23. The system of claim 21, wherein the HPC System Power Manager
allocates power to the job based in part by an estimated power
required to run the job.
24. The system of claim 21, wherein the out of band mechanism
monitors and reports failures of power delivery and cooling
infrastructures of the HPC system to the HPC System Power
Manager.
25. The system of claim 21, wherein the HPC Facility Power Manager
communicates capacity and requirements of the HPC system to a
utility provider through a demand/response interface.
Description
[0001] The present application claims the benefit of prior U.S.
Provisional Patent Application No. 62/040,576, entitled "SIMPLE
POWER-AWARE SCHEDULER TO LIMIT POWER CONSUMPTION BY HPC SYSTEM
WITHIN A BUDGET" filed on Aug. 22, 2014, which is hereby
incorporated by reference in its entirety.
[0002] The present application is related to the U.S. patent
application Ser. No. ______ (Attorney Docket No. 42P73498) entitled
______ filed ______; the U.S. patent application Ser. No. ______
(Attorney Docket No. 42P74562) entitled ______ filed ______; the
U.S. patent application Ser. No. ______ (Attorney Docket No.
42P74563) entitled ______ filed ______; the U.S. patent application
Ser. No. ______ (Attorney Docket No. 42P74564) entitled ______
filed ______; the U.S. patent application Ser. No. ______ (Attorney
Docket No. 42P74565) entitled ______ filed ______; the U.S. patent
application Ser. No. ______ (Attorney Docket No. 42P74566) entitled
______ filed ______; the U.S. patent application Ser. No. ______
(Attorney Docket No. 42P74568) entitled ______ filed ______; and
the U.S. patent application Ser. No. ______ (Attorney Docket No.
42P74569) entitled "A POWER AWARE JOB SCHEDULER AND MANAGER FOR A
DATA PROCESSING SYSTEM", filed ______.
FIELD
[0003] Embodiments of the invention relate to the field of computer
systems; and more specifically, to the methods and systems of power
management and monitoring of high performance computing
systems.
BACKGROUND
[0004] A High Performance Computing (HPC) system performs parallel
computing by simultaneous use of multiple nodes to execute a
computational assignment referred to as a job. Each node typically
includes processors, memory, operating system, and I/O components.
The nodes communicate with each other through a high speed network
fabric and may use shared file systems or storage. The job is
divided in thousands of parallel tasks distributed over thousands
of nodes. These tasks synchronize with each other hundreds of times
a second. Usually a HPC system can consume megawatts of power.
[0005] Growing usage of HPC systems in the recent years have made
power management a concern in the industry. Future systems are
expected to deliver higher performance while operating under a
power constrained environment. However, current methods used to
manage power and cooling in traditional servers cause a degradation
of performance.
[0006] The most commonly used power management systems use an out
of band mechanism to enforce both power allocation and system
capacity limits Commonly used approaches to limit power usage of an
HPC, such as Running Average Power Limit (RAPL), Node Manager (NM),
and Datacenter Manager (DCM), use a power capping methodology.
These power management systems define and enforce a power cap for
each layer of HPC systems (e.g., Datacenter, Processors, Racks,
Nodes, etc.) based on the limits. However, the power allocation in
this methodology is not tailored to increase the performance. For
example, Node Managers allocate equal power to the nodes within
their power budget. However, if nodes under the same power
conditions operate with different performance level such a
variation in performance of the nodes results in degradation of the
overall performance of the HPC system.
[0007] Furthermore, today's HPC facilities communicate their demand
for power to utility companies months in advance. Lacking a proper
monitoring mechanism to forecast power consumption, such demands
are usually made equal to or greater than the maximum power for a
worst case workload a facility can use. However, the actual power
consumption is usually expected to be lower and so the unused power
is wasted.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] The invention may best be understood by referring to the
following description and accompanying drawings that are used to
illustrate embodiments of the invention. In the drawings:
[0009] FIG. 1 illustrates an exemplary block diagram of an overall
architecture of a power management and monitoring system in
accordance with one embodiment.
[0010] FIG. 2 illustrates an exemplary block diagram of overall
interaction architecture of HPC Power-Performance Manager in
accordance with one embodiment.
[0011] FIG. 3 illustrates an exemplary block diagram showing an
interaction between the HPC facility power manager and other
component of the HPC facility.
[0012] FIG. 4 illustrates an exemplary block diagram showing an
interaction between the HPC System Power Manager with a Rack
Manager and a Node Manager.
[0013] FIG. 5 illustrates HPPM response mechanism at a node level
in case of a power delivery or cooling failures.
[0014] FIG. 6 illustrates an exemplary block diagram of a HPC
system receiving various policy instructions.
[0015] FIG. 7 illustrates an exemplary block diagram showing the
interaction between the HPC Resource Manager and other components
of the HPC System.
[0016] FIG. 8 illustrates an exemplary block diagram of the
interaction of the Job Manager with Power Aware Job Launcher
according to power performance policies.
[0017] FIG. 9 illustrates one embodiment of a process for power
management and monitoring of high performance computing
systems.
[0018] FIG. 10 illustrates another embodiment of a process for
power management and monitoring of high performance computing
systems.
DESCRIPTION OF EMBODIMENTS
[0019] The following description describes methods and apparatuses
for power management and monitoring of high performance computing
systems. In the following description, numerous specific details
such as specific power policies, particular power management
devices, and etc. are set forth in order to provide a more thorough
understanding of the present invention. It will be appreciated,
however, by one skilled in the art that the invention may be
practiced without such specific details.
[0020] References in the specification to "one embodiment," "an
embodiment," "an example embodiment," etc., indicate that the
embodiment described may include a particular feature, structure,
or characteristic, but every embodiment may not necessarily include
the particular feature, structure, or characteristic. Moreover,
such phrases are not necessarily referring to the same embodiment.
Further, when a particular feature, structure, or characteristic is
described in connection with an embodiment, it is submitted that it
is within the knowledge of one skilled in the art to affect such
feature, structure, or characteristic in connection with other
embodiments whether or not explicitly described.
[0021] As discussed above, embodiments described herein relate to
the power management and monitoring for high performance computing
systems. According to various embodiments of the invention, a frame
work for workload aware, hierarchical and holistic management and
monitoring for power and performance is disclosed.
[0022] FIG. 1 illustrates an example of power management and
monitoring system for HPC systems according to one embodiment. The
system is referred to herein as an HPC Power-Performance Manager
(HPPM). In this example, HPC System 400 includes multiple
components including Resource Manager 410, Job Manager 420,
Datacenter Manager 310, Rack Manager 430, Node Manager 431, and
Thermal Control 432. In one embodiment, HPPM receives numerous
power performance policies input at different stages of the
management. In one embodiment, power performance policies include a
facility policy, a utility provider policy, a facility
administrative policy, and a user policy.
[0023] HPC System Power Manager 300 communicates the capacity and
requirements of HPC System 400 to HPC Facility Power Manager 200.
HPC Facility Power Manager 200 then communicates the power
allocated by the utility provider back to HPC System Power Manager
300. In one embodiment, HPC System Power Manager 300 also receives
administrative policies from HPC System Administrator 202.
[0024] In order to properly allocate power to HPC System 400, in
one embodiment HPC System Power Manager 300 receives the power and
thermal capacity of HPC System 400 and maintains the average power
consumption of HPC System 400 at or below the allocation. In one
embodiment, a soft limit is defined in part by the power available
for the allocation. In one embodiment, the soft limit includes the
power allocated to each HPC system within HPPM and the power
allocated to each job. In one embodiment, the job manager 420
enforces the soft limit to each job based on the power consumption
of each node.
[0025] Furthermore, the power consumption of HPC System 400 never
exceeds the power and thermal capacity of the cooling and power
delivery infrastructures. In one embodiment, a hard limit is
defined by the power and thermal capacity of the cooling and power
delivery infrastructures. In one embodiment, hard limit defines
power and cooling capability available for the nodes, racks,
systems and datacenters within a HPC facility. The cooling and
power infrastructures may or may not be shared by different
elements of the HPC facility. In one embodiment, the hard limit
fluctuates in response to failures in cooling and power delivery
infrastructures, while the soft limit remains at or below the hard
limit at any time.
[0026] HPC System Power Manager 300 uses Out of Band mechanism 301
(e.g., Node Manager 431, Thermal Control 432, Rack Manager 430 and
Datacenter Manager 310) to monitor and manage the hard limit for
each component. In one embodiment, the Out of Band mechanism 301,
unlike In Band mechanism 302, uses an independent embedded
controller outside the system with an independent networking
capability to perform its operation.
[0027] To maintain the power consumption of HPC System 400 within
the limits (both the hard limit and the soft limit) and to increase
energy efficiency, HPC System Power Manager 300 allocates power to
the jobs. In one embodiment, the allocation of power to the jobs is
based on the dynamic monitoring and power-aware management of
Resource Manager 410 and Job Manager 420 further described below.
In one embodiment, Resource Manager 410 and Job Manager 420 are
operated by In-Band mechanism 302. In one embodiment, In Band
mechanism 301 uses system network and software for monitoring,
communication, and execution.
[0028] An advantage of embodiments described herein is that the
power consumption is managed by allocating power to the jobs. As
such, the power consumption is allocated in a way to cause
significant reduction in the performance variations of the nodes
and subsequently improvement in job completion time. In other
words, the power allocated to a particular job is distributed among
the nodes dedicated to run the job in such a way to achieve the
increased performance.
[0029] FIG. 2 illustrates an example of interactions between
different components of HPC Power-Performance Manager 100. It is
pointed out that those elements of FIG. 2 having the same reference
numbers (or names) as the elements of any other figure can operate
or function in any manner similar to that described, but are not
limited to such. The lines connecting the blocks represent
communication between different components of a HPPM.
[0030] In one embodiment, these communications include
communicating, for example, the soft and hard limits for each
component of the HPPM 100, reporting the power and thermal status
of the components, reporting failures of power and thermal
infrastructures, and communicating the available power for the
components, etc. In one embodiment, HPPM 100 includes multiple
components divided between multiple datacenters within a HPC
facility. HPPM 100 also includes power and cooling resources shared
by the components. In one embodiment, each datacenter includes a
plurality of sever racks, and each server rack includes a plurality
of nodes.
[0031] In one embodiment, HPPM 100 manages power and performance of
the system by forming a dynamic hierarchical management and
monitoring structure. The power and thermal status of each layer is
regularly monitored by a managing component and reported to a
higher layer. The managing component of the higher layer aggregates
the power and thermal conditions of its lower components and
reports it to its higher layer. Reversely, the higher managing
component ensures the allocation of power to its lower layers is
based upon the current power and thermal capacity of their
components.
[0032] For example, in one embodiment, HPC Facility Power Manager
200 distributes power to multiple datacenters and resources shared
within the HPC facility. HPC Facility Power Manager 200 receives
the aggregated report of the power and thermal conditions of the
HPC facility from Datacenter Manager 210. In one embodiment,
Datacenter Manager 210 is the highest managing component of HPPM
100. Datacenter Manager 210 is the higher managing component of
plurality of datacenters. Each datacenter is managed by a
datacenter manager, such as for example, Datacenter Manager 310.
Datacenter Manager 310 is the higher managing component of a
plurality of server racks. Each server rack includes plurality of
nodes. In one embodiment, Datacenter Manager 310 is a managing
component for the nodes of an entire or part of a server rack while
in other embodiments Datacenter Manager 301 is a managing component
for nodes of multiple racks. Each node is managed by a node
manager. For example, each of Nodes 500 is managed by Node Manager
431. Node Manager 431 monitors and manages power consumption and
thermal status of its associated node.
[0033] Datacenter Manager 310 is also a higher managing component
for the power and cooling resources shared by a plurality of the
nodes. Each shared power and cooling resource is managed by a rack
manager, for example the Rack Manager 430. In one embodiment,
plurality of nodes share multiple power and cooling resources each
managed by a rack manager. In one embodiment, HPC Facility Power
Manager 200 sends the capacity and requirements of the HPC facility
to a utility provider. HPC Facility Power Manager 200 distributes
the power budget to HPC System Power Manager associated with each
HPC System (e.g., the HPC System Power Manager 300). HPC System
Power Manager 300 determines how much power to allocate to each
job. Job Manager 420 manages power performance of a job within the
budget allocated by the HPC System Power Manager 300. Job Manager
420 manages a job throughout its life cycle by controlling the
power allocation and frequencies of Nodes 500.
[0034] In one embodiment, if a power or thermal failure occurs on
any lower layers of Datacenter Manager 310, Datacenter Manager 310
immediately warns HPC System Power Manager 300 of the change in
power or thermal capacity. Subsequently, HPC System Power Manager
300 adjusts the power consumption of the HPC system by changing the
power allocation to the jobs.
[0035] FIG. 3 demonstrates the role of HPC Facility Power Manager
200 in more details. It is pointed out that those elements of FIG.
3 having the same reference numbers (or names) as the elements of
any other figure can operate or function in any manner similar to
that described, but are not limited to such. In one embodiment, HPC
Facility 101 includes HPC Facility Power Manager 200, Power
Generator and Storage 210, Power Convertor 220, Cooling System 230
that may include storage of a cooling medium, and several HPC
systems including the HPC System 400. Each HPC system is managed by
a HPC System Power Manager (e.g., HPC System Power Manager 300
manages HPC System 400).
[0036] In one embodiment, HPC Facility Power Manager 200 manages
the power consumption of HPC Facility 101. HPC Facility Power
Manager 200 receives facility level policies from the Facility
Administrator 102. In one embodiment, the facility level policies
relate to selecting a local source of power, environmental
considerations, and the overall operation policy of the facility.
HPC Facility Power Manager 200 also communicates with Utility
Provider 103. In one embodiment, HPC Facility Power Manager 200
communicates its forecasted capacity and requirements of HPC
Facility 101 in advance to the Utility Provider 103. In one
embodiment HPC Facility 101 uses Demand/Response interface to
communicate with Utility Provider 103.
[0037] In one embodiment, the Demand/Response interface provides a
non-proprietary interface that allows the Utility Provider 103 to
send signals about electricity price and system grid reliability
directly to customers, e.g. HPC Facility 101. The dynamic
monitoring allows for HPC Facility Power Manager 200 to more
accurately estimate the required power and communicate its capacity
and requirement automatically to Utility Provider 103. This method
allows for improving cost based on the price in real time and
reduces the disparity between the allocated power by the Utility
Provider 103 and the power actually used by the Facility 101.
[0038] In one embodiment, HPPM determines a power budget at a given
time based upon the available power from Utility Provider 103, the
cost of the power from Utility Provider 103, the available power in
the local Power Generator and Storage 210, and actual demand by the
HPC systems. In one embodiment, HPPM substitutes the energy from
the utility provider by the energy from the local storages or
electricity generators. In one embodiment, HPPM receives the
current price of electricity and makes the electricity produced by
Power Generator and Storage 210 available for sell in the
market.
[0039] FIG. 4 illustrates how HPC System Power Manager 300 manages
shared power supply among nodes using a combination of Rack Manager
430 and Node Manager 440. It is pointed out that those elements of
FIG. 4 having the same reference numbers (or names) as the elements
of any other figure can operate or function in any manner similar
to that described, but are not limited to such.
[0040] In one embodiment, Rack Manager 430 reports the status of
the shared resources and receives power limits from Datacenter
Manager 310. Node Manager 440 reports node power consumption and
receive node power limits from Datacenter Manager 310. Similarly,
Datacenter Manager 310 reports system power consumption to HPC
System Power Manager 300. The communication between HPC System
Power Manager 300 and Datacenter Manager 310 facilitates monitoring
of the cooling and power delivery infrastructure in order to
maintain the power consumption within the hard limit. In one
embodiment, HPC System Power Manager 300 maintains the power
consumption of the nodes or processors by adjusting the power
allocated to them.
[0041] In one embodiment, in case failure of power supply or
cooling systems results in a sudden reduction of available power,
the hard limit is reduced automatically by either or both of Rack
Manager 430 and Node Manager 440 to a lower limit to avoid a
complete failure of the power supply. Subsequently the sudden
reduction of available power is reported to HPC System Power Manger
300 through Datacenter Manager 310 by either or both of Rack
Manager 430 and Node Manager 440, so that HPC System Power Manger
300 can readjust the power allocation accordingly.
[0042] FIG. 5 illustrates HPPM response mechanism at a node level
in case of a power delivery or cooling failures. It is pointed out
that those elements of FIG. 5 having the same reference numbers (or
names) as the elements of any other figure can operate or function
in any manner similar to that described, but are not limited to
such.
[0043] In one embodiment, a cooling and power delivery failure does
not impact all nodes equally. Once Node Manager 431 identifies the
impacted nodes, for example Nodes 500, it will adjust the
associated hard limit for Nodes 500. This hard limit is then
communicated to Job Manager 420. Job Manager 420 adjusts the soft
limit associated with Nodes 500 to maintain both soft limit and
power consumption of Nodes 500 at or below the hard limit. In one
embodiment, the frequency of the communication between Node Manager
431 and Job Manager 420 is in milliseconds.
[0044] In one embodiment, a faster response is required to avoid
further power failure of the system. As such, Node Manager 431
directly alerts Nodes 500. The alert imposes a restriction on Nodes
500 and causes an immediate reduction of power consumption by Nodes
500. In one embodiment, such a reduction could be more than
necessary to avoid further power failures. Subsequently Node
Manager 431 communicates the new hard limit to Job Manager 420. Job
Manager 420 adjusts the soft limits of Nodes 500 to maintain the
power consumption of Nodes 500 at or below the hard limit Job
Manager 420 enforces the new hard limit and removes the alert
asserted by Node Manager 431.
[0045] Referring to FIG. 6, an exemplary block diagram of a HPC
system receiving various inputs is illustrated. It is pointed out
that those elements of FIG. 6 having the same reference numbers (or
names) as the elements of any other figure can operate or function
in any manner similar to that described, but are not limited to
such. In one embodiment described herein, HPC system 400 includes
one or more operating system (OS) nodes 501, one or more compute
nodes 502, one or more input/output (I/O) nodes 503 and a storage
system 504. The high-speed fabric 505 communicatively connects the
OS nodes 501, compute nodes 502 and I/O nodes 503 and storage
system 504 The high-speed fabric may be a network topology of nodes
interconnected via one or more switches. In one embodiment, as
illustrated in FIG. 6, I/O nodes 503 are communicatively connected
to storage 504. In one embodiment, storage 504 is a non-persistent
storage such as volatile memory (e.g., any type of random access
memory "RAM"); persistent storage such as non-volatile memory
(e.g., read-only memory "ROM", power-backed RAM, flash memory,
phase-change memory, etc.), a solid-state drive, hard disk drive,
an optical disc drive, or a portable memory device.
[0046] The OS nodes 501 provide a gateway to accessing the compute
nodes 502. For example, prior to submitting a job for processing on
the compute nodes 502, a user may be required to log-in to HPC
system 400 which may be through OS nodes 501. In embodiments
described herein, OS nodes 501 accept jobs submitted by users and
assist in the launching and managing of jobs being processed by
compute nodes 502.
[0047] In one embodiment, compute nodes 502 provide the bulk of the
processing and computational power. I/O nodes 503 provides an
interface between compute nodes 502 and external devices (e.g.,
separate computers) that provides input to HPC system 400 or
receive output from HPC system 400.
[0048] The limited power allocated to HPC system 400 is used by HPC
system 400 to run one or more of jobs 520. Jobs 520 comprise one or
more jobs requested to be run on HPC system 400 by one or more
users, for example User 201. Each job includes a power policy,
which will be discussed in-depth below. The power policy will
assist the HPC System Power Manager in allocating power for the job
and aid in the management of the one or more jobs 520 being run by
HPC system 400.
[0049] In addition, HPC System Administrator 202 provides
administrative policies to guide the management of running jobs 520
by providing an over-arching policy that defines the operation of
HPC system 400. In one embodiment, examples of policies in the
administrative policies include, but are not limited or restricted
to, (1) a policy to increase utilization of all hardware and
software resources (e.g., instead of running fewer jobs at high
power and leaving resources unused, run as many jobs as possible to
use as much of the resources as possible); (2) a job with no power
limit is given the highest priority among all running jobs; and/or
(3) suspended jobs are at higher priority for resumption. Such
administrative policies govern the way the HPC System Power Manager
schedules, launches, suspends and re-launches one or more jobs.
[0050] User 201 policy can be specific to a particular job. User
201 can instruct HPC System 400 to run a particular job with no
power limit or according to a customized policy. Additionally User
201 can set the energy policy of a particular job, for example at
most efficiency or highest performance.
[0051] As shown in FIG. 1, HPC System Administrator 202 and User
201 communicate their policies to the HPC System Power Manager 300
and Resource Manager 410. In one embodiment, Resource Manager 410
receives these policies and formulates them into "modes" under
which Job Manager 420 instructs OS Nodes 501, CPU Nodes 502, and IO
Node 503 to operate.
[0052] FIG. 7 shows the flow of information between Resource
Manager 410 (including Power Aware Job scheduler 411, and Power
Aware Job launcher 412) and other elements of the HPPM (HPC System
Power Manager 300, Estimator 413, Calibrator 414, and Job Manager
420). In one embodiment, the purpose of these communications is to
allocate sufficient hardware resources (e.g., nodes, processors,
memories, network bandwidth and etc.) and schedule execution of
appropriate jobs. In one embodiment, power is allocated to the jobs
in such a way to maintain HPC System 400 power within the limits,
increase energy efficiency, and control HPC system 400 rate of
power consumption change.
[0053] Referring to FIG. 6, to determine amount of power allocation
to each job, HPC System Power Manager 300 communicates with
Resource Manager 410. Power Aware Job scheduler 411 considers the
policies and priorities of Facility Administrator 102, Utility
Provider 103, User 201, and HPC System Administrator 202 and
determines accordingly what hardware resources of HPC System 400 is
needed to run a particular job. Additionally, Power Aware Job
scheduler 411 receives power-performance characteristics of the job
at different operating points from Estimator 413 and Calibrator
414. Resource Manager 410 forecasts how much power a particular job
needs and take corrective actions when actual power differs from
the estimation.
[0054] Estimator 413 provides Resource Manager 410 with estimates
of power consumption for each job enabling Resource Manager 410 to
efficiently schedule and monitor each job requested by one or more
job owners (e.g., users). Estimator 413 provides a power
consumption estimate based on, for example, maximum and average
power values stored in a calibration database, wherein the
calibration database is populated by the processing of Calibrator
414. In addition, the minimum power required for each job is
considered. Other factors that is used by Estimator 413 to create a
power consumption estimate include, but are not limited or
restricted to, whether the owner of the job permits the job to be
subject to a power limit, the job power policy limiting the power
supplied to the job (e.g., a predetermined fixed frequency at which
the job will run, a minimum power required for the job, or varying
frequencies and/or power supplied determined by Resource Manager
410), the startup power for the job, the frequency at which the job
will run, the available power to HPC System 400 and/or the
allocated power to HPC System 400.
[0055] Calibrator 414 calibrates the power, thermal dissipation and
performance of each node within HPC System 400. Calibrator 414
provides a plurality of methods for calibrating the nodes within
HPC system 400. In one embodiment, Calibrator 414 provides a first
method of calibration in which every node within HPC system 400
runs sample workloads (e.g., a mini-application and/or a test
script) so Calibrator 414 may sample various parameters (e.g.,
power consumed) at predetermined time intervals in order to
determine, inter alia, (1) the average power, (2) the maximum
power, and (3) the minimum power for each node. In addition, the
sample workload is run on each node at every operating frequency of
the node. In another embodiment, Calibrator 414 provides a second
method of calibration in which calibration of one or more nodes
occurs during the run-time of a job. In such a situation,
Calibrator 414 samples the one or more nodes on which a job is
running (e.g., processing). In the second method, Calibrator 414
obtains power measurements of each node during actual run-time.
[0056] In one embodiment, Power Aware Job Scheduler 411 is
configured to receive a selection of a mode for a job, to determine
an available power for the job based on the mode and to allocate a
power for the job based on the available power. In one embodiment,
Power Aware Job Scheduler 411 is configured to determine a uniform
frequency for the job based on the available power. In one
embodiment, the power aware job scheduler is configured to
determine the available power for the job based on at least one of
a monitored power, an estimated power, and a calibrated power.
[0057] Generally, a user submits a program to be executed ("job")
to a queue. The job queue refers to a data structure containing
jobs to run. In one embodiment, Power Aware Job Scheduler 411
examines the job queue at appropriate times (periodically or at
certain events e.g., termination of previously running jobs) and
determines if resources including the power needed to run the job
can be allocated. In some cases, such resources can be allocated
only at a future time, and in such cases the job is scheduled to
run at a designated time in future. Power Aware Job Launcher 412
selects a job among the jobs in the queue, based on available
resources and priority, and schedules it to be launched. In one
embodiment, in case the available power is limited, Power Aware Job
Launcher 412 will look at the operating points to select the one
which results in highest frequency while maintain the power
consumption below the limit.
[0058] FIG. 8 illustrates the interaction of Job Manager 420 with
Power Aware Job Launcher 412 according to Power Performance
Policies 440. Once a job is launched, it is assigned a job manager,
for example Job Manager 420. Job Manager 420 manages power
performance of the job throughout its life cycle. In one
embodiment, Job Manager 420 is responsible for operating the job
within the constraints of one or more power policies and various
power limits after the job has been launched. In one embodiment,
for example, a user may designate "special" jobs that are not power
limited. Power Aware Job scheduler 411 will need to estimate the
maximum power the job could consume, and only start the job when
the power is available. System Power Performance 300 redistributes
power among the normal jobs in order to reduce stranded power and
increase efficiency. But even if the allocated power for HPC System
400 falls, the workload manager ensures that these "special" jobs'
power allocations remain intact. In another example, a user may
specify the frequency for a particular job. In one embodiment, user
selection may be based upon a table that indicates degradation in
performance and reduction in power for each frequency.
[0059] Alternatively, the frequency selection for the jobs can be
automated based upon available power. In one embodiment, with
dynamic power monitoring, Job Manager 420 will adjust the frequency
periodically based upon power headroom. An advantage of embodiment
described herein is that a job will be allowed to operate at all
available frequencies. Job Manager 420 will determine the best mode
to run the job based upon the policies and priorities communicated
by Facility Administrator 102, Utility Provider 103, User 201, and
HPC System Administrator 202.
[0060] FIG. 9 is a flow diagram of one embodiment of a process for
managing power and performance of HPC systems. The process is
performed by processing logic that may comprise hardware
(circuitry, dedicated logic, etc.), software (such as is run on a
general purpose computer system or a dedicated machine), firmware
or a combination of the three.
[0061] Referring to FIG. 9, at block 901, HPPM communicates
capacity and requirements of the HPC system to a utility provider.
In one embodiment, the capacity of the HPC system is determined
based on the cooling and power delivery capacity of the HPC system.
In one embodiment, the HPPM communicates its capacity and
requirements to the utility provider through a demand/response
interface. In one embodiment, the demand/response interface reduces
a cost for the power budget based on the capacity and requirements
of the HPC system and input from the utility provider. In one
embodiment the demand/response interface communicates the capacity
and requirements of the HPC system through an automated
mechanism.
[0062] At block 902, HPPM determines a power budget for the HPC
system. In one embodiment, the power budget is determined based on
the cooling and power delivery capacity of the HPC system. In one
embodiment, the power budget is determined based on the power
performance policies. In one embodiment, the power performance
policies based on at least one of a facility policy, a utility
provider policy, a facility administrative policy, and a user
policy.
[0063] At block 903, HPPM determines a power and cooling capacity
of the HPC system. In one embodiment, determining the power and
cooling capacity of the HPC system includes monitoring and
reporting failures of power delivery and cooling infrastructures.
In one embodiment, in case of a failure the power consumption is
adjusted accordingly. In one embodiment, determining the power and
cooling capacity of the HPC system is performed by an out of band
mechanism.
[0064] At block 904, HPPM allocates the power budget to the job to
maintain a power consumption of the HPC system within the power
budget and the power and cooling capacity of the HPC system. In one
embodiment, allocating the power budget to the job is based on
power performance policies. In one embodiment, allocating the power
budget to the job is based on an estimate of power required to
execute the job. In one embodiment, the estimate of the required
power to execute the job is based on at least one of a monitored
power, an estimated power, and a calibrated power.
[0065] At block 905, HPPM executes the job on selected HPC nodes.
In one embodiment, the selected HPC nodes are selected based on
power performance policies. In one embodiment, the selected HPC
nodes are selected based on power characteristics of the nodes. In
one embodiment, the power characteristics of the HPC nodes are
determined based on running of a sample workload. In one
embodiment, the power characteristics of the HPC nodes are
determined during runtime. In one embodiment, wherein the job is
executed on the selected HPC nodes based on power performance
policies.
[0066] FIG. 10 is a flow diagram of one embodiment of a process for
managing power and performance of HPC systems. The process is
performed by processing logic that may comprise hardware
(circuitry, dedicated logic, etc.), software (such as is run on a
general purpose computer system or a dedicated machine), firmware
or a combination of the three. Referring to FIG. 10, at block 1001,
HPPM defines a hard power limit based on a thermal and power
delivery capacity of a HPC facility. In one embodiment, the hard
power limit is managed and monitored by an out of band mechanism.
In one embodiment, the hard power limit decreases in response to
failures of the power and cooling infrastructures of the HPC
facility.
[0067] At block 1002, HPPM defines a soft power limit based on a
power budget allocated to the HPC facility. In one embodiment, the
power budget for the HPC facility is provided by a utility provider
through a demand/response interface. In one embodiment, the
demand/response interface reduces a cost for the power budget based
on the capacity and requirements of the HPC system and input from
the utility provider. In one embodiment the demand/response
interface communicates the capacity and requirements of the HPC
system through an automated mechanism.
[0068] At block 1003, HPPM allocates the power budget to the job to
maintain an average power consumption of the HPC facility below the
soft power limit. In one embodiment, allocating the power budget to
the job is based on power performance policies. In one embodiment,
allocating the power budget to the job is based on an estimate of
power required to execute the job. In one embodiment, the estimate
of the required power to execute the job is based on at least one
of a monitored power, an estimated power, and a calibrated
power.
[0069] At block 1004, HPPM executes the job on nodes while
maintaining the soft power limit at or below the hard power limit.
In one embodiment, allocating the power budget to the job and
executing the job on the nodes is according to power performance
policies. In one embodiment, the power performance policies are
based on at least one of a HPC facility policy, a utility provider
policy, a HPC administrative policy, and a user policy.
[0070] Some portions of the preceding detailed descriptions have
been presented in terms of algorithms and symbolic representations
of transactions on data bits within a computer memory. These
algorithmic descriptions and representations are the ways used by
those skilled in the data processing arts to most effectively
convey the substance of their work to others skilled in the art. An
algorithm is here, and generally, conceived to be a self-consistent
sequence of transactions leading to a desired result. The
transactions are those requiring physical manipulations of physical
quantities. Usually, though not necessarily, these quantities take
the form of electrical or magnetic signals capable of being stored,
transferred, combined, compared, and otherwise manipulated. It has
proven convenient at times, principally for reasons of common
usage, to refer to these signals as bits, values, elements,
symbols, characters, terms, numbers, or the like.
[0071] It should be borne in mind, however, that all of these and
similar terms are to be associated with the appropriate physical
quantities and are merely convenient labels applied to these
quantities. Unless specifically stated otherwise as apparent from
the above discussion, it is appreciated that throughout the
description, discussions utilizing terms such as "processing" or
"computing" or "calculating" or "determining" or "displaying" or
the like, refer to the action and processes of a computer system,
or similar electronic computing device, that manipulates and
transforms data represented as physical (electronic) quantities
within the computer system's registers and memories into other data
similarly represented as physical quantities within the computer
system memories or registers or other such information storage,
transmission or display devices.
[0072] The algorithms and displays presented herein are not
inherently related to any particular computer or other apparatus.
Various general-purpose systems may be used with programs in
accordance with the teachings herein, or it may prove convenient to
construct more specialized apparatus to perform the required method
transactions. The required structure for a variety of these systems
will appear from the description above. In addition, embodiments of
the present invention are not described with reference to any
particular programming language. It will be appreciated that a
variety of programming languages may be used to implement the
teachings of embodiments of the invention as described herein.
[0073] In the foregoing specification, embodiments of the invention
have been described with reference to specific exemplary
embodiments thereof. It will be evident that various modifications
may be made thereto without departing from the broader spirit and
scope of the invention as set forth in the following claims. The
specification and drawings are, accordingly, to be regarded in an
illustrative sense rather than a restrictive sense.
[0074] Throughout the description, embodiments of the present
invention have been presented through flow diagrams. It will be
appreciated that the order of transactions and transactions
described in these flow diagrams are only intended for illustrative
purposes and not intended as a limitation of the present invention.
One having ordinary skill in the art would recognize that
variations can be made to the flow diagrams without departing from
the broader spirit and scope of the invention as set forth in the
following claims.
[0075] The following examples pertain to further embodiments:
[0076] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, determining a
power budget for a HPC system, wherein the HPC system includes a
plurality of interconnected HPC nodes operable to execute a job,
determining a power and cooling capacity of the HPC system,
allocating the power budget to the job to maintain a power
consumption of the HPC system within the power budget and the power
and cooling capacity of the HPC system, and executing the job on
selected HPC nodes.
[0077] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, determining a
power budget for a HPC system, wherein the HPC system includes a
plurality of interconnected HPC nodes operable to execute a job,
determining a power and cooling capacity of the HPC system,
allocating the power budget to the job to maintain a power
consumption of the HPC system within the power budget and the power
and cooling capacity of the HPC system, and executing the job on
selected HPC nodes, wherein the selected HPC nodes are selected
based on power performance policies.
[0078] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, determining a
power budget for a HPC system, wherein the HPC system includes a
plurality of interconnected HPC nodes operable to execute a job,
determining a power and cooling capacity of the HPC system,
allocating the power budget to the job to maintain a power
consumption of the HPC system within the power budget and the power
and cooling capacity of the HPC system, and executing the job on
selected HPC nodes, wherein the selected HPC nodes are selected
based on power characteristics of the nodes.
[0079] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, determining a
power budget for a HPC system, wherein the HPC system includes a
plurality of interconnected HPC nodes operable to execute a job,
determining a power and cooling capacity of the HPC system,
allocating the power budget to the job to maintain a power
consumption of the HPC system within the power budget and the power
and cooling capacity of the HPC system, and executing the job on
selected HPC nodes, wherein the selected HPC nodes are selected
based on power characteristics of the nodes determined based on
running of a sample workload.
[0080] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, determining a
power budget for a HPC system, wherein the HPC system includes a
plurality of interconnected HPC nodes operable to execute a job,
determining a power and cooling capacity of the HPC system,
allocating the power budget to the job to maintain a power
consumption of the HPC system within the power budget and the power
and cooling capacity of the HPC system, and executing the job on
selected HPC nodes, wherein the selected HPC nodes are selected
based on power characteristics of the nodes determined during
runtime.
[0081] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, determining a
power budget for a HPC system, wherein the HPC system includes a
plurality of interconnected HPC nodes operable to execute a job,
determining a power and cooling capacity of the HPC system,
allocating the power budget to the job to maintain a power
consumption of the HPC system within the power budget and the power
and cooling capacity of the HPC system, and executing the job on
selected HPC nodes, wherein the job is executed on the selected HPC
nodes based on power performance policies.
[0082] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, determining a
power budget for a HPC system, wherein the HPC system includes a
plurality of interconnected HPC nodes operable to execute a job,
determining a power and cooling capacity of the HPC system,
allocating the power budget to the job to maintain a power
consumption of the HPC system within the power budget and the power
and cooling capacity of the HPC system, and executing the job on
selected HPC nodes, wherein allocating the power budget to the job
is based on power performance policies.
[0083] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, determining a
power budget for a HPC system, wherein the HPC system includes a
plurality of interconnected HPC nodes operable to execute a job,
determining a power and cooling capacity of the HPC system,
allocating the power budget to the job to maintain a power
consumption of the HPC system within the power budget and the power
and cooling capacity of the HPC system, and executing the job on
selected HPC nodes, wherein the allocating the power budget to the
job is based on an estimate of power required to execute the
job.
[0084] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, determining a
power budget for a HPC system, wherein the HPC system includes a
plurality of interconnected HPC nodes operable to execute a job,
determining a power and cooling capacity of the HPC system,
allocating the power budget to the job to maintain a power
consumption of the HPC system within the power budget and the power
and cooling capacity of the HPC system, and executing the job on
selected HPC nodes, wherein the allocating the power budget to the
job is based on an estimate of power required to execute the job
determined based on at least one of a monitored power, an estimated
power, and a calibrated power.
[0085] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, determining a
power budget for a HPC system, wherein the HPC system includes a
plurality of interconnected HPC nodes operable to execute a job,
determining a power and cooling capacity of the HPC system,
allocating the power budget to the job to maintain a power
consumption of the HPC system within the power budget and the power
and cooling capacity of the HPC system, and executing the job on
selected HPC nodes, wherein determining the power budget for the
HPC system is based on the power and cooling capacity of the HPC
system.
[0086] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, determining a
power budget for a HPC system, wherein the HPC system includes a
plurality of interconnected HPC nodes operable to execute a job,
determining a power and cooling capacity of the HPC system,
allocating the power budget to the job to maintain a power
consumption of the HPC system within the power budget and the power
and cooling capacity of the HPC system, and executing the job on
selected HPC nodes, wherein determining the power budget for the
HPC system is performed by communicating to a utility provider
through a demand/response interface. In one embodiment, the
demand/response interface reduces a cost for the power budget based
on the capacity and requirements of the HPC system and inputs from
the utility provider. In one embodiment, the demand/response
interface communicates the capacity and requirements of the HPC
system through an automated mechanism.
[0087] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, determining a
power budget for a HPC system, wherein the HPC system includes a
plurality of interconnected HPC nodes operable to execute a job,
determining a power and cooling capacity of the HPC system,
allocating the power budget to the job to maintain a power
consumption of the HPC system within the power budget and the power
and cooling capacity of the HPC system, and executing the job on
selected HPC nodes, wherein determining the power and cooling
capacity of the HPC system includes monitoring and reporting
failures of power delivery and cooling infrastructures. In one
embodiment, the method further comprises of adjusting the power
consumption of the HPC system in response to the failure of the
power and cooling infrastructures.
[0088] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, determining a
power budget for a HPC system, wherein the HPC system includes a
plurality of interconnected HPC nodes operable to execute a job,
determining a power and cooling capacity of the HPC system,
allocating the power budget to the job to maintain a power
consumption of the HPC system within the power budget and the power
and cooling capacity of the HPC system, and executing the job on
selected HPC nodes, wherein determining a power and cooling
capacity of the HPC system is performed by an out of band
mechanism.
[0089] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, defining a
hard power limit based on a thermal and power delivery capacity of
a HPC facility, wherein the HPC facility includes plurality of HPC
systems, and the HPC system includes a plurality of interconnected
HPC nodes operable to execute a job, defining a soft power limit
based on a power budget allocated to the HPC facility, allocating
the power budget to the job to maintain an average power
consumption of the HPC facility below the soft power limit,
executing the job on nodes while maintaining the soft power limit
at or below the hard power limit, and allocating the power budget
to the job and executing the job on the nodes according to power
performance policies.
[0090] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, defining a
hard power limit based on a thermal and power delivery capacity of
a HPC facility, wherein the HPC facility includes plurality of HPC
systems, and the HPC system includes a plurality of interconnected
HPC nodes operable to execute a job, defining a soft power limit
based on a power budget allocated to the HPC facility, allocating
the power budget to the job to maintain an average power
consumption of the HPC facility below the soft power limit,
executing the job on nodes while maintaining the soft power limit
at or below the hard power limit, and allocating the power budget
to the job and executing the job on the nodes according to power
performance policies, wherein the hard power limit decreases in
response to failures of the power and cooling infrastructures of
the HPC facility.
[0091] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, defining a
hard power limit based on a thermal and power delivery capacity of
a HPC facility, wherein the HPC facility includes plurality of HPC
systems, and the HPC system includes a plurality of interconnected
HPC nodes operable to execute a job, defining a soft power limit
based on a power budget allocated to the HPC facility, allocating
the power budget to the job to maintain an average power
consumption of the HPC facility below the soft power limit,
executing the job on nodes while maintaining the soft power limit
at or below the hard power limit, and allocating the power budget
to the job and executing the job on the nodes according to power
performance policies, wherein allocating the power budget to the
job is based on an estimate of a required power to execute the
job.
[0092] A method of managing power and performance of a
High-performance computing (HPC) system, comprising, defining a
hard power limit based on a thermal and power delivery capacity of
a HPC facility, wherein the HPC facility includes plurality of HPC
systems, and the HPC system includes a plurality of interconnected
HPC nodes operable to execute a job, defining a soft power limit
based on a power budget allocated to the HPC facility, allocating
the power budget to the job to maintain an average power
consumption of the HPC facility below the soft power limit,
executing the job on nodes while maintaining the soft power limit
at or below the hard power limit, and allocating the power budget
to the job and executing the job on the nodes according to power
performance policies, wherein the hard power limit is managed by an
out of band mechanism. In one embodiment, the power performance
policies is based on at least one of a HPC facility policy, a
utility provider policy, a HPC administrative policy, and a user
policy.
[0093] A computer readable medium having stored thereon sequences
of instruction which are executable by a system, and which, when
executed by the system, cause the system to perform a method,
comprising, determining a power budge for a HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, determining a power and cooling capacity
of the HPC system, allocating the power budget to the job such that
a power consumption of the HPC system stays within the power budget
and the power and cooling capacity of the HPC system, and executing
the job on selected HPC nodes.
[0094] A computer readable medium having stored thereon sequences
of instruction which are executable by a system, and which, when
executed by the system, cause the system to perform a method,
comprising, determining a power budge for a HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, determining a power and cooling capacity
of the HPC system, allocating the power budget to the job such that
a power consumption of the HPC system stays within the power budget
and the power and cooling capacity of the HPC system, and executing
the job on selected HPC nodes, wherein the selected HPC nodes to
execute the job are selected based in part by power performance
policies.
[0095] A computer readable medium having stored thereon sequences
of instruction which are executable by a system, and which, when
executed by the system, cause the system to perform a method,
comprising, determining a power budge for a HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, determining a power and cooling capacity
of the HPC system, allocating the power budget to the job such that
a power consumption of the HPC system stays within the power budget
and the power and cooling capacity of the HPC system, and executing
the job on selected HPC nodes, wherein the selected HPC nodes to
execute the job are selected based in part by a power
characteristics of the nodes. In one embodiment, the power
characteristics of the HPC nodes are determined upon running a
sample workload. In another embodiment, the power characteristics
of the HPC nodes are determined during an actual runtime.
[0096] A computer readable medium having stored thereon sequences
of instruction which are executable by a system, and which, when
executed by the system, cause the system to perform a method,
comprising, determining a power budge for a HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, determining a power and cooling capacity
of the HPC system, allocating the power budget to the job such that
a power consumption of the HPC system stays within the power budget
and the power and cooling capacity of the HPC system, and executing
the job on selected HPC nodes, wherein the job is executed on the
selected HPC nodes based in part upon a power performance
policies.
[0097] A computer readable medium having stored thereon sequences
of instruction which are executable by a system, and which, when
executed by the system, cause the system to perform a method,
comprising, determining a power budge for a HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, determining a power and cooling capacity
of the HPC system, allocating the power budget to the job such that
a power consumption of the HPC system stays within the power budget
and the power and cooling capacity of the HPC system, and executing
the job on selected HPC nodes, wherein allocating the power budget
to the job is based in part upon a power performance policies
[0098] A computer readable medium having stored thereon sequences
of instruction which are executable by a system, and which, when
executed by the system, cause the system to perform a method,
comprising, determining a power budge for a HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, determining a power and cooling capacity
of the HPC system, allocating the power budget to the job such that
a power consumption of the HPC system stays within the power budget
and the power and cooling capacity of the HPC system, and executing
the job on selected HPC nodes, wherein the allocating the power
budget to the job is based in part on an estimate of a required
power to execute the job. In one embodiment, the estimate of the
required power to execute the job is in part based upon at least
one of a monitored power, an estimated power, and a calibrated
power.
[0099] A computer readable medium having stored thereon sequences
of instruction which are executable by a system, and which, when
executed by the system, cause the system to perform a method,
comprising, determining a power budge for a HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, determining a power and cooling capacity
of the HPC system, allocating the power budget to the job such that
a power consumption of the HPC system stays within the power budget
and the power and cooling capacity of the HPC system, and executing
the job on selected HPC nodes, wherein determining the power budget
for the HPC system is in part based upon the power and cooling
capacity of the HPC system.
[0100] A computer readable medium having stored thereon sequences
of instruction which are executable by a system, and which, when
executed by the system, cause the system to perform a method,
comprising, determining a power budge for a HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, determining a power and cooling capacity
of the HPC system, allocating the power budget to the job such that
a power consumption of the HPC system stays within the power budget
and the power and cooling capacity of the HPC system, and executing
the job on selected HPC nodes, wherein determining the power budget
for the HPC system is performed in part by communicating to a
utility provider through a demand/response interface. In one
embodiment, the demand/response interface reduces a cost for the
power budget based on the capacity and requirements of the HPC
system and inputs from the utility provider. In one embodiment, the
demand/response interface communicates the capacity and
requirements of the HPC system through an automated mechanism.
[0101] A computer readable medium having stored thereon sequences
of instruction which are executable by a system, and which, when
executed by the system, cause the system to perform a method,
comprising, determining a power budge for a HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, determining a power and cooling capacity
of the HPC system, allocating the power budget to the job such that
a power consumption of the HPC system stays within the power budget
and the power and cooling capacity of the HPC system, and executing
the job on selected HPC nodes, wherein determining the power and
cooling capacity of the HPC system includes monitoring and
reporting failures of power delivery and cooling infrastructures.
In one embodiment, the method further comprises adjusting the power
consumption of the HPC system in response to the failure of the
power and cooling infrastructures.
[0102] A computer readable medium having stored thereon sequences
of instruction which are executable by a system, and which, when
executed by the system, cause the system to perform a method,
comprising, determining a power budge for a HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, determining a power and cooling capacity
of the HPC system, allocating the power budget to the job such that
a power consumption of the HPC system stays within the power budget
and the power and cooling capacity of the HPC system, and executing
the job on selected HPC nodes, wherein determining a power and
cooling capacity of the HPC system is performed by an out of band
system.
[0103] A system for managing power and performance of a
High-performance computing (HPC) system, comprising, a HPC facility
manager to determine a power budget for the HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, an out of band mechanism to monitor and
report a cooling and power capacity of the HPC system to a HPC
system manager, the HPC system manager to allocate the power budge
to the job within limitations of the cooling and power capacity of
the HPC system, and a job manager to execute the job on selected
nodes.
[0104] A system for managing power and performance of a
High-performance computing (HPC) system, comprising, a HPC facility
manager to determine a power budget for the HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, an out of band mechanism to monitor and
report a cooling and power capacity of the HPC system to a HPC
system manager, the HPC system manager to allocate the power budge
to the job within limitations of the cooling and power capacity of
the HPC system, and a job manager to execute the job on selected
nodes, wherein the HPC facility manager, the HPC system manager,
and the job manager are governed by power performance policies. In
one embodiment, the power performance policies are in part based
upon at least one of a HPC facility policy, a utility provider
policy, a HPC administrative policy, and a user policy.
[0105] A system for managing power and performance of a
High-performance computing (HPC) system, comprising, a HPC facility
manager to determine a power budget for the HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, an out of band mechanism to monitor and
report a cooling and power capacity of the HPC system to a HPC
system manager, the HPC system manager to allocate the power budge
to the job within limitations of the cooling and power capacity of
the HPC system, and a job manager to execute the job on selected
nodes, wherein the HPC system manager selects the selected HPC
nodes to execute the job based in part by a power characteristics
of the nodes. In one embodiment, a calibrator runs a sample
workload on the HPC nodes and reports the power characteristics of
the HPC nodes to the HPC system manager. In another embodiment, a
calibrator determines the power characteristics of the HPC nodes
during an actual runtime and reports it to the HPC system
manager.
[0106] A system for managing power and performance of a
High-performance computing (HPC) system, comprising, a HPC facility
manager to determine a power budget for the HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, an out of band mechanism to monitor and
report a cooling and power capacity of the HPC system to a HPC
system manager, the HPC system manager to allocate the power budge
to the job within limitations of the cooling and power capacity of
the HPC system, and a job manager to execute the job on selected
nodes, wherein the HPC system manager allocates power to the job
based in part by an estimated power required to run the job. In one
embodiment, an estimator calculates the estimated power required to
run the job in part based upon at least one of a monitored power,
an estimated power, and a calibrated power.
[0107] A system for managing power and performance of a
High-performance computing (HPC) system, comprising, a HPC facility
manager to determine a power budget for the HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, an out of band mechanism to monitor and
report a cooling and power capacity of the HPC system to a HPC
system manager, the HPC system manager to allocate the power budge
to the job within limitations of the cooling and power capacity of
the HPC system, and a job manager to execute the job on selected
nodes, wherein the out of band mechanism monitors and reports
failures of power delivery and cooling infrastructures of a HPC
facility to the HPC facility manager.
[0108] A system for managing power and performance of a
High-performance computing (HPC) system, comprising, a HPC facility
manager to determine a power budget for the HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, an out of band mechanism to monitor and
report a cooling and power capacity of the HPC system to a HPC
system manager, the HPC system manager to allocate the power budge
to the job within limitations of the cooling and power capacity of
the HPC system, and a job manager to execute the job on selected
nodes, wherein the out of band mechanism monitors and reports
failures of power delivery and cooling infrastructures of the HPC
system to the HPC system manager.
[0109] A system for managing power and performance of a
High-performance computing (HPC) system, comprising, a HPC facility
manager to determine a power budget for the HPC system, wherein the
HPC system includes a plurality of interconnected HPC nodes
operable to execute a job, an out of band mechanism to monitor and
report a cooling and power capacity of the HPC system to a HPC
system manager, the HPC system manager to allocate the power budge
to the job within limitations of the cooling and power capacity of
the HPC system, and a job manager to execute the job on selected
nodes, wherein the HPC facility manager communicates capacity and
requirements of the HPC system to a utility provider through a
demand/response interface. In one embodiment, the demand/response
interface reduces a cost for the power budget based on the capacity
and requirements of the HPC system and inputs from the utility
provider. In one embodiment, the demand/response interface
communicates the capacity and requirements of the HPC system
through an automated mechanism.
[0110] In the foregoing specification, methods and apparatuses have
been described with reference to specific exemplary embodiments
thereof. It will be evident that various modifications may be made
thereto without departing from the broader spirit and scope of
embodiments as set forth in the following claims. The specification
and drawings are, accordingly, to be regarded in an illustrative
sense rather than a restrictive sense.
* * * * *