U.S. patent application number 11/581975 was filed with the patent office on 2007-04-19 for method and system for deferred command issuing in a computer system.
This patent application is currently assigned to VIA Technologies, Inc.. Invention is credited to Guofeng Zhang.
Application Number | 20070088856 11/581975 |
Document ID | / |
Family ID | 38214146 |
Filed Date | 2007-04-19 |
United States Patent
Application |
20070088856 |
Kind Code |
A1 |
Zhang; Guofeng |
April 19, 2007 |
Method and system for deferred command issuing in a computer
system
Abstract
A method and system are disclosed for employing deferred command
issuing in a computer system with multiple peripheral processors
operating with a peripheral device driver embedded in a
multi-threaded central processor. After issuing a first command
with a first event tag by the peripheral device driver, a second
command is generated for a first peripheral processor by the
peripheral device driver following the issuing of the first
command. The second command is stored awaiting for the first event
tag being returned, and the second command is issued when the first
event tag is returned if the first and second commands need to be
synchronized.
Inventors: |
Zhang; Guofeng; (Shanghai,
CN) |
Correspondence
Address: |
L. Howard Chen, Esq.;Kirkpatrick & Lockhart Preston Gates Ellis LLP
55 Second Street
Suite 1700
San Francisco
CA
94105
US
|
Assignee: |
VIA Technologies, Inc.
|
Family ID: |
38214146 |
Appl. No.: |
11/581975 |
Filed: |
October 17, 2006 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60727668 |
Oct 18, 2005 |
|
|
|
Current U.S.
Class: |
710/6 |
Current CPC
Class: |
G06F 13/102 20130101;
G06T 15/005 20130101 |
Class at
Publication: |
710/006 |
International
Class: |
G06F 3/00 20060101
G06F003/00 |
Claims
1. A method for deferred command issuing in a computer system with
one or multi special purpose operating with a device driver running
on one or multiple central processors, the method comprising:
issuing a first command with a first event tag by the peripheral
device driver; generating a second command for a first peripheral
processor by the peripheral device driver following the issuing of
the first command; storing the second command awaiting for the
first event tag being returned; and issuing the second command when
the first event tag is returned.
2. The method of claim 1, wherein storing the second command
further includes storing the second command in a buffer associated
with the first processor.
3. The method of claim 2, further comprising: generating a third
command for a second processor; and storing the third command in a
buffer associated there with.
4. The method of claim 3, wherein the buffers associated with the
first and second processors are different.
5. The method of claim 1, further comprising checking whether the
generated second command relates to the first command requiring the
first event tag to return before the second command is being
issued.
6. The method of claim 5, wherein checking further includes:
checking whether the first event tag has returned; and checking
whether the first event tag is outstanding if it is not yet
returned and if it relates to the second command.
7. The method of claim 6, wherein checking whether the first event
tag has returned is performed after the generating the second
command.
8. The method of claim 6, wherein checking whether the first event
tag has returned is performed prior to the generating the second
command.
9. A method for deferred command issuing in a computer system with
multiple graphics processors operating with a graphics driver
embedded in a multi-threaded central processor, the method
comprising: issuing a first command with a first event tag by the
graphics driver; generating a second command to a first processor
of the computer system by the graphics driver following the issuing
of the first command; storing the second command awaiting for the
first event tag is returned; and issuing the second command when
the first event tag is returned.
10. The method of claim 9, wherein storing the second command
further includes storing the second command in a buffer associated
with the first processor.
11. The method of claim 10, further comprising: generating a third
command to a second processor; and storing the third command in a
buffer associated there with.
12. The method of claim 11, wherein the buffers associated with the
first and second processors are different.
13. The method of claim 9, further comprises checking whether the
generated second command needs to wait for the first event tag to
return.
14. The method of claim 13, wherein the checking further includes:
checking whether the first event tag has returned; and checking
whether the first event tag is outstanding if it has not
returned.
15. The method of claim 14, wherein checking whether the first
event tag has returned is performed after the generating the second
command.
16. The method of claim 14, wherein checking whether the first
event tag has returned is performed prior to the generating the
second command.
17. A system for supporting deferred command issuing in an advanced
computer system, the system comprising: a multi-threaded central
processing unit (CPU); a graphics subsystem with multiple graphics
processing units; at least one command buffer for storing commands
and associated event-tags; and a graphics driver embedded in the
CPU for generating commands to be stored in the command buffers,
assigning event-tags when synchronizations are needed, controlling
command issuing and monitoring event-tag returns.
Description
PRIORITY DATA
[0001] This application claims the benefits of U.S. Patent
Application Ser. No. 60/727,668, which was filed on Oct. 18, 2005
and entitled "Smart CPU Sync Technology for MultiGPU Solution."
CROSS REFERENCE
[0002] This application also relates to U.S. patent application
entitled "TRANSPARENT MULTI-BUFFERING IN MULTI-GPU GRAPHICS
SUBSYSTEM", U.S. patent application entitled "EVENT MEMORY ASSISTED
SYNCHRONIZATION IN MULTI-GPU GRAPHICS SUBSYSTEM" and U.S. patent
application entitled "METHOD AND SYSTEM FOR SYNCHRONIZING PARALLEL
ENGINES IN A GRAPHICS PROCESSING UNIT", all of which are commonly
filed on the same day, and which are incorporated by reference in
their entirety.
BACKGROUND
[0003] The present invention relates generally to the
synchronization between a computer's central processing units
(CPUs) and peripheral processing units, and, more particularly, to
the timing of command issuing.
[0004] In a modern computer system, each peripheral functional
module, such as audio or video, has its own dedicated processing
subsystem, and the operations of these subsystems typically require
direct control by computer's central processing unit (CPU).
Besides, communication and synchronization among components of the
subsystems are typically achieved through hardware connections. In
an advanced graphics processing subsystem with two or more graphics
processing units (GPUs), for instance, a CPU has to frequently
evaluate the state of GPUs, and a next rendering command can only
be issued when a previous or current command is finished. In other
cases, when CPU(s) is calculating something for GPUs using
multi-threaded technology, the GPUs may have to wait for the CPU to
complete the calculation before executing commands that need the
result from CPU(s). When one GPU requests data from another GPU,
the transfer must be made through a direct hardware link or the
bus, and controlled by the CPU, which then has to wait for the data
transfer to complete before executing subsequent commands. Either
CPU waiting for GPU or vice versa, the wait time is a waste and
lowers the computer's overall performance.
[0005] It is therefore desirable for a computer system to be able
to detach hard waiting as much as possible from CPU's
operations.
SUMMARY
[0006] In view of the foregoing, this invention provides a method
and system to remove some of the wait time by the CPU, as well as
some idle time in peripheral processing units. In other words, it
increases parallelism between processors.
[0007] A method and system are disclosed for employing deferred
commands issuing in a computer system with multiple peripheral
processors operating with a peripheral device driver embedded in
one or more central processor(s). After issuing a first command
with a first event tag by the peripheral device driver, a second
command is generated for a first peripheral processor by the
peripheral device driver following the issuing of the first
command. The second command is stored awaiting for the first event
tag to be returned, and the second command is issued when the first
event tag is returned if the first and second commands need to be
synchronized.
[0008] The construction and method of operation of the invention,
however, together with additional objects and advantages thereof
will be best understood from the following description of specific
embodiments when read in connection with the accompanying
drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] FIG. 1 is a block diagram of a part of a traditional
computer system.
[0010] FIG. 2 is a block diagram of a part of a computer system
according to one embodiment of the invention.
[0011] FIG. 3 illustrates commands and event-tag flowing according
to one embodiment of the invention.
[0012] FIG. 4A is a flow chart showing a command block generating
and a synchronization mechanism according to one embodiment of the
present invention.
[0013] FIGS. 4B and 4C are flow charts illustrating two different
driver subroutines within each command block and execution
according to one embodiment of the present invention.
[0014] FIGS. 5A and 5B are command timing diagrams for showing time
saving effects of deferred-command-issuing according to one
embodiment of the present invention.
DESCRIPTION
[0015] Detailed information with regard to the operation of the GPU
in the computer system is further described in U.S. patent
application entitled "TRANSPARENT MULTI-BUFFERING IN MULTI-GPU
GRAPHICS SUBSYSTEM", U.S. patent application entitled "EVENT MEMORY
ASSISTED SYNCHRONIZATION IN MULTI-GPU GRAPHICS SUBSYSTEM" and U.S.
patent application entitled "METHOD AND SYSTEM FOR SYNCHRONIZING
PARALLEL ENGINES IN A GRAPHICS PROCESSING UNIT", all of which are
commonly filed on the same day, and which are incorporated by
reference in their entirety.
[0016] FIG. 1 illustrates a part of a traditional computer system
100. In such a system, a peripheral device driver 110 is just a
program, functioning essentially like an instruction manual that
provides the operating system with information on how to control
and communicate with special processors 120 and 130 of a peripheral
subsystem 140. The driver 110 does not have any control function,
which is instead carried out by one or more central processor(s)
(CPU) 150. Communications between the special processors 120 and
130 take place through hardware connection 160 or through the bus
170.
[0017] As an embodiment of present invention, FIG. 2 illustrates a
part of a multi-processor computer system 200 with a driver 210
embedded in one or more central processor(s) 220. Here, the
`embedded` means that the driver actually runs in the CPU and
employs some of the CPU processing capability, so that the driver
can generate commands to be stored in the buffer, assign event-tags
when synchronizations with other commands are needed, issue the
commands and monitor the return of the event-tags, all without CPU
hard wait. Such driver implementation does not require extensive
hardware support, so it is also cost effective.
[0018] The computer system 200 also employs a command buffer 230,
which stores immediate commands sent by the driver 210. The command
buffer 230 can be just a memory space in a main memory 290 or
another memory located anywhere, and can be dynamically allocated
by the driver 210. With the processing power of the central
processor(s) 220, the driver 210 directs command buffering in and
subsequently issuing from the command buffer 230, as well as
synchronization among special processors 240 and 250 and the
central processor(s) 220. The special processors can be processors
dedicated for graphics operations, known as graphics processing
units (GPUs).
[0019] FIG. 3 is a diagram showing the flow of commands among the
CPU, the buffers and the special processors according to one
embodiment of the present invention. For illustration purposes, it
provides more details on command buffering. An embedded driver 320
generates commands along with event-tags, and then sent them to
command buffers 330 and 340 selectively. Commands and event-tags
for special processors 350 are sent to command buffer1 330, and
commands and event-tags for special processor2 360 are sent to
command buffer2 340, so that the commands for different special
processors can be issued independently and simultaneously. When a
current command needs to synchronize with another command
execution, the driver 320 generates an event-tag alongside the
current command. Processors, either peripheral special processors
350 and 360, or central processor(s) 300, execute their
corresponding commands and return event-tags, if present, upon
completion of the execution. There are various control mechanisms
installed among them through the communications. For example, the
central processor(s) 300 can control both buffers in its
operation.
[0020] FIG. 4A presents a flow chart detailing how the graphics
driver 320 synchronizes command issuing with GPUs and CPU. Here,
the driver 320 generates command blocks in steps 410A through 470A
continuously without any delay on the CPU side. Some of these
commands are to be stored in command buffers before being issued to
GPUs for execution. For example, command block[n-1] 410A has a
command to a first GPU with a request to return an event-tag[i].
The first GPU will return the event-tag[i] upon the completion of
command block[n-1] 410A. Upon detecting the event-tag[i], another
command that needs to synchronize with the command block[n-1] 410A,
can then be issued from the command buffer by driver 320. In this
way, CPU's hard wait for a synchronization event is eliminated. The
term "deferred command issuing" refers generally to this command
buffering process.
[0021] FIG. 4A also shows a command block[n+m] 440A that needs to
synchronize with another CPU thread, as well as a command
block[n+m+k] 470A that needs to synchronize with a second GPU. In
both cases, the driver 320's operations of storing commands,
checking event-tags and issuing commands are the same as in the
above first GPU case.
[0022] Within each command block, driver 320 executes certain
subroutines, such as generating a new command and an associated
event-tag if needed, checking on returned event-tags, buffering the
new command and issuing a buffered command or directly issuing the
new command if there is no outstanding event-tag. These subroutines
can be executed in various sequences. FIGS. 4B and 4C are two
examples for conducting such subroutines.
[0023] Referring to FIG. 3 and FIG. 4B, driver 320 first generates
a current command in step 410B, and then checks for any returned
event-tag in step 420B. If there is a returned event-tag, and if a
related command is in a buffer, then the driver 320 issues the
buffered command along with its own event-tag if present, as shown
in step 430B and 440B. Here, the `related` means that there is a
synchronization need between the buffered command and the previous
command that returns the event-tag to the buffer. If the related
command is not in buffer, then driver 320 checks if the current
command is related to the returned event-tag in step 450B. If so,
it issues the current command (step 470B), and if not, it buffers
the current command (step 480B).
[0024] On the other hand, if there is no returned event-tag in the
buffer, driver 320 then checks for any outstanding event-tag in
step 460B. In case there is an outstanding event-tag that the
current command issue will depend on or is related to, driver 320
then buffers the current command (step 480B). In case there is no
outstanding related event-tag, driver 320 directly issues the
current command. Note that in all cases of command buffering or
issuing, the associated event-tag if present, is also buffered or
issued along with the command.
[0025] FIG. 4C shows another subroutine according to another
embodiment of the present invention where the driver 320 first
checks for any returned event-tag in step 410C. If there is a
returned event-tag, and if a related command is in the buffer, then
driver 320 issues the buffered command (step 430C). If there is no
returned event-tag (step 410C) or there is no related command in
buffer (step 420C), then driver 320 generates a current command
(step 445C). If the current command is related to any returned
event-tag (step 450C), then the driver issues the current command
(step 480C). If the current command is not related to any returned
event-tag, then it is also checked to see whether there is any
outstanding event-tag that relates to the current command (step
460C). In case there is an outstanding related event-tag, driver
320 buffers the current command along with its event-tag if present
(step 470C), otherwise driver 320 issues the current command (step
480C) with its event-tag if present (step 470C). The aforementioned
event-tag checking process can be limited to only those processors
to which commands with event-tags have been sent previously.
[0026] In both cases as shown in FIGS. 4B and 4C, and as an
alternative, a current command is always buffered if there is any
outstanding event-tag. If driver 320 checks only the event-tag
buffer, then before an outstanding event-tag returns, driver 320
really has no way to know if it is related to a newly generated
current command. So that the current command has to be buffered if
there is any outstanding event-tag.
[0027] FIGS. 5A and 5B are timing diagrams illustrating
deferred-command-issuing that reduces CPU wait-time and GPU
idle-time. FIG. 5A represents the situation in which a
deferred-command-issuing process is not employed. In this case, the
CPU generates commands in time slots 500A, 510A and 520A for a
first GPU (or GPU1) in time slots 550A, 560A and 570A,
respectively. Commands generated in time slots 502A, 512A and 522A
are for a second GPU (or GPU2) in time slots 552A, 562A and 572A,
respectively. Since there is no command buffering process employed,
a subsequent command can only be generated and issued when a
current GPU operation is completed. For example, the time slot 510A
can only be initiated after the time slot 552A, and similarly, the
time slot 520A is after time slot 562A. The CPU has to wait while a
previously issued command is executing. As shown, the time
intervals between two adjacent time slots are either the CPU wait
time or the GPU idle time. For instance, the interval between 510A
and 502A is CPU's wait time, and the interval between 560A and 550A
is GPU1's idle time.
[0028] Contrasting to FIG. 5A and FIG. 5B illustrates the timing
relationships in the situation where the deferred-command-issuing
process is employed to allow the CPU to generate commands
continuously to command buffers without waiting for any GPU to
complete a command execution. In this case, the CPU generates
commands in time slots 500B, 510B and 520B for GPU1 in time slots
550B, 560B and 570B, respectively. Commands generated in time slots
502B, 512B and 522B are for GPU2 in time slots 552B, 562B and 572B,
respectively. As shown, the CPU command generating time slot 510B
is brought up to follow the completion of the time slot 502B, which
is prior to the end of GPU2 time slot 552B. But the CPU's fifth
command at time slot 520B still waits for the time slot 552B to
end, because there is a synchronization need between this
particular command and the GPU2 execution, so are the command at
530B and the GPU2 execution at 562B. In such a command processing
system, especially for a graphics system employing such a
deferred-command-issuing process, benefits are obtained as a
subsequent command is already generated and waiting in the command
buffer for the execution by the GPU. On the other hand, the GPUs do
not have to wait for the CPU to generate commands, and can execute
a subsequent command right after a current one finishes. This is
further illustrated in the case for GPU2 at time slot 562B and
572B, where GPU2 has no idle time. For the same reason, the GPU1
idle time between time slots 570B and 560B is also reduced.
[0029] To quantify time saved by the deferred-command-issuing
process employed, if it is assumed that the CPU command generating
time is `t`, and the execution time of GPU1 and GPU2 are T1 and T2
(assume T1<T2, for easing the evaluation), respectively, as
shown in FIG. 5A, a system without using the
deferred-command-issuing takes (3*T2+3*t) to complete three command
cycles. The system with the deferred-command-issuing of FIG. 5B
shortens the three-cycle time to (3*T2+t). So that the time saving
for 3 cycles is 2*t. In general, the saving is (n-1)*t for n number
of command cycles.
[0030] A comparison between FIGS. 5A and 5B also shows a saving in
GPU idle time. In FIG. 5A, GPU1 idle time between time slots 560A
and 570A is T2-T1+t, and GPU2 idle time is t. In FIG. 5B, GPU1 idle
time between the corresponding time slots becomes T2-T1, or a
saving of t. GPU2 idle time is completely eliminated, also a saving
of t.
[0031] This invention provides many different embodiments, or
examples, for implementing different features of the invention.
Specific examples of components and methods are described to help
clarify the disclosure. These are, of course, merely examples and
are not intended to limit the disclosure from that described in the
claims.
* * * * *