U.S. patent application number 14/585495 was filed with the patent office on 2015-11-12 for efficient inter-processor communication in ray tracing.
The applicant listed for this patent is Reuven BAKALASH. Invention is credited to Reuven BAKALASH.
Application Number | 20150325033 14/585495 |
Document ID | / |
Family ID | 54368309 |
Filed Date | 2015-11-12 |
United States Patent
Application |
20150325033 |
Kind Code |
A1 |
BAKALASH; Reuven |
November 12, 2015 |
Efficient Inter-processor Communication in Ray Tracing
Abstract
Novel method and system for distributed database ray-tracing is
presented, based on modular mapping of scene-data among processors.
Its inherent properties include matching between geographical
proximity in the scene with communication proximity between
processors.
Inventors: |
BAKALASH; Reuven; (Shdema,
IL) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
BAKALASH; Reuven |
Shdema |
|
IL |
|
|
Family ID: |
54368309 |
Appl. No.: |
14/585495 |
Filed: |
December 30, 2014 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
13726763 |
Dec 26, 2012 |
8957902 |
|
|
14585495 |
|
|
|
|
Current U.S.
Class: |
345/503 |
Current CPC
Class: |
G06F 2209/502 20130101;
G06F 2212/2542 20130101; G06F 9/5061 20130101; G06T 2210/52
20130101; G06T 15/06 20130101; G06F 12/0875 20130101; G06F 12/0806
20130101; G06T 15/005 20130101 |
International
Class: |
G06T 15/00 20060101
G06T015/00; G06T 15/06 20060101 G06T015/06 |
Claims
1. A multiprocessing computing system for ray tracing having
reduced inter-processor communication, comprising: an array of at
least three general purpose processing elements interconnected by
ring topology; graphics data representing three dimensional scene
of multiple objects; plurality of cells forming a grid; and a
modular mapping scheme; wherein the graphics data are mapped onto
said grid of cells according to said modular mapping scheme; there
is a need to autonomously exchange elements of said data between a
cell and any of its 6-connected neighboring cells; wherein said
autonomous exchange is done in a random way, without an agreed-upon
timing of the transfer, and not in a centrally controlled way; and
said cells are designated processing elements, according to said
modular mapping scheme; thereby, when a cell is processed by its
designated processing element, all cell's communication needs with
its 6-connected neighboring cells are mapped to at most one hop
along the ring, reducing the inter-processor communication.
2. The computing system of claim 1 wherein the autonomous exchange
of data between two geographically neighboring cells translates to
a communication neighborhood of the corresponding cell-designated
general nose processing elements on the ring, allowing reduced
interprocessor communication.
3. (canceled)
4. The computing system of claim 1 wherein each general purpose
processing element assigned to a cell, is in a communication
distance of at most one ring hop to a general purpose processing
element assigned to any of the cell's six connected neighboring
cells.
5. The computing system of claim 1 wherein a general purpose
processing element is allocated all cells of a given module,
handling one cell at a time, independently to other general purpose
processing elements.
6. (canceled)
7. The computing system of claim 1 wherein said exchange of data
from a sender-cell to a receiver-cell is performed by the general
purpose processing element assigned to the sender-cell.
8. A method of reducing inter-processor communication in an array
having at least three general purpose processing elements, when a
processed graphics data represent a three dimensional scene of
multiple objects, comprising the steps of: connecting said
processing elements by ring interconnection; subdividing the
three-dimensional data onto a grid of cells; mapping said cells,
according to a modular mapping scheme; designating processing
elements to said cells, according to modular mapping scheme;
thereby due to the said modular mapping, when a cell is processed
by its designated processing element, all cell's communication
needs are mapped to at most one hop along the ring, reducing the
inter-processor communication.
9. The method of claim 8 wherein a geographical neighborhood
between two data exchanging cells translates to a communication
neighborhood between corresponding cell-designated general purpose
processing elements, for reduced data communication.
10. (canceled)
11. The method of claim 8 wherein a general purpose processing
element is allocated all cells of a given module, handling one cell
at a time, independently to other general purpose processing
elements.
12. The method of claim 8 wherein said exchange of data from a
sender-cell to a receiver-cell, is communicated by the general
purpose processing element designated to the sender-cell.
Description
CROSS-REFERENCE TO RELATED CASES
[0001] The present application is a Continuation in Part of U.S.
patent application Ser. No. 13/726,763 filed on Dec. 26, 2012
entitled "Method and Apparatus for Interprocessor Communication
Employing Modular Space Division", which is a Continuation of U.S.
Pat. No. 8,373,699 filed on Apr. 16, 2012 entitled "Method and
Apparatus for Parallel Ray-Tracing Employing Modular Space
Division", which is a Divisional of U.S. Pat. No. 8,189,001 filed
on Dec. 24, 2010 entitled "Method and Apparatus for Parallel
Ray-Tracing Employing Modular Space Division", which claims
priority to the U.S. Provisional Application No. 61/292,152 filed
Jan. 4, 2010 entitled "Method and Apparatus for Parallel
Ray-Tracing Employing Modular Space Division"; ad also claims
priority to U.S. Provisional Application No. 61/282,241 filed Jan.
6, 2010 entitled "Method and Apparatus for Parallel Ray-Tracing
Employing Modular Space Division"; and U.S. Provisional Application
No. 61/347,380 filed May 22, 2010 entitled "Method and Apparatus
for Parallel Ray-Tracing Employing Modular Data Structure"; and
U.S. Provisional Application No. 61/422,100 filed Dec. 10, 2010
entitled "Method and System for Dynamic Load Balancing of Data
Parallel Processing," all of which are hereby incorporated by
reference.
FIELD OF THE INVENTION
[0002] The present invention relates generally to solving
data-parallel processing and, more particularly, to data-parallel
ray tracing technology enabling real time applications and highly
photo-realistic images.
BACKGROUND OF THE INVENTION
[0003] Ray-tracing is a technique for generating images by
simulating the behavior of light within a three-dimensional scene
by typically tracing light rays from the camera into the scene, as
depicted in FIG. 1A. In general two types of rays are used. Primary
rays are traced from a particular point on the camera image plane
(a pixel) into the scene, until they hit a surface, at a so-called
hit or intersection point. Shadow and secondary rays are traced
from a hit point to determine how it is lit. Finally, to determine
how the surface material appears texture lookups and shading
computations are performed at or near the hit point. FIG. 1B shows
a scene having three objects and single light source. Three ray
generations are created when the primary ray spawns other rays (N'
surface normal, R' reflected ray, L' shadow ray, T' transmitted
(refracted) ray).
[0004] Ray tracing is a high computationally expensive algorithm.
Fortunately, ray tracing is quite easy to parallelize. The
contribution of each ray to the final image can be computed
independently from the other rays. For this reason, there has been
a lot of effort put into finding the best parallel decomposition
for ray tracing.
[0005] There are two main approaches to parallel ray-tracing in
prior art: (i) ray-parallel, in which rays are distributed among
parallel processors, while each processor traces a ray all the way,
and (ii) data-parallel, in which the scene is distributed among
multiple processors, while a ray is handled by multiple processors
in a row.
[0006] The ray-parallel implementation of ray tracing would simply
replicate all the data with each processor and subdivide the screen
into a number of disjunct regions. Each processor then renders a
number of regions using the unaltered sequential version of the ray
tracing algorithm, until the whole image is completed. Whenever a
processor finishes a region, it asks the master processor for a new
task. This is also called the demand driven approach, or an image
space subdivision. Load balancing is achieved dynamically by
sending new tasks to processors that have just become idle.
However, if a very large models need to be rendered, the scene data
have to be distributed over the memories, because the local memory
of each processor is not large enough to hold the entire scene.
Then demand driven approach suffers from massive copies and
multiplications of geometric data.
[0007] Data-parallel is a different approach to rendering scenes
that do not fit into a single processor's memory. Here, the object
data is distributed over the processors. Each processor owns only a
subset of the database and it traces rays only when they pass
through its own subspace. Its high data locality excludes massive
moves of data, answering the needs of very large models. However,
rendering cost per ray and the number of rays passing through each
subset of database are likely to vary (e.g. hot spots are caused by
viewpoints and light sources), leading to severe load imbalances, a
problem which is difficult to solve either with static or dynamic
load balancing schemes. Efficiency thus tends to be low in such
systems.
[0008] In order to exploit locality between data accesses as much
as possible, usually some spatial subdivision is used to decide
which parts of the scene are stored with which processor. In its
simplest form, the data is distributed according to a uniform
distribution. Each processor will hold one or more equal sized
cells. Having just one cell per processor allows the data
decomposition to be nicely mapped onto a 3D grid topology. However,
since the number of objects may vary dramatically from cell to
cell, the cost of tracing a ray through each of these cells will
vary and therefore this approach may lead to severe load
imbalances.
[0009] A second, and more difficult problem to address, is the fact
that the number of rays passing through each cell is likely to
vary. Certain parts of the scene attract more rays than other
parts. This has mainly to do with the view point and the location
of the light sources. Both the variations in cost per ray and the
number of rays passing through each cell indicate that having
multiple cells per processor is a good option, as it is likely to
balance the workload, albeit at the cost of extra
communication.
[0010] The way the data is distributed over processors has a strong
impact on how well the system performs. The more even the workload
associated with a particular data distribution, the less idle time
is to be expected. Three main criteria need to be observed for such
distributions to lead to efficient execution of the parallel
algorithm (Salmon and Goldsmith):
[0011] The memory overhead for each processor should be as equal as
possible
[0012] Communication requirements during rendering need to be
minimized
[0013] Processing time for each processor needs to be equalized
[0014] Generating data distributions which adhere to all three
criteria is a difficult problem, which currently remains unsolved
in prior art. Most data distributions are limited to equalizing the
memory overhead for each processor. This is a relatively simple
exercise, because generating an adaptive spatial subdivision, such
as an octree or KD-tree, gives sufficient clues as to which regions
of space contain how many objects.
[0015] Another problem in ray tracing is the high processing cost
of acceleration structures. For each frame, a rendering system must
find the intersection points between many rays and many polygons.
The cost of testing each ray against each polygon is prohibitive,
so such systems typically use accelerating structures (such as
Uniform grid, Octree, KD-tree, other binary trees, bounding boxes,
etc.) to reduce the number of ray/polygon intersection tests that
must be performed. As the data is sorted over space with the
acceleration structure, the data distribution over the processors
is based on this structure as well. The spatial subdivision is also
used to establish which data needs to be fetched from other
processors. Moreover, construction of optimized structures is
expensive and does not allow for rebuilding the accelerating
structure every frame to support for interactive ray-tracing of
large dynamic scenes. The construction times for larger scenes are
very high and do not allow dynamic changes.
[0016] There has been an attempt in prior art to lower the cost and
complexity of acceleration structures by using its simplest form,
where the data is distributed uniformly. Each processor will hold
one or more equal sized cells. Having just one cell per processor
allows the data decomposition to be nicely mapped onto a 3D grid
topology. However, since the number of objects may vary
dramatically from cell to cell, the cost of tracing a ray through
each of these cells will vary and therefore this approach leads to
severe load imbalances, and consequently the uniform distribution
has been abandoned.
[0017] Today, the most popular data structure in prior art is the
KD-tree. Ray traversal in a KD-tree is less efficient than in
uniform grid. However, it is good for the scenes with non-uniform
distribution of objects. The massive traversal of accelerating
structure based on KD-tree typically consumes major chunk of the
frame time (e.g. 63% in average, Venkatraman et al.). Thus, there
is a great need in the art to devise a method of improved load
balancing leaned on simple acceleration structure, such as uniform
grid.
SUMMARY AND OBJECT OF THE PRESENT INVENTION
[0018] The present invention is a novel method and system for
distributed database ray tracing, based on data-parallel approach
and new modular mapping of the scene data into uniform grid. Due to
modular distribution of the scene-data among processors the
geographical neighborhood between cells translates to communication
neighborhood of cell-designated processors on the ring, allowing
task migration between neighboring processors for dynamic
load-balancing and efficient data exchange. In preferred embodiment
system the ray and scene data transfers are resolved by
neighbor-based data exchange using mutual cache-memory access,
instead of inter-processor communication, highly improving system's
utilization. No global communication is needed.
[0019] The method can be implemented as well on commodity
architectures, based on off-the-shelf elements, such as multicore
chips, and others. It is understood that other embodiments of the
present invention will become readily apparent to those skilled in
the art from the following detailed description, wherein are shown
and described only embodiments of the invention by way of
illustration. As will be realized, the invention is capable of
other and different embodiments and its several details are capable
of modification in various other respects, all without departing
from the scope of the present invention as defined by the claims.
Accordingly, the drawings and detailed description are to be
regarded as illustrative in nature and not as restrictive.
[0020] The above summary is not an exhaustive list of all aspects
of the present invention. Indeed, the inventor contemplates that
his invention includes all systems and methods that can be
practiced from all suitable combinations and derivatives of the
various aspects summarized above, as well as those disclosed in the
detailed description below and particularly pointed out in the
claims filed with the application. Such combinations have
particular advantages not specifically recited in the above
summary.
[0021] Accordingly, a primary object of the present invention is to
provide a new and improved method of and apparatus for practicing
data-parallel ray tracing processes in multiprocessing based
computer systems, while avoiding the shortcomings and drawbacks
associated with prior art apparatus and methodologies.
[0022] Another object of the present invention is to provide a new
and improved method of and apparatus for practicing data-parallel
ray tracing processes having increasing efficiency in very large
models.
[0023] Another object of the present invention is to provide a new
and efficient method of ray tracing leaned on simple acceleration
structure, such as uniform grid.
[0024] Another object of the present invention is to provide an
improved method of ray tracing based on uniform grid acceleration
structure, to avoid expensive traversals of complex data
structures.
[0025] Another object of the present invention is to provide a new
method of load balancing in conjunction with simple acceleration
structure, such as uniform grid.
[0026] Another object of the present invention is to provide
modular mapping (MM) of scene data into three-dimensional set of
uniform grid of cells (i.e. cells), for parallel ray tracing.
[0027] Another object of the present invention is to provide
modular distribution of scene data for parallel ray tracing, onto
three-dimensional uniform grid of cells (i.e. cells) according to
the method of modular mapping (MM), while the scene can be of any
shape used in prior-art ray tracing.
[0028] Another object of the present invention is to provide an
allocation of cells of scene data to processing elements (PEs)
according to MM skewed alignment, having multiple cells per
processor uniformly scattered across the scene.
[0029] Another object of the present invention is to provide an
allocation of cells of scene data to processing elements (PEs)
according to MM skewed alignment, having equivalence between
geographical neighborhoods of cells of data with computational
neighborhoods of PEs.
[0030] Another object of the present invention is to provide an
allocation of cells of scene data to processing elements (PEs)
according to MM skewed alignment, such that the shortest
geographical distance between cells is translated into shortest
inter-processor distance.
[0031] Another object of the present invention is to provide
improved architecture for data-parallel ray tracing in the form of
a multiple processing elements (PEs) based system, being connected
by ring network topology, wherein every two neighboring PEs share a
cache memory to accommodate fast ray transfer between neighboring
cells of data, with no need of communication.
[0032] Another object of the present invention is to provide such
apparatus for practicing data-parallel ray tracing in the form of a
multiple processing elements (PEs) based system, being potentially
implemented on different commodity multi-core architectures.
[0033] Another object of the present invention is to provide
apparatus for practicing data-parallel ray tracing in the form of a
multiple processing elements (PEs) based system having reduced
communication needs due to modular mapped (MM) data structure of
scene, avoiding any form of global communication.
[0034] Another object of the present invention is to provide
apparatus for practicing data-parallel ray tracing in the form of a
multiple PEs based system, having minimize massive ray migration
among processors due to modular mapped (MM) data structure of
scene.
[0035] These and other objects of the present invention will become
apparent hereinafter and in the claims to invention.
BRIEF DESCRIPTION OF DRAWINGS
[0036] The invention is herein described, by way of non-limiting
examples only, with reference to the accompanying figures and
drawings, wherein like designations denote like elements.
Understanding that these drawings only provide information
concerning typical embodiments of the invention and are not
therefore to be considered limiting in scope:
[0037] FIG. 1A. Prior art. The figure illustrates a setup of a
ray-traced scene including view point, image and scene object.
Reflection, refraction, and shadow rays are spawned from a point of
intersection between primary ray and scene object.
[0038] FIG. 1B. Prior art. Another setup of a ray traveling across
the scene is shown, having three objects and single light source.
Three ray generations are created when the primary ray spawns other
rays (N' surface normal, R' reflected ray, L' shadow ray, T'
transmitted (refracted) ray).
[0039] FIG. 2A depicts an exemplary scene universe of N=4 according
to the modular mapping scheme. (a) The modular mapping (MM) of
scene data into three-dimensional set of 4.times.4.times.4 uniform
grid of cells. Enumeration of cells is shown.
[0040] FIG. 2B depicts affiliation of cell to modules, and
alignment of those modules.
[0041] FIG. 2C illustrates that each one of the 4 processors in a
given example is assigned a module of 4.times.4 cells.
[0042] FIG. 2D exemplifies the skewed pattern of the modular set of
cells k=3, scattered over the space.
[0043] FIG. 3A shows the neighborhood according to the modular
mapping scheme. Cell 111, belonging to module k=3, is surrounded by
6-connected adjacent cells.
[0044] FIG. 3B shows the neighborhood modular rule of the modular
mapping scheme. All the 6-connected neighbors belong to one of the
two modules, k-1(module 2) and k+1 (module 0).
[0045] FIG. 3C depicts the computational neighborhood of processing
elements assigned to the cells. The k-th PE assigned to the central
cell has direct access to the data of all its 6-connected
surrounding cells, all residing in the two adjacent memories of k-1
and k+1 PEs.
[0046] FIG. 4A shows the second degree of neighborhood according to
the modular mapping scheme. Cell 111, on top of the above
6-connected cells, is second degree surrounded by 12-connected
cells.
[0047] FIG. 4B shows that all second degree 12-connected cells
belong to modules k, k-2 and k+2.
[0048] FIG. 4C depicts the associated processors. All the data of
12-connected cells is stored in the distance of k.+-.2 from the
central processor.
[0049] FIG. 5A depicts an extremely imbalanced scene of
"Teapot-in-a-stadium." For a traditional uniform grid subdivision
the tracing result would be very inefficient.
[0050] FIG. 5B depicts the teapot-in-a-stadium scene resolved in
the present invention by efficient `borrowing` mechanism. The only
object falls in a "central" cell belonging to module k, creating an
extreme load imbalance.
[0051] FIG. 5C depicts load balancing of the teapot-in-a-stadium
scene by "borrowing" two adjacent processors in a distance of
single hop across the ring communication.
[0052] FIG. 5D depicts extended load balancing of the
teapot-in-a-stadium scene by "borrowing" four processors in a
distance of two hops across the ring communication.
[0053] FIG. 6 shows a simple path of a traced ray within the
uniform universe.
[0054] FIG. 7A illustrates a tracing example of family of rays.
[0055] FIG. 7B summarizes the relation between involved cells and
PEs in the tracing example.
[0056] FIG. 7C depicts the consequent inter-PEs communication while
tracing the ray family. The entire data transfer occurs solely
between PE4 and its two neighbors, PE3 and PE5.
[0057] FIG. 8A depicts the PE-memory resident data structure
associated with each cell.
[0058] FIG. 8B depicts the PE cache resident PE data.
[0059] FIG. 8C depicts cell's data structure.
[0060] FIG. 9A depicts an elaborated example of ray tracing in the
modularly mapped scene.
[0061] FIG. 9B depicts detailed ray tracing process at specific PE
and specific cell, of the elaborated example.
[0062] FIG. 9C shows the folding stage of the exemplary rays, when
the light values are aggregated.
[0063] FIG. 10A depicts a high level flowchart of ray tracing
process of present invention.
[0064] FIG. 10B depicts a flowchart of ray tracing process in a
PE.
[0065] FIG. 10C depicts a flowchart of the folding stage of ray
tracing process.
[0066] FIG. 10D depicts a flowchart of processing a ray.
[0067] FIG. 11A depicts the preferred embodiment of present
invention comprising a multi-core system on chip (SOC).
[0068] FIG. 11B depicts one PE with access mechanism to neighboring
PEs and to communication ring.
[0069] FIG. 12 depicts an embodiment which is based on commodity
device; Larrabee multicore by Intel.
[0070] FIG. 13 depicts an embodiment which is based on commodity
device; Cell multicore by IBM.
DETAILED DESCRIPTION OF THE ILLUSTRATIVE EMBODIMENTS OF THE PRESENT
INVENTION
[0071] Unless specifically stated otherwise, as apparent from the
following discussions, it is appreciated that throughout the
specification discussions, utilizing terms such as "processing",
"computing", "calculating", "generating", "creating" or the like,
refer to the action and/or processes of a computer or computing
system, or processor or similar electronic computing device, that
manipulate and/or transform data represented as physical, such as
electronic, quantities within the computing system's registers
and/or memories into other data, similarly represented as physical
quantities within the computing system's memories, registers or
other such information storage, transmission or display
devices.
[0072] Embodiments of the present invention may use terms such as
processor, computer, apparatus, system, sub-system, module,
processing element (PE), multicore, FPGA, GPU and device (in single
or plural form) for performing the operations herein. This may be
specially constructed for the desired purposes, or it may contain a
general purpose computer selectively activated or reconfigured by a
computer program stored in the computer. Several technical terms
which are specifically associated with our disclosure are herein
defined.
[0073] Object--a scene is made up of objects. An object can stand
for a 2D or 3D primitive (polygon, triangle, solid, etc.), or a
complex object made up of primitives. `Pixel` as a 1D entity, is
not an object.
[0074] Cell--a sub-scene element capable to contain objects. It is
different from `voxel` that contains a single volumetric value.
[0075] Empty cell--is a cell without objects, as opposed to a data
filled cell or polygon populated cell.
[0076] Hit point--a point where a ray intersects an object.
[0077] Visible object--is an object which is visible, at least in
part, from the point of view. It is not fully hidden by other
objects.
[0078] Load balancing--distributing workload across multiple
processors to achieve optimal resource utilization, maximize
throughput, minimize response time, and avoid overload.
[0079] Static load balance--All information is available to
scheduling algorithm, which runs before a run time computation
starts.
[0080] Shared memory system--parallel computing system having
memory shared between all processing elements in a single address
space.
[0081] Distributed memory system--parallel computing system in
which each processing element has its own local address space.
[0082] Private memory--owned by a processing element and accessible
by the owner.
[0083] Local objects--objects residing wholly or partly in a
cell.
[0084] Hit point (HIP)--an intersection point between a traced ray
and an object within the scene.
[0085] Shadow rays--rays that are traced from a hit point to light
source in order to find out whether the point is lit or
shadowed.
[0086] Bouncing rays--traced rays that bounce from the surface of
an object (called also secondary rays).
[0087] Geographical neighborhood of cells--In a 3D grid of uniform
cells, each cell has 6 cells (6-connected) in a direct geographical
neighborhood.
[0088] Communication neighborhood--in a ring communication topology
each processing element has two direct neighbors on the ring.
[0089] Asynchronous exchange of data--exchange of data between
sender and receiver (e.g. two neighboring cells) in a random way,
without an agreed-upon timing of the transfer, and not in a
centrally controlled way.
[0090] Autonomous exchange of data--exchange of data from a
sender-cell to a receiver-cell, performed by the processing element
assigned to sender-cell, without needing to synchronize with other
processing elements, or with a central control system.
[0091] The principles and operation of an apparatus according to
the present invention may be understood with reference to the
figures and the accompanying description wherein similar components
appearing in different figures are denoted by identical reference
numerals. The drawings and descriptions are conceptual only. In
actual practice, a single component can implement one or more
functions; alternatively, each function can be implemented by a
plurality of components and devices. In the figures and
descriptions, identical reference numerals indicate those
components that are common to different embodiments or
configurations. Identical numerical references (even in the case of
using different suffix, such as 5, 5a, 5b and 5c) refer to
functions or actual devices that are either identical,
substantially similar or having similar functionality. It will be
readily understood that the components of the present invention, as
generally described and illustrated in the figures herein, could be
arranged and designed in a wide variety of different
configurations. Thus, the following more detailed description of
the embodiments of the apparatus, system, and method of the present
invention, as represented in the figures herein, is not intended to
limit the scope of the invention, as claimed, but is merely
representative of embodiments of the invention.
[0092] The processes/devices (or counterpart terms specified above)
and displays presented herein are not inherently related to any
particular computer or other apparatus, unless specifically stated
otherwise. Various general purpose or commodity computing systems
may be used with programs in accordance with the teachings herein,
or it may prove convenient to construct a more specialized
apparatus to perform the desired method. The desired structure for
a variety of these systems will appear in the description below. In
addition, embodiments of the present invention are not described
with reference to any particular programming language. It will be
appreciated that a variety of programming languages may be used to
implement the teachings of the inventions as described herein.
Skewed Modular Mapping
[0093] In one aspect of the invention, the space is subdivided into
uniform grid of cells according to Modular Mapping (MM) scheme. In
general, any spatial subdivision in parallel ray tracing must have
a threefold use; (i) data distribution over processors, (ii)
acceleration structure, and (iii) means to know what data to be
fetched from other processors. In addition, spatial subdivision
must address the load imbalance problem. It will become apparent
how the Modular Mapping subdivision of present invention, addresses
all those requirements, despite its simplicity and low cost.
[0094] We divide the whole scene into Nx, Ny, and Nz intervals
along x, y, and z axes, respectively, the three-dimensional scene
is partitioned into Nx.times.Ny.times.Nz axis-aligned grid cells.
To make the analysis of time and space complexity easier, we will
assume Nx=Ny=Nz=N. The universal space is then partitioned into
axis-parallel N3 cuboidal cells.
[0095] The cells are arranged in N modular subsets (or modules),
such that each module comprises exactly N2 cells. N is also the
number of processors, in a way that each processor, or cluster of
processors, is allocated a module of N2 cells. Module k-th is the
geometric location of all cells such that:
k=(x+y+z)mod N
[0096] An example is shown in FIG. 2A. A scene universe is divided
into 4.times.4.times.4 equal cells. Each cell is enumerated
according to its coordinates X, Y, Z. FIG. 2B depicts the same
cells sorted by their modular affiliation. Notice the skewed
alignment of modules. A cell belongs to the kth module by its
location, according to the modular formula:
k=(x+y+z)mod 4
[0097] e.g. a cell at x=2, y=1, z=0 is assigned to module 3.
[0098] Data distribution over processors. All cells are distributed
over at least N processors, such that each module is dominated by
one processor, or cluster of processors. Although for simplicity,
the following discussion refers to the case of a single processor
per module. In way of example, the table below shows cell
distribution over four processors, for the case of N=4.
TABLE-US-00001 X + Y + Z 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 PE 0
1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
[0099] Because of this simply formulated corresponding of data
distribution over processors, it is FIG. 2C shows schematically
that each PE gets assigned exactly 16 cells. The scattering pattern
of modular cells is apparent in FIG. 2D. Module k=3 is shown, out
of 4 modules. Its fairly uniform distribution across the space is a
critical factor for effective load balancing.
[0100] In general, acceleration structures assist in reducing the
number of ray/object intersection tests while looking for the hit
point. In complex structures, such as KD-tree or Octree, it is done
by traversing the data tree, which is a very expensive process. As
opposed to it, ray tracing on the uniform grid of MM, is based on
the well known incremental algorithm for line drawing on 3D raster
grid (3D DDA). While tracing, the algorithm skips empty cells by
only examining the farthest intersection point of the ray with the
next non-empty cell.
[0101] Data fetched from other processors. In regular course of ray
tracing according to present invention, for small data scenes the
entire scene can be stored in a shared memory, or for large
databases it has be distributed among processors' local memories.
In the later case, a processor holding data of required x, y, z
site can be easily located by any other processor by simply
calculating the k value from the MM formula.
[0102] In one way the Load balancing method of present invention
exploits the natural dispersion of modular cells across the space,
as shown in FIG. 2D. This is an intra-modular, or local load
balancing. A processor (or cluster of) to which all cells of a
given module are allocated, prioritizes its attention according to
cell's load. Moreover, a processor can handle multiple cells at a
time by multi-threading. The other way is inter-modular load
balancing, having a more global effect, carried out by processor
"borrowing" mechanism. It is based on the MM feature of correlation
between geographical and computational neighborhoods. It is used
for offloading an overloaded processor by its closest neighboring
processors. This method will become apparent in the following
discussion.
Geographical Vs. Computational Neighborhood
[0103] An important feature of the Modular Mapping of the present
invention is the similitude between geographical and computational
neighborhood. First, the neighborhood rule of MM must be
understood. We are going to analyze two circles of proximity to a
cell; the first degree is of 6 closest wrapping around cells, and
the second degree counting 18 cells (including the 6). In FIG. 3A a
cell 1, 1, 1 is shown rounded by its 6-connected geographical
neighboring cells. The way of mapping those cells onto modules is
indicated in FIG. 3B. The central cell, belonging to module k, is
wrapped by cells assigned to two modules only: k+1 and k-1. Because
of the one-to-one mapping of module per processor, the meaning is
that central processor PEk can access all the geographically
surrounding data of every one of its modular cells, merely fetching
two processors. And vice versa, the processors PEk-1 and PEk+1 can
access all the data of PEk, not just for the central cell, but of
all its modular cells. Moreover, in the simplest ring topology, as
illustrated in FIG. 3C, the distance is of one hop only. The
imperative fact is that the computational proximity matches the
geographical one, thus the shortest geographical distance between
cells is translated to the shortest inter-processor communication
distance.
[0104] The second circle of proximity includes 18-connected cells.
12 of them, excluding the 6-connected, are shown in FIG. 4A. The
way they are mapped onto modules is indicated in FIG. 4B. Two
modules have joined the first ones; k-2 and k+2. All the four
modules, assigned to processors are shown in FIG. 4C. The central
processor PEk can access all the 18-connected geographical
neighbors of any of its cells from four processors only, and vice
versa, the processors PEk-1, PEk-2, PEk+1 and PEk-2 can access all
the data of PEk, just in one or two ring hops.
[0105] It is noteworthy that in the bouncing stage of ray-tracing,
majority of rays entering or leaving a cell communicate to "next
door" cells, which due to the modular scheme is always done across
a minimal communication distance between two neighboring processors
on the ring. There is no requirement to exchange data between any
two non-adjacent processors on the ring. As will become apparent,
in the preferred embodiment of the present invention this minimal
computational distance is cheaply hopped by a shared cache-memory
access between adjacent PEs.
Inter-Modular Load Balancing
[0106] Spatial subdivision suffers from clustering of objects in a
few regions for most of the scenes. In modular scale clustering can
be overcome by intra-modular load balancing mechanism. However, for
global scale clustering an inter-modular mechanism is vital, such
that load distribution between modules can be adjusted. The global
clustering is addressed by "borrowing" mechanism, based on the
inexpensive data transfer between adjacent PEs. For an overloaded
module, part of the load can be taken by an unemployed neighboring
PE, which is borrowed to process some cell subset of the busy
neighbor. The data associated with those cells must be delivered to
the borrowed PE. This is readily accomplished by data transfer
between adjacent PEs.
[0107] The borrowing mechanism is demonstrated on a known test case
in the art; the "teapot-in-a-stadium" scene. This test scene has a
very small object (the teapot) inside a very large empty space (the
stadium). If a traditional uniform grid subdivision is used for
this scene, the result is very inefficient due to extremely
imbalanced load, as shown in FIG. 5A.
[0108] FIG. 5B depicts the scene in present invention's MM data
domain. For clearness, only the 2D section is shown. The only
(teapot) object falls in a "central" cell belonging to module k,
creating an extreme load imbalance. The borrowing mechanism of
present invention solves such imbalance by migrating parts of the
cell's load to neighboring processors, PEk-1 and PEk+1, offloading
the central PEk. Then the source data at cell k must be shared with
destination processors. Due to the Modular Mapping, the minimal
geographical distance to surrounding cells, correlates with the
smallest physical distance between the memories of PEk with those
of PEk-1 and PEk+1. This feature significantly lowers the cost of
data moves in the case of borrowing. FIG. 5C shows the solution in
processor domain. Processors k-1 and k+1 are "borrowed" to offload
the central processor k. The relevant data residing at the k-th
PE's memory must be delivered to the borrowed processors. Due to MM
scheme this data is transferred across the ring between adjacent
processors. No global communication is required.
[0109] An extended load balancing of the teapot-in-a-stadium scene
can be done by mobilizing the second degree neighborhood as well.
As shown in FIG. 5D, processors k-2, k-1, k+1 and k+2 are
"borrowed" to offload the central processor k. The relevant data
residing at the k-th PE's memory is transferred by utmost 2 hops
across the ring. No global communication is required as well.
Tracing a Ray
[0110] FIG. 6 shows a very basic example of tracing a pathway from
a particular point on the camera image plane (a pixel 601) into the
scene, until it hits a surface at intersection point 602. The ray
is created by PE at a front cells (facing the screen) of its
module. The ray progression is based on the incremental 3D DDA
algorithm for line drawing on 3D raster grid. Each module keeps a
list of its empty cells. While tracing, the algorithm skips empty
cells by only examining the farthest intersection point of the ray
with the next non-empty cell. The intersection at each cell is
tested against the accelerating structure of local KD-tree. At
intersection point a shadow ray is created toward the source of
light 603, testing whether the hit point is shadowed by any
occluding object, and the primary ray splits into reflective
(secondary) ray that hits another object 605. Assuming maximal
recursion depth of a ray is limited to 2, the reflection ray
terminates the trail. In its way back the secondary ray brings back
light and texture values of its hit point 605, that are being
summed-up at the cell of intersection 602, and the final value is
returned to the screen's pixel 601.
[0111] In another example the dramatic cutback of inter-PE
communication is demonstrated, an important aspect of present
invention, by more elaborated case depicted in FIG. 7A. The scene
is partitioned into regular grid of 8.times.8.times.8 cells,
controlled by parallel array of 8 PEs, 64 cells per PE. A cell is
enumerated by its X, Y, Z coordinates, e.g. cell 642 is positioned
at X=6, Y=4, Z=2, and allocated to PE4 by its modular value. The
path of the ray starts at a given pixel (a) 701, and intersects an
object in cell 642 (b) 702, where a generation of three additional
rays is created: a shadow ray that is sent to light source (f) 705,
reflection ray directed to (c) 703 in cube 652, and transmission
ray (d) 704 reaching cube 542. A farther generation is shown when
the ray of 672 is reflected to cube 742 (e) 706. Additionally, from
each point of intersection a shadow ray is sent to light source (f)
705. The table of FIG. 7B summarizes the relations between involved
cells and PEs. The consequent inter-PEs communication while tracing
the ray family is shown in FIG. 7C. The entire data transfer is
limited to PE4 and its two neighbors, PE3 and PE5. Another data
transfer, between cells 652 and 742 (c to e), needs no
communication, since both cells are allocated to the same processor
PE4.
[0112] The above example demonstrates how the inter-processor
communication in data-parallel ray tracing system is reduced. Due
to the Modular Mapping of present invention, the data exchange
occurs always between two neighboring PEs on the ring-based
communication, which is the shortest data exchange path. Never a
longer exchange path, i.e. between two distant PEs along the ring,
is required. Such a reduced communication ensures higher efficiency
and improved performance. This embodiment will be readily
understood.
Ray and Scene Data Structures
[0113] The following detailed description of the ray and data
structures, as represented in the figures herein, is not intended
to limit the scope of the invention, as claimed, but is merely
representative of embodiments of the invention.
[0114] In the present invention, taking the data-parallel approach
of ray tracing, the object data is distributed over the memories,
gaining high locality when each processor relates to a subset of
the database and traces segments of rays that pass through its
subspace. Moreover, the exchange of data from a sender-cell to a
receiver-cell, by the processing element assigned to the
sender-cell, is performed locally, i.e. no need to synchronize with
other processing elements, or with any central control system. This
is an autonomous exchange of data.
[0115] Three main criteria need to be observed for such
distributions to lead to efficient execution of the parallel
algorithm:
Communication requirements during rendering need to be minimized
The memory overhead for each processor should be as equal as
possible Processing time for each processor needs to be
equalized
[0116] Communication requirements are minimized by high data
locality excluding massive moves of data, moreover, the Modular
Mapping of present invention transforms the massive and global
communication needed for balancing the load, into neighbor-wise
regularly patterned communication.
[0117] The amount of scene memory varies among processors because
of the schematic data distribution, without trying to equally
distribute the scene objects. However, locality of scene data
minimizes its moves. Moreover, the use of cache-memories in the
present invention is intended to nullify cache misses and memory
fetches during cell processing, once cell data have been downloaded
from memory to cache. All these keep the memory overhead low and
fairly equal. The other data in the system is that of rays, it is
non-massive and moves among processors in neighbor-wise fashion,
accommodated by shared-cache transfers of the preferred embodiment,
or by single ring hop for commodity based embodiment, still
minimizing memory overhead. The exception is when empty cells must
be skipped, and then ray may be transferred between non-adjacent
processors.
[0118] Equalization of processing time in cases of global
clustering is addressed by the "borrowing" mechanism, as described
hereinabove. The data associated with borrowing is delivered by
massive data transfers among adjacent PEs, by means of shared-cache
transfers of preferred embodiment, or by single ring hop for
commodity based embodiment.
[0119] The scene and ray data are structured to address the above
three criteria. Processor's subspace consists of the entirety of
module's cells, while the PE is competent of processing only one
cell at a time (or few more by applying multithreading). Therefore
a cell can be in an active state, handled by processor, or inactive
state, waiting for its turn, and a mechanism is needed to keep all
the inactive cells receiving rays from all over in real time. This
and other special needs of present invention data handling are
accomplished by differentiating between the duties of processor's
memory and its cache. FIG. 8A-B depict the data structures of both.
As shown in FIG. 8A, the processor's memory holds scene 812 and ray
data 813 for all non-empty cells of the module, and list of empty
cells 811. There are N cells per module, while M of them are
non-empty. Cell's local scene 812 is KD-tree structured for
intersections acceleration. The cache-memory in FIG. 8B holds the
entire data set of the active cell k 822. In addition, the cache
stores input buffers of all processor's cells 824, enabling a
continuous ray reception mode in real time for all inactive cells
(excluding the empty cells). The received rays will be taken care
of when the cell turns active. Another important partition in cache
is Transfer-buffer 821 to accommodate communicating the data
between processor and its nearest neighbor.
[0120] A ray life cycle has two distinct phases: (i) building the
ray family and associated nodes, and (ii) folding back the ray
structure while aggregating the light value toward the final screen
pixel. The build phase includes ray commencement at a screen's
pixel, seeking intersection with object, constructing a node, and
generating children rays of shadow, transmission and reflection. At
this phase rays do not carry any light values. The second phase of
back folding starts upon termination of rays, due to depth
limitation or space limitation. All the family rays are returned
back to their source nodes carrying light value which is summed up
with other rays in the node. A node is being removed from Node
list, right after all children have been aggregated into parent
ray. The parent, in turn, is sent back to its source node, etc.,
until the entire family's light value finalizes at the screen
pixel.
[0121] FIG. 8C details the structure of ray data for each non-empty
cell. The data associated with life cycle of a ray is divided to
static and dynamic elements. Static data resides in memory:
Input-buffer (incoming rays), Output buffer (outgoing rays),
Termination buffer (terminating rays), and Node Record defining the
intersection point. The dynamic Ray Record, defining ray
attributes, source and destination, is being conveyed among the
cells. Each of the classes is farther break down and explained in
FIG. 8C. Their exact functions will become clear in the hereinafter
example.
[0122] The static data comprises:
Input-Buffer
[0123] Incoming rays [0124] Local rays (created in course of local
processing) [0125] Folding rays (incoming ray on its way back)
Output-Buffer
[0125] [0126] immediate send (to be sent to target cell) [0127]
Return to source (to be returned to the cell it came from)
Termination Buffer
[0127] [0128] Parent rays (waiting for aggregation in Folding
cycle) [0129] Terminated rays (waiting for aggregation in Folding
cycle)
Node Record
[0129] [0130] local node ID; [0131] Pxl_ID; [0132] Topology [parent
(source_cell_ID), shadow (target_cell ID), out_ray1 (target_cell);
out_ray2 (target_cell]; [0133] Status: (build(0)/fold(1); [0134]
Light value The dynamic data comprises: Ray record: [0135] Pxl_ID;
[0136] Source (cell; node); [0137] Ray class: regular(0)/shadow(1);
[0138] Status: Build(0)/Fold(1); [0139] Ray's normal [0140] Light
value
[0141] The understanding of detailed structure of the Ray and Node
records is required for the elaborated example hereinafter:
Ray Record
[0142] Pixel ID; source cell; source node (0=pixel, any
number=node); regular (0) or shadow (1) ray; build (0) or fold (1)
state; ray's normal; light value
Node Record:
[0143] Local node ID; Pixel ID; father (source cell); shadow
(target cell or terminate(0)); child1 (target cell); child2 (target
cell or terminate(0)); state (build, fold); light value
[0144] Detailed example of the ray handling mechanism is described
in FIGS. 9A-9C. FIG. 9A shows the course of primary ray, secondary
rays and shadow rays. This is a simplified test case where the
space is partitioned into 4.times.4.times.4 cells, processed by
parallel array of 4 PEs, 16 cells per PE. The exemplified cell is
133, processed by PE=3. The cell receives ray R1 which during the
Build phase intersects with object 1, splits into three children:
reflected ray R'1 which leaves the cell; transmitted ray T'1 which
encounters another intersection inside the cell; and shadow ray L'1
which leaves the cell for the light source. The first point of
intersection creates node 1. Second node within the cell is created
where T'1 hits object 1 and splits into: R'2 that leaves the cell
to adjacent one; T'2 that escapes out of the scene space; and
shadow ray L'2 which leaves toward light source. The Folding phase
occurs when rays start returning along with light values, which are
summed up at the nodes. R'2, T'2 and L'2 aggregate into T'1 and
node 2 is been removed from records. T'1, R'1 and L'1 aggregate
into R1 which is sent back to the pixel, and finally node 1 is
removed.
[0145] Now let's repeat step by step on the above scenario, the way
it is performed by PE=3. First the Build phase. [0146] 1. In some
prior time, when cell 133 where inactive, arrives in Received Rays
queue a ray having the following record: 1000000; 130; 0; 0; 0;
normal value; 0. Ray's ID is 1000000, named after the source pixel;
the source cell is 130; It comes directly from the pixel rather
than some node; It is a regular ray (not a shadow); build phase;
Its direction given by normal value; light value is null. The ray
waits in the queue for the cell to become active. [0147] 2. When
the cell becomes active, its specific data is downloaded from
memory into PE's cache: (i) Cell.sub.--133 KD-tree, (ii) static ray
data comprising buffers and node list. [0148] 3. Node 1. The ray is
taken from the queue for intersection test against the
Cell.sub.--133 KD-tree. Intersection is found and registered in a
newly created Node 1: 1;1000000; 130; 032; 133; 213; 0; 0. The ray
is stored in Termination Buffer as Parent ray, waiting for Folding
phase. [0149] 4. Node 1. Reflection ray is created and tested for
possible intersection with Cell.sub.--133 KD-tree. Since internal
intersection was not found, another cell on ray's way is pierced by
3D DDA. The ray record is assembled: 1000000; 133; 1; 0; 0; normal
value; 0, and sent to target cell 213. [0150] 5. Node 1.
Transmission ray is created and tested for possible intersection
with Cell.sub.--133 KD-tree. Intersection is found and registered
in a newly created Node 2: 2;1000000; 133; 032; 033; 0; 0; 0. The
ray is stored in Termination Buffer as Father ray, waiting for
Folding phase. [0151] 6. Node 1. Shadow ray is created; its
direction toward light source is calculated, and tested for
possible intersection with Cell.sub.--133 KD-tree. Once internal
intersection was not found, another cell on ray's way is pierced by
3D DDA. The ray record is assembled: 1000000; 133; 1; 1; 0; normal
value; 0, and sent to target cell 032. [0152] 7. Node 2. Reflection
ray is created and tested for possible intersection with
Cell.sub.--133 KD-tree. Once internal intersection was not found,
another cell on ray's way is pierced by 3D DDA. The ray record is
assembled: 1000000; 133; 2; 1; 0; normal value; 0, and sent to
target cell 033. [0153] 8. Node 2. Transmission ray is created and
tested for possible intersection with Cell.sub.--133 KD-tree. Since
internal intersection was not found, then 3D DDA founds out that
the ray leaves the scene. Therefore it is given a background light
value (or texture) and stored in Terminated Ray as: 1000000; 133;
2; 0; 0; normal value; 0. Now it is ready to Folding phase, to be
summed up into its parent ray at Node 2. [0154] 9. Node 2. Shadow
ray is created, its direction toward light source calculated, and
tested for possible intersection with Cell.sub.--133 KD-tree. Once
internal intersection was not found, another cell on ray's way is
pierced by 3D DDA. The ray record is then assembled: 1000000; 133;
2; 1; 0; normal value; 0, and sent to target cell 032.
[0155] The Folding phase is shown in FIG. 9C. [0156] 1. There are
rays waiting for folding when the cell becomes active again: (i) in
cell's Input-Buffer, which permanently resides in PEs cache and
continuously receives returning rays from previously target cells,
(ii) Parent rays in Termination Buffer. [0157] 2. Node 2. The
returning ray 1000000; 133; 2; 0; 1; normal value; light value is
summed up with the locally terminated ray 1000000; 133; 2; 0; 0;
normal value; 0. The intermediate result is stored in Node 2. Upon
arrival of the shadow ray 1000000; 133; 2; 1; 1; normal value;
light value the aggregation is completed and stored in the Light
field of the parent ray 1000000; 133; 1; 0; 0; normal value; light
value. The parent is stored in Input Buffer for next folding cycle,
and Node 2 is removed from the list. [0158] 3. Node 1. Upon arrival
of the rays 1000000; 133; 1; 0; 1; normal value; light value and
1000000; 033; 1; 1; 1; normal value; light value, they are
aggregated along with the awaiting ray 1000000; 133; 1; 0; 0;
normal value; light value, into parent ray 1000000; 130; 0; 0; 1;
normal value; light value and sent back to cell 130. The record of
Node 1 is removed from the list.
Data Locality
[0159] Assuming that the partition of universe was properly
designed, such that the cache size is adequate for all cell's data,
then during the cell activity period all its associated data
resides in the cache: KD-tree, all its buffers and rays. This data
entirely describes everything occurring in the cell, no additional
data from external memory is needed. This locality of data
eliminates cache misses during cell's processing time. The only
cache miss is associated with assigning the processor to the
successive cell. This can be eventually avoided if the cache is big
enough to preload the successive data.
Flowchart
[0160] Flowchart of FIG. 10A describes the high level ray tracing
process of present invention. A parallel array of any N number of
PEs is assumed, while one of the PEs monitors the process at high
level. It can be just one of the processors (e.g. PE=0), performing
an extra task 1100 of calculating and broadcasting partitioning
parameters to PEs, and merging the partial results of the PEs
1107-1108.
[0161] At first step of initialization phase, upon receiving the
portioning parameters, each PE downloads its sub-partitioned scene
data from central storage to PE's memory 1101. The data is analyzed
to find the empty cells 1102 and to construct a KD-tree for the
non-empty cells 1103. Then a set of buffers is initialized for
every (non-empty) cell in PE's memory 1104. The Input buffers of
all cells are copied to cache memory, for staying all the time in
reception mode 1105. Then each processor starts the ray tracing
phase 1106 (this stage is farther detailed in FIG. 10B).
[0162] Upon finalizing all partial pixel images at PEs, the
monitoring PE merges them into final image 1107, and sends the
image to display 1108.
[0163] The flowchart of FIG. 10B describes the ray tracing at each
PE. First all the front cells, facing the screen image, are
processed for generating rays. It is possible to process a bundle
of rays, relying on their coherence. However, in the herein
described ray tracing processes only a single ray is taken at a
time, to simplify the principal description. A ray, after being
generated 1011, starts the tracing process in the current front
cell. This ray processing task 1012 is described elsewhere (FIG.
10C) in depth. If intersection was met in the front cell, then it
was registered in `local rays` of Input buffer for later
processing. After all rays are generated pixels of current front
cell 1013, and all front cells are exhausted 1014, the most loaded
cell is chosen for processing 1019. This cell data KD-tree
structured and ray structure are loaded to PE's cache 1020-1021,
and next ray in the `Input List` (`received`) 1015 is taken for
processing 1012. After all these rays are done, the locally
generated rays in `Input Buffer` (`local`) 1016 are processed. When
this is done, all the nodes waiting in `Termination Buffer` of the
current cell are being folded 1017. Upon terminating the cell, next
heaviest cell is chosen, 1019 etc. When all "input Buffers` of all
cells are done, the folding cycle 1018 terminates the tracing.
[0164] The detailed flowchart of ray processing is depicted in FIG.
10C. It starts by ray intersection with local KD-tree 1050. If no
intersection detected, next cell along the ray is calculated 1065
by means of 3D DDA, and stored in `Output Buffer` (`Immediate
send`). If intersection has been detected, and the intersecting ray
is not a shadow ray, a new node is registered 1051, then shadow
rays are created and stored 1052-1053, and the secondary rays of
reflection 1054 and refraction 1055 are calculated and stored for
later processing. In case it is a shadow ray, a proper handling is
done 1056-1059.
[0165] The folding cycle is described in FIG. 10D. The next ray
from `folding` queue is taken, matched with a node 1031, light
value aggregated 1032, and nodes are folded back 1034 until the
pixel of origin is reached 1036.
Implementation
PREFERRED EMBODIMENT OF PRESENT INVENTION
[0166] In the present invention, by taking the data-parallel
approach of parallel ray tracing the inter-processor communication
is highly reduced. Due to the Modular Mapping there is no massive
global communication, and most of the data exchange occurs between
neighboring PEs. Because of the high importance of neighbor-wise
communication, the preferred embodiment comprises shared memory
access between adjacent processors, replacing wired communication,
for higher efficiency and improved performance. The preferred
embodiment, depicted in FIG. 11A, is a distributed memory parallel
ray tracing system, where the scene data is distributed over the
memories. There are some N processors, each processor 1101 includes
CPU, cache and I/O, and is connected to private memory 1102. All
processors are connected via ring 1105. In some cases the ring can
be replaced by different topology, e.g. router, however this will
deteriorate the efficiency and cost/performance). The neighbor-wise
communication, which happens to be the most significant
communication in present invention, is implemented by shared-memory
means, via caches. Each processor 1101 is connected to one neighbor
by its own cache 1103 and by direct connection to the second
neighbor's cache 1104. All processors are having access to central
storage 1106 via ring. The massive data moves, such as for
inter-processor load-balancing ("borrowing"), and majority of ray
data moving between neighboring cells, are accomplished via
cache-based communication. Another ray communication, much less
frequent, such as skipping over empty cells, is done via ring.
[0167] Another preferred embodiment, depicted in FIG. 11B, is a
shared memory parallel ray tracing system, with no private
memories. This kind of system is implementable by SOC (system on
chip) technology (termed also `multicore`), where the entire system
resides in a single chip, and silicon constraints exclude
constructing private memories on chip. The central memory 1110 is
external to the SOC.
Implementation on Commercial Multicore Architectures
[0168] The efficient algorithm of present invention is capable of
implementation on many commercial parallel systems, which offer
advantages of lower cost and mature programming environment, such
as many-core architecture of Larrabee by Intel, Cell by IBM, and
others. In this class of architectures each core PE has its own
cache memory element, and the multiple processing cores are
connected via ring bus. Each PE entirely depends on the ring to
transfer data to and from the other caches. In these transfers the
number of steps involved in sending the packet has typically very
little impact on transfer latency: the clock speed driving the
steps is very fast relative to other considerations. However,
longer communication distances are detrimental to the overall
performance of the ring bus as they reduce available concurrency.
Such a reduced concurrency is an inhibiting factor for prior art's
space division ray-tracing. As opposed to it, the modular space
division of present invention is an enabler to space division
ray-tracing by replacing the majority of long communication
distances, or global communication, by short data exchanges between
neighboring PEs, drastically rising the concurrency of ring
communication.
[0169] Larrabee by Intel is an example of many-core architecture
with coherent caching, designed around fast ring connecting X86
based CPUs with local caches, as shown in block diagram of FIG. 12.
Due to coherency, all local caches can be seen as one global 2nd
level (L2) cache, divided into separate local subsets, one per CPU
core. Each CPU has a fast direct access path to its own local
subset of the L2 cache. Data read by a CPU core is stored in its L2
cache subset and can be accessed quickly, in parallel with other
CPUs accessing their own local L2 cache subsets. Data written by a
CPU core is stored in its own L2 cache subset and is flushed to
other subsets, if necessary. The ring network ensures coherency for
shared data.
[0170] Implementation of the present invention's method on Larrabee
is done by dividing the scene space among the core CPUs, each CPU
controlling N2 cells. The monitoring is done either by one of the
cores functioning as privileged CPU, or by an external CPU. Local
cache holds the cell's data of all assigned cells, in a turn.
Moving ray record to other CPUs is done by writing to its local L2
cache subset via ring bus. Data flows stepwise around the ring.
Data exchange between two physically neighboring CPUs can be done
concurrently with any other pair of CPUs. However, longer
communication distances are harmful to the overall performance of
the ring as they reduce available concurrency. Therefore it is of
great advantage to reduce the communication to short neighboring
segments along the ring, by applying the modular mapping of present
invention. The four arrows in FIG. 12 indicate the communication of
the example in FIG. 9A, implemented on Larrabee.
[0171] Another commercial multi-core architecture is Cell by IBM
(see FIG. 13). Each Cell consists of a PowerPC core (PPE) and eight
"synergetic co-processor elements" (SPEs). Each SPE has 256 KB
local store, which is incoherent with the rest of the memory
system, a memory flow controller (MFC) and synergetic processing
unit (SPU). The EIB bus is responsible of transferring data between
SPEs. It is implemented as a circular ring comprising four
unidirectional channels which counter-rotate in pairs. Each SPE has
one read port and one write port, and a dedicated DMA management
queue capable of transactions to various endpoints without
interfering with the ongoing computations. Each SPE depends on DMA
to transfer data to and from the main memory and other SPE's local
memories. Data flows on an EIB channel stepwise around the
ring.
[0172] Implementation of present invention on Cell is rather
straightforward, e.g. the PPE monitors the entire process while the
scene space is partitioned among and controlled by the SPEs. Local
stores hold the data associated with cells. Each SPE has access to
local stores of other SPEs via EIB bus.
[0173] Although several exemplary embodiments have been herein
shown and described, those of skill in the art will recognize that
many modifications and variations are possible without departing
from the spirit and scope of the invention, and it is intended to
measure the invention only by the appended claims.
* * * * *