SlideShare a Scribd company logo
Linkoping Electronic Articles in
Computer and Information Science
Vol. 2(1997): nr 3
This work has been submitted for publication elsewhere.
Copyright may then be transferred,
and the present version of the article may be superseded by a revised one.
The WWW page at the URL stated below will contain up-to-date information
about the current version and copyright status of this article. Additional
copyright information is found on the next page of this document.
Linkoping University Electronic Press
Linkoping, Sweden
http://www.ep.liu.se/ea/cis/1997/003/
Parallel Algorithms for
Batched Range Searching on
Coarse-Grained
Multicomputers
Per-Olof Fjallstrom
Department of Computer and Information Science
Linkoping University
Linkoping, Sweden
Published on April 1, 1997 by
Linkoping University Electronic Press
581 83 Linkoping, Sweden
Linkoping Electronic Articles in
Computer and Information Science
ISSN 1401-9841
Series editor: Erik Sandewall
c 1997 Per-Olof Fjallstrom
Typeset by the author using LaTEX
Formatted using etendu style
Recommended citation:
<Author>. <Title>. Linkoping Electronic Articles in
Computer and Information Science, Vol. 2(1997): nr 3.
http://www.ep.liu.se/ea/cis/1997/003/. April 1, 1997.
This URL will also contain a link to the author's home page.
The publishers will keep this article on-line on the Internet
(or its possible replacement network in the future)
for a period of 25 years from the date of publication,
barring exceptional circumstances as described separately.
The on-line availability of the article implies
a permanent permission for anyone to read the article on-line,
and to print out single copies of it for personal use.
This permission can not be revoked by subsequent
transfers of copyright. All other uses of the article,
including for making copies for classroom use,
are conditional on the consent of the copyright owner.
The publication of the article on the date stated above
included also the production of a limited number of copies
on paper, which were archived in Swedish university libraries
like all other written works published in Sweden.
The publisher has taken technical and administrative measures
to assure that the on-line version of the article will be
permanently accessible using the URL stated above,
unchanged, and permanently equal to the archived printed copies
at least until the expiration of the publication period.
For additional information about the Linkoping University
Electronic Press and its procedures for publication and for
assurance of document integrity, please refer to
its WWW home page: http://www.ep.liu.se/
or by conventional mail to the address stated above.
Abstract
We de ne the batched range-searching problem as follows: given
a set S of n points and a set Q of m hyperrectangles, report for
each hyperrectangle which points it contains. This problem has
applications in, for example, computer-aided design and engi-
neering. We present several parallel algorithms for this problem
on coarse-grained multicomputers. Our algorithms are based
on well-known average- and worst-case e cient sequential algo-
rithms. One of our algorithms solves the d-dimensional batched
range-searching problem in O(Ts(nlogd 1
p;p)+Ts(mlogd 1
p;p)+
((m + n)logd 1(n=p) + mlogd 1p log(n=p) + k)=p) time on a p-
processor coarse-grained multicomputer. (Ts(n;p) denotes the
time globally to sort n numbers on a p-processor multicomputer,
and k is the total number of reported points.)
Keywords Parallel algorithms, coarse-grained multicomputers,
range searching.
The work presented here is funded by CENIIT (the Center for
Industrial Information Technology) at Linkoping University.
1
1 Introduction
In many applications, such as geographic information systems, com-
puter-aided design and engineering, statistics, etc., we need to answer
the following range-searching query: given a set S of n points, which
points lie within a given hyperrectangle? (A hyperrectangle is the
Cartesian product of intervals on distinct coordinate axes.) Usually,
we need to answer many such queries for the same set of points.
In some situations, we know the set of queries in advance. That
is, we want to solve the following batched range-searching problem:
given a set S of n points and a set Q of m hyperrectangles, report
for each hyperrectangle which points it contains. For example, this
is an important subproblem in computer simulation of deformation
processes, such as vehicle collisions and mechanical forming processes.
In such simulations nding all contacts between components of nite-
element models of physical objectsis necessary. This can be simpli ed
by approximating surface segments with hyperrectangles, and then
determining which vertices these hyperrectangles contain 1, 2].
In this paper, we present parallel algorithms for batched range
searching on coarse-grained multicomputers. A coarse-grained mul-
ticomputer consists of several processors connected by an intercon-
nection network. Each processor is fairly powerful, i.e., it delivers
workstation-class performance. Since o -the-shelf hardware can be
used, coarse-grained multicomputers are relatively inexpensive. Most
commercially available parallel computers are of this type.
Most of the research on parallel algorithms for geometric problems
has focused on ne-grain parallel models of computation 3, 4, 5]. It
is only during the last couple of years that researchers have designed
parallel geometric algorithms for coarse-grained multicomputers 6,
7, 8, 9, 10, 11, 12, 13, 14]. In this model of computation we can
assume that the size of each local memory is large. For example, it
is common to assume that the size of each local memory is larger
than the number of processors. This property allows the algorithm
designer to balance communication latency with local computation
time.
Our parallel algorithms for batched range searching are based on
well-known worst- and average-case e cient sequential algorithms.
One of our algorithms is based the range-tree method, and solves the
d-dimensional batched range-searching problem in O(Ts(nlogd 1
p;p)+
Ts(mlogd 1
p;p)+((m+n)logd 1
(n=p)+mlogd 1
plog(n=p)+k)=p)
time on a p-processor coarse-grained multicomputer. (Ts(n;p) de-
notes the time globally to sort n numbers on a p-processor multi-
computer, and k is the total number of reported points.) We also
give algorithms based on the cell method. This method has poor
worst-case performance, but since it can be very e cient in practice,
we believe that developing parallel algorithms based on this approach
is important.
Other researchers have developed parallel algorithms for range
2
searching on coarse-grained multicomputers. Devillers and Fabri
7] give an algorithm for the one-dimensional case. Recently, Fer-
reira et al 14] present algorithms for the d-dimensional case. They
construct a distributed range tree in time O(s=p + Ts(s;p)), where
s = nlogd 1
n. They can then answer a set of m = O(n) range queries
in time ((slogn + k)=p+ Ts(s;p)).
We organize the rest of the paper as follows. In Section 2, we give
additional information about coarse-grained multicomputers, and de-
scribe some basic operations used by our algorithms. In Sections 3
and 4, we present parallel range-searching algorithms based on the
range-tree and cell methods, respectively.
2 Model of Computation
Coarse-grained multicomputers consist of a set of processors con-
nected through an interconnection network. The number of proces-
sors usually varies between 16 and 256. The memory is physically
distributed over the processors, and interaction between processors
is through message passing. Each processor can execute a di erent
program independent of the other processors. However, it is com-
mon to let each processor execute the same program asynchronously.
That is, except a few global communication steps, processors execute
the same program independently of each other. Common intercon-
nection networks are 2D meshes (Paragon XP/S), 3D meshes (Cray
T3E), hypercubes (nCUBE 2), and fat trees (CM-5).
Our algorithms use a few basic and extensively studied communi-
cation operations. We next describe these operations, and give their
time complexities for a square 2D mesh with p processors, which are
assumed to be indexed from 1 through p. For a detailed description
and analysis of the operations, see Kumar et al 15].
Monotone routing: Each processor P(i) sends at most one m-word
message. The destination address, d(i), of the message sent by P(i)
is such that if both P(i) and P(i0), i < i0, send messages, then d(i)
d(i0). The time complexity, Tmr(m;p;rmax), is O((rmax + m)pp),
where rmax is the maximum number of words received by any pro-
cessor.
Segmented broadcast: Processors with indexes i1 < i2 ::: < iq, are
selected; each processor P(ij) sends the same m-word message to
all processors P(ij + 1) through P(ij+1 1). The time complexity,
Tsb(m;p), is O(mpp).
Multinode broadcast: Every processor sends the same m-word mes-
sage to every other processor. The time complexity, Tmb(m;p), is
O(mp).
Total exchange: Every processor sends a distinct m-word message to
every other processor. The time complexity, Tx(m;p), is O(mppp).
Pre x sums and reduction: Let a1;a2;:::;an be a list of numbers
3
evenly distributed over the processors and let be an associative
operator. The pre x sums operation computes si = a1 ai, and
stores si in the same processor as ai. The time complexity, Tp(n;p),
is O(n=p+pp). The reduction operation computes s = a1 an,
and stores s in each processor. The time complexity, Tr(n;p), is
O(n=p + pp). In the segmented versions of these operations, we
apply them to sublists of a1;a2;:::;an. The time complexity is the
same as for the ordinary operations.
Global sort: Given a list a1;a2;:::;an of numbers evenly distributed
over the processors, the global sort operations sorts the list, and re-
turns it evenly distributed over the processors. The time complexity,
Ts(n;p), is O(n(log(n=p)+ pp)=p).
We end this section by showing how some of the above operations
can be used to solve a simple data-copying problem. This is an im-
portant subproblem in the algorithms to be presented in this paper.
The data-copying problem is as follows. A set R of n equal-sized data
records is evenly distributed over the processors of a p-processor mul-
ticomputer. With each record r is associated a nonnegative integer
n(r). The task is to create n(r) additional copies of each record r
such that the work of creating the records is uniformly distributed
over the processors. We do this as follows.
1. Let R0 = fr 2 R : n(r) > 0g, and let w =
P
r2R0 n(r).
Decompose R0 into subsets R0(i), i = 1;2;:::;p, such thatPr2R0
(i) n(r) = bw=pc for i pdw=pe w, and
Pr2R0
(i) n(r) =
dw=pe otherwise.
2. For i = 1;2;:::;p, copy R0(i) to the processor P(i). Create the
copies of the records in R0(i) in the processor P(i).
Lemma 1 We can solve the data-copying problem in O(Tmr(n=p;p;
(n + w)=p)+ (n + w)=p), time where w is the total number of copies
and n p2.
Proof. Regard R0 as an ordered set fr1;r2;:::;rmg. We begin Step 1
by computing the pre x sums s1;s2;:::;sm, where sk =
Pk
j=1 n(rj).
To simplify the description of how to decompose R0 into subsets, we
assume that w is an integer multiple of p. Extending our descrip-
tion to the general case is easy. Let lk = bsk=(w=p)c. For each
record rk, if lk 1 = lk, then we assign rk to the subset R0(lk 1 + 1).
Otherwise, let dk = lk lk 1. Next, we create new records rk;j,
j = 0;1;:::;dk, such that we (1) assign rk;0 to the subset R0(lk 1 +1)
and set n(rk;0) = (lk 1 + 1)w=p sk 1, (2) assign rk;dk to the sub-
set R0(lk + 1) and set n(rk;dk ) = sk lkw=p, and (3) assign rk;j,
0 < j < dk, to subset R0(lk 1 + 1 + j) and set n(rk;j) = w=p. Ob-
serve that no subset contains more than dw=pe elements. In Step
2, we rst identify nonlocal subsets, i.e., subsets whose elements lie
in several processors. To do this each processor sends the indexes
4
of the lowest- and highest-indexed subset that it contains to every
other processor. Using monotone routing, we then copy the nonlo-
cal subsets directly to their nal destinations. Some processors may
completely contain one or more subsets. We handle this by copying
all such subsets to the nal destination of the lowest-indexed subset
in the processor. If a processor completely contains more than one
subset, we then use segmented broadcast to transfer the subsets to
their correct destinations. 2
3 A Worst-Case E cientAlgorithmfor Range
Searching
Let us again state the problem in which we are interested. The input
consists of a set S of n points and a set Q of m hyperrectangles. The
task is to report, for each hyperrectangle, which points it contains. In
our development of parallel algorithms for this problem, we assume
that initially each processor stores n=p points and m=p hyperrectan-
gles, and that m and n are both greater than or equal to p2. The
output consists of hyperrectangle-point pairs, that is, for each hyper-
rectangle q and point p such that p is contained in q, the pair (q;p)
is created.
In this section we present a parallel algorithm inspired by the se-
quential range-tree method 16]. This is a worst-casee cient method.
We can use it to solve the d-dimensional, d 2, batched range-
searching problem in time O(nlogd 1n+mlogdn+k), where k is the
total number of reported points. For a set S of points in the plane,
the corresponding range tree consists of a binary search tree on the
x-coordinates of the points. That is, every node v represents an in-
terval I(v) such that a leaf node represents the interval between two
consecutive x-coordinates, and an interior node represents the union
of the intervals of its children. (We call these intervals standard in-
tervals.) With every node v is associated a y-sorted list Sy(v) of the
points with x-coordinate within I(v). To determine which points are
contained in a hyperrectangle q, partition the x-range of q into stan-
dard intervals. More speci cally, interval I(v) is part of the partition
if the x-range of q contains I(v) but not I(p(v)), where p(v) is the
parent of node v. Then, for every interval I(v) in the partition, decide
by a binary search which points in Sy(v) lie within the y-range of q.
We can thus decompose a two-dimensional range-searching problem
into a collection of one-dimensional range-searching problems.
We give rst a parallel algorithm for the one-dimensional case. We
then show how we can extend this algorithm to higher dimensions.
The algorithm consists of three parts and the details are as follows.
Part I:
1. Globally sort S into nondecreasing order by x-coordinate. Di-
vide the sorted list into equal-sized sublists, S(i), i = 1;2;:::;p=2.
5
1 2 4
5 6
7
3
l(1) l(2) l(3) l(4) l(5)
x
Figure 1: The tree Tp for p = 8. Tp has p=2 leaf nodes and p 1 nodes
in total. The given x-range is partitioned into the standard intervals
corresponding to the circled nodes. It contains the intervals of leaves
2, 3 and 4, and intersects the interval of leaf 1. We index nodes from
left to right, beginning with the leaves.
(We assume that p is an integer power of two.)
2. For each sublist S(i), nd l(i), the smallest x-coordinate in the
sublist (for sublist S(p=2) nd also l(p=2 + 1), the largest x-
coordinate in S(p=2)). Broadcast the l-values to all processors.
3. In every processor build a binary search tree Tp on the l-values.
Identify each node in Tp by a unique index in the range 1
through p 1. See Figure 1.
Part II:
1. For each hyperrectangle q and leaf node i, if q's x-range inter-
sects but does not contain I(i), create the node-hyperrectangle
pair (i;q).
2. For each leaf node i, determine e(i), the number of node-hyper-
rectangle pairs with node index i. Compute e =
Pp=2
i=1 e(i). If
e = 0, continue to Part III.
3. Globally sort the node-hyperrectangle pairs by node index.
4. For each leaf node i, compute p(i) = de(i)=d2e=pee and f(i) =Pi 1
j=1 p(j). If p(i) > 0, continue as follows.
(a) Copy S(i) to the processors P(f(i)+1) through P(f(i)+
p(i)).
(b) Divide the node-hyperrectangle pairs with node index i
into equal-sized subsets Q(i;j), j = 1;2;:::;p(i). Move
Q(i;j) to the processor P(f(i)+ j).
(c) Find k(i;j),the output size of the range-searching problem
with input S(i)and Q(i;j). Compute k =
Pp=2
i=1
Pp(i)
j=1 k(i;j).
If k = 0, continue to Part III.
6
5. Foreach subset Q(i;j),compute p(i;j) = bk(i;j)=(k0=p)c, where
k0 = max(k;n), and f(i;j) =
Pi 1
k=1
Pp(k)
l=1 p(k;l)+
Pj 1
l=1 p(i;l).
If p(i;j) = 0, solve the range-searching problem with input S(i)
and Q(i;j) in the processor P(f(i) + j). Otherwise, continue
as follows.
(a) Copy S(i) and Q(i;j) to the processors P(f(i;j) + 1)
through P(f(i;j)+ p(i;j)).
(b) Divide Q(i;j)into the subsets Q0(i;j;l),l = 1;2;:::;p(i;j),
such that
P
(i;q)2Q0
(i;j;l) k(i;q) is O(k0=p).
(c) Solve the range-searching problem with input S(i) and
Q0(i;j;l) in the processor P(f(i;j)+ l).
Part III:
1. For each leaf node i, determine d(i), the number of hyper-
rectangles whose x-range contains the interval I(i). Compute
d =
Pp=2
i=1 d(i). If d = 0, end the execution.
2. For each leaf node i, compute p(i) = dd(i)=d2d=pee and f(i) =Pi 1
j=1 p(j). Copy S(i) to the processors P(f(i) + 1) through
P(f(i)+ p(i)).
3. If d > m, then do as follows.
(a) Copy the hyperrectangles in each processor to every other
processor.
(b) For each hyperrectangle q and leaf node i, if q's x-range
contains I(i), create the node-hyperrectangle pair (i;q).
(c) For each leaf node i, divide the node-hyperrectangle pairs
with node index i into equal-sized subsets Q(i;j), j =
1;2;:::;p(i).
4. If d m, then do as follows.
(a) For each hyperrectangle q and leaf node i, if q's x-range
contains I(i), create the node-hyperrectangle pair (i;q).
(b) Globally sort the pairs by node index.
(c) For each leaf node i, divide the node-hyperrectangle pairs
with node index i into equal-sized subsets Q(i;j), j =
1;2;:::;p(i). Move Q(i;j) to the processor P(f(i)+ j).
5. For each pair (i;q) 2 Q(i;j) and each point p 2 S(i), create the
pair (q;p).
Theorem 1 We can solve the one-dimensional range-searching prob-
lem in time O(Ts(n;p)+ Ts(m;p)+ (mlog(n=p)+ k)=p).
7
Proof. In Step 1 of PartII, we use binary search to decide which pairs
to create. Since each hyperrectangle intersects at most two intervals,
we create the corresponding pairs locally. In Step 2, each proces-
sor rst determines how many of its node-hyperrectangle pairs have
node index i, i = 1;2;:::;p=2. After a total exchange operation, the
processor P(i) computes e(i). P(i) then broadcasts e(i) to every pro-
cessor. Step 4(a) is done by monotone routing followed by segmented
broadcasting. In Step 4(b), a segmented pre x sums operation deter-
mines the rank of each pair (i;q) among the pairs with node index i.
The rank decides to which subset Q(i;j) that (i;q) belongs. We then
move Q(i;j) to the processor P(f(i)+ j) using techniques similar to
those used in Step 2 of the algorithm for the data-copying problem
(Section 2). In Steps 4(c) and 5(c), we use binary search. In total,
Parts I and II take O(Ts(n;p)+Ts(m;p)+(mlog(n=p)+k)=p) time.
In Step 1 of Part III, each processor rst determines, for each leaf
node i, how many of its hyperrectangles contain the interval I(i).
This is done as follows. For each hyperrectangle q and node i, if q's
x-range contains I(i) but not I(p(i)) (where p(i) is the parent of node
i), increment a counter associated with node i. To compute, for each
leaf node i, how many local hyperrectangles contain I(i), sum the
counters associated with nodes along the path from i to the root of
Tp. Then continue as in Step 2 of Part II. In Step 3(b), each processor
that has received a copy of S(i)decides which hyperrectangles contain
the interval I(i), and creates the corresponding pairs. In Step 4(a),
we create the pairs using a modi ed version of the algorithm for
the data-copying problem (Section 2). Since we assume in Step 4
that d m, the total time complexity of Step 4 is O(Ts(m;p)).
Step 5, nally, takes O(dn=p2) time. Since we assume that n p2,
if d > m, the time complexity of Step 5 asymptotically exceeds the
time complexities of Steps 1 and 3. The total time for Part III is thus
O(Tsb(n=p;p)+ Ts(m;p)+ dn=p2). 2
Giving an algorithm for the two-dimensional case is now easy. It
too consists of three parts, where Parts I and II are essentially the
same as above. In Part II, we use the batched range-searching algo-
rithm of Edelsbrunner and Overmars 17]. They give a divide-and-
conquer algorithm for batched range searching that runs in O((m+
n)logd 1
n + mlogm+ k) time and uses O(m+ n) space. It is only
Part III that deviates signi cantly from the one-dimensional case.
The details of Part III are now as follows.
Part III:
1. For each hyperrectangle q and node i, if q's x-range contains
I(i) but not I(p(i)), create the node-hyperrectangle pair (i;q).
2. Foreach node i, determine c(i),the number of node-hyperrectangle
pairs with node index i. Compute c =
Pp 1
i=1 c(i). If c = 0, end
the execution.
8
3. For each point p and each node i such that p's x-coordinate is
contained in I(i) and c(i) > 0, create the node-point pair (i;p).
4. Solve the one-dimensional range-searching problem with input
consisting of the node-point pairs and the node-hyperrectangle
pairs. That is, for each node-hyperrectangle pair (i;q), nd the
node-point pairs (i;p) such that p is contained in q's y-range.
Theorem 2 We can solve the two-dimensional range-searching prob-
lem in time O(Ts(nlogp;p)+Ts(mlogp;p)+(mlogplog(n=p)+k)=p).
Proof. Part I is exactly as in the one-dimensional case. Part II is the
same as in the one-dimensional case, except Steps 4(c) and 5(c) which
now use the batched range-searching algorithm of Edelsbrunner and
Overmars. (In Step 4(c), we modify this algorithm to compute just
how many points are contained in each hyperrectangle.) Parts I and
II take together O(Ts(n;p) + Ts(m;p) + ((m + n)log(n=p) + k)=p)
time.
In Step 1 of Part III, it takes O(logp) time for each hyperrect-
angle q to nd all nodes i in Tp such that q's x-range contains I(i)
but not I(p(i)). Step 2 is similar to Step 2 of Part II. Step 3 takes
O(nlogp=p) time. In Step 4, we apply our one-dimensional range-
searching algorithm to the node-hyperrectangle and node-point pairs
created in Steps 1 and 3. In our one-dimensional algorithm we as-
sume that the input is evenly distributed over the processors, and
that the number of points and the number of hyperrectangles are both
greater than or equal to p2. These assumptions are not necessarily
satis ed by the node-hyperrectangle and node-point pairs. We can
easily remedy this by adding dummy input as follows. Each processor
counts how many node-point pairs it stores. By a reduction opera-
tion, we then nd nmax, the maximum number of such pairs contained
in any processor. Finally, each processor adds dummy pairs until
it has exactly max(nmax;p) pairs. The same approach is used for
the node-hyperrectangle pairs. Since no processor stores more than
O(nlogp=p) node-point pairs and O(mlogp=p) node-hyperrectangle
pairs, it takes O(Tr(p;p)+ (m+ n)logp=p) time to add the dummy
input. 2
Generalizing the above approach to higher dimensions is straight-
forward. We can easily derive the following result.
Theorem 3 We can solve the d-dimensional range-searching prob-
lem in time O(Ts(nlogd 1
p;p)+Ts(mlogd 1
p;p)+((m+n)logd 1
(n=p)+
mlogd 1p log(n=p)+ k)=p).
4 Average-Case E cientAlgorithmsfor Range
Searching
In this section, we present parallel algorithms for range searching
that are based on the cell method 18]. In its simplest version, this
9
method is as follows. First, nd the smallest hyperrectangle B that
contains the set S. Divide B into equal-sized hyperrectangular cells,
and record for each cell which points it contains. We call the resulting
data structure a cell directory. To decide which points a hyperrect-
angle q contains, do as follows. For each cell intersected by q, access
the corresponding entry in the cell directory and test, for each point
contained in the cell, if it is included within q.
It is common to divide B into O(n) cells, in which case we can
build the cell directory (e.g., a multidimensional array of pointers)
in O(n) time. The total cost of solving the batched range-searching
problem is then O(m + n + s + t) time, where s and t denote the
total number of cell accesses and point inclusion tests, respectively.
(The time complexity increases linearly with the dimension d of the
problem. In this paper, we assume that d is a small constant.)
As already mentioned, the worst-case performance of this method
is poor. We can easily create an input such that s + t is (mn), al-
though the output size k is zero. However, in many applications the
cell method may outperform more sophisticated methods. For exam-
ple, in an experimental evaluation of methods for range searching, we
2] found it to be much faster than the range-tree method. This is due
to its relative simplicity (small constants of proportionality). More-
over, one can show that if the points are evenly distributed in space
and the shape of the query hyperrectangles is similar to the shape of
the cells, then s+t is O(k). Finding an e cient parallelization of the
cell method as described above is therefore important.
Algorithm I: Our rst algorithm is based on the assumption that
storing a copy of S and Q in each processor is possible. The rst step
of the algorithm achieves this by multinode broadcasting. Then, each
processor executes the sequential cell method. To load-balance the
computations, we divide Q into subsets for which the total number
of cell accesses and point inclusion tests is about the same.
1. Copy the points and hyperrectangles in each processor to every
other processor.
2. Locally build a cell directory for S. That is, compute B, the
smallest hyperrectangle containing S. Divide B into O(n)equal-
sized hyperrectangular cells, and record, for each cell, which
points it contains.
3. For each hyperrectangle q, nd s(q), the number of cells it in-
tersects. Compute s =
P
q2Q s(q).
4. For each hyperrectangle q, nd t(q), the total number of points
contained in the cells intersected by q. Compute t =
Pq2Q t(q).
5. For each hyperrectangle q, let r(q) = s(q) + t(q). Divide Q
into the subsets Q(j), j = 1;2;:::;p, such that
Pq2Q(j) r(q) =
O(max(n;(s+ t)=p)).
10
2 3 4 4
3
4
5
5
16
0
0
7
9
7
10
12
8
12
0
0
0 0 0 0
0
Figure 2: Example with sixteen points and one hyperrectangle. The
number besides each grid vertex is the number of points dominated by
the vertex. The shaded region is the block of cells intersected by the
hyperrectangle. The number of point inclusion tests is 12+0 4 0 =
8.
6. Solve the range-searching problem with input S and Q(j) in
the processor P(j).
Theorem 4 Algorithm I solves the batched range-searching problem
in time O(Tmb((m+ n)=p;p)+ m+ n + (s+ t)=p), where m;n p.
Proof. Steps 1 through 3 take O(Tmb((m+ n)=p;p)+ m+ n) time.
In Step 4, we rst compute for each grid vertex v how many points
in S it dominates, that is, how many points lie in v's southwest
quadrant. This can be done in O(n) time. Let d(v) denote the
number of points dominated by the grid vertex v. Then, t(q) =
d(vNE) + d(vSW) d(vNW) d(vSE), where vNE, vSW , vNW and
vSE denote the northeast, southwest and northwest and southeast
vertices, respectively, of the block of cells intersected by the hyper-
rectangle q. See Figure 2. Thus, Step 4 takes O(n+m) time in total.
In Step 5, the partitioning of Q into subsets can easily be done in
O(m) time. Finally, Step 6 takes O(max(n;(s+ t)=p)) time. 2
If we can have a copy of S and Q on each processor, and if s +
t is large compared with m + n, then this algorithm can be quite
e cient. However, in many applications storing a copy of the input
on each processor would be impossible. This suggests that we should
investigate alternative parallelizations of the cell method.
Algorithm II: Brie y, this algorithm is asfollows. Foreach nonempty
cell, we create a list of the points it contains. For each intersected
cell, we create a list of (copies of) the hyperrectangles that intersect
11
it. For each cell that is both nonempty and intersected, we then com-
bine the two lists, that is, we do the corresponding point inclusion
tests.
1. Compute B, the smallest hyperrectangle containing S. Divide
B into O(n) equal-sized hyperrectangular cells. Identify each
cell by a unique index.
2. Decide for each point p in which cell i it is contained. Create
the cell-point pair (i;p).
3. For each hyperrectangle q and cell i such that q intersects cell i,
create the cell-hyperrectangle pair (i;q). Compute s, the total
number of cell-hyperrectangle pairs.
4. Globally sort the cell-point and cell-hyperrectangle pairs with
respect to cell indexes. When comparing a cell-point pair and
cell-hyperrectangle pair with the same index, let the latter pair
win.
If cell i is both nonempty and intersected, there is now a list
of cell-point pairs with index i (denoted pl(i)), followed by a
list of cell-hyperrectangle pairs with index i (denoted hl(i)).
It remains to test each point in pl(i) for inclusion within each
hyperrectangle in hl(i).
5. Let n(i) and m(i) denote the lengths of pl(i) and hl(i), respec-
tively, and let t(i) = n(i)m(i). Let A be the set of nonempty
and intersected cells. For each cell i 2 A, compute n(i);m(i)
and t(i). Compute t =
Pi2A t(i).
6. Let I = fi 2 A : t(i) < dt0=peg, where t0 = max(t;n+ s).
(a) For each cell i 2 I, gather pl(i) and hl(i) into the lowest-
indexed processor that contains elements of pl(i).
(b) Let I(j) = fi 2 I : pl(i) and hl(i) are in the processor P(j)g.
For j = 1;2;:::;p, compute ^t(j) =
P
i2I(j) t(i), p(j) =
b^t(j)=dt0=pec and f(j) =
Pj 1
k=1 p(k).
(c) For j = 1;2;:::;p, if p(j) = 0, then do the point inclusion
tests corresponding to I(j) in the processor P(j). Oth-
erwise, copy pl(i) and hl(i), i 2 I(j), to the processors
P(f(j)+1) through P(f(j)+p(j)). Decompose I(j) into
subsets I(j;k), k = 1;2;:::;p(j), such that
P
i2I(j;k) t(i)
is O(t0=p). Do the point inclusion tests corresponding to
I(j;k) in the processor P(f(j)+ k).
7. Let E = A nI. For each cell i 2 E, do as follows.
(a) Compute p(i) = bt(i)=dt0=pec and f(i) =
Pk2E;k<i p(k).
12
(b) Divide the longest list of pl(i) and hl(i) into equal-sized
sublists, ll(i;j), j = 1;2;:::;p(i). Move ll(i;j) to the
processor P(f(i)+ j).
(c) Create a copy of the shortest list of pl(i) and hl(i) in each
processor P(f(i)+1) through P(f(i)+ p(i)). Do the cor-
responding point inclusion tests.
Theorem 5 Algorithm II solves the batched range-searching problem
in O(Ts(n+s;p)+Tsb(m=p;p)+Tsb(t=p;p)+(m+n+s+t)=p), time,
where m;n p2.
Proof. Steps 1 and 2 take together O(Tr(n;p)+n=p) time. In Step
3, we can create the cell-hyperrectangle pairs by slightly modifying
the algorithm for the data-copying problem (Section 2). By Lemma 1,
this takes O(Tmr(m=p;p;(m+ s)=p) + (m + s)=p) time. The global
sort in Step 4 takes O(Ts(n + s;p)) time. In Step 5, we compute
n(i) and m(i) by segmented reduction in O(Tr(n + s;p)) time. We
can identify all cells that are both nonempty and intersected in time
O((n+s)=p+Tmr(1;p;1)). We then compute t(i) and t in O(Tr(n+
s;p)) time. Thus, in total Step 5 takes O(Tr(n + s;p)) time.
Step 6(a) takes O(Tmr((n+s)=p;p;t0=p)) time. This follows from
the fact that, if cell i 2 I, then m(i) + n(i) dt0=pe. Thus, no
processor receives more than dt0=pe pairs. In Step 6(b), we compute
and broadcast ^t(j) to every processor in O((n + s)=p + Tmb(1;p))
time. In Step 6(c), to decompose I(j) into subsets of cost O(t0=p) can
easily be done in O(t0=p) time. Step 6(c) takes O(Tmr(t0=p;p;t0=p)+
Tsb(t0=p;p)+ t0=p) time.
Step 7(a) is similar to Step 6(b). To describe Step 7(b), we as-
sume that list pl(i) is longer than hl(i). We divide pl(i) into p(i) sub-
lists such that p(i)dn(i)=p(i)e n(i) sublists have length bn(i)=p(i)c,
whereas the remaining sublists have length dn(i)=p(i)e. We use a seg-
mented pre x sums computation to decide, for each list element, to
which sublist it belongs. To move each sublist to its selected proces-
sor, we use the same techniques as in Step 2 of the algorithm for the
data-copying problem (Section 2). Since no sublist has more than
O(t0=p) elements, the total time for Step 7(b) is O(Tp(n + s;p) +
Tmb(1;p) + Tmr((n + s)=p;p;t0=p) + Tmr((n + s)=p;p;(n + s)=p) +
Tsb((n + s)=p;p)). In Step 7(c), we copy the shortest list to the se-
lected processors by monotone routing followed by segmented broad-
casting. The length of a shortest list cannot exceed
p
t0. Thus, the
total time for Step 7(c) is O(Tmr((n+s)=p;p;
p
t0)+Tsb(
p
t0;p)+t0=p).
Since we assume that n p2, it follows that
p
t0 t0=p. 2
References
1] Z-H. Zhong. Finite Element Procedures for Contact-Impact
Problems. Oxford University Press, 1993.
13
2] P-O. Fjallstrom, J. Petersson, L. Nilsson, and Z-H. Zhong. Eval-
uation of range searching methods for contact searching in me-
chanical engineering. To appear in International Journal of
Computational Geometry & Applications.
3] A. Aggarwal, B. Chazelle, L. Guibas, and C. O'Dunlaing. Par-
allel computational geometry. Algorithmica, 3:293{327, 1988.
4] M.J. Atallah. Parallel techniques for computational geometry.
Proc. IEEE, 80(9):1435{1448, 1992.
5] S.G. Akl and K.A. Lyons. Parallel Computational Geometry.
Prentice-Hall, 1993.
6] F. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable parallel
geometric algorithms for coarse grained multicomputers. In
Proc. 9th Annual ACM Symposium on Computational Geom-
etry, pages 298{307, 1993.
7] O. Devillers and A. Fabri. Scalable algorithms for bichromatic
line segment intersection problems on coarse grained multicom-
puters. In Algorithms and Data Structures. Third Workshop,
WADS'93, pages 277{288, 1993.
8] X. Deng. A convex hull algorithm on coarse grained multipro-
cessors. In Proc. 5th Annual International Symposium on Algo-
rithms and Computation (ISAAC 94), pages 634{642, 1994.
9] F. Dehne, C. Kenyon, and A. Fabri. Scalable and architecture
independent parallel geometric algorithms with high probability
optimal time. In Proc. 6th IEEE Symposium on Parallel and
Distributed Processing (SPDP), pages 586{593, 1994.
10] F. Dehne, X. Deng, P. Dymond, A. Fabri, and A.A. Khokhar. A
randomized parallel 3D convex hull algorithm for coarse grained
multicomputers. In Proc. 7th ACM Symposium on Parallel Al-
gorithms and Architectures, pages 27{33, 1995.
11] I. Al-furaih, S. Aluru, S. Goil, and S. Ranka. Parallel construc-
tion of multidimensional binary search trees. In Proc. Interna-
tional Conference on Supercomputing (ICS'96), 1996.
12] P-O. Fjallstrom. Parallel algorithms for geometric problems on
coarse grained multicomputers. Technical Report LiTH-IDA-R-
96-38, Dep. of Computer and Information Science, Linkoping
University, 1996.
13] P-O. Fjallstrom. Parallel interval-cover algorithms for coarse
grained multicomputers. Technical Report LiTH-IDA-R-96-39,
Dep. of Computer and Information Science, Linkoping Univer-
sity, 1996.
14
14] A. Ferreira, C. Kenyon, A. Rau-Chaplin, and S. Ubeda. d-
Dimensional range search on multicomputers. Technical Report
96-23, Laboratoire de l'Informatique du Parallelisme, Ecole Nor-
male Superieure de Lyon, 1996.
15] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction
to Parallel Computing. The Benjamin/Cummings Publishing
Company, Inc., 1994.
16] J.L. Bentley. Decomposable searching problems. Information
Processing Letters, 8(5):244{251, 1979.
17] H. Edelsbrunner and M.H. Overmars. Batched dynamic solu-
tions to decomposable searching problems. Journal of Algo-
rithms, 6:515{542, 1985.
18] J.L. Bentley and J.H. Friedman. Data structures for range
searching. Computing Surveys, 11:397{409, 1979.

More Related Content

PPT
Chapter 4 pc
Hanif Durad
 
PPT
All-Reduce and Prefix-Sum Operations
Syed Zaid Irshad
 
PPT
Collective Communications in MPI
Hanif Durad
 
ODP
Chapter - 04 Basic Communication Operation
Nifras Ismail
 
PPTX
Communication costs in parallel machines
Syed Zaid Irshad
 
PPT
Chap4 slides
Jothish DL
 
PPT
Chapter 5 pc
Hanif Durad
 
PDF
Colfax-Winograd-Summary _final (1)
Sangamesh Ragate
 
Chapter 4 pc
Hanif Durad
 
All-Reduce and Prefix-Sum Operations
Syed Zaid Irshad
 
Collective Communications in MPI
Hanif Durad
 
Chapter - 04 Basic Communication Operation
Nifras Ismail
 
Communication costs in parallel machines
Syed Zaid Irshad
 
Chap4 slides
Jothish DL
 
Chapter 5 pc
Hanif Durad
 
Colfax-Winograd-Summary _final (1)
Sangamesh Ragate
 

What's hot (20)

PPT
Chapter 3 pc
Hanif Durad
 
PPTX
Broadcast in Hypercube
Sujith Jay Nair
 
PPTX
Lecturre 07 - Chapter 05 - Basic Communications Operations
National College of Business Administration & Economics ( NCBA&E)
 
PPT
Chapter 6 pc
Hanif Durad
 
PPT
FEC & File Multicast
Yoss Cohen
 
PDF
00b7d51ed81834e4d7000000
Rahul Jain
 
PDF
Performance comparision 1307.4129
Pratik Joshi
 
PPTX
Nsl seminar(2)
Thomhert Siadari
 
DOC
The Most Important Algorithms
wensheng wei
 
PDF
Available network bandwidth schema to improve performance in tcp protocols
IJCNCJournal
 
PPT
Network coding
Lishi He
 
PDF
Report on High Performance Computing
Prateek Sarangi
 
PDF
A novel technique for speech encryption based on k-means clustering and quant...
journalBEEI
 
PPT
Tutorial on Parallel Computing and Message Passing Model - C4
Marcirio Chaves
 
PDF
FINAL PROJECT REPORT
Dhrumil Shah
 
PPT
Distributed Hash Table
ravindra.devagiri
 
PDF
The reasons why 64-bit programs require more stack memory
PVS-Studio
 
PDF
Model checking
Richard Ashworth
 
PDF
A046020112
IJERA Editor
 
PDF
C0431320
IOSR Journals
 
Chapter 3 pc
Hanif Durad
 
Broadcast in Hypercube
Sujith Jay Nair
 
Lecturre 07 - Chapter 05 - Basic Communications Operations
National College of Business Administration & Economics ( NCBA&E)
 
Chapter 6 pc
Hanif Durad
 
FEC & File Multicast
Yoss Cohen
 
00b7d51ed81834e4d7000000
Rahul Jain
 
Performance comparision 1307.4129
Pratik Joshi
 
Nsl seminar(2)
Thomhert Siadari
 
The Most Important Algorithms
wensheng wei
 
Available network bandwidth schema to improve performance in tcp protocols
IJCNCJournal
 
Network coding
Lishi He
 
Report on High Performance Computing
Prateek Sarangi
 
A novel technique for speech encryption based on k-means clustering and quant...
journalBEEI
 
Tutorial on Parallel Computing and Message Passing Model - C4
Marcirio Chaves
 
FINAL PROJECT REPORT
Dhrumil Shah
 
Distributed Hash Table
ravindra.devagiri
 
The reasons why 64-bit programs require more stack memory
PVS-Studio
 
Model checking
Richard Ashworth
 
A046020112
IJERA Editor
 
C0431320
IOSR Journals
 
Ad

Viewers also liked (16)

PPTX
Indian conquistadors
Roccaheather
 
PPTX
Indian conquistadors
Roccaheather
 
PDF
Survey and Evaluation of Methods for Tissue Classification
perfj
 
PDF
Enfoque basado en procesos
ZELEY VELEZ
 
PPTX
The jesuit relations
Roccaheather
 
PDF
cis98010
perfj
 
PPTX
Italy slides for history
Roccaheather
 
DOCX
3 strategie di web marketing per acquisire clienti online
Enrico Venti
 
PPTX
Midterm history
Roccaheather
 
PPTX
Harness The Full Potential Of Mobile Through Paid Search
dmothes
 
PDF
cis98006
perfj
 
PDF
Assessing the compactness and isolation of individual clusters
perfj
 
PPTX
Earthquakes
Roccaheather
 
PDF
cis97007
perfj
 
PDF
Data Backup, Archiving &amp; Disaster Recovery October 2011
zaheer756
 
PPTX
Dead star
Maria Vanessa Tabuada
 
Indian conquistadors
Roccaheather
 
Indian conquistadors
Roccaheather
 
Survey and Evaluation of Methods for Tissue Classification
perfj
 
Enfoque basado en procesos
ZELEY VELEZ
 
The jesuit relations
Roccaheather
 
cis98010
perfj
 
Italy slides for history
Roccaheather
 
3 strategie di web marketing per acquisire clienti online
Enrico Venti
 
Midterm history
Roccaheather
 
Harness The Full Potential Of Mobile Through Paid Search
dmothes
 
cis98006
perfj
 
Assessing the compactness and isolation of individual clusters
perfj
 
Earthquakes
Roccaheather
 
cis97007
perfj
 
Data Backup, Archiving &amp; Disaster Recovery October 2011
zaheer756
 
Ad

Similar to cis97003 (20)

PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
PDF
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Derryck Lamptey, MPhil, CISSP
 
PDF
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
IRJET Journal
 
PDF
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
RSIS International
 
PDF
Problems in Task Scheduling in Multiprocessor System
ijtsrd
 
PDF
Gk3611601162
IJERA Editor
 
PDF
Parallelization of Graceful Labeling Using Open MP
IJSRED
 
PDF
Solution(1)
Gopi Saiteja
 
PDF
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
ijtsrd
 
PDF
Rapport_Cemracs2012
Jussara F.M.
 
PDF
Ecc cipher processor based on knapsack algorithm
Alexander Decker
 
PDF
Bh36352357
IJERA Editor
 
PPTX
Complier design
shreeuva
 
PDF
A comparison of efficient algorithms for scheduling parallel data redistribution
IJCNCJournal
 
PDF
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
paperpublications3
 
PDF
International Journal of Soft Computing, Mathematics and Control (IJSCMC)
ijscmcj1
 
PPT
Query optimization for_sensor_networks
Harshavardhan Achrekar
 
PPT
Computing with Directed Labeled Graphs
Marko Rodriguez
 
PDF
A new RSA public key encryption scheme with chaotic maps
IJECEIAES
 
PDF
The Quality of the New Generator Sequence Improvent to Spread the Color Syste...
TELKOMNIKA JOURNAL
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Derryck Lamptey, MPhil, CISSP
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
IRJET Journal
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
RSIS International
 
Problems in Task Scheduling in Multiprocessor System
ijtsrd
 
Gk3611601162
IJERA Editor
 
Parallelization of Graceful Labeling Using Open MP
IJSRED
 
Solution(1)
Gopi Saiteja
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
ijtsrd
 
Rapport_Cemracs2012
Jussara F.M.
 
Ecc cipher processor based on knapsack algorithm
Alexander Decker
 
Bh36352357
IJERA Editor
 
Complier design
shreeuva
 
A comparison of efficient algorithms for scheduling parallel data redistribution
IJCNCJournal
 
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
paperpublications3
 
International Journal of Soft Computing, Mathematics and Control (IJSCMC)
ijscmcj1
 
Query optimization for_sensor_networks
Harshavardhan Achrekar
 
Computing with Directed Labeled Graphs
Marko Rodriguez
 
A new RSA public key encryption scheme with chaotic maps
IJECEIAES
 
The Quality of the New Generator Sequence Improvent to Spread the Color Syste...
TELKOMNIKA JOURNAL
 

cis97003

  • 1. Linkoping Electronic Articles in Computer and Information Science Vol. 2(1997): nr 3 This work has been submitted for publication elsewhere. Copyright may then be transferred, and the present version of the article may be superseded by a revised one. The WWW page at the URL stated below will contain up-to-date information about the current version and copyright status of this article. Additional copyright information is found on the next page of this document. Linkoping University Electronic Press Linkoping, Sweden http://www.ep.liu.se/ea/cis/1997/003/ Parallel Algorithms for Batched Range Searching on Coarse-Grained Multicomputers Per-Olof Fjallstrom Department of Computer and Information Science Linkoping University Linkoping, Sweden
  • 2. Published on April 1, 1997 by Linkoping University Electronic Press 581 83 Linkoping, Sweden Linkoping Electronic Articles in Computer and Information Science ISSN 1401-9841 Series editor: Erik Sandewall c 1997 Per-Olof Fjallstrom Typeset by the author using LaTEX Formatted using etendu style Recommended citation: <Author>. <Title>. Linkoping Electronic Articles in Computer and Information Science, Vol. 2(1997): nr 3. http://www.ep.liu.se/ea/cis/1997/003/. April 1, 1997. This URL will also contain a link to the author's home page. The publishers will keep this article on-line on the Internet (or its possible replacement network in the future) for a period of 25 years from the date of publication, barring exceptional circumstances as described separately. The on-line availability of the article implies a permanent permission for anyone to read the article on-line, and to print out single copies of it for personal use. This permission can not be revoked by subsequent transfers of copyright. All other uses of the article, including for making copies for classroom use, are conditional on the consent of the copyright owner. The publication of the article on the date stated above included also the production of a limited number of copies on paper, which were archived in Swedish university libraries like all other written works published in Sweden. The publisher has taken technical and administrative measures to assure that the on-line version of the article will be permanently accessible using the URL stated above, unchanged, and permanently equal to the archived printed copies at least until the expiration of the publication period. For additional information about the Linkoping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/ or by conventional mail to the address stated above.
  • 3. Abstract We de ne the batched range-searching problem as follows: given a set S of n points and a set Q of m hyperrectangles, report for each hyperrectangle which points it contains. This problem has applications in, for example, computer-aided design and engi- neering. We present several parallel algorithms for this problem on coarse-grained multicomputers. Our algorithms are based on well-known average- and worst-case e cient sequential algo- rithms. One of our algorithms solves the d-dimensional batched range-searching problem in O(Ts(nlogd 1 p;p)+Ts(mlogd 1 p;p)+ ((m + n)logd 1(n=p) + mlogd 1p log(n=p) + k)=p) time on a p- processor coarse-grained multicomputer. (Ts(n;p) denotes the time globally to sort n numbers on a p-processor multicomputer, and k is the total number of reported points.) Keywords Parallel algorithms, coarse-grained multicomputers, range searching. The work presented here is funded by CENIIT (the Center for Industrial Information Technology) at Linkoping University.
  • 4. 1 1 Introduction In many applications, such as geographic information systems, com- puter-aided design and engineering, statistics, etc., we need to answer the following range-searching query: given a set S of n points, which points lie within a given hyperrectangle? (A hyperrectangle is the Cartesian product of intervals on distinct coordinate axes.) Usually, we need to answer many such queries for the same set of points. In some situations, we know the set of queries in advance. That is, we want to solve the following batched range-searching problem: given a set S of n points and a set Q of m hyperrectangles, report for each hyperrectangle which points it contains. For example, this is an important subproblem in computer simulation of deformation processes, such as vehicle collisions and mechanical forming processes. In such simulations nding all contacts between components of nite- element models of physical objectsis necessary. This can be simpli ed by approximating surface segments with hyperrectangles, and then determining which vertices these hyperrectangles contain 1, 2]. In this paper, we present parallel algorithms for batched range searching on coarse-grained multicomputers. A coarse-grained mul- ticomputer consists of several processors connected by an intercon- nection network. Each processor is fairly powerful, i.e., it delivers workstation-class performance. Since o -the-shelf hardware can be used, coarse-grained multicomputers are relatively inexpensive. Most commercially available parallel computers are of this type. Most of the research on parallel algorithms for geometric problems has focused on ne-grain parallel models of computation 3, 4, 5]. It is only during the last couple of years that researchers have designed parallel geometric algorithms for coarse-grained multicomputers 6, 7, 8, 9, 10, 11, 12, 13, 14]. In this model of computation we can assume that the size of each local memory is large. For example, it is common to assume that the size of each local memory is larger than the number of processors. This property allows the algorithm designer to balance communication latency with local computation time. Our parallel algorithms for batched range searching are based on well-known worst- and average-case e cient sequential algorithms. One of our algorithms is based the range-tree method, and solves the d-dimensional batched range-searching problem in O(Ts(nlogd 1 p;p)+ Ts(mlogd 1 p;p)+((m+n)logd 1 (n=p)+mlogd 1 plog(n=p)+k)=p) time on a p-processor coarse-grained multicomputer. (Ts(n;p) de- notes the time globally to sort n numbers on a p-processor multi- computer, and k is the total number of reported points.) We also give algorithms based on the cell method. This method has poor worst-case performance, but since it can be very e cient in practice, we believe that developing parallel algorithms based on this approach is important. Other researchers have developed parallel algorithms for range
  • 5. 2 searching on coarse-grained multicomputers. Devillers and Fabri 7] give an algorithm for the one-dimensional case. Recently, Fer- reira et al 14] present algorithms for the d-dimensional case. They construct a distributed range tree in time O(s=p + Ts(s;p)), where s = nlogd 1 n. They can then answer a set of m = O(n) range queries in time ((slogn + k)=p+ Ts(s;p)). We organize the rest of the paper as follows. In Section 2, we give additional information about coarse-grained multicomputers, and de- scribe some basic operations used by our algorithms. In Sections 3 and 4, we present parallel range-searching algorithms based on the range-tree and cell methods, respectively. 2 Model of Computation Coarse-grained multicomputers consist of a set of processors con- nected through an interconnection network. The number of proces- sors usually varies between 16 and 256. The memory is physically distributed over the processors, and interaction between processors is through message passing. Each processor can execute a di erent program independent of the other processors. However, it is com- mon to let each processor execute the same program asynchronously. That is, except a few global communication steps, processors execute the same program independently of each other. Common intercon- nection networks are 2D meshes (Paragon XP/S), 3D meshes (Cray T3E), hypercubes (nCUBE 2), and fat trees (CM-5). Our algorithms use a few basic and extensively studied communi- cation operations. We next describe these operations, and give their time complexities for a square 2D mesh with p processors, which are assumed to be indexed from 1 through p. For a detailed description and analysis of the operations, see Kumar et al 15]. Monotone routing: Each processor P(i) sends at most one m-word message. The destination address, d(i), of the message sent by P(i) is such that if both P(i) and P(i0), i < i0, send messages, then d(i) d(i0). The time complexity, Tmr(m;p;rmax), is O((rmax + m)pp), where rmax is the maximum number of words received by any pro- cessor. Segmented broadcast: Processors with indexes i1 < i2 ::: < iq, are selected; each processor P(ij) sends the same m-word message to all processors P(ij + 1) through P(ij+1 1). The time complexity, Tsb(m;p), is O(mpp). Multinode broadcast: Every processor sends the same m-word mes- sage to every other processor. The time complexity, Tmb(m;p), is O(mp). Total exchange: Every processor sends a distinct m-word message to every other processor. The time complexity, Tx(m;p), is O(mppp). Pre x sums and reduction: Let a1;a2;:::;an be a list of numbers
  • 6. 3 evenly distributed over the processors and let be an associative operator. The pre x sums operation computes si = a1 ai, and stores si in the same processor as ai. The time complexity, Tp(n;p), is O(n=p+pp). The reduction operation computes s = a1 an, and stores s in each processor. The time complexity, Tr(n;p), is O(n=p + pp). In the segmented versions of these operations, we apply them to sublists of a1;a2;:::;an. The time complexity is the same as for the ordinary operations. Global sort: Given a list a1;a2;:::;an of numbers evenly distributed over the processors, the global sort operations sorts the list, and re- turns it evenly distributed over the processors. The time complexity, Ts(n;p), is O(n(log(n=p)+ pp)=p). We end this section by showing how some of the above operations can be used to solve a simple data-copying problem. This is an im- portant subproblem in the algorithms to be presented in this paper. The data-copying problem is as follows. A set R of n equal-sized data records is evenly distributed over the processors of a p-processor mul- ticomputer. With each record r is associated a nonnegative integer n(r). The task is to create n(r) additional copies of each record r such that the work of creating the records is uniformly distributed over the processors. We do this as follows. 1. Let R0 = fr 2 R : n(r) > 0g, and let w = P r2R0 n(r). Decompose R0 into subsets R0(i), i = 1;2;:::;p, such thatPr2R0 (i) n(r) = bw=pc for i pdw=pe w, and Pr2R0 (i) n(r) = dw=pe otherwise. 2. For i = 1;2;:::;p, copy R0(i) to the processor P(i). Create the copies of the records in R0(i) in the processor P(i). Lemma 1 We can solve the data-copying problem in O(Tmr(n=p;p; (n + w)=p)+ (n + w)=p), time where w is the total number of copies and n p2. Proof. Regard R0 as an ordered set fr1;r2;:::;rmg. We begin Step 1 by computing the pre x sums s1;s2;:::;sm, where sk = Pk j=1 n(rj). To simplify the description of how to decompose R0 into subsets, we assume that w is an integer multiple of p. Extending our descrip- tion to the general case is easy. Let lk = bsk=(w=p)c. For each record rk, if lk 1 = lk, then we assign rk to the subset R0(lk 1 + 1). Otherwise, let dk = lk lk 1. Next, we create new records rk;j, j = 0;1;:::;dk, such that we (1) assign rk;0 to the subset R0(lk 1 +1) and set n(rk;0) = (lk 1 + 1)w=p sk 1, (2) assign rk;dk to the sub- set R0(lk + 1) and set n(rk;dk ) = sk lkw=p, and (3) assign rk;j, 0 < j < dk, to subset R0(lk 1 + 1 + j) and set n(rk;j) = w=p. Ob- serve that no subset contains more than dw=pe elements. In Step 2, we rst identify nonlocal subsets, i.e., subsets whose elements lie in several processors. To do this each processor sends the indexes
  • 7. 4 of the lowest- and highest-indexed subset that it contains to every other processor. Using monotone routing, we then copy the nonlo- cal subsets directly to their nal destinations. Some processors may completely contain one or more subsets. We handle this by copying all such subsets to the nal destination of the lowest-indexed subset in the processor. If a processor completely contains more than one subset, we then use segmented broadcast to transfer the subsets to their correct destinations. 2 3 A Worst-Case E cientAlgorithmfor Range Searching Let us again state the problem in which we are interested. The input consists of a set S of n points and a set Q of m hyperrectangles. The task is to report, for each hyperrectangle, which points it contains. In our development of parallel algorithms for this problem, we assume that initially each processor stores n=p points and m=p hyperrectan- gles, and that m and n are both greater than or equal to p2. The output consists of hyperrectangle-point pairs, that is, for each hyper- rectangle q and point p such that p is contained in q, the pair (q;p) is created. In this section we present a parallel algorithm inspired by the se- quential range-tree method 16]. This is a worst-casee cient method. We can use it to solve the d-dimensional, d 2, batched range- searching problem in time O(nlogd 1n+mlogdn+k), where k is the total number of reported points. For a set S of points in the plane, the corresponding range tree consists of a binary search tree on the x-coordinates of the points. That is, every node v represents an in- terval I(v) such that a leaf node represents the interval between two consecutive x-coordinates, and an interior node represents the union of the intervals of its children. (We call these intervals standard in- tervals.) With every node v is associated a y-sorted list Sy(v) of the points with x-coordinate within I(v). To determine which points are contained in a hyperrectangle q, partition the x-range of q into stan- dard intervals. More speci cally, interval I(v) is part of the partition if the x-range of q contains I(v) but not I(p(v)), where p(v) is the parent of node v. Then, for every interval I(v) in the partition, decide by a binary search which points in Sy(v) lie within the y-range of q. We can thus decompose a two-dimensional range-searching problem into a collection of one-dimensional range-searching problems. We give rst a parallel algorithm for the one-dimensional case. We then show how we can extend this algorithm to higher dimensions. The algorithm consists of three parts and the details are as follows. Part I: 1. Globally sort S into nondecreasing order by x-coordinate. Di- vide the sorted list into equal-sized sublists, S(i), i = 1;2;:::;p=2.
  • 8. 5 1 2 4 5 6 7 3 l(1) l(2) l(3) l(4) l(5) x Figure 1: The tree Tp for p = 8. Tp has p=2 leaf nodes and p 1 nodes in total. The given x-range is partitioned into the standard intervals corresponding to the circled nodes. It contains the intervals of leaves 2, 3 and 4, and intersects the interval of leaf 1. We index nodes from left to right, beginning with the leaves. (We assume that p is an integer power of two.) 2. For each sublist S(i), nd l(i), the smallest x-coordinate in the sublist (for sublist S(p=2) nd also l(p=2 + 1), the largest x- coordinate in S(p=2)). Broadcast the l-values to all processors. 3. In every processor build a binary search tree Tp on the l-values. Identify each node in Tp by a unique index in the range 1 through p 1. See Figure 1. Part II: 1. For each hyperrectangle q and leaf node i, if q's x-range inter- sects but does not contain I(i), create the node-hyperrectangle pair (i;q). 2. For each leaf node i, determine e(i), the number of node-hyper- rectangle pairs with node index i. Compute e = Pp=2 i=1 e(i). If e = 0, continue to Part III. 3. Globally sort the node-hyperrectangle pairs by node index. 4. For each leaf node i, compute p(i) = de(i)=d2e=pee and f(i) =Pi 1 j=1 p(j). If p(i) > 0, continue as follows. (a) Copy S(i) to the processors P(f(i)+1) through P(f(i)+ p(i)). (b) Divide the node-hyperrectangle pairs with node index i into equal-sized subsets Q(i;j), j = 1;2;:::;p(i). Move Q(i;j) to the processor P(f(i)+ j). (c) Find k(i;j),the output size of the range-searching problem with input S(i)and Q(i;j). Compute k = Pp=2 i=1 Pp(i) j=1 k(i;j). If k = 0, continue to Part III.
  • 9. 6 5. Foreach subset Q(i;j),compute p(i;j) = bk(i;j)=(k0=p)c, where k0 = max(k;n), and f(i;j) = Pi 1 k=1 Pp(k) l=1 p(k;l)+ Pj 1 l=1 p(i;l). If p(i;j) = 0, solve the range-searching problem with input S(i) and Q(i;j) in the processor P(f(i) + j). Otherwise, continue as follows. (a) Copy S(i) and Q(i;j) to the processors P(f(i;j) + 1) through P(f(i;j)+ p(i;j)). (b) Divide Q(i;j)into the subsets Q0(i;j;l),l = 1;2;:::;p(i;j), such that P (i;q)2Q0 (i;j;l) k(i;q) is O(k0=p). (c) Solve the range-searching problem with input S(i) and Q0(i;j;l) in the processor P(f(i;j)+ l). Part III: 1. For each leaf node i, determine d(i), the number of hyper- rectangles whose x-range contains the interval I(i). Compute d = Pp=2 i=1 d(i). If d = 0, end the execution. 2. For each leaf node i, compute p(i) = dd(i)=d2d=pee and f(i) =Pi 1 j=1 p(j). Copy S(i) to the processors P(f(i) + 1) through P(f(i)+ p(i)). 3. If d > m, then do as follows. (a) Copy the hyperrectangles in each processor to every other processor. (b) For each hyperrectangle q and leaf node i, if q's x-range contains I(i), create the node-hyperrectangle pair (i;q). (c) For each leaf node i, divide the node-hyperrectangle pairs with node index i into equal-sized subsets Q(i;j), j = 1;2;:::;p(i). 4. If d m, then do as follows. (a) For each hyperrectangle q and leaf node i, if q's x-range contains I(i), create the node-hyperrectangle pair (i;q). (b) Globally sort the pairs by node index. (c) For each leaf node i, divide the node-hyperrectangle pairs with node index i into equal-sized subsets Q(i;j), j = 1;2;:::;p(i). Move Q(i;j) to the processor P(f(i)+ j). 5. For each pair (i;q) 2 Q(i;j) and each point p 2 S(i), create the pair (q;p). Theorem 1 We can solve the one-dimensional range-searching prob- lem in time O(Ts(n;p)+ Ts(m;p)+ (mlog(n=p)+ k)=p).
  • 10. 7 Proof. In Step 1 of PartII, we use binary search to decide which pairs to create. Since each hyperrectangle intersects at most two intervals, we create the corresponding pairs locally. In Step 2, each proces- sor rst determines how many of its node-hyperrectangle pairs have node index i, i = 1;2;:::;p=2. After a total exchange operation, the processor P(i) computes e(i). P(i) then broadcasts e(i) to every pro- cessor. Step 4(a) is done by monotone routing followed by segmented broadcasting. In Step 4(b), a segmented pre x sums operation deter- mines the rank of each pair (i;q) among the pairs with node index i. The rank decides to which subset Q(i;j) that (i;q) belongs. We then move Q(i;j) to the processor P(f(i)+ j) using techniques similar to those used in Step 2 of the algorithm for the data-copying problem (Section 2). In Steps 4(c) and 5(c), we use binary search. In total, Parts I and II take O(Ts(n;p)+Ts(m;p)+(mlog(n=p)+k)=p) time. In Step 1 of Part III, each processor rst determines, for each leaf node i, how many of its hyperrectangles contain the interval I(i). This is done as follows. For each hyperrectangle q and node i, if q's x-range contains I(i) but not I(p(i)) (where p(i) is the parent of node i), increment a counter associated with node i. To compute, for each leaf node i, how many local hyperrectangles contain I(i), sum the counters associated with nodes along the path from i to the root of Tp. Then continue as in Step 2 of Part II. In Step 3(b), each processor that has received a copy of S(i)decides which hyperrectangles contain the interval I(i), and creates the corresponding pairs. In Step 4(a), we create the pairs using a modi ed version of the algorithm for the data-copying problem (Section 2). Since we assume in Step 4 that d m, the total time complexity of Step 4 is O(Ts(m;p)). Step 5, nally, takes O(dn=p2) time. Since we assume that n p2, if d > m, the time complexity of Step 5 asymptotically exceeds the time complexities of Steps 1 and 3. The total time for Part III is thus O(Tsb(n=p;p)+ Ts(m;p)+ dn=p2). 2 Giving an algorithm for the two-dimensional case is now easy. It too consists of three parts, where Parts I and II are essentially the same as above. In Part II, we use the batched range-searching algo- rithm of Edelsbrunner and Overmars 17]. They give a divide-and- conquer algorithm for batched range searching that runs in O((m+ n)logd 1 n + mlogm+ k) time and uses O(m+ n) space. It is only Part III that deviates signi cantly from the one-dimensional case. The details of Part III are now as follows. Part III: 1. For each hyperrectangle q and node i, if q's x-range contains I(i) but not I(p(i)), create the node-hyperrectangle pair (i;q). 2. Foreach node i, determine c(i),the number of node-hyperrectangle pairs with node index i. Compute c = Pp 1 i=1 c(i). If c = 0, end the execution.
  • 11. 8 3. For each point p and each node i such that p's x-coordinate is contained in I(i) and c(i) > 0, create the node-point pair (i;p). 4. Solve the one-dimensional range-searching problem with input consisting of the node-point pairs and the node-hyperrectangle pairs. That is, for each node-hyperrectangle pair (i;q), nd the node-point pairs (i;p) such that p is contained in q's y-range. Theorem 2 We can solve the two-dimensional range-searching prob- lem in time O(Ts(nlogp;p)+Ts(mlogp;p)+(mlogplog(n=p)+k)=p). Proof. Part I is exactly as in the one-dimensional case. Part II is the same as in the one-dimensional case, except Steps 4(c) and 5(c) which now use the batched range-searching algorithm of Edelsbrunner and Overmars. (In Step 4(c), we modify this algorithm to compute just how many points are contained in each hyperrectangle.) Parts I and II take together O(Ts(n;p) + Ts(m;p) + ((m + n)log(n=p) + k)=p) time. In Step 1 of Part III, it takes O(logp) time for each hyperrect- angle q to nd all nodes i in Tp such that q's x-range contains I(i) but not I(p(i)). Step 2 is similar to Step 2 of Part II. Step 3 takes O(nlogp=p) time. In Step 4, we apply our one-dimensional range- searching algorithm to the node-hyperrectangle and node-point pairs created in Steps 1 and 3. In our one-dimensional algorithm we as- sume that the input is evenly distributed over the processors, and that the number of points and the number of hyperrectangles are both greater than or equal to p2. These assumptions are not necessarily satis ed by the node-hyperrectangle and node-point pairs. We can easily remedy this by adding dummy input as follows. Each processor counts how many node-point pairs it stores. By a reduction opera- tion, we then nd nmax, the maximum number of such pairs contained in any processor. Finally, each processor adds dummy pairs until it has exactly max(nmax;p) pairs. The same approach is used for the node-hyperrectangle pairs. Since no processor stores more than O(nlogp=p) node-point pairs and O(mlogp=p) node-hyperrectangle pairs, it takes O(Tr(p;p)+ (m+ n)logp=p) time to add the dummy input. 2 Generalizing the above approach to higher dimensions is straight- forward. We can easily derive the following result. Theorem 3 We can solve the d-dimensional range-searching prob- lem in time O(Ts(nlogd 1 p;p)+Ts(mlogd 1 p;p)+((m+n)logd 1 (n=p)+ mlogd 1p log(n=p)+ k)=p). 4 Average-Case E cientAlgorithmsfor Range Searching In this section, we present parallel algorithms for range searching that are based on the cell method 18]. In its simplest version, this
  • 12. 9 method is as follows. First, nd the smallest hyperrectangle B that contains the set S. Divide B into equal-sized hyperrectangular cells, and record for each cell which points it contains. We call the resulting data structure a cell directory. To decide which points a hyperrect- angle q contains, do as follows. For each cell intersected by q, access the corresponding entry in the cell directory and test, for each point contained in the cell, if it is included within q. It is common to divide B into O(n) cells, in which case we can build the cell directory (e.g., a multidimensional array of pointers) in O(n) time. The total cost of solving the batched range-searching problem is then O(m + n + s + t) time, where s and t denote the total number of cell accesses and point inclusion tests, respectively. (The time complexity increases linearly with the dimension d of the problem. In this paper, we assume that d is a small constant.) As already mentioned, the worst-case performance of this method is poor. We can easily create an input such that s + t is (mn), al- though the output size k is zero. However, in many applications the cell method may outperform more sophisticated methods. For exam- ple, in an experimental evaluation of methods for range searching, we 2] found it to be much faster than the range-tree method. This is due to its relative simplicity (small constants of proportionality). More- over, one can show that if the points are evenly distributed in space and the shape of the query hyperrectangles is similar to the shape of the cells, then s+t is O(k). Finding an e cient parallelization of the cell method as described above is therefore important. Algorithm I: Our rst algorithm is based on the assumption that storing a copy of S and Q in each processor is possible. The rst step of the algorithm achieves this by multinode broadcasting. Then, each processor executes the sequential cell method. To load-balance the computations, we divide Q into subsets for which the total number of cell accesses and point inclusion tests is about the same. 1. Copy the points and hyperrectangles in each processor to every other processor. 2. Locally build a cell directory for S. That is, compute B, the smallest hyperrectangle containing S. Divide B into O(n)equal- sized hyperrectangular cells, and record, for each cell, which points it contains. 3. For each hyperrectangle q, nd s(q), the number of cells it in- tersects. Compute s = P q2Q s(q). 4. For each hyperrectangle q, nd t(q), the total number of points contained in the cells intersected by q. Compute t = Pq2Q t(q). 5. For each hyperrectangle q, let r(q) = s(q) + t(q). Divide Q into the subsets Q(j), j = 1;2;:::;p, such that Pq2Q(j) r(q) = O(max(n;(s+ t)=p)).
  • 13. 10 2 3 4 4 3 4 5 5 16 0 0 7 9 7 10 12 8 12 0 0 0 0 0 0 0 Figure 2: Example with sixteen points and one hyperrectangle. The number besides each grid vertex is the number of points dominated by the vertex. The shaded region is the block of cells intersected by the hyperrectangle. The number of point inclusion tests is 12+0 4 0 = 8. 6. Solve the range-searching problem with input S and Q(j) in the processor P(j). Theorem 4 Algorithm I solves the batched range-searching problem in time O(Tmb((m+ n)=p;p)+ m+ n + (s+ t)=p), where m;n p. Proof. Steps 1 through 3 take O(Tmb((m+ n)=p;p)+ m+ n) time. In Step 4, we rst compute for each grid vertex v how many points in S it dominates, that is, how many points lie in v's southwest quadrant. This can be done in O(n) time. Let d(v) denote the number of points dominated by the grid vertex v. Then, t(q) = d(vNE) + d(vSW) d(vNW) d(vSE), where vNE, vSW , vNW and vSE denote the northeast, southwest and northwest and southeast vertices, respectively, of the block of cells intersected by the hyper- rectangle q. See Figure 2. Thus, Step 4 takes O(n+m) time in total. In Step 5, the partitioning of Q into subsets can easily be done in O(m) time. Finally, Step 6 takes O(max(n;(s+ t)=p)) time. 2 If we can have a copy of S and Q on each processor, and if s + t is large compared with m + n, then this algorithm can be quite e cient. However, in many applications storing a copy of the input on each processor would be impossible. This suggests that we should investigate alternative parallelizations of the cell method. Algorithm II: Brie y, this algorithm is asfollows. Foreach nonempty cell, we create a list of the points it contains. For each intersected cell, we create a list of (copies of) the hyperrectangles that intersect
  • 14. 11 it. For each cell that is both nonempty and intersected, we then com- bine the two lists, that is, we do the corresponding point inclusion tests. 1. Compute B, the smallest hyperrectangle containing S. Divide B into O(n) equal-sized hyperrectangular cells. Identify each cell by a unique index. 2. Decide for each point p in which cell i it is contained. Create the cell-point pair (i;p). 3. For each hyperrectangle q and cell i such that q intersects cell i, create the cell-hyperrectangle pair (i;q). Compute s, the total number of cell-hyperrectangle pairs. 4. Globally sort the cell-point and cell-hyperrectangle pairs with respect to cell indexes. When comparing a cell-point pair and cell-hyperrectangle pair with the same index, let the latter pair win. If cell i is both nonempty and intersected, there is now a list of cell-point pairs with index i (denoted pl(i)), followed by a list of cell-hyperrectangle pairs with index i (denoted hl(i)). It remains to test each point in pl(i) for inclusion within each hyperrectangle in hl(i). 5. Let n(i) and m(i) denote the lengths of pl(i) and hl(i), respec- tively, and let t(i) = n(i)m(i). Let A be the set of nonempty and intersected cells. For each cell i 2 A, compute n(i);m(i) and t(i). Compute t = Pi2A t(i). 6. Let I = fi 2 A : t(i) < dt0=peg, where t0 = max(t;n+ s). (a) For each cell i 2 I, gather pl(i) and hl(i) into the lowest- indexed processor that contains elements of pl(i). (b) Let I(j) = fi 2 I : pl(i) and hl(i) are in the processor P(j)g. For j = 1;2;:::;p, compute ^t(j) = P i2I(j) t(i), p(j) = b^t(j)=dt0=pec and f(j) = Pj 1 k=1 p(k). (c) For j = 1;2;:::;p, if p(j) = 0, then do the point inclusion tests corresponding to I(j) in the processor P(j). Oth- erwise, copy pl(i) and hl(i), i 2 I(j), to the processors P(f(j)+1) through P(f(j)+p(j)). Decompose I(j) into subsets I(j;k), k = 1;2;:::;p(j), such that P i2I(j;k) t(i) is O(t0=p). Do the point inclusion tests corresponding to I(j;k) in the processor P(f(j)+ k). 7. Let E = A nI. For each cell i 2 E, do as follows. (a) Compute p(i) = bt(i)=dt0=pec and f(i) = Pk2E;k<i p(k).
  • 15. 12 (b) Divide the longest list of pl(i) and hl(i) into equal-sized sublists, ll(i;j), j = 1;2;:::;p(i). Move ll(i;j) to the processor P(f(i)+ j). (c) Create a copy of the shortest list of pl(i) and hl(i) in each processor P(f(i)+1) through P(f(i)+ p(i)). Do the cor- responding point inclusion tests. Theorem 5 Algorithm II solves the batched range-searching problem in O(Ts(n+s;p)+Tsb(m=p;p)+Tsb(t=p;p)+(m+n+s+t)=p), time, where m;n p2. Proof. Steps 1 and 2 take together O(Tr(n;p)+n=p) time. In Step 3, we can create the cell-hyperrectangle pairs by slightly modifying the algorithm for the data-copying problem (Section 2). By Lemma 1, this takes O(Tmr(m=p;p;(m+ s)=p) + (m + s)=p) time. The global sort in Step 4 takes O(Ts(n + s;p)) time. In Step 5, we compute n(i) and m(i) by segmented reduction in O(Tr(n + s;p)) time. We can identify all cells that are both nonempty and intersected in time O((n+s)=p+Tmr(1;p;1)). We then compute t(i) and t in O(Tr(n+ s;p)) time. Thus, in total Step 5 takes O(Tr(n + s;p)) time. Step 6(a) takes O(Tmr((n+s)=p;p;t0=p)) time. This follows from the fact that, if cell i 2 I, then m(i) + n(i) dt0=pe. Thus, no processor receives more than dt0=pe pairs. In Step 6(b), we compute and broadcast ^t(j) to every processor in O((n + s)=p + Tmb(1;p)) time. In Step 6(c), to decompose I(j) into subsets of cost O(t0=p) can easily be done in O(t0=p) time. Step 6(c) takes O(Tmr(t0=p;p;t0=p)+ Tsb(t0=p;p)+ t0=p) time. Step 7(a) is similar to Step 6(b). To describe Step 7(b), we as- sume that list pl(i) is longer than hl(i). We divide pl(i) into p(i) sub- lists such that p(i)dn(i)=p(i)e n(i) sublists have length bn(i)=p(i)c, whereas the remaining sublists have length dn(i)=p(i)e. We use a seg- mented pre x sums computation to decide, for each list element, to which sublist it belongs. To move each sublist to its selected proces- sor, we use the same techniques as in Step 2 of the algorithm for the data-copying problem (Section 2). Since no sublist has more than O(t0=p) elements, the total time for Step 7(b) is O(Tp(n + s;p) + Tmb(1;p) + Tmr((n + s)=p;p;t0=p) + Tmr((n + s)=p;p;(n + s)=p) + Tsb((n + s)=p;p)). In Step 7(c), we copy the shortest list to the se- lected processors by monotone routing followed by segmented broad- casting. The length of a shortest list cannot exceed p t0. Thus, the total time for Step 7(c) is O(Tmr((n+s)=p;p; p t0)+Tsb( p t0;p)+t0=p). Since we assume that n p2, it follows that p t0 t0=p. 2 References 1] Z-H. Zhong. Finite Element Procedures for Contact-Impact Problems. Oxford University Press, 1993.
  • 16. 13 2] P-O. Fjallstrom, J. Petersson, L. Nilsson, and Z-H. Zhong. Eval- uation of range searching methods for contact searching in me- chanical engineering. To appear in International Journal of Computational Geometry & Applications. 3] A. Aggarwal, B. Chazelle, L. Guibas, and C. O'Dunlaing. Par- allel computational geometry. Algorithmica, 3:293{327, 1988. 4] M.J. Atallah. Parallel techniques for computational geometry. Proc. IEEE, 80(9):1435{1448, 1992. 5] S.G. Akl and K.A. Lyons. Parallel Computational Geometry. Prentice-Hall, 1993. 6] F. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable parallel geometric algorithms for coarse grained multicomputers. In Proc. 9th Annual ACM Symposium on Computational Geom- etry, pages 298{307, 1993. 7] O. Devillers and A. Fabri. Scalable algorithms for bichromatic line segment intersection problems on coarse grained multicom- puters. In Algorithms and Data Structures. Third Workshop, WADS'93, pages 277{288, 1993. 8] X. Deng. A convex hull algorithm on coarse grained multipro- cessors. In Proc. 5th Annual International Symposium on Algo- rithms and Computation (ISAAC 94), pages 634{642, 1994. 9] F. Dehne, C. Kenyon, and A. Fabri. Scalable and architecture independent parallel geometric algorithms with high probability optimal time. In Proc. 6th IEEE Symposium on Parallel and Distributed Processing (SPDP), pages 586{593, 1994. 10] F. Dehne, X. Deng, P. Dymond, A. Fabri, and A.A. Khokhar. A randomized parallel 3D convex hull algorithm for coarse grained multicomputers. In Proc. 7th ACM Symposium on Parallel Al- gorithms and Architectures, pages 27{33, 1995. 11] I. Al-furaih, S. Aluru, S. Goil, and S. Ranka. Parallel construc- tion of multidimensional binary search trees. In Proc. Interna- tional Conference on Supercomputing (ICS'96), 1996. 12] P-O. Fjallstrom. Parallel algorithms for geometric problems on coarse grained multicomputers. Technical Report LiTH-IDA-R- 96-38, Dep. of Computer and Information Science, Linkoping University, 1996. 13] P-O. Fjallstrom. Parallel interval-cover algorithms for coarse grained multicomputers. Technical Report LiTH-IDA-R-96-39, Dep. of Computer and Information Science, Linkoping Univer- sity, 1996.
  • 17. 14 14] A. Ferreira, C. Kenyon, A. Rau-Chaplin, and S. Ubeda. d- Dimensional range search on multicomputers. Technical Report 96-23, Laboratoire de l'Informatique du Parallelisme, Ecole Nor- male Superieure de Lyon, 1996. 15] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Parallel Computing. The Benjamin/Cummings Publishing Company, Inc., 1994. 16] J.L. Bentley. Decomposable searching problems. Information Processing Letters, 8(5):244{251, 1979. 17] H. Edelsbrunner and M.H. Overmars. Batched dynamic solu- tions to decomposable searching problems. Journal of Algo- rithms, 6:515{542, 1985. 18] J.L. Bentley and J.H. Friedman. Data structures for range searching. Computing Surveys, 11:397{409, 1979.