cis97003

Linkoping Electronic Articles in
Computer and Information Science
Vol. 2(1997): nr 3
This work has been submitted for publication elsewhere.
Copyright may then be transferred,
and the present version of the article may be superseded by a revised one.
The WWW page at the URL stated below will contain up-to-date information
about the current version and copyright status of this article. Additional
copyright information is found on the next page of this document.
Linkoping University Electronic Press
Linkoping, Sweden
http://www.ep.liu.se/ea/cis/1997/003/
Parallel Algorithms for
Batched Range Searching on
Coarse-Grained
Multicomputers
Per-Olof Fjallstrom
Department of Computer and Information Science
Linkoping University
Linkoping, Sweden

Published on April 1, 1997 by
Linkoping University Electronic Press
581 83 Linkoping, Sweden
Linkoping Electronic Articles in
Computer and Information Science
ISSN 1401-9841
Series editor: Erik Sandewall
c 1997 Per-Olof Fjallstrom
Typeset by the author using LaTEX
Formatted using etendu style
Recommended citation:
<Author>. <Title>. Linkoping Electronic Articles in
Computer and Information Science, Vol. 2(1997): nr 3.
http://www.ep.liu.se/ea/cis/1997/003/. April 1, 1997.
This URL will also contain a link to the author's home page.
The publishers will keep this article on-line on the Internet
(or its possible replacement network in the future)
for a period of 25 years from the date of publication,
barring exceptional circumstances as described separately.
The on-line availability of the article implies
a permanent permission for anyone to read the article on-line,
and to print out single copies of it for personal use.
This permission can not be revoked by subsequent
transfers of copyright. All other uses of the article,
including for making copies for classroom use,
are conditional on the consent of the copyright owner.
The publication of the article on the date stated above
included also the production of a limited number of copies
on paper, which were archived in Swedish university libraries
like all other written works published in Sweden.
The publisher has taken technical and administrative measures
to assure that the on-line version of the article will be
permanently accessible using the URL stated above,
unchanged, and permanently equal to the archived printed copies
at least until the expiration of the publication period.
For additional information about the Linkoping University
Electronic Press and its procedures for publication and for
assurance of document integrity, please refer to
its WWW home page: http://www.ep.liu.se/
or by conventional mail to the address stated above.

Abstract
We de ne the batched range-searching problem as follows: given
a set S of n points and a set Q of m hyperrectangles, report for
each hyperrectangle which points it contains. This problem has
applications in, for example, computer-aided design and engi-
neering. We present several parallel algorithms for this problem
on coarse-grained multicomputers. Our algorithms are based
on well-known average- and worst-case e cient sequential algo-
rithms. One of our algorithms solves the d-dimensional batched
range-searching problem in O(Ts(nlogd 1
p;p)+Ts(mlogd 1
p;p)+
((m + n)logd 1(n=p) + mlogd 1p log(n=p) + k)=p) time on a p-
processor coarse-grained multicomputer. (Ts(n;p) denotes the
time globally to sort n numbers on a p-processor multicomputer,
and k is the total number of reported points.)
Keywords Parallel algorithms, coarse-grained multicomputers,
range searching.
The work presented here is funded by CENIIT (the Center for
Industrial Information Technology) at Linkoping University.

1
1 Introduction
In many applications, such as geographic information systems, com-
puter-aided design and engineering, statistics, etc., we need to answer
the following range-searching query: given a set S of n points, which
points lie within a given hyperrectangle? (A hyperrectangle is the
Cartesian product of intervals on distinct coordinate axes.) Usually,
we need to answer many such queries for the same set of points.
In some situations, we know the set of queries in advance. That
is, we want to solve the following batched range-searching problem:
given a set S of n points and a set Q of m hyperrectangles, report
for each hyperrectangle which points it contains. For example, this
is an important subproblem in computer simulation of deformation
processes, such as vehicle collisions and mechanical forming processes.
In such simulations nding all contacts between components of nite-
element models of physical objectsis necessary. This can be simpli ed
by approximating surface segments with hyperrectangles, and then
determining which vertices these hyperrectangles contain 1, 2].
In this paper, we present parallel algorithms for batched range
searching on coarse-grained multicomputers. A coarse-grained mul-
ticomputer consists of several processors connected by an intercon-
nection network. Each processor is fairly powerful, i.e., it delivers
workstation-class performance. Since o -the-shelf hardware can be
used, coarse-grained multicomputers are relatively inexpensive. Most
commercially available parallel computers are of this type.
Most of the research on parallel algorithms for geometric problems
has focused on ne-grain parallel models of computation 3, 4, 5]. It
is only during the last couple of years that researchers have designed
parallel geometric algorithms for coarse-grained multicomputers 6,
7, 8, 9, 10, 11, 12, 13, 14]. In this model of computation we can
assume that the size of each local memory is large. For example, it
is common to assume that the size of each local memory is larger
than the number of processors. This property allows the algorithm
designer to balance communication latency with local computation
time.
Our parallel algorithms for batched range searching are based on
well-known worst- and average-case e cient sequential algorithms.
One of our algorithms is based the range-tree method, and solves the
d-dimensional batched range-searching problem in O(Ts(nlogd 1
p;p)+
Ts(mlogd 1
p;p)+((m+n)logd 1
(n=p)+mlogd 1
plog(n=p)+k)=p)
time on a p-processor coarse-grained multicomputer. (Ts(n;p) de-
notes the time globally to sort n numbers on a p-processor multi-
computer, and k is the total number of reported points.) We also
give algorithms based on the cell method. This method has poor
worst-case performance, but since it can be very e cient in practice,
we believe that developing parallel algorithms based on this approach
is important.
Other researchers have developed parallel algorithms for range

2
searching on coarse-grained multicomputers. Devillers and Fabri
7] give an algorithm for the one-dimensional case. Recently, Fer-
reira et al 14] present algorithms for the d-dimensional case. They
construct a distributed range tree in time O(s=p + Ts(s;p)), where
s = nlogd 1
n. They can then answer a set of m = O(n) range queries
in time ((slogn + k)=p+ Ts(s;p)).
We organize the rest of the paper as follows. In Section 2, we give
additional information about coarse-grained multicomputers, and de-
scribe some basic operations used by our algorithms. In Sections 3
and 4, we present parallel range-searching algorithms based on the
range-tree and cell methods, respectively.
2 Model of Computation
Coarse-grained multicomputers consist of a set of processors con-
nected through an interconnection network. The number of proces-
sors usually varies between 16 and 256. The memory is physically
distributed over the processors, and interaction between processors
is through message passing. Each processor can execute a di erent
program independent of the other processors. However, it is com-
mon to let each processor execute the same program asynchronously.
That is, except a few global communication steps, processors execute
the same program independently of each other. Common intercon-
nection networks are 2D meshes (Paragon XP/S), 3D meshes (Cray
T3E), hypercubes (nCUBE 2), and fat trees (CM-5).
Our algorithms use a few basic and extensively studied communi-
cation operations. We next describe these operations, and give their
time complexities for a square 2D mesh with p processors, which are
assumed to be indexed from 1 through p. For a detailed description
and analysis of the operations, see Kumar et al 15].
Monotone routing: Each processor P(i) sends at most one m-word
message. The destination address, d(i), of the message sent by P(i)
is such that if both P(i) and P(i0), i < i0, send messages, then d(i)
d(i0). The time complexity, Tmr(m;p;rmax), is O((rmax + m)pp),
where rmax is the maximum number of words received by any pro-
cessor.
Segmented broadcast: Processors with indexes i1 < i2 ::: < iq, are
selected; each processor P(ij) sends the same m-word message to
all processors P(ij + 1) through P(ij+1 1). The time complexity,
Tsb(m;p), is O(mpp).
Multinode broadcast: Every processor sends the same m-word mes-
sage to every other processor. The time complexity, Tmb(m;p), is
O(mp).
Total exchange: Every processor sends a distinct m-word message to
every other processor. The time complexity, Tx(m;p), is O(mppp).
Pre x sums and reduction: Let a1;a2;:::;an be a list of numbers

3
evenly distributed over the processors and let be an associative
operator. The pre x sums operation computes si = a1 ai, and
stores si in the same processor as ai. The time complexity, Tp(n;p),
is O(n=p+pp). The reduction operation computes s = a1 an,
and stores s in each processor. The time complexity, Tr(n;p), is
O(n=p + pp). In the segmented versions of these operations, we
apply them to sublists of a1;a2;:::;an. The time complexity is the
same as for the ordinary operations.
Global sort: Given a list a1;a2;:::;an of numbers evenly distributed
over the processors, the global sort operations sorts the list, and re-
turns it evenly distributed over the processors. The time complexity,
Ts(n;p), is O(n(log(n=p)+ pp)=p).
We end this section by showing how some of the above operations
can be used to solve a simple data-copying problem. This is an im-
portant subproblem in the algorithms to be presented in this paper.
The data-copying problem is as follows. A set R of n equal-sized data
records is evenly distributed over the processors of a p-processor mul-
ticomputer. With each record r is associated a nonnegative integer
n(r). The task is to create n(r) additional copies of each record r
such that the work of creating the records is uniformly distributed
over the processors. We do this as follows.
1. Let R0 = fr 2 R : n(r) > 0g, and let w =
P
r2R0 n(r).
Decompose R0 into subsets R0(i), i = 1;2;:::;p, such thatPr2R0
(i) n(r) = bw=pc for i pdw=pe w, and
Pr2R0
(i) n(r) =
dw=pe otherwise.
2. For i = 1;2;:::;p, copy R0(i) to the processor P(i). Create the
copies of the records in R0(i) in the processor P(i).
Lemma 1 We can solve the data-copying problem in O(Tmr(n=p;p;
(n + w)=p)+ (n + w)=p), time where w is the total number of copies
and n p2.
Proof. Regard R0 as an ordered set fr1;r2;:::;rmg. We begin Step 1
by computing the pre x sums s1;s2;:::;sm, where sk =
Pk
j=1 n(rj).
To simplify the description of how to decompose R0 into subsets, we
assume that w is an integer multiple of p. Extending our descrip-
tion to the general case is easy. Let lk = bsk=(w=p)c. For each
record rk, if lk 1 = lk, then we assign rk to the subset R0(lk 1 + 1).
Otherwise, let dk = lk lk 1. Next, we create new records rk;j,
j = 0;1;:::;dk, such that we (1) assign rk;0 to the subset R0(lk 1 +1)
and set n(rk;0) = (lk 1 + 1)w=p sk 1, (2) assign rk;dk to the sub-
set R0(lk + 1) and set n(rk;dk ) = sk lkw=p, and (3) assign rk;j,
0 < j < dk, to subset R0(lk 1 + 1 + j) and set n(rk;j) = w=p. Ob-
serve that no subset contains more than dw=pe elements. In Step
2, we rst identify nonlocal subsets, i.e., subsets whose elements lie
in several processors. To do this each processor sends the indexes

4
of the lowest- and highest-indexed subset that it contains to every
other processor. Using monotone routing, we then copy the nonlo-
cal subsets directly to their nal destinations. Some processors may
completely contain one or more subsets. We handle this by copying
all such subsets to the nal destination of the lowest-indexed subset
in the processor. If a processor completely contains more than one
subset, we then use segmented broadcast to transfer the subsets to
their correct destinations. 2
3 A Worst-Case E cientAlgorithmfor Range
Searching
Let us again state the problem in which we are interested. The input
consists of a set S of n points and a set Q of m hyperrectangles. The
task is to report, for each hyperrectangle, which points it contains. In
our development of parallel algorithms for this problem, we assume
that initially each processor stores n=p points and m=p hyperrectan-
gles, and that m and n are both greater than or equal to p2. The
output consists of hyperrectangle-point pairs, that is, for each hyper-
rectangle q and point p such that p is contained in q, the pair (q;p)
is created.
In this section we present a parallel algorithm inspired by the se-
quential range-tree method 16]. This is a worst-casee cient method.
We can use it to solve the d-dimensional, d 2, batched range-
searching problem in time O(nlogd 1n+mlogdn+k), where k is the
total number of reported points. For a set S of points in the plane,
the corresponding range tree consists of a binary search tree on the
x-coordinates of the points. That is, every node v represents an in-
terval I(v) such that a leaf node represents the interval between two
consecutive x-coordinates, and an interior node represents the union
of the intervals of its children. (We call these intervals standard in-
tervals.) With every node v is associated a y-sorted list Sy(v) of the
points with x-coordinate within I(v). To determine which points are
contained in a hyperrectangle q, partition the x-range of q into stan-
dard intervals. More speci cally, interval I(v) is part of the partition
if the x-range of q contains I(v) but not I(p(v)), where p(v) is the
parent of node v. Then, for every interval I(v) in the partition, decide
by a binary search which points in Sy(v) lie within the y-range of q.
We can thus decompose a two-dimensional range-searching problem
into a collection of one-dimensional range-searching problems.
We give rst a parallel algorithm for the one-dimensional case. We
then show how we can extend this algorithm to higher dimensions.
The algorithm consists of three parts and the details are as follows.
Part I:
1. Globally sort S into nondecreasing order by x-coordinate. Di-
vide the sorted list into equal-sized sublists, S(i), i = 1;2;:::;p=2.

5
1 2 4
5 6
7
3
l(1) l(2) l(3) l(4) l(5)
x
Figure 1: The tree Tp for p = 8. Tp has p=2 leaf nodes and p 1 nodes
in total. The given x-range is partitioned into the standard intervals
corresponding to the circled nodes. It contains the intervals of leaves
2, 3 and 4, and intersects the interval of leaf 1. We index nodes from
left to right, beginning with the leaves.
(We assume that p is an integer power of two.)
2. For each sublist S(i), nd l(i), the smallest x-coordinate in the
sublist (for sublist S(p=2) nd also l(p=2 + 1), the largest x-
coordinate in S(p=2)). Broadcast the l-values to all processors.
3. In every processor build a binary search tree Tp on the l-values.
Identify each node in Tp by a unique index in the range 1
through p 1. See Figure 1.
Part II:
1. For each hyperrectangle q and leaf node i, if q's x-range inter-
sects but does not contain I(i), create the node-hyperrectangle
pair (i;q).
2. For each leaf node i, determine e(i), the number of node-hyper-
rectangle pairs with node index i. Compute e =
Pp=2
i=1 e(i). If
e = 0, continue to Part III.
3. Globally sort the node-hyperrectangle pairs by node index.
4. For each leaf node i, compute p(i) = de(i)=d2e=pee and f(i) =Pi 1
j=1 p(j). If p(i) > 0, continue as follows.
(a) Copy S(i) to the processors P(f(i)+1) through P(f(i)+
p(i)).
(b) Divide the node-hyperrectangle pairs with node index i
into equal-sized subsets Q(i;j), j = 1;2;:::;p(i). Move
Q(i;j) to the processor P(f(i)+ j).
(c) Find k(i;j),the output size of the range-searching problem
with input S(i)and Q(i;j). Compute k =
Pp=2
i=1
Pp(i)
j=1 k(i;j).
If k = 0, continue to Part III.

6
5. Foreach subset Q(i;j),compute p(i;j) = bk(i;j)=(k0=p)c, where
k0 = max(k;n), and f(i;j) =
Pi 1
k=1
Pp(k)
l=1 p(k;l)+
Pj 1
l=1 p(i;l).
If p(i;j) = 0, solve the range-searching problem with input S(i)
and Q(i;j) in the processor P(f(i) + j). Otherwise, continue
as follows.
(a) Copy S(i) and Q(i;j) to the processors P(f(i;j) + 1)
through P(f(i;j)+ p(i;j)).
(b) Divide Q(i;j)into the subsets Q0(i;j;l),l = 1;2;:::;p(i;j),
such that
P
(i;q)2Q0
(i;j;l) k(i;q) is O(k0=p).
(c) Solve the range-searching problem with input S(i) and
Q0(i;j;l) in the processor P(f(i;j)+ l).
Part III:
1. For each leaf node i, determine d(i), the number of hyper-
rectangles whose x-range contains the interval I(i). Compute
d =
Pp=2
i=1 d(i). If d = 0, end the execution.
2. For each leaf node i, compute p(i) = dd(i)=d2d=pee and f(i) =Pi 1
j=1 p(j). Copy S(i) to the processors P(f(i) + 1) through
P(f(i)+ p(i)).
3. If d > m, then do as follows.
(a) Copy the hyperrectangles in each processor to every other
processor.
(b) For each hyperrectangle q and leaf node i, if q's x-range
contains I(i), create the node-hyperrectangle pair (i;q).
(c) For each leaf node i, divide the node-hyperrectangle pairs
with node index i into equal-sized subsets Q(i;j), j =
1;2;:::;p(i).
4. If d m, then do as follows.
(a) For each hyperrectangle q and leaf node i, if q's x-range
contains I(i), create the node-hyperrectangle pair (i;q).
(b) Globally sort the pairs by node index.
(c) For each leaf node i, divide the node-hyperrectangle pairs
with node index i into equal-sized subsets Q(i;j), j =
1;2;:::;p(i). Move Q(i;j) to the processor P(f(i)+ j).
5. For each pair (i;q) 2 Q(i;j) and each point p 2 S(i), create the
pair (q;p).
Theorem 1 We can solve the one-dimensional range-searching prob-
lem in time O(Ts(n;p)+ Ts(m;p)+ (mlog(n=p)+ k)=p).

7
Proof. In Step 1 of PartII, we use binary search to decide which pairs
to create. Since each hyperrectangle intersects at most two intervals,
we create the corresponding pairs locally. In Step 2, each proces-
sor rst determines how many of its node-hyperrectangle pairs have
node index i, i = 1;2;:::;p=2. After a total exchange operation, the
processor P(i) computes e(i). P(i) then broadcasts e(i) to every pro-
cessor. Step 4(a) is done by monotone routing followed by segmented
broadcasting. In Step 4(b), a segmented pre x sums operation deter-
mines the rank of each pair (i;q) among the pairs with node index i.
The rank decides to which subset Q(i;j) that (i;q) belongs. We then
move Q(i;j) to the processor P(f(i)+ j) using techniques similar to
those used in Step 2 of the algorithm for the data-copying problem
(Section 2). In Steps 4(c) and 5(c), we use binary search. In total,
Parts I and II take O(Ts(n;p)+Ts(m;p)+(mlog(n=p)+k)=p) time.
In Step 1 of Part III, each processor rst determines, for each leaf
node i, how many of its hyperrectangles contain the interval I(i).
This is done as follows. For each hyperrectangle q and node i, if q's
x-range contains I(i) but not I(p(i)) (where p(i) is the parent of node
i), increment a counter associated with node i. To compute, for each
leaf node i, how many local hyperrectangles contain I(i), sum the
counters associated with nodes along the path from i to the root of
Tp. Then continue as in Step 2 of Part II. In Step 3(b), each processor
that has received a copy of S(i)decides which hyperrectangles contain
the interval I(i), and creates the corresponding pairs. In Step 4(a),
we create the pairs using a modi ed version of the algorithm for
the data-copying problem (Section 2). Since we assume in Step 4
that d m, the total time complexity of Step 4 is O(Ts(m;p)).
Step 5, nally, takes O(dn=p2) time. Since we assume that n p2,
if d > m, the time complexity of Step 5 asymptotically exceeds the
time complexities of Steps 1 and 3. The total time for Part III is thus
O(Tsb(n=p;p)+ Ts(m;p)+ dn=p2). 2
Giving an algorithm for the two-dimensional case is now easy. It
too consists of three parts, where Parts I and II are essentially the
same as above. In Part II, we use the batched range-searching algo-
rithm of Edelsbrunner and Overmars 17]. They give a divide-and-
conquer algorithm for batched range searching that runs in O((m+
n)logd 1
n + mlogm+ k) time and uses O(m+ n) space. It is only
Part III that deviates signi cantly from the one-dimensional case.
The details of Part III are now as follows.
Part III:
1. For each hyperrectangle q and node i, if q's x-range contains
I(i) but not I(p(i)), create the node-hyperrectangle pair (i;q).
2. Foreach node i, determine c(i),the number of node-hyperrectangle
pairs with node index i. Compute c =
Pp 1
i=1 c(i). If c = 0, end
the execution.

8
3. For each point p and each node i such that p's x-coordinate is
contained in I(i) and c(i) > 0, create the node-point pair (i;p).
4. Solve the one-dimensional range-searching problem with input
consisting of the node-point pairs and the node-hyperrectangle
pairs. That is, for each node-hyperrectangle pair (i;q), nd the
node-point pairs (i;p) such that p is contained in q's y-range.
Theorem 2 We can solve the two-dimensional range-searching prob-
lem in time O(Ts(nlogp;p)+Ts(mlogp;p)+(mlogplog(n=p)+k)=p).
Proof. Part I is exactly as in the one-dimensional case. Part II is the
same as in the one-dimensional case, except Steps 4(c) and 5(c) which
now use the batched range-searching algorithm of Edelsbrunner and
Overmars. (In Step 4(c), we modify this algorithm to compute just
how many points are contained in each hyperrectangle.) Parts I and
II take together O(Ts(n;p) + Ts(m;p) + ((m + n)log(n=p) + k)=p)
time.
In Step 1 of Part III, it takes O(logp) time for each hyperrect-
angle q to nd all nodes i in Tp such that q's x-range contains I(i)
but not I(p(i)). Step 2 is similar to Step 2 of Part II. Step 3 takes
O(nlogp=p) time. In Step 4, we apply our one-dimensional range-
searching algorithm to the node-hyperrectangle and node-point pairs
created in Steps 1 and 3. In our one-dimensional algorithm we as-
sume that the input is evenly distributed over the processors, and
that the number of points and the number of hyperrectangles are both
greater than or equal to p2. These assumptions are not necessarily
satis ed by the node-hyperrectangle and node-point pairs. We can
easily remedy this by adding dummy input as follows. Each processor
counts how many node-point pairs it stores. By a reduction opera-
tion, we then nd nmax, the maximum number of such pairs contained
in any processor. Finally, each processor adds dummy pairs until
it has exactly max(nmax;p) pairs. The same approach is used for
the node-hyperrectangle pairs. Since no processor stores more than
O(nlogp=p) node-point pairs and O(mlogp=p) node-hyperrectangle
pairs, it takes O(Tr(p;p)+ (m+ n)logp=p) time to add the dummy
input. 2
Generalizing the above approach to higher dimensions is straight-
forward. We can easily derive the following result.
Theorem 3 We can solve the d-dimensional range-searching prob-
lem in time O(Ts(nlogd 1
p;p)+Ts(mlogd 1
p;p)+((m+n)logd 1
(n=p)+
mlogd 1p log(n=p)+ k)=p).
4 Average-Case E cientAlgorithmsfor Range
Searching
In this section, we present parallel algorithms for range searching
that are based on the cell method 18]. In its simplest version, this

9
method is as follows. First, nd the smallest hyperrectangle B that
contains the set S. Divide B into equal-sized hyperrectangular cells,
and record for each cell which points it contains. We call the resulting
data structure a cell directory. To decide which points a hyperrect-
angle q contains, do as follows. For each cell intersected by q, access
the corresponding entry in the cell directory and test, for each point
contained in the cell, if it is included within q.
It is common to divide B into O(n) cells, in which case we can
build the cell directory (e.g., a multidimensional array of pointers)
in O(n) time. The total cost of solving the batched range-searching
problem is then O(m + n + s + t) time, where s and t denote the
total number of cell accesses and point inclusion tests, respectively.
(The time complexity increases linearly with the dimension d of the
problem. In this paper, we assume that d is a small constant.)
As already mentioned, the worst-case performance of this method
is poor. We can easily create an input such that s + t is (mn), al-
though the output size k is zero. However, in many applications the
cell method may outperform more sophisticated methods. For exam-
ple, in an experimental evaluation of methods for range searching, we
2] found it to be much faster than the range-tree method. This is due
to its relative simplicity (small constants of proportionality). More-
over, one can show that if the points are evenly distributed in space
and the shape of the query hyperrectangles is similar to the shape of
the cells, then s+t is O(k). Finding an e cient parallelization of the
cell method as described above is therefore important.
Algorithm I: Our rst algorithm is based on the assumption that
storing a copy of S and Q in each processor is possible. The rst step
of the algorithm achieves this by multinode broadcasting. Then, each
processor executes the sequential cell method. To load-balance the
computations, we divide Q into subsets for which the total number
of cell accesses and point inclusion tests is about the same.
1. Copy the points and hyperrectangles in each processor to every
other processor.
2. Locally build a cell directory for S. That is, compute B, the
smallest hyperrectangle containing S. Divide B into O(n)equal-
sized hyperrectangular cells, and record, for each cell, which
points it contains.
3. For each hyperrectangle q, nd s(q), the number of cells it in-
tersects. Compute s =
P
q2Q s(q).
4. For each hyperrectangle q, nd t(q), the total number of points
contained in the cells intersected by q. Compute t =
Pq2Q t(q).
5. For each hyperrectangle q, let r(q) = s(q) + t(q). Divide Q
into the subsets Q(j), j = 1;2;:::;p, such that
Pq2Q(j) r(q) =
O(max(n;(s+ t)=p)).

10
2 3 4 4
3
4
5
5
16
0
0
7
9
7
10
12
8
12
0
0
0 0 0 0
0
Figure 2: Example with sixteen points and one hyperrectangle. The
number besides each grid vertex is the number of points dominated by
the vertex. The shaded region is the block of cells intersected by the
hyperrectangle. The number of point inclusion tests is 12+0 4 0 =
8.
6. Solve the range-searching problem with input S and Q(j) in
the processor P(j).
Theorem 4 Algorithm I solves the batched range-searching problem
in time O(Tmb((m+ n)=p;p)+ m+ n + (s+ t)=p), where m;n p.
Proof. Steps 1 through 3 take O(Tmb((m+ n)=p;p)+ m+ n) time.
In Step 4, we rst compute for each grid vertex v how many points
in S it dominates, that is, how many points lie in v's southwest
quadrant. This can be done in O(n) time. Let d(v) denote the
number of points dominated by the grid vertex v. Then, t(q) =
d(vNE) + d(vSW) d(vNW) d(vSE), where vNE, vSW , vNW and
vSE denote the northeast, southwest and northwest and southeast
vertices, respectively, of the block of cells intersected by the hyper-
rectangle q. See Figure 2. Thus, Step 4 takes O(n+m) time in total.
In Step 5, the partitioning of Q into subsets can easily be done in
O(m) time. Finally, Step 6 takes O(max(n;(s+ t)=p)) time. 2
If we can have a copy of S and Q on each processor, and if s +
t is large compared with m + n, then this algorithm can be quite
e cient. However, in many applications storing a copy of the input
on each processor would be impossible. This suggests that we should
investigate alternative parallelizations of the cell method.
Algorithm II: Brie y, this algorithm is asfollows. Foreach nonempty
cell, we create a list of the points it contains. For each intersected
cell, we create a list of (copies of) the hyperrectangles that intersect

11
it. For each cell that is both nonempty and intersected, we then com-
bine the two lists, that is, we do the corresponding point inclusion
tests.
1. Compute B, the smallest hyperrectangle containing S. Divide
B into O(n) equal-sized hyperrectangular cells. Identify each
cell by a unique index.
2. Decide for each point p in which cell i it is contained. Create
the cell-point pair (i;p).
3. For each hyperrectangle q and cell i such that q intersects cell i,
create the cell-hyperrectangle pair (i;q). Compute s, the total
number of cell-hyperrectangle pairs.
4. Globally sort the cell-point and cell-hyperrectangle pairs with
respect to cell indexes. When comparing a cell-point pair and
cell-hyperrectangle pair with the same index, let the latter pair
win.
If cell i is both nonempty and intersected, there is now a list
of cell-point pairs with index i (denoted pl(i)), followed by a
list of cell-hyperrectangle pairs with index i (denoted hl(i)).
It remains to test each point in pl(i) for inclusion within each
hyperrectangle in hl(i).
5. Let n(i) and m(i) denote the lengths of pl(i) and hl(i), respec-
tively, and let t(i) = n(i)m(i). Let A be the set of nonempty
and intersected cells. For each cell i 2 A, compute n(i);m(i)
and t(i). Compute t =
Pi2A t(i).
6. Let I = fi 2 A : t(i) < dt0=peg, where t0 = max(t;n+ s).
(a) For each cell i 2 I, gather pl(i) and hl(i) into the lowest-
indexed processor that contains elements of pl(i).
(b) Let I(j) = fi 2 I : pl(i) and hl(i) are in the processor P(j)g.
For j = 1;2;:::;p, compute ^t(j) =
P
i2I(j) t(i), p(j) =
b^t(j)=dt0=pec and f(j) =
Pj 1
k=1 p(k).
(c) For j = 1;2;:::;p, if p(j) = 0, then do the point inclusion
tests corresponding to I(j) in the processor P(j). Oth-
erwise, copy pl(i) and hl(i), i 2 I(j), to the processors
P(f(j)+1) through P(f(j)+p(j)). Decompose I(j) into
subsets I(j;k), k = 1;2;:::;p(j), such that
P
i2I(j;k) t(i)
is O(t0=p). Do the point inclusion tests corresponding to
I(j;k) in the processor P(f(j)+ k).
7. Let E = A nI. For each cell i 2 E, do as follows.
(a) Compute p(i) = bt(i)=dt0=pec and f(i) =
Pk2E;k<i p(k).

12
(b) Divide the longest list of pl(i) and hl(i) into equal-sized
sublists, ll(i;j), j = 1;2;:::;p(i). Move ll(i;j) to the
processor P(f(i)+ j).
(c) Create a copy of the shortest list of pl(i) and hl(i) in each
processor P(f(i)+1) through P(f(i)+ p(i)). Do the cor-
responding point inclusion tests.
Theorem 5 Algorithm II solves the batched range-searching problem
in O(Ts(n+s;p)+Tsb(m=p;p)+Tsb(t=p;p)+(m+n+s+t)=p), time,
where m;n p2.
Proof. Steps 1 and 2 take together O(Tr(n;p)+n=p) time. In Step
3, we can create the cell-hyperrectangle pairs by slightly modifying
the algorithm for the data-copying problem (Section 2). By Lemma 1,
this takes O(Tmr(m=p;p;(m+ s)=p) + (m + s)=p) time. The global
sort in Step 4 takes O(Ts(n + s;p)) time. In Step 5, we compute
n(i) and m(i) by segmented reduction in O(Tr(n + s;p)) time. We
can identify all cells that are both nonempty and intersected in time
O((n+s)=p+Tmr(1;p;1)). We then compute t(i) and t in O(Tr(n+
s;p)) time. Thus, in total Step 5 takes O(Tr(n + s;p)) time.
Step 6(a) takes O(Tmr((n+s)=p;p;t0=p)) time. This follows from
the fact that, if cell i 2 I, then m(i) + n(i) dt0=pe. Thus, no
processor receives more than dt0=pe pairs. In Step 6(b), we compute
and broadcast ^t(j) to every processor in O((n + s)=p + Tmb(1;p))
time. In Step 6(c), to decompose I(j) into subsets of cost O(t0=p) can
easily be done in O(t0=p) time. Step 6(c) takes O(Tmr(t0=p;p;t0=p)+
Tsb(t0=p;p)+ t0=p) time.
Step 7(a) is similar to Step 6(b). To describe Step 7(b), we as-
sume that list pl(i) is longer than hl(i). We divide pl(i) into p(i) sub-
lists such that p(i)dn(i)=p(i)e n(i) sublists have length bn(i)=p(i)c,
whereas the remaining sublists have length dn(i)=p(i)e. We use a seg-
mented pre x sums computation to decide, for each list element, to
which sublist it belongs. To move each sublist to its selected proces-
sor, we use the same techniques as in Step 2 of the algorithm for the
data-copying problem (Section 2). Since no sublist has more than
O(t0=p) elements, the total time for Step 7(b) is O(Tp(n + s;p) +
Tmb(1;p) + Tmr((n + s)=p;p;t0=p) + Tmr((n + s)=p;p;(n + s)=p) +
Tsb((n + s)=p;p)). In Step 7(c), we copy the shortest list to the se-
lected processors by monotone routing followed by segmented broad-
casting. The length of a shortest list cannot exceed
p
t0. Thus, the
total time for Step 7(c) is O(Tmr((n+s)=p;p;
p
t0)+Tsb(
p
t0;p)+t0=p).
Since we assume that n p2, it follows that
p
t0 t0=p. 2
References
1] Z-H. Zhong. Finite Element Procedures for Contact-Impact
Problems. Oxford University Press, 1993.

13
2] P-O. Fjallstrom, J. Petersson, L. Nilsson, and Z-H. Zhong. Eval-
uation of range searching methods for contact searching in me-
chanical engineering. To appear in International Journal of
Computational Geometry & Applications.
3] A. Aggarwal, B. Chazelle, L. Guibas, and C. O'Dunlaing. Par-
allel computational geometry. Algorithmica, 3:293{327, 1988.
4] M.J. Atallah. Parallel techniques for computational geometry.
Proc. IEEE, 80(9):1435{1448, 1992.
5] S.G. Akl and K.A. Lyons. Parallel Computational Geometry.
Prentice-Hall, 1993.
6] F. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable parallel
geometric algorithms for coarse grained multicomputers. In
Proc. 9th Annual ACM Symposium on Computational Geom-
etry, pages 298{307, 1993.
7] O. Devillers and A. Fabri. Scalable algorithms for bichromatic
line segment intersection problems on coarse grained multicom-
puters. In Algorithms and Data Structures. Third Workshop,
WADS'93, pages 277{288, 1993.
8] X. Deng. A convex hull algorithm on coarse grained multipro-
cessors. In Proc. 5th Annual International Symposium on Algo-
rithms and Computation (ISAAC 94), pages 634{642, 1994.
9] F. Dehne, C. Kenyon, and A. Fabri. Scalable and architecture
independent parallel geometric algorithms with high probability
optimal time. In Proc. 6th IEEE Symposium on Parallel and
Distributed Processing (SPDP), pages 586{593, 1994.
10] F. Dehne, X. Deng, P. Dymond, A. Fabri, and A.A. Khokhar. A
randomized parallel 3D convex hull algorithm for coarse grained
multicomputers. In Proc. 7th ACM Symposium on Parallel Al-
gorithms and Architectures, pages 27{33, 1995.
11] I. Al-furaih, S. Aluru, S. Goil, and S. Ranka. Parallel construc-
tion of multidimensional binary search trees. In Proc. Interna-
tional Conference on Supercomputing (ICS'96), 1996.
12] P-O. Fjallstrom. Parallel algorithms for geometric problems on
coarse grained multicomputers. Technical Report LiTH-IDA-R-
96-38, Dep. of Computer and Information Science, Linkoping
University, 1996.
13] P-O. Fjallstrom. Parallel interval-cover algorithms for coarse
grained multicomputers. Technical Report LiTH-IDA-R-96-39,
Dep. of Computer and Information Science, Linkoping Univer-
sity, 1996.

14
14] A. Ferreira, C. Kenyon, A. Rau-Chaplin, and S. Ubeda. d-
Dimensional range search on multicomputers. Technical Report
96-23, Laboratoire de l'Informatique du Parallelisme, Ecole Nor-
male Superieure de Lyon, 1996.
15] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction
to Parallel Computing. The Benjamin/Cummings Publishing
Company, Inc., 1994.
16] J.L. Bentley. Decomposable searching problems. Information
Processing Letters, 8(5):244{251, 1979.
17] H. Edelsbrunner and M.H. Overmars. Batched dynamic solu-
tions to decomposable searching problems. Journal of Algo-
rithms, 6:515{542, 1985.
18] J.L. Bentley and J.H. Friedman. Data structures for range
searching. Computing Surveys, 11:397{409, 1979.

cis97003

More Related Content

What's hot (20)

Viewers also liked (16)

Similar to cis97003 (20)

cis97003