1 Introduction
A metric space is a nonempty set endowed with a
metric,
i.e.,
a function
such that
-
β’
if and only if (identity of indiscernibles),
-
β’
(symmetry), and
-
β’
(triangle inequality)
for all
, , Β [13].
For all , define .
Given
and oracle access to a metric ,
metric -median
asks for
,
breaking ties arbitrarily.
It generalizes the classical median selection on the real line
and has a
brute-force
-time algorithm.
More generally, metric -median asks for
, , ,
minimizing
.
Because
defines
nonzero distances,
only -time algorithms are said to
run in
sublinear timeΒ [8].
For all ,
an
-approximate
-median
is a point
satisfying
|
|
|
For all ,
metric -median
has
a Monte Carlo -approximation
-time
algorithmΒ [8, 9].
Guha et al.Β [7]
show that metric -median
has
a Monte Carlo, -approximation,
-time, -space and one-pass algorithm for all small
as well as a deterministic, -approximation,
-time, -space and one-pass algorithm.
Given
points in with ,
the Monte Carlo algorithms of Kumar et al.Β [10]
find
a -approximate -median
in
time
and a -approximate solution to metric -median
in time.
All randomized -approximation algorithms for metric -median
take timeΒ [11, 7].
ChangΒ [2]
shows that metric -median has
a deterministic, -approximation, -time
and nonadaptive algorithm for
all
constants
, generalizing the results of ChangΒ [1] and
WuΒ [15].
On the other hand,
he
disproves
the
existence
of
deterministic -approximation -time
algorithms
for all constants and Β [3, 4].
In social network analysis, the closeness centrality of a point
is
the reciprocal of the
average distance
from to all
pointsΒ [14].
So metric -median
asks for
a point with the maximum closeness
centrality.
Given oracle access to a graph metric,
the Monte-Carlo algorithms of
Goldreich and RonΒ [6] and Eppstein and WangΒ [5]
estimate the closeness centrality of a given point and those of all points, respectively.
All known
sublinear-time
algorithms
for metric -median
are
either deterministic or
Monte Carlo,
the latter having
a positive probability of failure.
For example, Indykβs Monte Carlo -approximation algorithm
outputs
with a positive probability
a solution
without approximation guarantees.
In contrast,
we show
that metric -median has
a randomized
algorithm
that always outputs a
-approximate solution
in
expected
time
for all constants .
So,
excluding
the
known
deterministic algorithms (which are Las Vegas only in the degenerate
sense),
this paper gives
the first Las Vegas approximation algorithm for metric -median
with an expected
sublinear
running
time.
Note that
deterministic
sublinear-time
algorithms for metric -median
can be -approximate but not -approximate for any constant
Β [1, 4].
So our
approximation ratio
of
beats
that of
any
deterministic
sublinear-time
algorithm.
Inheriting
Indykβs algorithm,
our algorithm
outputs a -approximate -median in
time with probability for all constants .
Below is
our
high-level
and inaccurate
sketch of proof,
where ,
are small constants:
-
(i)
Run Indykβs algorithm to find a probably -approximate -median, .
Then let be the average distance from
to all points.
-
(ii)
For all , denote
by the open ball with center and radius .
Use the triangle inequality (with details omitted here) to show
to be a
solution
no worse
than the points
in ,
i.e.,
|
|
|
(1) |
-
(iii)
Take
a uniformly random bijection
.
Then observe that
|
|
|
|
|
(2) |
|
|
|
|
|
(3) |
where the first (resp., second) inequality follows from the injectivity of (resp.,
the triangle inequality).
-
(iv)
Assume for simplicity.
So by
inequalitiesΒ (1)β(3),
if the following inequality holds, then it serves as a
witness that is -approximate:
|
|
|
(4) |
To guarantee outputting a -approximate
-median,
output
only when inequalityΒ (4) holds.
Restart from itemΒ (i) whenever
inequalityΒ (4) is false.
More
details
of
itemΒ (iv) follow:
For a -median of ,
it will be easy to
show
|
|
|
(5) |
When in itemΒ (i) is indeed -approximate,
|
|
|
(6) |
Assuming ,
inequalitiesΒ (5)β(6)
make
inequalityΒ (4)
hold with high probability
as long as
is highly
concentrated
around its expectation.
The need
for such concentration
is why
we
restrict
the radius of
the codomain
of
to be
in itemΒ (iii)βLarge
distances
ruin concentration bounds.
To
accommodate for
the
points in
,
our
witness for the approximation ratio of
actually differs slightly from
inequalityΒ (4),
unlike in itemΒ (iv).
2 Definitions and preliminaries
For a metric space , and ,
define
|
|
|
to be the open ball with center and radius .
For brevity,
|
|
|
The
pairs in are ordered.
An algorithm with oracle access to
is denoted by
and
may query on any for .
In this paper, all
Landau symbols (such as , , and )
are w.r.t.Β .
The following
result
is due to Indyk.
Fact 1 ([8, 9]).
For all ,
metric -median has a Monte Carlo -approximation
-time algorithm
with
a failure probability of at most
.
Henceforth,
denote Indykβs algorithm in FactΒ 1 by Indyk median.
It is given
,
and oracle access to
a metric
.
By convention, denote the expected value and the variance of a random variable
by and , respectively.
Chebyshevβs inequality ([12]).
Let
be a random variable
with a finite expected value and a finite
nonzero variance.
Then for all
,
|
|
|
4 Probability of termination in any iteration
This section analyzes
the probability of running lineΒ 7
in
any
particular
iteration of the while loop
of Las Vegas median.
The following lemma
uses an easy averaging argument.
Lemma 5.
|
|
|
and, therefore,
|
|
|
Proof.
Clearly,
|
|
|
Then use
lineΒ 4 of Las Vegas median.
β
Henceforth,
assume without loss of generality; otherwise, find
a -median by brute force.
So by LemmaΒ 5.
Define
|
|
|
(9) |
to
be the average distance
in .
Lemma 6.
.
Proof.
By
equationΒ (9) and
the triangle inequality,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Obviously,
the average distance from to the points in
is at most
that from to all points,
i.e.,
|
|
|
(11) |
InequalitiesΒ (4)β(11)
and
lineΒ 4 of Las Vegas median
complete the proof.
β
To analyze the probability
that
the condition in lineΒ 6
of Las Vegas median
holds,
we
shall
derive a concentration bound
for
|
|
|
whose
expected value and
variance
are
examined
in the
next
four lemmas.
Lemma 7.
With expectations taken over ,
|
|
|
(12) |
Proof.
For each ,
is a uniformly random
size- subset of
by lineΒ 5 of Las Vegas median.
Therefore,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(14) |
where the
second (resp., last) equality follows from
the identity of indiscernibles
(resp.,
equationΒ (9) and
LemmaΒ 5).
Finally, use
equationsΒ (4)β(14),
the linearity of expectation
and LemmaΒ 5.
β
Clearly,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(16) |
where the last equality follows from the linearity of expectation
and the separation of pairs
according to whether .
Lemma 8.
With expectations taken over ,
|
|
|
Proof.
Pick any distinct , .
By lineΒ 5 of Las Vegas median,
|
|
|
is a uniformly random size- subset of .
So
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Clearly,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In summary,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Together with
LemmaΒ 5
and equationΒ (9),
this completes
the proof.
β
Lemma 9.
With expectations taken over ,
|
|
|
(17) |
Proof.
By lineΒ 5 of Las Vegas median,
is a uniformly random size- subset of
for each
.
Therefore,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
For all , ,
|
|
|
(19) |
where the first inequality follows from the triangle inequality.
By equationsΒ (9)Β andΒ (4)β(19),
the left-hand side of inequalityΒ (17)
cannot exceed the optimal value of the following problem, called max square sum:
Find
for all , to maximize
|
|
|
(20) |
subject to
|
|
|
(21) |
|
|
|
(22) |
Above, constraintΒ (21)
(resp., (22))
mimics equationΒ (9)
(resp., inequalityΒ (19) and the
non-negativeness of distances).
AppendixΒ A
bounds
the
optimal value of
max square sum
from
above by
|
|
|
This evaluates to be
at most
by LemmaΒ 5.
β
Recall that the variance of any random variable
equals
.
Lemma 10.
With variances taken over ,
|
|
|
Proof.
By equationsΒ (4)β(16)
and
LemmasΒ 8β9,
|
|
|
This and
LemmaΒ 7 imply
|
|
|
Finally, invoke LemmaΒ 6.
β
Lemma 11.
For
all
,
|
|
|
where the probability is taken over .
Proof.
Use
Chebyshevβs inequality
and
LemmasΒ 7Β andΒ 10.
β
Let be a -median of , i.e.,
|
|
|
breaking ties arbitrarily.
So by the averaging argument,
|
|
|
(23) |
Lemma 12.
|
|
|
Proof.
We have
|
|
|
Clearly, .
β
Lemma 13.
For all sufficiently large ,
|
|
|
Proof.
We have
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where the first
inequality (resp., the first
equality) follows from
the triangle inequality (resp.,
lineΒ 4 of Las Vegas median).
By
LemmasΒ 6Β andΒ 12,
|
|
|
(25) |
By
inequalitiesΒ (4)β(25)
and
LemmaΒ 5,
.
β
Lemma 14.
For all sufficiently large ,
|
|
|
Proof.
By the triangle inequality,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Now sum up the above with the inequality in
LemmaΒ 12.
β
Lemma 15.
For all sufficiently large and
with
probability
greater than
,
|
|
|
(26) |
where the probability is taken over and the internal coin tosses of Indyk median
in lineΒ 3 of Las Vegas median.
Proof.
By LemmaΒ 11 with
,
|
|
|
(27) |
with probability at least
.
By FactΒ 1 and lineΒ 3 of Las Vegas median,
|
|
|
|
|
(28) |
|
|
|
|
|
(29) |
with probability at least
.
Now by the union bound,
inequalitiesΒ (27)β(29) hold
simultaneously
with probability at least
.
It remains to derive inequalityΒ (26)
from inequalitiesΒ (27)β(29)
for all sufficiently large .
LineΒ 4 of Las Vegas median,
inequalitiesΒ (28)β(29) and
LemmaΒ 14
give
|
|
|
(30) |
This and inequalityΒ (27) imply
|
|
|
|
|
(31) |
|
|
|
|
|
|
|
|
|
|
Clearly,
for all sufficiently large .
So inequalityΒ (31)
implies,
for all sufficiently large and after laborious calculations,
|
|
|
|
|
|
|
|
|
|
This implies inequalityΒ (26)
for all sufficiently large
(note that
by lineΒ 1
of Las Vegas Median).
β
LemmaΒ 15 and linesΒ 6β7 of Las Vegas median
show the probability of termination in any iteration to be
.
Because
the proof of LemmaΒ 15
implies
that
inequalitiesΒ (26)β(29)
hold simultaneously with probability in any iteration of Las Vegas median,
it happens with probability that
in the first iteration,
is returned in
lineΒ 7 (because of inequalityΒ (26))
and
is -approximate (because of inequalityΒ (28)).
So Las Vegas median
outputs a -approximate
-median
with probability
in the first iteration.
In summary, we have the following.
Lemma 16.
The first iteration of the while loop of
Las Vegas median outputs
a -approximate -median
with probability
.
5 Putting things together
We now show that
metric -median has a Las Vegas -approximation algorithm
with an expected running time for all constants .
Our algorithm
also outputs
a -approximate
-median in time with probability .
Theorem 17.
For
each constant
,
metric -median has a randomized algorithm that
(1)Β always outputs a
-approximate solution
in an expected
time and that
(2)Β outputs a -approximate solution in time
with probability .
Proof.
By
LemmaΒ 4,
Las Vegas median
outputs a
-approximate
-median
at termination.
To prevent Las Vegas median
from running forever,
find a -median by brute force (which obviously takes time)
after
steps of computation.
By FactΒ 1, lineΒ 3
of
Las Vegas median
takes time.
LineΒ 5
takes
time
by
the Knuth shuffle.
Clearly, the other lines also take time.
Consequently, each
iteration of the while loop
of Las Vegas median
takes time.
By LemmaΒ 15 and linesΒ 6β7,
Las Vegas median
runs for at most iterations in expectation.
So
its
expected running time
is
.
Having shown
each iteration
of Las Vegas median
to take time,
establish conditionΒ (2) of the theorem with
LemmaΒ 16.
β
By
FactΒ 1,
Indyk median satisfies conditionΒ (2)
in TheoremΒ 17.
But it does not satisfy conditionΒ (1).
We briefly
justify
the optimality of
the ratio of in
TheoremΒ 17.
Let
be a randomized algorithm
that
always
outputs
a
-approximate
-median.
Furthermore, denote by (resp., )
the output (resp., the set of queries as unordered pairs)
of , where is the discrete metric (i.e.,
and for all distinct , ).
Without loss of generality, assume for all by adding dummy
queries.
So
knows
that
|
|
|
(32) |
Furthermore, assume that never queries for the distance from a point to itself.
In the sequel,
consider the case
that .
By
the averaging argument, there exists a point
involved in at most queries in .
Clearly, cannot exclude the possibility that for all
satisfying .
In summary,
cannot rule out the case that
|
|
|
|
|
(33) |
EquationsΒ (32)β(33)
contradict
the guarantee that is -approximate.
In summary,
any randomized algorithm that always outputs a -approximate
-median must always make at least queries
given oracle access to the discrete metric.
Appendix A Analyzing max square sum
Max square sum has an optimal solution, denoted
,
because
its
feasible solutions
(i.e., those satisfying
constraintsΒ (21)β(22))
form a
closed and bounded
subset of
.
(Recall from elementary mathematical analysis that a continuous
real-valued function on a
closed and bounded
subset of
has a maximum value, where .)
Note that
must be feasible to max square sum.
Below is a consequence of constraintΒ (21).
Lemma A.1.
|
|
|
(34) |
Proof.
Clearly,
|
|
|
Furthermore, the left-hand side of
inequalityΒ (34)
is an integer.
β
Lemma A.2.
|
|
|
Proof.
Assume otherwise.
Then
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
So by
constraintΒ (22) (and the feasibility of
to max square sum),
|
|
|
Consequently,
there exist distinct ,
satisfying
|
|
|
(35) |
By symmetry, assume
.
By
inequalityΒ (35),
there exists a small real number
such that
increasing by and simultaneously
decreasing by
will preserve
constraintsΒ (21)β(22).
I.e., the solution defined below is
feasible
to
max square sum:
|
|
|
(39) |
Clearly,
objectiveΒ (20)
w.r.t.Β
exceeds that w.r.t.Β
by
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where the inequality holds
because and
.
In summary,
is a feasible solution
achieving a greater
objectiveΒ (20)
than the optimal solution
does, a contradiction.
β
We now
bound the optimal value of
max square sum.
Theorem A.3.
The optimal value of max square sum
is at most
|
|
|
Proof.
W.r.t.Β the optimal (and thus feasible) solution ,
objectiveΒ (20) equals
|
|
|
|
|
|
|
|
|
|
where if is true and otherwise, for any
predicate .
Now invoke LemmaΒ A.2.
β