Ching-Lueh Chang 222Department of Computer Science and
Engineering,
Yuan Ze University, Taoyuan, Taiwan. Email:
clchang@saturn.yzu.edu.tw333Supported
in part by the Ministry of Science and Technology of Taiwan under
grant 105-2221-E-155-047-.
Abstract
Let be a metric space.
We analyze the expected value and the variance
of
for a uniformly random permutation of ,
leading to the following results:
β’
Consider the problem of finding
a point in with the minimum
sum of distances
to all points.
We show that
this problem
has a randomized algorithm that
(1)Β always outputs a -approximate solution
in
expected time
and that
(2)Β inherits IndykβsΒ [9, 10] algorithm
to output a -approximate solution in time
with probability ,
where .
β’
The average distance in
can be approximated in time
to within a multiplicative factor in
with probability ,
where
.
β’
Assume to be a graph metric.
Then the
average distance in
can be approximated in time
to within a multiplicative factor in
with probability ,
where .
1 Introduction
A metric space is a nonempty set endowed with a
metric,
i.e.,
a function
such that
For all , define .
Given
and oracle access to a metric ,
metric -median
asks for
,
breaking ties arbitrarily.
It generalizes the classical median selection on the real line
and has a
brute-force
-time algorithm.
More generally, metric -median asks for
, , ,
minimizing
.
Because
defines
nonzero distances,
only -time algorithms are said to
run in
sublinear timeΒ [9].
For all ,
an
-approximate
-median
is a point
satisfying
For all ,
metric -median
has
a Monte Carlo -approximation
-time
algorithmΒ [9, 10].
Guha et al.Β [8]
show that metric -median
has
a Monte Carlo, -approximation,
-time, -space and one-pass algorithm for all small
as well as a deterministic, -approximation,
-time, -space and one-pass algorithm.
Given
points in with ,
the Monte Carlo algorithms of Kumar et al.Β [11]
find
a -approximate -median
in
time
and a -approximate solution to metric -median
in time.
All randomized -approximation algorithms for metric -median
take timeΒ [12, 8].
ChangΒ [3]
shows that metric -median has
a deterministic, -approximation, -time
and nonadaptive algorithm for
all
constants
, generalizing the results of ChangΒ [2] and
WuΒ [16].
On the other hand,
he
disproves
the
existence
of
deterministic -approximation -time
algorithms
for all constants and Β [4, 5].
In social network analysis, the closeness centrality of a point
is
the reciprocal of the
average distance
from to all
pointsΒ [15].
So metric -median
asks for
a point with the maximum closeness
centrality.
Given oracle access to a graph metric,
the Monte-Carlo algorithms of
Goldreich and RonΒ [7] and Eppstein and WangΒ [6]
estimate the closeness centrality of a given point and those of all points, respectively.
All known
sublinear-time
algorithms
for metric -median
are
either deterministic or
Monte Carlo,
the latter having
a positive probability of failure.
For example, Indykβs Monte Carlo -approximation algorithm
outputs
with a positive probability
a solution
without approximation guarantees.
In contrast,
we show
that metric -median has
a randomized
algorithm
that always outputs a
-approximate solution
in
expected
time
for all
.
So,
excluding
the
known
deterministic algorithms (which are Las Vegas only in the degenerate
sense),
this paper gives
the first Las Vegas approximation algorithm for metric -median
with an expected
sublinear
running
time.
Note that
deterministic
sublinear-time
algorithms for metric -median
can be -approximate but not -approximate for any constant
Β [2, 5].
So our
approximation ratio
of
beats
that of
any
deterministic
sublinear-time
algorithm.
Inheriting
Indykβs algorithm,
our algorithm
outputs a -approximate -median in
time with probability for all
.
IndykΒ [9, 10] gives a Monte-Carlo -time
algorithm
that approximates the average distance in any
metric space
to within a multiplicative factor in ,
for all
.
Barhum, Goldreich and ShraibmanΒ [1]
improve
Indykβs
time complexity of
to
.
This paper gives
a Monte-Carlo
-time algorithm
that approximates the average distance
in
to within a multiplicative factor in ,
for all
.
For all ,
we
present
a Monte-Carlo
-time
algorithm
approximating
the average distance
of any graph metric
to within a multiplicative factor in .
But for general metrics, we do not know whether
the
running time
of Barhum, Goldreich and Shraibman
can be improved to .
2 Definitions and preliminaries
For a metric space ,
(1)
(2)
breaking ties arbitrarily
in equationΒ (2).
So
is the average distance in , and
is a -median.
An algorithm with oracle access to
is denoted by
and
may query on any for .
In this paper, all
Landau symbols (such as , , and )
are w.r.t.Β .
The following
result
is due to Indyk.
For all ,
metric -median has a Monte Carlo -approximation
-time algorithm
with
a failure probability of at most
.
Henceforth,
denote Indykβs algorithm in FactΒ 1 by Indyk median.
It is given
,
and oracle access to
a metric
.
The following fact
on estimating the average distance
is due to Barhum, Goldreich and Shraibman.
This and LemmaΒ 5
imply .
So the left-hand side of inequalityΒ (5)
is at least
.
β
Lemma 7.
For all and in
each iteration of the while loop of Las Vegas median,
(7)
where the probability is taken over ,
, ,
and the random
coin tosses of Indyk median.
Proof.
By
FactΒ 1 and lineΒ 2 of Las Vegas median,
the first condition within in equationΒ (7)
holds with probability
at least
over
the
random
coin tosses of
Indyk median.
By
LemmaΒ 6,
the
second condition holds
with probability
at least
over , ,
, .
In summary, the first two conditions hold simultaneously
with probability at least
(note
that the random coin tosses of Indyk median
are independent of , ,
, ).
Finally, the first two conditions together imply the third
by inequalityΒ (3) and the easy fact that
β
Theorem 8.
For
all
,
metric -median has a randomized algorithm that
(1)Β always outputs a -approximate solution in
expected time and (2)Β outputs a -approximate
solution in time with probability .
Proof.
By
LemmaΒ 7,
each execution of linesΒ 4β5 of Las Vegas median returns with probability
.
So the expected number of iterations is .
By FactΒ 1,
lineΒ 2 takes time.
LineΒ 3 takes time by the Knuth shuffle.
Clearly,
linesΒ 4β5 take time.
In summary, the expected running time of Las Vegas median is
.
To prevent Las Vegas median from running forever, find a -median
by brute force (which obviously takes time) after steps of
computation.
By LemmaΒ 3, Las Vegas median
is -approximate.
By LemmaΒ 7,
is -approximate and is also returned in lineΒ 5
with probability in
the first
(in fact, any)
iteration.
Finally, the previous paragraph has
shown each
iteration to take time.
β
By
FactΒ 1,
Indyk median satisfies conditionΒ (2)
in TheoremΒ 8.
But it does not satisfy conditionΒ (1).
We
now
justify
the optimality of
the ratio of in
TheoremΒ 8.
Let
be a randomized algorithm
that
always
outputs
a
-approximate
-median.
Furthermore, denote by (resp., )
the output (resp., the set of queries as unordered pairs)
of , where is the discrete metric (i.e.,
and for all distinct , ).
Without loss of generality, assume for all by adding dummy
queries.
So
the queries in witness
that
(8)
Assume
without loss of generality
that never queries for the distance from a point to itself.
In the sequel,
consider the case
that .
By
the averaging argument, there exists a point
involved in at most queries in (note that each
query involves two points).
Because
every
function
with
satisfies the triangle inequality,
cannot exclude the possibility that for all
satisfying .
In summary,
cannot rule out the case that
(9)
EquationsΒ (8)β(9)
contradict
the guarantee that is -approximate.
Consequently, the case that should never
happen.
The next theorem summarizes the above.
Theorem 9.
Metric -median
has no
randomized algorithm
that always outputs a -approximate
solution and that makes
fewer than
queries
with a positive probability
given oracle access to the discrete metric,
for any
constant
.
LemmasΒ 4Β andΒ 6
yield the following estimation of the average distance.
Theorem 10.
Given
, and
oracle access to a metric ,
a real number in
can be found
in time
with probability .
with probability .
The Knuth shuffle picks
, , ,
in
time.
Then
the left-hand side of
relationΒ (10)
can be calculated in time.
β
Note that the estimation
of
the average distance
in
TheoremΒ 10
has only
one-sided error.
The time complexity (resp., approximation ratio) in
TheoremΒ 10
is better (resp., worse) than that in FactΒ 2.
4 Estimating the average distance of a graph
metric
Throughout
this
section,
take
any
less than a small constant,
e.g., .
Define
(11)
(12)
where is as in equationΒ (2).
As ,
by equationΒ (11).
As in lineΒ 1 of average distance in Fig.Β 2,
let be a uniformly random permutation.
Clearly,
(15)
where the last equality follows from the linearity of expectation
and the separation of pairs
according to whether .
The next three lemmas analyze the
variance
of
By equationsΒ (1)Β andΒ (4)β(19),
the left-hand side of inequalityΒ (17)
cannot exceed the optimal value of the following problem, called max square sum:
Find
for all , to maximize
(20)
subject to
(21)
(22)
Above, constraintΒ (21)
(resp., (22))
mimics equationΒ (1)
(resp., inequalityΒ (19) and the
non-negativeness of distances).
AppendixΒ A
bounds
the
optimal value of
max square sum
from
above by
This evaluates to be
at most
β
Recall that the variance of any random variable
equals
.
We now arrive at an efficient estimation of the average distance on a graph.
Theorem 17.
Given ,
and oracle access to a graph metric
,
a real number in
can be found in time with probability
.
Proof.
Let be an undirected unweighted graph inducing the distance function .
Then pick , with
, i.e., is a furthest pair of vertices of .
Find
a simple
shortest - path, denoted
, in .
By equationΒ (12),
(23)
Now,
(24)
where the first inequality (resp., the second equality) follows from
the triangle inequality (resp.,
being
a shortest - path).444It is easy to verify that
if
and otherwise.
By
inequalitiesΒ (23)β(24),
(25)
Because is a graph metric, for all distinct , .
So
by equationΒ (12),
for all
sufficiently large
.555If , then
.
Otherwise, for all
.
Finally, recall that .
By
equationΒ (11),
(28)
for all sufficiently large .
By
inequalitiesΒ (27)β(28),
LemmaΒ 16 with
and recalling that ,
(29)
for
all sufficiently
large .
Consequently, the output of lineΒ 2 of average distance
in Fig.Β 2
is in
with probability .
LineΒ 1 takes time by the Knuth shuffle.
Clearly,
lineΒ 2
also takes
time.
β
The time complexity of in TheoremΒ 17
is independent of .
But
for general metrics,
we do not know
whether the time complexity of
in
FactΒ 2
can be improved to .
Appendix A Analyzing max square sum
Max square sum has an optimal solution, denoted
,
because
its
feasible solutions
(i.e., those satisfying
constraintsΒ (21)β(22))
form a
closed and bounded
subset of
.
(Recall from elementary mathematical analysis that a continuous
real-valued function on a
closed and bounded
subset of
has a maximum value, where .)
Note that
must be feasible to max square sum.
Below is a consequence of constraintΒ (21).
Lemma A.1.
(30)
Proof.
Clearly,
Furthermore, the left-hand side of
inequalityΒ (30)
is an integer.
β
Lemma A.2.
Proof.
Assume otherwise.
Then
So by
constraintΒ (22) (and the feasibility of
to max square sum),
Consequently,
there exist distinct ,
satisfying
(31)
By symmetry, assume
.
By
inequalityΒ (31),
there exists a small real number
such that
increasing by and simultaneously
decreasing by
will preserve
constraintsΒ (21)β(22).
I.e., the solution defined below is
feasible
to
max square sum:
(35)
Clearly,
objectiveΒ (20)
w.r.t.Β
exceeds that w.r.t.Β
by
where the inequality holds
because and
.
In summary,
is a feasible solution
to max square sum
achieving a greater
objectiveΒ (20)
than the optimal solution
does, a contradiction.
β
where if is true and otherwise, for any
predicate .
Now invoke LemmaΒ A.2.
β
References
[1]
K.Β Barhum, O.Β Goldreich, and A.Β Shraibman.
On approximating the average distance between points.
In Proceedings of the 10th International Workshop on
Approximation and the 11th International Workshop on Randomization, and
Combinatorial Optimization, pages 296β310, 2007.
[2]
C.-L. Chang.
Deterministic sublinear-time approximations for metric -median
selection.
Information Processing Letters, 113(8):288β292, 2013.
[3]
C.-L. Chang.
A deterministic sublinear-time nonadaptive algorithm for metric
-median selection.
Theoretical Computer Science, 602:149β157, 2015.
[4]
C.-L. Chang.
Metric -median selection: Query complexity vs.Β approximation
ratio.
In Proceedings of the 22nd International Computing and
Combinatorics Conference, pages 131β142, Ho Chi Minh City, Vietnam, 2016.
Full version at https://arxiv.org/abs/1509.05662.
[5]
C.-L. Chang.
A lower bound for metric -median selection.
Journal of Computer and System Sciences, 84:44β51, 2017.
[6]
D.Β Eppstein and J.Β Wang.
Fast approximation of centrality.
Journal of Graph Algorithms and Applications, 8(1):39β45,
2004.
[7]
O.Β Goldreich and D.Β Ron.
Approximating average parameters of graphs.
Random Structures & Algorithms, 32(4):473β493, 2008.
[8]
S.Β Guha, A.Β Meyerson, N.Β Mishra, R.Β Motwani, and L.Β OβCallaghan.
Clustering data streams: Theory and practice.
IEEE Transactions on Knowledge and Data Engineering,
15(3):515β528, 2003.
[9]
P.Β Indyk.
Sublinear time algorithms for metric space problems.
In Proceedings of the 31st Annual ACM Symposium on Theory of
Computing, pages 428β434, 1999.
[11]
A.Β Kumar, Y.Β Sabharwal, and S.Β Sen.
Linear-time approximation schemes for clustering problems in any
dimensions.
Journal of the ACM, 57(2):5, 2010.
[12]
R.Β R. Mettu and C.Β G. Plaxton.
Optimal time bounds for approximate clustering.
Machine Learning, 56(1β3):35β60, 2004.
[13]
R.Β Motwani and P.Β Raghavan.
Randomized Algorithms.
Cambridge University Press, Cambridge, UK, 1995.