On ultrametric 111-median selection

Ching-Lueh Chang 111Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, Taiwan. Email: clchang@saturn.yzu.edu.tw
Abstract

Consider the problem of finding a point in an ultrametric space with the minimum average distance to all points. We give this problem a Monte Carlo O((log2(1/ϵ))/ϵ3)𝑂superscript21italic-ϵsuperscriptitalic-ϵ3O((\log^{2}(1/\epsilon))/\epsilon^{3})-time (1+ϵ)1italic-ϵ(1+\epsilon)-approximation algorithm for all ϵ>0italic-ϵ0\epsilon>0.

1 Introduction

A metric space is a nonempty set M𝑀M endowed with a distance function d:M×M[0,):𝑑𝑀𝑀0d\colon M\times M\to[0,\infty) satisfying

  • d(x,y)=0𝑑𝑥𝑦0d(x,y)=0 if and only if x=y𝑥𝑦x=y,

  • d(x,y)=d(y,x)𝑑𝑥𝑦𝑑𝑦𝑥d(x,y)=d(y,x), and

  • d(x,z)d(x,y)+d(y,z)𝑑𝑥𝑧𝑑𝑥𝑦𝑑𝑦𝑧d(x,z)\leq d(x,y)+d(y,z) (triangle inequality)

for all x𝑥x, y𝑦y, zM𝑧𝑀z\in M. With the triangle inequality strengthened to

d(x,z)max{d(x,y),d(y,z)},𝑑𝑥𝑧𝑑𝑥𝑦𝑑𝑦𝑧d\left(x,z\right)\leq\max\left\{d\left(x,y\right),\,d\left(y,z\right)\right\},

we call (M,d)𝑀𝑑(M,d) an ultrametric space and d𝑑d an ultrametric (a.k.a. non-Archimedean metric or super-metric). The mathematical community studies ultrametrics extensively.

Given an n𝑛n-point metric space (M,d)𝑀𝑑(M,d), metric 111-median asks for a point in M𝑀M, called a 111-median, with the minimum average distance to all points. Metric 111-median is a special case of the classical k𝑘k-median clustering and a generalization to the classical median selection [3]. It can also be interpreted as finding the most important point because social network analysis often measures the importance of an actor v𝑣v by v𝑣v’s closeness centrality, defined to be v𝑣v’s average distance to all points [8]. Not surprisingly, metric 111-median is extensively studied, e.g., in the general [5, 6], Euclidean [7], streaming [4] and deterministic [2] cases. Indyk [5, 6] has the currently best upper bound for metric 111-median:

Theorem 1 ([5, 6]).

Metric 111-median has a Monte Carlo O(n/ϵ2)𝑂𝑛superscriptitalic-ϵ2O(n/\epsilon^{2})-time (1+ϵ)1italic-ϵ(1+\epsilon)-approximation algorithm for all ϵ>0italic-ϵ0\epsilon>0.

The greatest strengths of Theorem 1 are the sublinear time complexity (of O(n/ϵ2)𝑂𝑛superscriptitalic-ϵ2O(n/\epsilon^{2})) and the optimal approximation ratio (of 1+ϵ1italic-ϵ1+\epsilon), where “sublinear” means “o(n2)𝑜superscript𝑛2o(n^{2})” by convention because there are Θ(n2)Θsuperscript𝑛2\Theta(n^{2}) distances. Furthermore, except for the dependence of the time complexity on ϵitalic-ϵ\epsilon, all parameters in Theorem 1 are easily shown to be optimal [1, Sec. 7].

Chang [1, Sec. 6] uses Indyk’s [6, Sec. 6.1] technique to give a Monte Carlo algorithm for metric 111-median with time complexity independent of n𝑛n but at the cost of a worse approximation ratio:

Theorem 2 ([1, Sec. 6]).

For all ϵ>0italic-ϵ0\epsilon>0, metric 111-median has a Monte Carlo O((log2(1/ϵ))/ϵ3)𝑂superscript21italic-ϵsuperscriptitalic-ϵ3O((\log^{2}(1/\epsilon))/\epsilon^{3})-time (2+ϵ)2italic-ϵ(2+\epsilon)-approximation algorithm with success probability greater than 1ϵ1italic-ϵ1-\epsilon.

Let ultrametric 111-median be metric 111-median restricted to ultrametric spaces. The approximation ratio of 2+ϵ2italic-ϵ2+\epsilon in Theorem 2 cannot be improved to 2ϵ2italic-ϵ2-\epsilon even if we require the success probability only to be a small constant [1, Sec. 7]. In contrast, this paper gives a Monte Carlo O((log2(1/ϵ))/ϵ3)𝑂superscript21italic-ϵsuperscriptitalic-ϵ3O((\log^{2}(1/\epsilon))/\epsilon^{3})-time (1+ϵ)1italic-ϵ(1+\epsilon)-approximation algorithm for ultrametric 111-median. So our algorithm has the optimal approximation ratio (of 1+ϵ1italic-ϵ1+\epsilon) and a time complexity (of O((log2(1/ϵ))/ϵ3)𝑂superscript21italic-ϵsuperscriptitalic-ϵ3O((\log^{2}(1/\epsilon))/\epsilon^{3})) independent of n𝑛n.

2 Algorithm

For all n+𝑛superscriptn\in\mathbb{Z}^{+}, [n]=def.{1,2,,n}superscriptdef.delimited-[]𝑛12𝑛[n]\stackrel{{\scriptstyle\text{def.}}}{{=}}\{1,2,\ldots,n\} by convention. Let ([n],d)delimited-[]𝑛𝑑([n],d) be an ultrametric space, OPT a 111-median of ([n],d)delimited-[]𝑛𝑑([n],d) and ϵ>0italic-ϵ0\epsilon>0. Order the points in [n]delimited-[]𝑛[n] as p1=OPTsubscript𝑝1OPTp_{1}=\text{\rm OPT}, p2subscript𝑝2p_{2}, \ldots, pnsubscript𝑝𝑛p_{n} so that

0=d(OPT,p1)d(OPT,p2)d(OPT,pn).0𝑑OPTsubscript𝑝1𝑑OPTsubscript𝑝2𝑑OPTsubscript𝑝𝑛\displaystyle 0=d\left(\text{OPT},p_{1}\right)\leq d\left(\text{OPT},p_{2}\right)\leq\cdots\leq d\left(\text{OPT},p_{n}\right). (1)

Furthermore, let

r=def.1ni=1nd(OPT,pi)superscriptdef.superscript𝑟1𝑛superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle r^{*}\stackrel{{\scriptstyle\text{def.}}}{{=}}\frac{1}{n}\cdot\sum_{i=1}^{n}\,d\left(\text{\rm OPT},p_{i}\right) (2)

be the average distance from a 111-median to all points. Because the brute-force algorithm for ultrametric 111-median takes Θ(n2)Θsuperscript𝑛2\Theta(n^{2}) time and we want an O((log2(1/ϵ))/ϵ3)𝑂superscript21italic-ϵsuperscriptitalic-ϵ3O((\log^{2}(1/\epsilon))/\epsilon^{3})-time algorithm, assume ϵn2/3italic-ϵsuperscript𝑛23\epsilon\geq n^{-2/3} W.L.O.G. Furthermore, assume ϵ0.0001italic-ϵ0.0001\epsilon\leq 0.0001 W.L.O.G.222It is easy to see that if our result holds when ϵ=0.0001italic-ϵ0.0001\epsilon=0.0001, then it also holds for all ϵ>0.0001italic-ϵ0.0001\epsilon>0.0001.

Lemma 3.

For all 1n1𝑛1\leq\ell\leq n,

i=1nd(p,pi)(1+1n+1)i=1nd(OPT,pi).superscriptsubscript𝑖1𝑛𝑑subscript𝑝subscript𝑝𝑖11𝑛1superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\sum_{i=1}^{n}\,d\left(p_{\ell},p_{i}\right)\leq\left(1+\frac{\ell-1}{n-\ell+1}\right)\sum_{i=1}^{n}\,d\left(\text{\rm OPT},p_{i}\right).
Proof.

We have

i=1nd(p,pi)superscriptsubscript𝑖1𝑛𝑑subscript𝑝subscript𝑝𝑖\displaystyle\sum_{i=1}^{n}\,d\left(p_{\ell},p_{i}\right)
=\displaystyle= i=11d(p,pi)+i=+1nd(p,pi)superscriptsubscript𝑖11𝑑subscript𝑝subscript𝑝𝑖superscriptsubscript𝑖1𝑛𝑑subscript𝑝subscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,d\left(p_{\ell},p_{i}\right)+\sum_{i=\ell+1}^{n}\,d\left(p_{\ell},p_{i}\right)
\displaystyle\leq i=11max{d(OPT,p),d(OPT,pi)}+i=+1nmax{d(OPT,p),d(OPT,pi)}superscriptsubscript𝑖11𝑑OPTsubscript𝑝𝑑OPTsubscript𝑝𝑖superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑑OPTsubscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,\max\left\{d\left(\text{OPT},p_{\ell}\right),\,d\left(\text{OPT},p_{i}\right)\right\}+\sum_{i=\ell+1}^{n}\,\max\left\{d\left(\text{OPT},p_{\ell}\right),\,d\left(\text{OPT},p_{i}\right)\right\}
(1)superscript(1)\displaystyle\stackrel{{\scriptstyle\text{(\ref{orderofincreasingdistances})}}}{{\leq}} i=11d(OPT,p)+i=+1nd(OPT,pi)superscriptsubscript𝑖11𝑑OPTsubscript𝑝superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,d\left(\text{OPT},p_{\ell}\right)+\sum_{i=\ell+1}^{n}\,d\left(\text{OPT},p_{i}\right)
\displaystyle\leq i=11d(OPT,p)+i=1nd(OPT,pi)superscriptsubscript𝑖11𝑑OPTsubscript𝑝superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,d\left(\text{OPT},p_{\ell}\right)+\sum_{i=1}^{n}\,d\left(\text{OPT},p_{i}\right)
=\displaystyle= i=111n+1j=nd(OPT,p)+i=1nd(OPT,pi)superscriptsubscript𝑖111𝑛1superscriptsubscript𝑗𝑛𝑑OPTsubscript𝑝superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,\frac{1}{n-\ell+1}\cdot\sum_{j=\ell}^{n}\,d\left(\text{OPT},p_{\ell}\right)+\sum_{i=1}^{n}\,d\left(\text{OPT},p_{i}\right)
(1)superscript(1)\displaystyle\stackrel{{\scriptstyle\text{(\ref{orderofincreasingdistances})}}}{{\leq}} i=111n+1j=nd(OPT,pj)+i=1nd(OPT,pi)superscriptsubscript𝑖111𝑛1superscriptsubscript𝑗𝑛𝑑OPTsubscript𝑝𝑗superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,\frac{1}{n-\ell+1}\cdot\sum_{j=\ell}^{n}\,d\left(\text{OPT},p_{j}\right)+\sum_{i=1}^{n}\,d\left(\text{OPT},p_{i}\right)
\displaystyle\leq i=111n+1j=1nd(OPT,pj)+i=1nd(OPT,pi)superscriptsubscript𝑖111𝑛1superscriptsubscript𝑗1𝑛𝑑OPTsubscript𝑝𝑗superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,\frac{1}{n-\ell+1}\cdot\sum_{j=1}^{n}\,d\left(\text{OPT},p_{j}\right)+\sum_{i=1}^{n}\,d\left(\text{OPT},p_{i}\right)
=\displaystyle= 1n+1i=1nd(OPT,pi)+i=1nd(OPT,pi).1𝑛1superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle\frac{\ell-1}{n-\ell+1}\cdot\sum_{i=1}^{n}\,d\left(\text{OPT},p_{i}\right)+\sum_{i=1}^{n}\,d\left(\text{OPT},p_{i}\right).

In short, Lemma 3 says that psubscript𝑝p_{\ell} is an approximate 111-median for all small \ell. Below is the key of the proof of Theorem 1.

Fact 4 ([6, Sec. 6.1]).

Pick 𝐯1subscript𝐯1{\boldsymbol{v}}_{1}, 𝐯2subscript𝐯2{\boldsymbol{v}}_{2}, \ldots, 𝐯ksubscript𝐯𝑘{\boldsymbol{v}}_{k} independently and uniformly at random from [n]delimited-[]𝑛[n], where k+𝑘superscriptk\in\mathbb{Z}^{+}. Then for all a𝑎a, b[n]𝑏delimited-[]𝑛b\in[n] satisfying j=1nd(b,pj)>(1+ϵ)j=1nd(a,pj)superscriptsubscript𝑗1𝑛𝑑𝑏subscript𝑝𝑗1italic-ϵsuperscriptsubscript𝑗1𝑛𝑑𝑎subscript𝑝𝑗\sum_{j=1}^{n}\,d(b,p_{j})>(1+\epsilon)\,\sum_{j=1}^{n}\,d(a,p_{j}),

Pr[j=1kd(b,𝒗j)j=1kd(a,𝒗j)]<exp(ϵ2k64).Prsuperscriptsubscript𝑗1𝑘𝑑𝑏subscript𝒗𝑗superscriptsubscript𝑗1𝑘𝑑𝑎subscript𝒗𝑗superscriptitalic-ϵ2𝑘64\Pr\left[\sum_{j=1}^{k}\,d\left(b,{\boldsymbol{v}}_{j}\right)\leq\sum_{j=1}^{k}\,d\left(a,{\boldsymbol{v}}_{j}\right)\right]<\exp{\left(-\frac{\epsilon^{2}k}{64}\right)}.

The following lemma uses Indyk’s [6, Sec. 6.1] technique that Chang [1, Sec. 6] uses to prove Theorem 2.

Lemma 5.

Pick 𝐯1subscript𝐯1{\boldsymbol{v}}_{1}, 𝐯2subscript𝐯2{\boldsymbol{v}}_{2}, \ldots, 𝐯ksubscript𝐯𝑘{\boldsymbol{v}}_{k} as in Fact 4, where k=109(log(1/ϵ))/ϵ2𝑘superscript1091italic-ϵsuperscriptitalic-ϵ2k=\lceil 10^{9}(\log(1/\epsilon))/\epsilon^{2}\rceil. Let x1subscript𝑥1x_{1}, x2subscript𝑥2x_{2}, \ldots, xh[n]subscript𝑥delimited-[]𝑛x_{h}\in[n], where h=109(log(1/ϵ))/ϵsuperscript1091italic-ϵitalic-ϵh=\lceil 10^{9}(\log(1/\epsilon))/\epsilon\rceil, and

t=argmini=1hj=1kd(xi,𝒗j),𝑡superscriptsubscriptargmin𝑖1superscriptsubscript𝑗1𝑘𝑑subscript𝑥𝑖subscript𝒗𝑗\displaystyle t=\mathop{\mathrm{argmin}}_{i=1}^{h}\,\sum_{j=1}^{k}\,d\left(x_{i},{\boldsymbol{v}}_{j}\right), (3)

breaking ties arbitrarily. Then

Pr[j=1nd(xt,pj)(1+ϵ)mini=1hj=1nd(xi,pj)]>1ϵ.Prsuperscriptsubscript𝑗1𝑛𝑑subscript𝑥𝑡subscript𝑝𝑗1italic-ϵsuperscriptsubscript𝑖1superscriptsubscript𝑗1𝑛𝑑subscript𝑥𝑖subscript𝑝𝑗1italic-ϵ\Pr\left[\sum_{j=1}^{n}\,d\left(x_{t},p_{j}\right)\leq\left(1+\epsilon\right)\cdot\min_{i=1}^{h}\,\sum_{j=1}^{n}\,d\left(x_{i},p_{j}\right)\right]>1-\epsilon.
Proof.

Let

isuperscript𝑖\displaystyle i^{*} =\displaystyle= argmini=1hj=1nd(xi,pj),superscriptsubscriptargmin𝑖1superscriptsubscript𝑗1𝑛𝑑subscript𝑥𝑖subscript𝑝𝑗\displaystyle\mathop{\mathrm{argmin}}_{i=1}^{h}\,\sum_{j=1}^{n}\,d\left(x_{i},p_{j}\right), (4)

breaking ties arbitrarily. Then

Pr[j=1nd(xt,pj)>(1+ϵ)mini=1hj=1nd(xi,pj)]Prsuperscriptsubscript𝑗1𝑛𝑑subscript𝑥𝑡subscript𝑝𝑗1italic-ϵsuperscriptsubscript𝑖1superscriptsubscript𝑗1𝑛𝑑subscript𝑥𝑖subscript𝑝𝑗\displaystyle\Pr\left[\sum_{j=1}^{n}\,d\left(x_{t},p_{j}\right)>\left(1+\epsilon\right)\cdot\min_{i=1}^{h}\,\sum_{j=1}^{n}\,d\left(x_{i},p_{j}\right)\right]
=(4)superscript(4)\displaystyle\stackrel{{\scriptstyle\text{(\ref{thebestfromthesamples})}}}{{=}} Pr[j=1nd(xt,pj)>(1+ϵ)j=1nd(xi,pj)]Prsuperscriptsubscript𝑗1𝑛𝑑subscript𝑥𝑡subscript𝑝𝑗1italic-ϵsuperscriptsubscript𝑗1𝑛𝑑subscript𝑥superscript𝑖subscript𝑝𝑗\displaystyle\Pr\left[\sum_{j=1}^{n}\,d\left(x_{t},p_{j}\right)>\left(1+\epsilon\right)\cdot\sum_{j=1}^{n}\,d\left(x_{i^{*}},p_{j}\right)\right]
=(3)superscript(3)\displaystyle\stackrel{{\scriptstyle\text{(\ref{thebestindexaccordingtorandomsamples})}}}{{=}} Pr[(j=1nd(xt,pj)>(1+ϵ)j=1nd(xi,pj))(j=1kd(xt,𝒗j)=mini=1hj=1kd(xi,𝒗j))]Prsuperscriptsubscript𝑗1𝑛𝑑subscript𝑥𝑡subscript𝑝𝑗1italic-ϵsuperscriptsubscript𝑗1𝑛𝑑subscript𝑥superscript𝑖subscript𝑝𝑗superscriptsubscript𝑗1𝑘𝑑subscript𝑥𝑡subscript𝒗𝑗superscriptsubscript𝑖1superscriptsubscript𝑗1𝑘𝑑subscript𝑥𝑖subscript𝒗𝑗\displaystyle\Pr\left[\left(\sum_{j=1}^{n}\,d\left(x_{t},p_{j}\right)>\left(1+\epsilon\right)\cdot\sum_{j=1}^{n}\,d\left(x_{i^{*}},p_{j}\right)\right)\land\left(\sum_{j=1}^{k}\,d\left(x_{t},{\boldsymbol{v}}_{j}\right)=\min_{i=1}^{h}\,\sum_{j=1}^{k}\,d\left(x_{i},{\boldsymbol{v}}_{j}\right)\right)\right]
\displaystyle\leq Pr[(j=1nd(xt,pj)>(1+ϵ)j=1nd(xi,pj))(j=1kd(xt,𝒗j)j=1kd(xi,𝒗j))]Prsuperscriptsubscript𝑗1𝑛𝑑subscript𝑥𝑡subscript𝑝𝑗1italic-ϵsuperscriptsubscript𝑗1𝑛𝑑subscript𝑥superscript𝑖subscript𝑝𝑗superscriptsubscript𝑗1𝑘𝑑subscript𝑥𝑡subscript𝒗𝑗superscriptsubscript𝑗1𝑘𝑑subscript𝑥superscript𝑖subscript𝒗𝑗\displaystyle\Pr\left[\left(\sum_{j=1}^{n}\,d\left(x_{t},p_{j}\right)>\left(1+\epsilon\right)\cdot\sum_{j=1}^{n}\,d\left(x_{i^{*}},p_{j}\right)\right)\land\left(\sum_{j=1}^{k}\,d\left(x_{t},{\boldsymbol{v}}_{j}\right)\leq\sum_{j=1}^{k}\,d\left(x_{i^{*}},{\boldsymbol{v}}_{j}\right)\right)\right]
\displaystyle\leq Pr[i[h],(j=1nd(xi,pj)>(1+ϵ)j=1nd(xi,pj))(j=1kd(xi,𝒗j)j=1kd(xi,𝒗j))]Pr𝑖delimited-[]superscriptsubscript𝑗1𝑛𝑑subscript𝑥𝑖subscript𝑝𝑗1italic-ϵsuperscriptsubscript𝑗1𝑛𝑑subscript𝑥superscript𝑖subscript𝑝𝑗superscriptsubscript𝑗1𝑘𝑑subscript𝑥𝑖subscript𝒗𝑗superscriptsubscript𝑗1𝑘𝑑subscript𝑥superscript𝑖subscript𝒗𝑗\displaystyle\Pr\left[\exists i\in[h],\,\left(\sum_{j=1}^{n}\,d\left(x_{i},p_{j}\right)>\left(1+\epsilon\right)\cdot\sum_{j=1}^{n}\,d\left(x_{i^{*}},p_{j}\right)\right)\land\left(\sum_{j=1}^{k}\,d\left(x_{i},{\boldsymbol{v}}_{j}\right)\leq\sum_{j=1}^{k}\,d\left(x_{i^{*}},{\boldsymbol{v}}_{j}\right)\right)\right]
\displaystyle\leq i=1hPr[(j=1nd(xi,pj)>(1+ϵ)j=1nd(xi,pj))(j=1kd(xi,𝒗j)j=1kd(xi,𝒗j))]superscriptsubscript𝑖1Prsuperscriptsubscript𝑗1𝑛𝑑subscript𝑥𝑖subscript𝑝𝑗1italic-ϵsuperscriptsubscript𝑗1𝑛𝑑subscript𝑥superscript𝑖subscript𝑝𝑗superscriptsubscript𝑗1𝑘𝑑subscript𝑥𝑖subscript𝒗𝑗superscriptsubscript𝑗1𝑘𝑑subscript𝑥superscript𝑖subscript𝒗𝑗\displaystyle\sum_{i=1}^{h}\,\Pr\left[\left(\sum_{j=1}^{n}\,d\left(x_{i},p_{j}\right)>\left(1+\epsilon\right)\cdot\sum_{j=1}^{n}\,d\left(x_{i^{*}},p_{j}\right)\right)\land\left(\sum_{j=1}^{k}\,d\left(x_{i},{\boldsymbol{v}}_{j}\right)\leq\sum_{j=1}^{k}\,d\left(x_{i^{*}},{\boldsymbol{v}}_{j}\right)\right)\right]
<Fact 4superscriptFact 4\displaystyle\stackrel{{\scriptstyle\text{Fact~{}\ref{Indykkeyfact}}}}{{<}} i=1hexp(ϵ2k64)superscriptsubscript𝑖1superscriptitalic-ϵ2𝑘64\displaystyle\sum_{i=1}^{h}\,\exp{\left(-\frac{\epsilon^{2}k}{64}\right)}
=\displaystyle= hexp(ϵ2k64)superscriptitalic-ϵ2𝑘64\displaystyle h\cdot\exp{\left(-\frac{\epsilon^{2}k}{64}\right)}
<\displaystyle< ϵ,italic-ϵ\displaystyle\epsilon,

where the second inequality uses t[h]𝑡delimited-[]t\in[h]. ∎

In short, Lemma 5 says how to find a ((1+ϵ)κ)1italic-ϵ𝜅((1+\epsilon)\kappa)-approximate 111-median from {x1,x2,,xh}subscript𝑥1subscript𝑥2subscript𝑥\{x_{1},x_{2},\ldots,x_{h}\} with probability greater than 1ϵ1italic-ϵ1-\epsilon, where κ𝜅\kappa is the best approximation ratio among x1subscript𝑥1x_{1}, x2subscript𝑥2x_{2}, \ldots, xhsubscript𝑥x_{h}. Note that computing t𝑡t in Eq. (3) requires no knowledge of the ordering p1subscript𝑝1p_{1}, p2subscript𝑝2p_{2}, \ldots, pnsubscript𝑝𝑛p_{n}.

1:  h109(log(1/ϵ))/ϵsuperscript1091italic-ϵitalic-ϵh\leftarrow\lceil 10^{9}(\log(1/\epsilon))/\epsilon\rceil;
2:  k109(log(1/ϵ))/ϵ2𝑘superscript1091italic-ϵsuperscriptitalic-ϵ2k\leftarrow\lceil 10^{9}(\log(1/\epsilon))/\epsilon^{2}\rceil;
3:  Pick 𝒖1subscript𝒖1{\boldsymbol{u}}_{1}, 𝒖2subscript𝒖2{\boldsymbol{u}}_{2}, \ldots, 𝒖hsubscript𝒖{\boldsymbol{u}}_{h}, 𝒗1subscript𝒗1{\boldsymbol{v}}_{1}, 𝒗2subscript𝒗2{\boldsymbol{v}}_{2}, \ldots, 𝒗ksubscript𝒗𝑘{\boldsymbol{v}}_{k} independently and uniformly at random from [n]delimited-[]𝑛[n];
4:  targmini=1hj=1kd(𝒖i,𝒗j)𝑡superscriptsubscriptargmin𝑖1superscriptsubscript𝑗1𝑘𝑑subscript𝒖𝑖subscript𝒗𝑗t\leftarrow\mathop{\mathrm{argmin}}_{i=1}^{h}\,\sum_{j=1}^{k}\,d({\boldsymbol{u}}_{i},{\boldsymbol{v}}_{j}), breaking ties arbitrarily;
5:  return  𝒖tsubscript𝒖𝑡{\boldsymbol{u}}_{t};

Figure 1: Algorithm approx. median for ultrametric 111-median
Lemma 6.

Algorithm approx. median in Fig. 1 outputs a ((1+ϵ)(1+2ϵ))1italic-ϵ12italic-ϵ((1+\epsilon)(1+2\epsilon))-approximate 111-median with probability greater than 12ϵ12italic-ϵ1-2\epsilon.

Proof.

With hh and 𝒖1subscript𝒖1{\boldsymbol{u}}_{1}, 𝒖2subscript𝒖2{\boldsymbol{u}}_{2}, \ldots, 𝒖hsubscript𝒖{\boldsymbol{u}}_{h} as in approx. median,

Pr[i[h],𝒖i{p1,p2,,pϵn}]Pr𝑖delimited-[]subscript𝒖𝑖subscript𝑝1subscript𝑝2subscript𝑝italic-ϵ𝑛\displaystyle\Pr\left[\exists i\in[h],\,{\boldsymbol{u}}_{i}\in\left\{p_{1},p_{2},\ldots,p_{\lceil\epsilon n\rceil}\right\}\right]
=\displaystyle= 1Pr[i[h],𝒖i{p1,p2,,pϵn}]1Prfor-all𝑖delimited-[]subscript𝒖𝑖subscript𝑝1subscript𝑝2subscript𝑝italic-ϵ𝑛\displaystyle 1-\Pr\left[\forall i\in[h],\,{\boldsymbol{u}}_{i}\notin\left\{p_{1},p_{2},\ldots,p_{\lceil\epsilon n\rceil}\right\}\right]
=\displaystyle= 1(1ϵnn)h1superscript1italic-ϵ𝑛𝑛\displaystyle 1-\left(1-\frac{\lceil\epsilon n\rceil}{n}\right)^{h}
>\displaystyle> 1ϵ.1italic-ϵ\displaystyle 1-\epsilon. (6)

When there exists 1ih1𝑖1\leq i\leq h satisfying 𝒖i{p1,p2,,pϵn}subscript𝒖𝑖subscript𝑝1subscript𝑝2subscript𝑝italic-ϵ𝑛{\boldsymbol{u}}_{i}\in\{p_{1},p_{2},\ldots,p_{\lceil\epsilon n\rceil}\}, Lemma 3 asserts the existence of a (1+2ϵ)12italic-ϵ(1+2\epsilon)-approximate 111-median in {𝒖1,𝒖2,,𝒖h}subscript𝒖1subscript𝒖2subscript𝒖\{{\boldsymbol{u}}_{1},{\boldsymbol{u}}_{2},\ldots,{\boldsymbol{u}}_{h}\}. So Eqs. (2)–(6) force {𝒖1,𝒖2,,𝒖h}subscript𝒖1subscript𝒖2subscript𝒖\{{\boldsymbol{u}}_{1},{\boldsymbol{u}}_{2},\ldots,{\boldsymbol{u}}_{h}\} to contain a (1+2ϵ)12italic-ϵ(1+2\epsilon)-approximate 111-median with probability greater than 1ϵ1italic-ϵ1-\epsilon. By Lemma 5 (with {xi}i=1hsuperscriptsubscriptsubscript𝑥𝑖𝑖1\{x_{i}\}_{i=1}^{h} substituted by {𝒖i}i=1hsuperscriptsubscriptsubscript𝒖𝑖𝑖1\{{\boldsymbol{u}}_{i}\}_{i=1}^{h}), approx. median outputs a ((1+ϵ)κ)1italic-ϵ𝜅((1+\epsilon)\kappa)-approximate 111-median with probability greater than 1ϵ1italic-ϵ1-\epsilon if {𝒖1,𝒖2,,𝒖h}subscript𝒖1subscript𝒖2subscript𝒖\{{\boldsymbol{u}}_{1},{\boldsymbol{u}}_{2},\ldots,{\boldsymbol{u}}_{h}\} contains a κ𝜅\kappa-approximate 111-median, for all κ>0𝜅0\kappa>0. Now take κ=1+2ϵ𝜅12italic-ϵ\kappa=1+2\epsilon. ∎

Theorem 7.

Ultrametric 111-median has a Monte Carlo O((log2(1/ϵ))/ϵ3)𝑂superscript21italic-ϵsuperscriptitalic-ϵ3O((\log^{2}(1/\epsilon))/\epsilon^{3})-time (1+ϵ)1italic-ϵ(1+\epsilon)-approximation algorithm with success probability greater than 1ϵ1italic-ϵ1-\epsilon.

Proof.

Invoke Lemma 6 (with ϵitalic-ϵ\epsilon substituted by ϵ/4italic-ϵ4\epsilon/4) and calculate the running time of approx. median. ∎

References

  • [1] C.-L. Chang. Some results on approximate 111-median selection in metric spaces. Theoretical Computer Science, 426:1–12, 2012.
  • [2] C.-L. Chang. Metric 111-median selection: Query complexity vs. approximation ratio. ACM Transactions on Computation Theory, 9(4):20:1–20:23, 2018.
  • [3] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, 3rd edition, 2009.
  • [4] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3):515–528, 2003.
  • [5] P. Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing, pages 428–434, 1999.
  • [6] P. Indyk. High-dimensional computational geometry. PhD thesis, Stanford University, 2000.
  • [7] A. Kumar, Y. Sabharwal, and S. Sen. Linear-time approximation schemes for clustering problems in any dimensions. Journal of the ACM, 57(2):5, 2010.
  • [8] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994.