Institut für Informatik
Freie Universität Berlin
E-Mail: mulzer@inf.fu-berlin.de
Five Proofs of Chernoff’s Bound with Applications111Supported in part by DFG Grants MU 3501/1 and MU 3501/2 and ERC StG 757609.
Abstract
We discuss five ways of proving Chernoff’s bound and show how they lead to different extensions of the basic bound.
1 Introduction
Chernoff’s bound gives an estimate on the probability that a sum of independent Binomial random variables deviates from its expectation [14]. It has many variants and extensions that are known under various names such as Bernstein’s inequality or Hoeffding’s bound [4, 14]. Chernoff’s bound is one of the most basic and versatile tools in the life of a theoretical computer scientist, with a seemingly endless amount of applications. Almost every contemporary textbook on algorithms or complexity theory contains a statement and a proof of the bound [2, 12, 16, 8], and there are several texts that discuss its various applications in great detail (e.g., the textbooks by Alon and Spencer [1], Dubhashi and Panchonesi [10], Mitzenmacher and Upfal [19], Motwani and Raghavan [21], or the articles by Chung and Lu [6], Hagerup and Rüb [13], or McDiarmid [17]).
In the present survey, we will see five different ways of proving the basic Chernoff bound. The different techniques used in these proofs allow various generalizations and extensions, some of which we will also discuss.
2 The Basic Bound
We begin with a statement of the basic Chernoff bound. For this, we first need a notion from information theory [9]. Let and be two probability distributions on elements, i.e., with , for , and . The Kullback-Leibler divergence or relative entropy of and is defined as
If , i.e., if and , we write for . The Kullback-Leibler divergence measures the distance between the distributions and : it represents the expected loss of efficiency if we encode an -letter alphabet with distribution with a code that is optimal for distribution . Now, the basic Chernoff bound is as follows:
Theorem 2.1.
Let , , and let be independent random variables with and , for . Set . Then, for any , we have
3 Five Proofs for Theorem 2.1
We will now see five different ways of proving Theorem 2.1.
3.1 The Moment Method
The usual textbook proof of Theorem 2.1 uses the exponential function and Markov’s inequality. It is called the moment method, because simultaneously encodes all moments of . This trick is often attributed to Bernstein [4]. It is very general and can be used to obtain several variants of Theorem 2.1, perhaps most prominently, the Azuma-Hoeffding inequality for martingales with bounded differences [14, 3].
The proof goes as follows. Let be a parameter to be determined later. We have
From Markov’s inequality, we obtain
Now, the independence of the yields
Thus,
(1) |
for every . Optimizing for using calculus, we get that the right hand side is minimized if
Plugging this into (1), we get
as desired.
3.2 Chvátal’s Method
The following proof of Theorem 2.1 is due to Chvátal [7]. As we will see below, it can be generalized to give tail bounds for the hypergeometric distribution. Let be the random variable that gives the number of heads in independent Bernoulli trials with success probability . Then,
for . Thus, for any and , we get
Using the Binomial theorem, we obtain
If we write and , we get
This is the same as (1), so we can complete the proof of Theorem 2.1 as in Section 3.1.
3.3 The Impagliazzo-Kabanets Method
The third proof is due to Impagliazzo and Kabanets [15], and it leads to a constructive version of the bound. Let be a parameter to be chosen later. Let be a random index set obtained by including each element with probability . We estimate in two different ways, where the expectation is over the random choice of and .
On the one hand, using the law of total expectation and independence, we have
(2) |
On the other hand, by the law of total expectation,
Now, fix with . For the fixed choice of , the expectation is exactly the probability that avoids all the indices where . Thus, the conditional expectation is
so
Combining with (2),
(3) |
Using calculus, we get that the right hand side is minimized for (note that for ). Plugging this into (3),
as desired.
3.4 The Encoding Argument
The next proof stems from discussions with Luc Devroye, Gábor Lugosi, and Pat Morin, and it is inspired by an encoding argument [20]. A similar argument can also be derived from Xinjia Chen’s likelihood ratio method [5]. Let be the set of all bit strings of length , and let be a weight function. We call valid if . The following lemma says that for any probability distribution on , a valid weight function is unlikely to be substantially larger than .
Lemma 3.1.
Let be a probability distribution on that assigns to each a probability , and let be a valid weight function. For any , we have
Proof.
Let . We have
since for , , and since is valid. ∎
We now show that Lemma 3.1 implies Theorem 2.1. For this, we interpret the sequence as a bit string of length . This induces a probability distribution that assigns to each the probability , where denotes the number of -bits in . We define a weight function by , for . Then is valid, since is the probability that is generated by setting each bit to independently with probability . For , we have
Since , it follows that is an increasing function of . Hence, if , we have
We now apply Lemma 3.1 to and to get
as claimed in Theorem 2.1.
See the survey [20] for a more thorough discussion of how this proof is related to coding theory.
3.5 A Proof via Differential Privacy
The fifth proof of Chernoff’s bound is due to Steinke and Ullman [22], and it uses methods from the theory of differential privacy [11]. Unlike the previous four proofs, it seems to lead to a slightly weaker version of the bound. Let be a parameter to be determined later. The main idea is to bound the expectation of independent copies of .
Lemma 3.2.
Let and . Let be independent copies of , and set . Then,
We will give a proof of Lemma 3.2 below. First, however, we will see how we can use Lemma 3.2 to derive the following weaker version of Theorem 2.1.222In the published version of this paper, the proof of Theorem 3.3 is based on an incorrect application of Markov’s inequality. We have changed Lemma 3.2 so that is fixed to . This ensures that Markov’s inequality is applied to a nonnegative random variable. We thank Natalia Shenkman for pointing this out to us.
Theorem 3.3.
Let , , and let be independent random variables with and , for . Set . Then, for any , we have
Proof.
We may assume that , since otherwise the lemma holds trivially. Set . Let be independent copies of and let . Then,
(4) |
On the other hand, Markov’s inequality gives
by Lemma 3.2. Thus, setting , and combining with (4), we get
since . Now the lemma follows from
which holds as , as is decreasing for , and as . ∎
It remains to prove Lemma 3.2. For this, we use an idea from differential privacy. Let , , be an -matrix with entries from . For a given parameter , we define a random variable with values in as follows: for , let be the sum of the entries in the -th row of . Set
Then, for , we define
The random variable is called a stable selector for (see the work by McSherry and Talwar [18] for more background). The next lemma states two interesting properties for . For a matrix , a vector , and a number we denote by the matrix obtained from by replacing the -th column of with .
Lemma 3.4.
Let be an matrix with entries in . We have
-
•
Stability: For every vector and every ,
-
•
Accuracy: Let be the sum of the -th row of . Then,
Proof.
Stability: for , let be the sum of the -th row of , and let be the sum of the -th row of . Since and differ in one column, and since the entries are from , we have . Hence,
and
as claimed.
Accuracy: The inequality is obvious. For the second inequality, we observe that by definition,
Thus,
since and since is a convex function. ∎
Lemma 3.4 shows that constitutes a reasonable mechanism of estimating the maximum row sum of without revealing too much information about any single column of . We can now use Lemma 3.4 to bound the expectation of the maximum of independent copies of and .
Lemma 3.5.
Let . let be independent copies of , and set . Then, for any , we have
Proof.
Let be independent copies of , and let ; let be independent copies of and let ; and so on. We consider the random matrix whose entry in row and column is . Then, we can write , for . By the accuracy claim in Lemma 3.4,
(5) |
Now we bound . We unwrap the expectation for and get
Let be an independent copy of . Denote the entry in the -th row and -th column of by , and set , for . By the stability claim in Lemma 3.4, for every ,
Since the random variables , , , , are independent, the pairs and have the same distribution. Therefore, we can write | ||||
We can conclude the lemma by plugging this bound into (5). ∎
4 Useful Consequences
We now show several useful consequences of Theorem 2.1. These results can be derived directly from Theorem 2.1, and therefore they also hold for variants of the theorem with slightly different assumptions.
4.1 The Lower Tail
First, we show that an analogous bound holds for the lower tail probability .
Corollary 4.1.
Let be independent random variables with and , for . Set . Then, for any , we have
Proof.
where with independent random variables such that . The result follows from . ∎
4.2 Multiplicative Version
Next, we derive a multiplicative variant of Theorem 2.1. This well-known version of the bound can be found in the classic text by Motwani and Raghavan [21].
Corollary 4.2.
Let be independent random variables with and , for . Set and . Then, for any , we have
4.3 Useful Variants
The next few corollaries give some handy variants of the bound that are often more manageable in practice. First, we give a simple bound for the multiplicative lower tail.
Corollary 4.3.
Let be independent random variables with and , for . Set and . Then, for any , we have
Proof.
An only slightly more complicated bound can be found for the multiplicative upper tail.
Corollary 4.4.
Let be independent random variables with and , for . Set and . Then, for any , we have
Proof.
We may assume that . Then, Theorem 2.1 gives
Define . Then,
and
By Taylor’s theorem, we have
for some . Since , it follows that
For , we have , for , we have . This gives, for all ,
and the claim follows. ∎
The following corollary combines the two bounds. This variant can be found, e.g., in the book by Arora and Barak [2].
Corollary 4.5.
Let be independent random variables with and , for . Set and . Then, for any , we have
The following corollary, which appears, e.g., in the book by Motwani and Raghavan [21], is also sometimes useful.
Corollary 4.6.
Let be independent random variables with and , for . Set and . For , we have
Proof.
By Corollary 4.2
For , the denominator in the right hand side is at least , and the claim follows. ∎
5 Generalizations
We mention a few generalizations of the proof techniques for Section 3. Since the consequences from Section 4 are based on simple algebraic manipulation of the bounds, the same consequences also hold for the generalized settings.
5.1 Hoeffding Extension
The moment method (Section 3.1) yields many generalizations of Theorem 2.1. The following result is known as Hoeffding’s extension [14]. It shows that the can actually be chosen to be continuous with varying expectations.
Theorem 5.1.
Let be independent random variables with and . Set and . Then, for any , we have
Proof.
5.2 Hypergeometric Distribution
Chvátals proof [7] from Section 3.2 generalizes to the hypergeometric distribution. We emphasize once again that this means that all the corollaries from Section 4 also apply to this case.
Theorem 5.2.
Suppose we have an urn with balls, of which are red. We randomly draw balls from the urn without replacement. Let denote the number of red balls in the sample. Set . Then, for any , we have
Proof.
It is well known that
for .
Claim 5.3.
For every , we have
Proof.
Consider the following random experiment: take a random permutation of the balls in the urn. Let be the sequence of the first elements in the permutation. Let be the number of -subsets of that contain only red balls. We compute in two different ways. On the one hand,
(7) |
On the other hand, let with . Then the probability that all the balls in the positions indexed by are red is
Thus, by linearity of expectation . Together with (7), the claim follows. ∎
Claim 5.4.
For every , we have
Proof.
5.3 Negative Correlations
The proof by Impagliazzo and Kabanets [15] from Section 3.3 can be used to relax the independence assumption. It now suffices that the random variables are negatively correlated.
Theorem 5.5.
Let be random variables with . Suppose there exist , , such that for every index set , we have . Set and . Then, for any , we have
Proof.
Let be a parameter to be chosen later. Let be a random index set obtained by including each element with probability . As before, we estimate the expectation in two different ways, where the expectation is over the random choice of and . Similarly to before,
(8) |
by the arithmetic-geometric mean inequality. The proof of the lower bound remains unchanged and yields
as before. Combining with (8) and optimizing for finishes the proof, see Section 3.3. ∎
Acknowledgments.
This survey is based on lecture notes for a class on advanced algorithms at Freie Universität Berlin. I would like to thank all the students who took this class for their interest and participation. I would also like to thank Nabil Mustafa and Jonathan Ullman for valuable comments that improved this survey.
References
- [1] N. Alon and J. Spencer. The Probabilistic Method. Wiley-Interscience, 2016.
- [2] S. Arora and B. Barak. Computational Complexity – A Modern Approach. Cambridge University Press, 2009.
- [3] K. Azuma. Weighted sums of certain dependent random variables. Tôhoku Math. J. (2), 19:357–367, 1967.
- [4] S. N. Bernstein. Sobranie Sochinenii [Collected Works]. Nauka, Moscow, 1964.
- [5] X. Chen. A likelihood ratio approach for probabilistic inequalities. arXiv:1308.4123, 2013.
- [6] F. R. K. Chung and L. Lu. Concentration inequalities and martingale inequalities: A survey. Internet Mathematics, 3(1):79–127, 2006.
- [7] V. Chvátal. The tail of the hypergeometric distribution. Discrete Mathematics, 25(3):285–287, 1979.
- [8] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 3rd edition, 2009.
- [9] T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience, 2en edition, 2006.
- [10] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, 2009.
- [11] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.
- [12] O. Goldreich. Computational complexity – a conceptual perspective. Cambridge University Press, 2008.
- [13] T. Hagerup and C. Rüb. A guided tour of Chernoff bounds. Inform. Process. Lett., 33(6):305–308, 1990.
- [14] W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc., 58:13–30, 1963.
- [15] R. Impagliazzo and V. Kabanets. Constructive proofs of concentration bounds. In Proc. 13th Int. Conf. Approx. (APPROX) and 14th Int. Conf. Rand. Comb. Opt. (RANDOM), pages 617–631, 2010.
- [16] J. M. Kleinberg and É. Tardos. Algorithm design. Addison-Wesley, 2006.
- [17] C. McDiarmid. Concentration. In Probabilistic methods for algorithmic discrete mathematics, volume 16 of Algorithms Combin., pages 195–248. Springer-Verlag, 1998.
- [18] F. McSherry and K. Talwar. Mechanism design via differential privacy. In Proc. 48th Annu. IEEE Symp. Found. Comput. Sci. (FOCS), pages 94–103, 2007.
- [19] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, 2nd edition, 2017.
- [20] P. Morin, W. Mulzer, and T. Reddad. Encoding arguments. ACM Comput. Surv., 50(3):46:1–46:36, 2017.
- [21] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
- [22] T. Steinke and J. Ullman. Subgaussian tail bounds via stability arguments. arXiv:1701.03493, 2017.