Grinstead and Snell's Introduction to Probability The CHANCE Project1 Version dated 4 July 2006 'Copyright (C) 2006 Peter G. Doyle. This work is a version of Grinstead and Snell's 'Introduction to Probability, 2nd edition', published by the American Mathematical So- ciety, Copyright (C) 2003 Charles M. Grinstead and J. Laurie Snell. This work is freely redistributable under the terms of the GNU Free Documentation License.  To our wives and in memory of Reese T. Prosser  Contents Preface 1 Discrete Probability Distributions 1.1 Simulation of Discrete Probabilities . . . . . . . . . . . . . . . . . . . 1.2 Discrete Probability Distributions . . . . . . . . . . . . . . . . . . . . 2 Continuous Probability Densities 2.1 Simulation of Continuous Probabilities . . . . . . . . . . . . . . . . . 2.2 Continuous Density Functions . . . . . . . . . . . . . . . . . . . . . . vii 1 1 18 41 41 55 75 75 92 120 3 Combinatorics 3.1 Permutations . . . . . . . . . . . . . . . . . . . . 3.2 Combinations . . . . . . . . . . . . . . . . . . . . 3.3 Card Shuffling . . . . . . . . . . . . . . . . . . . . 4 Conditional Probability 4.1 Discrete Conditional Probability . . . . . . . . . 4.2 Continuous Conditional Probability . . . . . . . . 4.3 Paradoxes . . . . . . . . . . . . . . . . . . . . . . 5 Distributions and Densities 5.1 Important Distributions . . . . . . . . . . . . . . 5.2 Important Densities . . . . . . . . . . . . . . . . 6 Expected Value and Variance 6.1 Expected Value . . . . . . . . . . . . . . . . . . . 6.2 Variance of Discrete Random Variables . . . . . . 6.3 Continuous Random Variables . . . . . . . . . . . 7 Sums of Random Variables 7.1 Sums of Discrete Random Variables . . . . . . . 7.2 Sums of Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 . . . . . . . . . . . 133 . . . . . . . . . . . 162 . . . . . . . . . . . 175 183 . . . . . . . . . . . 183 . . . . . . . . . . . 205 225 . . . . . . . . . . . 225 . . . . . . . . . . . 257 . . . . . . . . . . . 268 285 285 291 8 Law of Large Numbers 8.1 Discrete Random Variables . . . . . . . . . . . 8.2 Continuous Random Variables . . . . . . . . . . 305 ................ 305 ................ 316 V  vi CONTENTS 9 Central Limit Theorem 325 9.1 Bernoulli Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 9.2 Discrete Independent Trials . . . . . . . . . . . . . . . . . . . . . . . 340 9.3 Continuous Independent Trials . . . . . . . . . . . . . . . . . . . . . 356 10 Generating Functions 365 10.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 365 10.2 Branching Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 10.3 Continuous Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 11 Markov Chains 405 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 11.2 Absorbing Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . 416 11.3 Ergodic Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . 433 11.4 Fundamental Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 447 11.5 Mean First Passage Time . . . . . . . . . . . . . . . . . . . . . . . . 452 12 Random Walks 471 12.1 Random Walks in Euclidean Space . . . . . . . . . . . . . . . . . . . 471 12.2 Gambler's Ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 12.3 Arc Sine Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493 Appendices 499 Index 503  Preface Probability theory began in seventeenth century France when the two great French mathematicians, Blaise Pascal and Pierre de Fermat, corresponded over two prob- lems from games of chance. Problems like those Pascal and Fermat solved continued to influence such early researchers as Huygens, Bernoulli, and DeMoivre in estab- lishing a mathematical theory of probability. Today, probability theory is a well- established branch of mathematics that finds applications in every area of scholarly activity from music to physics, and in daily experience from weather prediction to predicting the risks of new medical treatments. This text is designed for an introductory probability course taken by sophomores, juniors, and seniors in mathematics, the physical and social sciences, engineering, and computer science. It presents a thorough treatment of probability ideas and techniques necessary for a firm understanding of the subject. The text can be used in a variety of course lengths, levels, and areas of emphasis. For use in a standard one-term course, in which both discrete and continuous probability is covered, students should have taken as a prerequisite two terms of calculus, including an introduction to multiple integrals. In order to cover Chap- ter 11, which contains material on Markov chains, some knowledge of matrix theory is necessary. The text can also be used in a discrete probability course. The material has been organized in such a way that the discrete and continuous probability discussions are presented in a separate, but parallel, manner. This organization dispels an overly rigorous or formal view of probability and offers some strong pedagogical value in that the discrete discussions can sometimes serve to motivate the more abstract continuous probability discussions. For use in a discrete probability course, students should have taken one term of calculus as a prerequisite. Very little computing background is assumed or necessary in order to obtain full benefits from the use of the computing material and examples in the text. All of the programs that are used in the text have been written in each of the languages TrueBASIC, Maple, and Mathematica. This book is distributed on the Web as part of the Chance Project, which is de- voted to providing materials for beginning courses in probability and statistics. The computer programs, solutions to the odd-numbered exercises, and current errata are also available at this site. Instructors may obtain all of the solutions by writing to either of the authors, at jlsnell~dartmouth.edu and cgrinst1@swarthmore.edu. vii  viii PREFACE FEATURES Level of rigor and emphasis: Probability is a wonderfully intuitive and applicable field of mathematics. We have tried not to spoil its beauty by presenting too much formal mathematics. Rather, we have tried to develop the key ideas in a somewhat leisurely style, to provide a variety of interesting applications to probability, and to show some of the nonintuitive examples that make probability such a lively subject. Exercises: There are over 600 exercises in the text providing plenty of oppor- tunity for practicing skills and developing a sound understanding of the ideas. In the exercise sets are routine exercises to be done with and without the use of a computer and more theoretical exercises to improve the understanding of basic con- cepts. More difficult exercises are indicated by an asterisk. A solution manual for all of the exercises is available to instructors. Historical remarks: Introductory probability is a subject in which the funda- mental ideas are still closely tied to those of the founders of the subject. For this reason, there are numerous historical comments in the text, especially as they deal with the development of discrete probability. Pedagogical use of computer programs: Probability theory makes predictions about experiments whose outcomes depend upon chance. Consequently, it lends itself beautifully to the use of computers as a mathematical tool to simulate and analyze chance experiments. In the text the computer is utilized in several ways. First, it provides a labora- tory where chance experiments can be simulated and the students can get a feeling for the variety of such experiments. This use of the computer in probability has been already beautifully illustrated by William Feller in the second edition of his famous text An Introduction to Probability Theory and Its Applications (New York: Wiley, 1950). In the preface, Feller wrote about his treatment of fluctuation in coin tossing: "The results are so amazing and so at variance with common intuition that even sophisticated colleagues doubted that coins actually misbehave as theory predicts. The record of a simulated experiment is therefore included." In addition to providing a laboratory for the student, the computer is a powerful aid in understanding basic results of probability theory. For example, the graphical illustration of the approximation of the standardized binomial distributions to the normal curve is a more convincing demonstration of the Central Limit Theorem than many of the formal proofs of this fundamental result. Finally, the computer allows the student to solve problems that do not lend themselves to closed-form formulas such as waiting times in queues. Indeed, the introduction of the computer changes the way in which we look at many problems in probability. For example, being able to calculate exact binomial probabilities for experiments up to 1000 trials changes the way we view the normal and Poisson approximations. ACKNOWLEDGMENTS Anyone writing a probability text today owes a great debt to William Feller, who taught us all how to make probability come alive as a subject matter. If you  PREFACE ix find an example, an application, or an exercise that you really like, it probably had its origin in Feller's classic text, An Introduction to Probability Theory and Its Applications. We are indebted to many people for their help in this undertaking. The approach to Markov Chains presented in the book was developed by John Kemeny and the second author. Reese Prosser was a silent co-author for the material on continuous probability in an earlier version of this book. Mark Kernighan contributed 40 pages of comments on the earlier edition. Many of these comments were very thought- provoking; in addition, they provided a student's perspective on the book. Most of the major changes in this version of the book have their genesis in these notes. Fuxing Hou and Lee Nave provided extensive help with the typesetting and the figures. John Finn provided valuable pedagogical advice on the text and and the computer programs. Karl Knaub and Jessica Sklar are responsible for the implementations of the computer programs in Mathematica and Maple. Jessica and Gang Wang assisted with the solutions. Finally, we thank the American Mathematical Society, and in particular Sergei Gelfand and John Ewing, for their interest in this book; their help in its production; and their willingness to make the work freely redistributable.  x PREFACE  Chapter 1 Discrete Probability Distributions 1.1 Simulation of Discrete Probabilities Probability In this chapter, we shall first consider chance experiments with a finite number of possible outcomes wi, w2, ... , wn. For example, we roll a die and the possible outcomes are 1, 2, 3, 4, 5, 6 corresponding to the side that turns up. We toss a coin with possible outcomes H (heads) and T (tails). It is frequently useful to be able to refer to an outcome of an experiment. For example, we might want to write the mathematical expression which gives the sum of four rolls of a die. To do this, we could let X2, i = 1, 2, 3, 4, represent the values of the outcomes of the four rolls, and then we could write the expression X1 + X2 + X3 + X4 for the sum of the four rolls. The Xi's are called random variables. A random vari- able is simply an expression whose value is the outcome of a particular experiment. Just as in the case of other types of variables in mathematics, random variables can take on different values. Let X be the random variable which represents the roll of one die. We shall assign probabilities to the possible outcomes of this experiment. We do this by assigning to each outcome w a nonnegative number m(w) in such a way that m(wi) + m(w2) + ... + m(w6) = 1 . The function m(w) is called the distribution function of the random variable X. For the case of the roll of the die we would assign equal probabilities or probabilities 1/6 to each of the outcomes. With this assignment of probabilities, one could write P(X 4) 2=- 3 1  2 CHAPTER 1. DISCRETE PROBABILITY DISTRIBUTIONS to mean that the probability is 2/3 that a roll of a die will have a value which does not exceed 4. Let Y be the random variable which represents the toss of a coin. In this case, there are two possible outcomes, which we can label as H and T. Unless we have reason to suspect that the coin comes up one way more often than the other way, it is natural to assign the probability of 1/2 to each of the two outcomes. In both of the above experiments, each outcome is assigned an equal probability. This would certainly not be the case in general. For example, if a drug is found to be effective 30 percent of the time it is used, we might assign a probability .3 that the drug is effective the next time it is used and .7 that it is not effective. This last example illustrates the intuitive frequency concept of probability. That is, if we have a probability p that an experiment will result in outcome A, then if we repeat this experiment a large number of times we should expect that the fraction of times that A will occur is about p. To check intuitive ideas like this, we shall find it helpful to look at some of these problems experimentally. We could, for example, toss a coin a large number of times and see if the fraction of times heads turns up is about 1/2. We could also simulate this experiment on a computer. Simulation We want to be able to perform an experiment that corresponds to a given set of probabilities; for example, m(wi) = 1/2, m(w2) = 1/3, and m(w3) = 1/6. In this case, one could mark three faces of a six-sided die with an w1, two faces with an w2, and one face with an w3. In the general case we assume that m(wi), m(w2), ..., m(wn) are all rational numbers, with least common denominator n. If n > 2, we can imagine a long cylindrical die with a cross-section that is a regular n-gon. If m(w) = n1/n, then we can label n3 of the long faces of the cylinder with an w, and if one of the end faces comes up, we can just roll the die again. If n = 2, a coin could be used to perform the experiment. We will be particularly interested in repeating a chance experiment a large num- ber of times. Although the cylindrical die would be a convenient way to carry out a few repetitions, it would be difficult to carry out a large number of experiments. Since the modern computer can do a large number of operations in a very short time, it is natural to turn to the computer for this task. Random Numbers We must first find a computer analog of rolling a die. This is done on the computer by means of a random number generator. Depending upon the particular software package, the computer can be asked for a real number between 0 and 1, or an integer in a given set of consecutive integers. In the first case, the real numbers are chosen in such a way that the probability that the number lies in any particular subinterval of this unit interval is equal to the length of the subinterval. In the second case, each integer has the same probability of being chosen.  1.1. SIMULATION OF DISCRETE PROBABILITIES 3 .203309 .762057 .151121 .623868 .932052 .415178 .716719 .967412 .069664 .670982 .352320 .049723 .750216 .784810 .089734 .966730 .946708 .380365 .027381 .900794 Table 1.1: Sample output of the program RandomNumbers. Let X be a random variable with distribution function m(w), where w is in the set {wi, w2, W3}, and m(wi) = 1/2, m(w2) = 1/3, and m(w3) = 1/6. If our computer package can return a random integer in the set {1, 2, ..., 6}, then we simply ask it to do so, and make 1, 2, and 3 correspond to wi, 4 and 5 correspond to w2, and 6 correspond to w3. If our computer package returns a random real number r in the interval (0, 1), then the expression [6r] + 1 will be a random integer between 1 and 6. (The notation [x] means the greatest integer not exceeding x, and is read "floor of x.") The method by which random real numbers are generated on a computer is described in the historical discussion at the end of this section. The following example gives sample output of the program RandomNumbers. Example 1.1 (Random Number Generation) The program RandomNumbers generates n random real numbers in the interval [0, 1], where n is chosen by the user. When we ran the program with n = 20, we obtained the data shown in Table 1.1. D Example 1.2 (Coin Tossing) As we have noted, our intuition suggests that the probability of obtaining a head on a single toss of a coin is 1/2. To have the computer toss a coin, we can ask it to pick a random real number in the interval [0, 1] and test to see if this number is less than 1/2. If so, we shall call the outcome heads; if not we call it tails. Another way to proceed would be to ask the computer to pick a random integer from the set {0,1}. The program CoinTosses carries out the experiment of tossing a coin n times. Running this program, with n = 20, resulted in: THTTTHTTTTHTTTTTHHTT. Note that in 20 tosses, we obtained 5 heads and 15 tails. Let us toss a coin n times, where n is much larger than 20, and see if we obtain a proportion of heads closer to our intuitive guess of 1/2. The program CoinTosses keeps track of the number of heads. When we ran this program with n =1000, we obtained 494 heads. When we ran it with n =10000, we obtained 5039 heads.  4 CHAPTER 1. DISCRETE PROBABILITY DISTRIBUTIONS We notice that when we tossed the coin 10,000 times, the proportion of heads was close to the "true value" .5 for obtaining a head when a coin is tossed. A math- ematical model for this experiment is called Bernoulli Trials (see Chapter 3). The Law of Large Numbers, which we shall study later (see Chapter 8), will show that in the Bernoulli Trials model, the proportion of heads should be near .5, consistent with our intuitive idea of the frequency interpretation of probability. Of course, our program could be easily modified to simulate coins for which the probability of a head is p, where p is a real number between 0 and 1. Q In the case of coin tossing, we already knew the probability of the event occurring on each experiment. The real power of simulation comes from the ability to estimate probabilities when they are not known ahead of time. This method has been used in the recent discoveries of strategies that make the casino game of blackjack favorable to the player. We illustrate this idea in a simple situation in which we can compute the true probability and see how effective the simulation is. Example 1.3 (Dice Rolling) We consider a dice game that played an important role in the historical development of probability. The famous letters between Pas- cal and Fermat, which many believe started a serious study of probability, were instigated by a request for help from a French nobleman and gambler, Chevalier de Mer6. It is said that de Mer6 had been betting that, in four rolls of a die, at least one six would turn up. He was winning consistently and, to get more people to play, he changed the game to bet that, in 24 rolls of two dice, a pair of sixes would turn up. It is claimed that de Mer6 lost with 24 and felt that 25 rolls were necessary to make the game favorable. It was un grand scandale that mathematics was wrong. We shall try to see if de Mer6 is correct by simulating his various bets. The program DeMerel simulates a large number of experiments, seeing, in each one, if a six turns up in four rolls of a die. When we ran this program for 1000 plays, a six came up in the first four rolls 48.6 percent of the time. When we ran it for 10,000 plays this happened 51.98 percent of the time. We note that the result of the second run suggests that de Mer6 was correct in believing that his bet with one die was favorable; however, if we had based our conclusion on the first run, we would have decided that he was wrong. Accurate results by simulation require a large number of experiments. Q The program DeMere2 simulates de Mer6's second bet that a pair of sixes will occur in n rolls of a pair of dice. The previous simulation shows that it is important to know how many trials we should simulate in order to expect a certain degree of accuracy in our approximation. We shall see later that in these types of experiments, a rough rule of thumb is that, at least 95% of the time, the error does not exceed the reciprocal of the square root of the number of trials. Fortunately, for this dice game, it will be easy to compute the exact probabilities. We shall show in the next section that for the first bet the probability that de M~r6 wins is 1 - (5/6)4 .518.  1.1. SIMULATION OF DISCRETE PROBABILITIES 5 10 8 6 4 2 - II I l,,,IlI ,, Il 1 1 1 1, ,1 1 1 ,iI iI,, 5 10 15 2 25 30 35 40 -2- -4- -6- -8- -10- Figure 1.1: Peter's winnings in 40 plays of heads or tails. One can understand this calculation as follows: The probability that no 6 turns up on the first toss is (5/6). The probability that no 6 turns up on either of the first two tosses is (5/6)2. Reasoning in the same way, the probability that no 6 turns up on any of the first four tosses is (5/6)4. Thus, the probability of at least one 6 in the first four tosses is 1 - (5/6)4. Similarly, for the second bet, with 24 rolls, the probability that de Mer6 wins is 1 - (35/36)24 = .491, and for 25 rolls it is 1 - (35/36)25= .506. Using the rule of thumb mentioned above, it would require 27,000 rolls to have a reasonable chance to determine these probabilities with sufficient accuracy to assert that they lie on opposite sides of .5. It is interesting to ponder whether a gambler can detect such probabilities with the required accuracy from gambling experience. Some writers on the history of probability suggest that de Mer6 was, in fact, just interested in these problems as intriguing probability problems. Example 1.4 (Heads or Tails) For our next example, we consider a problem where the exact answer is difficult to obtain but for which simulation easily gives the qualitative results. Peter and Paul play a game called heads or tails. In this game, a fair coin is tossed a sequence of times-we choose 40. Each time a head comes up Peter wins 1 penny from Paul, and each time a tail comes up Peter loses 1 penny to Paul. For example, if the results of the 40 tosses are THTHHHHTTHTHHTTHHTTTTHHHTHHTHHHTHHHTTTHH. Peter's winnings may be graphed as in Figure 1.1. Peter has won 6 pennies in this particular game. It is natural to ask for the probability that he will win j pennies; here j could be any even number from -40 to 40. It is reasonable to guess that the value of j with the highest probability is j =0, since this occurs when the number of heads equals the number of tails. Similarly, we would guess that the values of j with the lowest probabilities are j =+40.  6 CHAPTER 1. DISCRETE PROBABILITY DISTRIBUTIONS A second interesting question about this game is the following: How many times in the 40 tosses will Peter be in the lead? Looking at the graph of his winnings (Figure 1.1), we see that Peter is in the lead when his winnings are positive, but we have to make some convention when his winnings are 0 if we want all tosses to contribute to the number of times in the lead. We adopt the convention that, when Peter's winnings are 0, he is in the lead if he was ahead at the previous toss and not if he was behind at the previous toss. With this convention, Peter is in the lead 34 times in our example. Again, our intuition might suggest that the most likely number of times to be in the lead is 1/2 of 40, or 20, and the least likely numbers are the extreme cases of 40 or 0. It is easy to settle this by simulating the game a large number of times and keeping track of the number of times that Peter's final winnings are j, and the number of times that Peter ends up being in the lead by k. The proportions over all games then give estimates for the corresponding probabilities. The program HTSimulation carries out this simulation. Note that when there are an even number of tosses in the game, it is possible to be in the lead only an even number of times. We have simulated this game 10,000 times. The results are shown in Figures 1.2 and 1.3. These graphs, which we call spike graphs, were generated using the program Spikegraph. The vertical line, or spike, at position x on the horizontal axis, has a height equal to the proportion of outcomes which equal x. Our intuition about Peter's final winnings was quite correct, but our intuition about the number of times Peter was in the lead was completely wrong. The simulation suggests that the least likely number of times in the lead is 20 and the most likely is 0 or 40. This is indeed correct, and the explanation for it is suggested by playing the game of heads or tails with a large number of tosses and looking at a graph of Peter's winnings. In Figure 1.4 we show the results of a simulation of the game, for 1000 tosses and in Figure 1.5 for 10,000 tosses. In the second example Peter was ahead most of the time. It is a remarkable fact, however, that, if play is continued long enough, Peter's winnings will continue to come back to 0, but there will be very long times between the times that this happens. These and related results will be discussed in Chapter 12. Q In all of our examples so far, we have simulated equiprobable outcomes. We illustrate next an example where the outcomes are not equiprobable. Example 1.5 (Horse Races) Four horses (Acorn, Balky, Chestnut, and Dolby) have raced many times. It is estimated that Acorn wins 30 percent of the time, Balky 40 percent of the time, Chestnut 20 percent of the time, and Dolby 10 percent of the time. We can have our computer carry out one race as follows: Choose a random number x. If x < .3 then we say that Acorn won. If .3 < x < .7 then Balky wins. If .7 0, for all wEQ,and 2. Em(w) = 1 . wCQ2 For any subset E of Q, we define the probability of E to be the number P(E) given by P(E)= m(w) . wEE Example 1.7 Consider an experiment in which a coin is tossed twice. Let X be the random variable which corresponds to this experiment. We note that there are several ways to record the outcomes of this experiment. We could, for example, record the two tosses, in the order in which they occurred. In this case, we have Q ={HH,HT,TH,TT}. We could also record the outcomes by simply noting the number of heads that appeared. In this case, we have Q ={0,1,2}. Finally, we could record the two outcomes, without regard to the order in which they occurred. In this case, we have Q ={HH,HT,TT}. We will use, for the moment, the first of the sample spaces given above. We will assume that all four outcomes are equally likely, and define the distribution function m(w) by 1 m(HH) =m(H T) =m(TH) =m(T T) =-. 4  20 CHAPTER 1. DISCRETE PROBABILITY DISTRIBUTIONS Let E ={HH,HT,TH} be the event that at least one head comes up. Then, the probability of E can be calculated as follows: P(E) = m(HH) + m(HT) + m(TH) 1 1 1 3 4+4+4 4 Similarly, if F ={HH,HT} is the event that heads comes up on the first toss, then we have P(F) = m(HH) + m(HT) 1 1 1 4+4 2 Example 1.8 (Example 1.6 continued) The sample space for the experiment in which the die is rolled is the 6-element set Q = {1, 2, 3, 4, 5, 6}. We assumed that the die was fair, and we chose the distribution function defined by 1 m(i) = -, for i=1, .. .,6 . 6 If E is the event that the result of the roll is an even number, then E {2, 4, 6} and P(E) = m(2) + m(4) + m(6) 1 1 1 1 6 6 6 2 Notice that it is an immediate consequence of the above definitions that, for every w E O, P({w}) = m(w) . That is, the probability of the elementary event {w}, consisting of a single outcome w, is equal to the value m(w) assigned to the outcome w by the distribution function. Example 1.9 Three people, A, B, and C, are running for the same office, and we assume that one and only one of them wins. The sample space may be taken as the 3-element set Q ={A,B,C} where each element corresponds to the outcome of that candidate's winning. Suppose that A and B have the same chance of winning, but that C has only 1/2 the chance of A or B. Then we assign Since m(A) +m(B) +m(C) =1 ,  1.2. DISCRETE PROBABILITY DISTRIBUTIONS 21 we see that 2m(C) + 2m(C) + m(C) = 1 , which implies that 5m(C) = 1. Hence, 2 2 1 m(A) =2- m(B) =2- m(C) =1- 5 5'5 Let E be the event that either A or C wins. Then E ={A,C}, and P(E) =m(A)+m(C) =-+1-3 5 5 5 In many cases, events can be described in terms of other events through the use of the standard constructions of set theory. We will briefly review the definitions of these constructions. The reader is referred to Figure 1.7 for Venn diagrams which illustrate these constructions. Let A and B be two sets. Then the union of A and B is the set AUB={x xEAorxEB}. The intersection of A and B is the set AnB={x xEAandxEB}. The difference of A and B is the set A - B ={xx E A and x g B} . The set A is a subset of B, written A c B, if every element of A is also an element of B. Finally, the complement of A is the set A ={x x E Q and x A} . The reason that these constructions are important is that it is typically the case that complicated events described in English can be broken down into simpler events using these constructions. For example, if A is the event that "it will snow tomorrow and it will rain the next day," B is the event that "it will snow tomorrow," and C is the event that "it will rain two days from now," then A is the intersection of the events B and C. Similarly, if D is the event that "it will snow tomorrow or it will rain the next day," then D = B U C. (Note that care must be taken here, because sometimes the word "or" in English means that exactly one of the two alternatives will occur. The meaning is usually clear from context. In this book, we will always use the word "or" in the inclusive sense, i.e., A or B means that at least one of the two events A, B is true.) The event B is the event that "it will not snow tomorrow." Finally, if E is the event that "it will snow tomorrow but it will not rain the next day," then E =B - C.  22 CHAPTER 1. DISCRETE PROBABILITY DISTRIBUTIONS AnB AU B A-B Figure 1.7: Basic set operations. Properties Theorem 1.1 The probabilities assigned to events by a distribution function on a sample space Q satisfy the following properties: 1. P(E) > 0 for every E C Q . 2. P(Q) = 1 . 3. If E cF C Q, then P(E) < P(F) . 4. If A and B are disjoint subsets of Q, then P(A U B) = P(A) + P(B) 5. P(A) = 1 - P(A) for every A C Q. Proof. For any event E the probability P(E) is determined from the distribution m by P(E)= m(w) , wEE for every E c Q. Since the function m is nonnegative, it follows that P(E) is also nonnegative. Thus, Property 1 is true. Property 2 is proved by the equations P(Q) _=1 m(w) = 1 . wCQ2 Suppose that E C F c Q. Then every element w that belongs to E also belongs to F. Therefore, Zm(w) <; m(w), wEE wEF since each term in the left-hand sum is in the right-hand sum, and all the terms in both sums are non-negative. This implies that P(E) < P(F) , and Property 3 is proved.  1.2. DISCRETE PROBABILITY DISTRIBUTIONS 23 Suppose next that A and B are disjoint subsets of Q. Then every element w of A U B lies either in A and not in B or in B and not in A. It follows that P(A U B) =EWEAUB m(w)=ZWEA m(w) + ZWEB m(w) = P( A) + P(B), and Property 4 is proved. Finally, to prove Property 5, consider the disjoint union Q=AUA. Since P(Q) = 1, the property of disjoint additivity (Property 4) implies that 1 = P(A) + P(Z) , whence P(A) = 1 - P(A). D It is important to realize that Property 4 in Theorem 1.1 can be extended to more than two sets. The general finite additivity property is given by the following theorem. Theorem 1.2 If A1, ..., An are pairwise disjoint subsets of Q (i.e., no two of the Ai's have an element in common), then P(A1U ---UAn)= P(AZ) . i=1 Proof. Let w be any element in the union A1U--UAn . Then m(w) occurs exactly once on each side of the equality in the statement of the theorem. We shall often use the following consequence of the above theorem. Theorem 1.3 Let A1, ..., An be pairwise disjoint events with Q = A1 G. -U-G An, and let E be any event. Then P(E)= P(E n AZ) . i=1 Proof. The sets E n A1, . . ., E n A12 are pairwise disjoint, and their union is the set E. The result now follows from Theorem 1.2.D  24 CHAPTER 1. DISCRETE PROBABILITY DISTRIBUTIONS Corollary 1.1 For any two events A and B, P(A)=P(AnB)+P(AnB). Property 4 can be generalized in another way. Suppose that A and B are subsets of Q which are not necessarily disjoint. Then: Theorem 1.4 If A and B are subsets of Q, then P(A U B) = P(A) + P(B) - P(A n B) . (1.1) Proof. The left side of Equation 1.1 is the sum of m(w) for w in either A or B. We must show that the right side of Equation 1.1 also adds m(w) for w in A or B. If w is in exactly one of the two sets, then it is counted in only one of the three terms on the right side of Equation 1.1. If it is in both A and B, it is added twice from the calculations of P(A) and P(B) and subtracted once for P(A n B). Thus it is counted exactly once by the right side. Of course, if A n B = 0, then Equation 1.1 reduces to Property 4. (Equation 1.1 can also be generalized; see Theorem 3.8.) Q Tree Diagrams Example 1.10 Let us illustrate the properties of probabilities of events in terms of three tosses of a coin. When we have an experiment which takes place in stages such as this, we often find it convenient to represent the outcomes by a tree diagram as shown in Figure 1.8. A path through the tree corresponds to a possible outcome of the experiment. For the case of three tosses of a coin, we have eight paths wi, w2, ..., w8 and, assuming each outcome to be equally likely, we assign equal weight, 1/8, to each path. Let E be the event "at least one head turns up." Then E is the event "no heads turn up." This event occurs for only one outcome, namely, w8= TTT. Thus, E = {TTT} and we have 1 P(E)= P({TTT}) =m(TTT) -. 8 By Property 5 of Theorem 1.1, 1 7 P(E)=1-P(E)=1-- - 8(8 Note that we shall often find it is easier to compute the probability that an event does not happen rather than the probability that it does. We then use Property 5 to obtain the desired probability.  1.2. DISCRETE PROBABILITY DISTRIBUTIONS 25 First toss Second toss Third toss Outcome (Start) H T H H T H T T H H T (03 (04 (05 (1 T H T (07 Figure 1.8: Tree diagram for three tosses of a coin. Let A be the event "the first outcome is a head," and B the event "the second outcome is a tail." By looking at the paths in Figure 1.8, we see that 1 P(A)=P(B)= -. 2 Moreover, AOnB ={W3, W4}, and so P(AOnB) =1/4. Using Theorem 1.4, we obtain P(AU B) P(A) +P(B) -P(An B) 1 1 1 3 2+2 4 4 Since A U B is the 6-element set, A U B = {HHH,HHT,HTH,HTT,TTH,TTT}, we see that we obtain the same result by direct enumeration. El In our coin tossing examples and in the die rolling example, we have assigned an equal probability to each possible outcome of the experiment. Corresponding to this method of assigning probabilities, we have the following definitions. Uniform Distribution Definition 1.3 The uniform distribution on a sample space Q containing n ele- ments is the function m defined by 1 m(w)= -, n for every w c Q. F-I  26 CHAPTER 1. DISCRETE PROBABILITY DISTRIBUTIONS It is important to realize that when an experiment is analyzed to describe its possible outcomes, there is no single correct choice of sample space. For the ex- periment of tossing a coin twice in Example 1.2, we selected the 4-element set Q ={HH,HT,TH,TT} as a sample space and assigned the uniform distribution func- tion. These choices are certainly intuitively natural. On the other hand, for some purposes it may be more useful to consider the 3-element sample space Q = {0, 1, 2} in which 0 is the outcome "no heads turn up," 1 is the outcome "exactly one head turns up," and 2 is the outcome "two heads turn up." The distribution function m on Q defined by the equations 1 1 1 (0) = -1, m(1) = -1, m(2 = - m) 4 () 2 2 ) 4 is the one corresponding to the uniform probability density on the original sample space Q. Notice that it is perfectly possible to choose a different distribution func- tion. For example, we may consider the uniform distribution function on Q, which is the function q defined by 1 (0) =(1) = (2)=-. Although q is a perfectly good distribution function, it is not consistent with ob- served data on coin tossing. Example 1.11 Consider the experiment that consists of rolling a pair of dice. We take as the sample space Q the set of all ordered pairs (i, j) of integers with 1 i 6 and 1 < j < 6. Thus, = {(i,j) : 1 i,j <6} . (There is at least one other "reasonable" choice for a sample space, namely the set of all unordered pairs of integers, each between 1 and 6. For a discussion of why we do not use this set, see Example 3.14.) To determine the size of Q, we note that there are six choices for i, and for each choice of i there are six choices for j, leading to 36 different outcomes. Let us assume that the dice are not loaded. In mathematical terms, this means that we assume that each of the 36 outcomes is equally likely, or equivalently, that we adopt the uniform distribution function on Q by setting 1 m((i,j)) =36' 1 P(A)+ P(B) - 1. 19 If A, B, and C are any three events, show that P(AUBU C) = P(A) + P(B) + P(C) -P(AnB) - P(BnC) -P(CnA) +P(AnB nC) . 20 Explain why it is not possible to define a uniform distribution function (see Definition 1.3) on a countably infinite sample space. Hint: Assume m(w) = a for all w, where 0 < a < 1. Does m(w) have all the properties of a distribution function? 21 In Example 1.13 find the probability that the coin turns up heads for the first time on the tenth, eleventh, or twelfth toss. 22 A die is rolled until the first time that a six turns up. We shall see that the probability that this occurs on the nth roll is (5/6)n-1 . (1/6). Using this fact, describe the appropriate infinite sample space and distribution function for the experiment of rolling a die until a six turns up for the first time. Verify that for your distribution function >3 m(w) =1. 22See Knot X, in Lewis Carroll, Mathematical Recreations, vol. 2 (Dover, 1958).  38 CHAPTER 1. DISCRETE PROBABILITY DISTRIBUTIONS 23 Let Q be the sample space S=}{0, 1,2, ...} and define a distribution function by m(j)=(1-r) r , for some fixed r, 0 < r < 1, and for j = 0,1,2,.... Show that this is a distribution function for Q. 24 Our calendar has a 400-year cycle. B. H. Brown noticed that the number of times the thirteenth of the month falls on each of the days of the week in the 4800 months of a cycle is as follows: Sunday 687 Monday 685 Tuesday 685 Wednesday 687 Thursday 684 Friday 688 Saturday 684 From this he deduced that the thirteenth was more likely to fall on Friday than on any other day. Explain what he meant by this. 25 Tversky and Kahneman23 asked a group of subjects to carry out the following task. They are told that: Linda is 31, single, outspoken, and very bright. She majored in philosophy in college. As a student, she was deeply concerned with racial discrimination and other social issues, and participated in anti-nuclear demonstrations. The subjects are then asked to rank the likelihood of various alternatives, such as: (1) Linda is active in the feminist movement. (2) Linda is a bank teller. (3) Linda is a bank teller and active in the feminist movement. Tversky and Kahneman found that between 85 and 90 percent of the subjects rated alternative (1) most likely, but alternative (3) more likely than alterna- tive (2). Is it? They call this phenomenon the conjunction fallacy, and note that it appears to be unaffected by prior training in probability or statistics. Is this phenomenon a fallacy? If so, why? Can you give a possible explanation for the subjects' choices? 2K. McKean, "Decisions, Decisions," pp. 22-31.  1.2. DISCRETE PROBABILITY DISTRIBUTIONS 39 26 Two cards are drawn successively from a deck of 52 cards. Find the probability that the second card is higher in rank than the first card. Hint: Show that 1 P(higher) + P(lower) + P(same) and use the fact that P(higher) = P(lower). 27 A life table is a table that lists for a given number of births the estimated number of people who will live to a given age. In Appendix C we give a life table based upon 100,000 births for ages from 0 to 85, both for women and for men. Show how from this table you can estimate the probability m(x) that a person born in 1981 would live to age x. Write a program to plot m(x) both for men and for women, and comment on the differences that you see in the two cases. *28 Here is an attempt to get around the fact that we cannot choose a "random integer." (a) What, intuitively, is the probability that a "randomly chosen" positive integer is a multiple of 3? (b) Let P3(N) be the probability that an integer, chosen at random between 1 and N, is a multiple of 3 (since the sample space is finite, this is a legitimate probability). Show that the limit P3 = lim P3(N) N-*oo exists and equals 1/3. This formalizes the intuition in (a), and gives us a way to assign "probabilities" to certain events that are infinite subsets of the positive integers. (c) If A is any set of positive integers, let A(N) mean the number of elements of A which are less than or equal to N. Then define the "probability" of A as P(A) lim A(N)/N N-oo provided this limit exists. Show that this definition would assign prob- ability 0 to any finite set and probability 1 to the set of all positive integers. Thus, the probability of the set of all integers is not the sum of the probabilities of the individual integers in this set. This means that the definition of probability given here is not a completely satisfactory definition. (d) Let A be the set of all positive integers with an odd number of dig- its. Show that P(A) does not exist. This shows that under the above definition of probability, not all sets have probabilities. 29 (from Sholander24) In a standard clover-leaf interchange, there are four ramps for making right-hand turns, and inside these four ramps, there are four more ramps for making left-hand turns. Your car approaches the interchange from the south. A mechanism has been installed so that at each point where there exists a choice of directions, the car turns to the right with fixed probability r. 24M. Sholander, Problem #1034, Mathematics Magazine, vol. 52, no. 3 (May 1979), p. 183.  40 CHAPTER 1. DISCRETE PROBABILITY DISTRIBUTIONS (a) If r = 1/2, what is your chance of emerging from the interchange going west? (b) Find the value of r that maximizes your chance of a westward departure from the interchange. 30 (from Benkoski25) Consider a "pure" cloverleaf interchange in which there are no ramps for right-hand turns, but only the two intersecting straight highways with cloverleaves for left-hand turns. (Thus, to turn right in such an interchange, one must make three left-hand turns.) As in the preceding problem, your car approaches the interchange from the south. What is the value of r that maximizes your chances of an eastward departure from the interchange? 31 (from vos Savant26) A reader of Marilyn vos Savant's column wrote in with the following question: My dad heard this story on the radio. At Duke University, two students had received A's in chemistry all semester. But on the night before the final exam, they were partying in another state and didn't get back to Duke until it was over. Their excuse to the professor was that they had a flat tire, and they asked if they could take a make-up test. The professor agreed, wrote out a test and sent the two to separate rooms to take it. The first question (on one side of the paper) was worth 5 points, and they answered it easily. Then they flipped the paper over and found the second question, worth 95 points: 'Which tire was it?' What was the probability that both students would say the same thing? My dad and I think it's 1 in 16. Is that right?" (a) Is the answer 1/16? (b) The following question was asked of a class of students. "I was driving to school today, and one of my tires went flat. Which tire do you think it was?" The responses were as follows: right front, 58%, left front, 11%, right rear, 18%, left rear, 13%. Suppose that this distribution holds in the general population, and assume that the two test-takers are randomly chosen from the general population. What is the probability that they will give the same answer to the second question? 25S. Benkoski, Comment on Problem #1034, Mathematics Magazine, vol. 52, no. 3 (May 1979), pp. 183-184. 26.vos Savant, Parade Magazine, 3 March 1996, p. 14.  Chapter 2 Continuous Probability Densities 2.1 Simulation of Continuous Probabilities In this section we shall show how we can use computer simulations for experiments that have a whole continuum of possible outcomes. Probabilities Example 2.1 We begin by constructing a spinner, which consists of a circle of unit circumference and a pointer as shown in Figure 2.1. We pick a point on the circle and label it 0, and then label every other point on the circle with the distance, say x, from 0 to that point, measured counterclockwise. The experiment consists of spinning the pointer and recording the label of the point at the tip of the pointer. We let the random variable X denote the value of this outcome. The sample space is clearly the interval [0, 1). We would like to construct a probability model in which each outcome is equally likely to occur. If we proceed as we did in Chapter 1 for experiments with a finite number of possible outcomes, then we must assign the probability 0 to each outcome, since otherwise, the sum of the probabilities, over all of the possible outcomes, would not equal 1. (In fact, summing an uncountable number of real numbers is a tricky business; in particular, in order for such a sum to have any meaning, at most countably many of the summands can be different than 0.) However, if all of the assigned probabilities are 0, then the sum is 0, not 1, as it should be. In the next section, we will show how to construct a probability model in this situation. At present, we will assume that such a model can be constructed. We will also assume that in this model, if E is an arc of the circle, and E is of length p, then the model will assign the probability p to E. This means that if the pointer is spun, the probability that it ends up pointing to a point in E equals p, which is certainly a reasonable thing to expect. 41  42 CHAPTER 2. CONTINUOUS PROBABILITY DENSITIES Figure 2.1: A spinner. To simulate this experiment on a computer is an easy matter. Many computer software packages have a function which returns a random real number in the in- terval [0, 1]. Actually, the returned value is always a rational number, and the values are determined by an algorithm, so a sequence of such values is not truly random. Nevertheless, the sequences produced by such algorithms behave much like theoretically random sequences, so we can use such sequences in the simulation of experiments. On occasion, we will need to refer to such a function. We will call this function rnd. D Monte Carlo Procedure and Areas It is sometimes desirable to estimate quantities whose exact values are difficult or impossible to calculate exactly. In some of these cases, a procedure involving chance, called a Monte Carlo procedure, can be used to provide such an estimate. Example 2.2 In this example we show how simulation can be used to estimate areas of plane figures. Suppose that we program our computer to provide a pair (x, y) or numbers, each chosen independently at random from the interval [0, 1]. Then we can interpret this pair (x, y) as the coordinates of a point chosen at random from the unit square. Events are subsets of the unit square. Our experience with Example 2.1 suggests that the point is equally likely to fall in subsets of equal area. Since the total area of the square is 1, the probability of the point falling in a specific subset E of the unit square should be equal to its area. Thus, we can estimate the area of any subset of the unit square by estimating the probability that a point chosen at random from this square falls in the subset. We can use this method to estimate the area of the region E under the curve y = x2 in the unit square (see Figure 2.2). We choose a large number of points (x, y) at random and record what fraction of them fall in the region E = { (x, y) : y < x2 . The program MonteCarlo will carry out this experiment for us. Running this program for 10,000 experiments gives an estimate of .325 (see Figure 2.3). From these experiments we would estimate the area to be about 1/3. Of course,  2.1. SIMULATION OF CONTINUOUS PROBABILITIES 43 y 1 E x 1 Figure 2.2: Area under y = x2. for this simple region we can find the exact area by calculus. In fact, Area of E= xdx . o 3 We have remarked in Chapter 1 that, when we simulate an experiment of this type n times to estimate a probability, we can expect the answer to be in error by at most 1// at least 95 percent of the time. For 10,000 experiments we can expect an accuracy of 0.01, and our simulation did achieve this accuracy. This same argument works for any region E of the unit square. For example, suppose E is the circle with center (1/2, 1/2) and radius 1/2. Then the probability that our random point (x, y) lies inside the circle is equal to the area of the circle, that is, P() (1)2 P(E) = gr - = - . If we did not know the value of r, we could estimate the value by performing this experiment a large number of times! D The above example is not the only way of estimating the value of wr by a chance experiment. Here is another way, discovered by Buffon.1 1G. L. Buffon, in "Essai d'Arithmetique Morale," Oeztvres Completes de Buiffon avec Supple- menits, tome iv, ed. Dum~nil (Paris, 1836).  44 CHAPTER 2. CONTINUOUS PROBABILITY DENSITIES 1000 trials/y=x Esimt ofaeai f.2 Figure 2.3: Computing the area by simulation. Buffon's Needle Example 2.3 Suppose that we take a card table and draw across the top surface a set of parallel lines a unit distance apart. We then drop a common needle of unit length at random on this surface and observe whether or not the needle lies across one of the lines. We can describe the possible outcomes of this experiment by coordinates as follows: Let d be the distance from the center of the needle to the nearest line. Next, let L be the line determined by the needle, and define 0 as the acute angle that the line L makes with the set of parallel lines. (The reader should certainly be wary of this description of the sample space. We are attempting to coordinatize a set of line segments. To see why one must be careful in the choice of coordinates, see Example 2.6.) Using this description, we have 0 K d 1/2, and o K 0 wr/2. Moreover, we see that the needle lies across the nearest line if and only if the hypotenuse of the triangle (see Figure 2.4) is less than half the length of the needle, that is, d 1 sinO 2 Now we assume that when the needle drops, the pair (0, d) is chosen at random from the rectangle 0 0 wr/2, 0 d 1/2. We observe whether the needle lies across the nearest line (i.e., whether d (1/2) sin 0). The probability of this event E is the fraction of the area of the rectangle which lies inside E (see Figure 2.5).  2.1. SIMULATION OF CONTINUOUS PROBABILITIES 45 1 Figure 2.4: Buffon's experiment. d A 1/2 t/2 0 Figure 2.5: Set E of pairs (0, d) with d < } sinO8. Now the area of the rectangle is 7/4, while the area of E is 7rajl/21 1 Area = - sin68d80- . o 2 2 Hence, we get P(E) 1/2 ir/4 2 7r The program BuffonsNeedle simulates this experiment. In Figure 2.6, we show the position of every 100th needle in a run of the program in which 10,000 needles were "dropped." Our final estimate for 7 is 3.139. While this was within 0.003 of the true value for 7 we had no right to expect such accuracy. The reason for this is that our simulation estimates P(E). While we can expect this estimate to be in error by at most 0.001, a small error in P(E) gets magnified when we use this to compute 7 = 2/P(E). Perlman and Wichura, in their article "Sharpening Buffon's  46 CHAPTER 2. CONTINUOUS PROBABILITY DENSITIES Figure 2.6: Simulation of Buffon's needle experiment. Needle,"2 show that we can expect to have an error of not more than 5//n about 95 percent of the time. Here n is the number of needles dropped. Thus for 10,000 needles we should expect an error of no more than 0.05, and that was the case here. We see that a large number of experiments is necessary to get a decent estimate for 7r. Q In each of our examples so far, events of the same size are equally likely. Here is an example where they are not. We will see many other such examples later. Example 2.4 Suppose that we choose two random real numbers in [0, 1] and add them together. Let X be the sum. How is X distributed? To help understand the answer to this question, we can use the program Are- abargraph. This program produces a bar graph with the property that on each interval, the area, rather than the height, of the bar is equal to the fraction of out- comes that fell in the corresponding interval. We have carried out this experiment 1000 times; the data is shown in Figure 2.7. It appears that the function defined by x, if 0 3 if its midpoint has distance d < 1/2 from the origin (see Figure 2.9). The following calculations determine the probability that L > V3 in each of the three cases. 1. L > 3 if(x, y) lies inside a circle of radius 1/2, which occurs with probability 7r(1/2)2 1 P w(1)2- 4 2. L> 3 if r < 1/2, which occurs with probability 1/2 - (-1/2) 1 1 - (-1) 2 ~ 3. L> 3 if 27/3 < a < 47/3, which occurs with probability 47/3 - 27r/3 1 27r-0 3 We see that our simulations agree quite well with these theoretical values. Q Historical Remarks G. L. Buffon (1707-1788) was a natural scientist in the eighteenth century who applied probability to a number of his investigations. His work is found in his monumental 44-volume Histoire Naturelle and its supplements.5 For example, he 5G. L. Buffon, Histoire Naturelle, Generali et Particular avec le Description du Cabinet du Roy, 44 vols. (Paris: L'Imprimerie Royale, 1749-1803).  2.1. SIMULATION OF CONTINUOUS PROBABILITIES 51 Length of Number of Number of Estimate Experimenter needle casts crossings for 7 Wolf, 1850 .8 5000 2532 3.1596 Smith, 1855 .6 3204 1218.5 3.1553 De Morgan, c.1860 1.0 600 382.5 3.137 Fox, 1864 .75 1030 489 3.1595 Lazzerini, 1901 .83 3408 1808 3.1415929 Reina, 1925 .5419 2520 869 3.1795 Table 2.1: Buffon needle experiments to estimate 7. presented a number of mortality tables and used them to compute, for each age group, the expected remaining lifetime. From his table he observed: the expected remaining lifetime of an infant of one year is 33 years, while that of a man of 21 years is also approximately 33 years. Thus, a father who is not yet 21 can hope to live longer than his one year old son, but if the father is 40, the odds are already 3 to 2 that his son will outlive him.6 Buffon wanted to show that not all probability calculations rely only on algebra, but that some rely on geometrical calculations. One such problem was his famous "needle problem" as discussed in this chapter.7 In his original formulation, Buffon describes a game in which two gamblers drop a loaf of French bread on a wide-board floor and bet on whether or not the loaf falls across a crack in the floor. Buffon asked: what length L should the bread loaf be, relative to the width W of the floorboards, so that the game is fair. He found the correct answer (L = (7/4)W) using essentially the methods described in this chapter. He also considered the case of a checkerboard floor, but gave the wrong answer in this case. The correct answer was given later by Laplace. The literature contains descriptions of a number of experiments that were actu- ally carried out to estimate 7 by this method of dropping needles. N. T. Gridgeman8 discusses the experiments shown in Table 2.1. (The halves for the number of cross- ing comes from a compromise when it could not be decided if a crossing had actually occurred.) He observes, as we have, that 10,000 casts could do no more than estab- lish the first decimal place of 7 with reasonable confidence. Gridgeman points out that, although none of the experiments used even 10,000 casts, they are surprisingly good, and in some cases, too good. The fact that the number of casts is not always a round number would suggest that the authors might have resorted to clever stop- ping to get a good answer. Gridgeman comments that Lazzerini's estimate turned out to agree with a well-known approximation to 7, 355/113 = 3.1415929, discov- ered by the fifth-century Chinese mathematician, Tsu Ch'ungchih. Gridgeman says that he did not have Lazzerini's original report, and while waiting for it (knowing 6G. L. Buffon, "Essai d'Arithmetique Morale," p. 301. 7 ibid., pp. 277-278. 8N. T. Gridgeman, "Geometric Probability and the Number wr" Scripta Mathematika, vol. 25, no. 3, (1960), pp. 183-195.  52 CHAPTER 2. CONTINUOUS PROBABILITY DENSITIES only the needle crossed a line 1808 times in 3408 casts) deduced that the length of the needle must have been 5/6. He calculated this from Buffon's formula, assuming 7 = 355/113: L P(E)_1Y355 1808'\ 5_ 5 83 2 2 113 3408 6 Even with careful planning one would have to be extremely lucky to be able to stop so cleverly. The second author likes to trace his interest in probability theory to the Chicago World's Fair of 1933 where he observed a mechanical device dropping needles and displaying the ever-changing estimates for the value of 7. (The first author likes to trace his interest in probability theory to the second author.) Exercises *1 In the spinner problem (see Example 2.1) divide the unit circumference into three arcs of length 1/2, 1/3, and 1/6. Write a program to simulate the spinner experiment 1000 times and print out what fraction of the outcomes fall in each of the three arcs. Now plot a bar graph whose bars have width 1/2, 1/3, and 1/6, and areas equal to the corresponding fractions as determined by your simulation. Show that the heights of the bars are all nearly the same. 2 Do the same as in Exercise 1, but divide the unit circumference into five arcs of length 1/3, 1/4, 1/5, 1/6, and 1/20. 3 Alter the program MonteCarlo to estimate the area of the circle of radius 1/2 with center at (1/2,1/2) inside the unit square by choosing 1000 points at random. Compare your results with the true value of 7/4. Use your results to estimate the value of 7. How accurate is your estimate? 4 Alter the program MonteCarlo to estimate the area under the graph of y = sin 7x inside the unit square by choosing 10,000 points at random. Now calculate the true value of this area and use your results to estimate the value of 7. How accurate is your estimate? 5 Alter the program MonteCarlo to estimate the area under the graph of y = 1/(x + 1) in the unit square in the same way as in Exercise 4. Calculate the true value of this area and use your simulation results to estimate the value of log 2. How accurate is your estimate? 6 To simulate the Buffon's needle problem we choose independently the dis- tance d and the angle 0 at random, with 0 < d < 1/2 and 0 < 0 0. For waiting times produced in this way, the average waiting time is 1 /A. For example, the times spent waiting for 9P. S. Laplace, Thiorie Analytique des Probabilitis (Paris: Courcier, 1812).  54 CHAPTER 2. CONTINUOUS PROBABILITY DENSITIES a car to pass on a highway, or the times between emissions of particles from a radioactive source, are simulated by a sequence of random numbers, each of which is chosen by computing (-1/A)log(rnd), where 1/A is the average time between cars or emissions. Write a program to simulate the times between cars when the average time between cars is 30 seconds. Have your program compute an area bar graph for these times by breaking the time interval from 0 to 120 into 24 subintervals. On the same pair of axes, plot the function f(x) = (1/30)e-(/30)x. Does the function fit the bar graph well? 10 In Exercise 9, the distribution came "out of a hat." In this problem, we will again consider an experiment whose outcomes are not equally likely. We will determine a function f(x) which can be used to determine the probability of certain events. Let T be the right triangle in the plane with vertices at the points (0, 0), (1, 0), and (0, 1). The experiment consists of picking a point at random in the interior of T, and recording only the x-coordinate of the point. Thus, the sample space is the set [0, 1], but the outcomes do not seem to be equally likely. We can simulate this experiment by asking a computer to return two random real numbers in [0, 1], and recording the first of these two numbers if their sum is less than 1. Write this program and run it for 10,000 trials. Then make a bar graph of the result, breaking the interval [0, 1] into 10 intervals. Compare the bar graph with the function f(x) = 2 - 2x. Now show that there is a constant c such that the height of T at the x-coordinate value x is c times f(x) for every x in [0, 1]. Finally, show that /1 J f(x)dx~1. How might one use the function f(x) to determine the probability that the outcome is between .2 and .5? 11 Here is another way to pick a chord at random on the circle of unit radius. Imagine that we have a card table whose sides are of length 100. We place coordinate axes on the table in such a way that each side of the table is parallel to one of the axes, and so that the center of the table is the origin. We now place a circle of unit radius on the table so that the center of the circle is the origin. Now pick out a point (xo, Yo) at random in the square, and an angle 0 at random in the interval (-7r/2,7r/2). Let m = tan 0. Then the equation of the line passing through (xo, yo) with slope m is Y =Yo + m(x - xo) and the distance of this line from the center of the circle (i.e., the origin) is Yo - mxo /m2 + 1 We can use this distance formula to check whether the line intersects the circle (i.e., whether d < 1). If so, we consider the resulting chord a random chord.  2.2. CONTINUOUS DENSITY FUNCTIONS 55 This describes an experiment of dropping a long straw at random on a table on which a circle is drawn. Write a program to simulate this experiment 10000 times and estimate the probability that the length of the chord is greater than v/3. How does your estimate compare with the results of Example 2.6? 2.2 Continuous Density Functions In the previous section we have seen how to simulate experiments with a whole continuum of possible outcomes and have gained some experience in thinking about such experiments. Now we turn to the general problem of assigning probabilities to the outcomes and events in such experiments. We shall restrict our attention here to those experiments whose sample space can be taken as a suitably chosen subset of the line, the plane, or some other Euclidean space. We begin with some simple examples. Spinners Example 2.7 The spinner experiment described in Example 2.1 has the interval [0, 1) as the set of possible outcomes. We would like to construct a probability model in which each outcome is equally likely to occur. We saw that in such a model, it is necessary to assign the probability 0 to each outcome. This does not at all mean that the probability of every event must be zero. On the contrary, if we let the random variable X denote the outcome, then the probability P(0 < X < 1) that the head of the spinner comes to rest somewhere in the circle, should be equal to 1. Also, the probability that it comes to rest in the upper half of the circle should be the same as for the lower half, so that P O0 0 }, for example, corresponds to the statement that the dart lands in the upper half of the target, and so forth. Unless there is reason to believe otherwise (and with experts at the  2.2. CONTINUOUS DENSITY FUNCTIONS 57 game there may well be!), it is natural to assume that the coordinates are chosen at random. (When doing this with a computer, each coordinate is chosen uniformly from the interval [-1, 1]. If the resulting point does not lie inside the unit circle, the point is not counted.) Then the arguments used in the preceding example show that the probability of any elementary event, consisting of a single outcome, must be zero, and suggest that the probability of the event that the dart lands in any subset E of the target should be determined by what fraction of the target area lies in E. Thus, area of E area of E area of target 7 This can be written in the form P(E) = f (x) d where f(x) is the constant function with value 1/. In particular, if E { (x, y) x2 + y2 < a2 } is the event that the dart lands within distance a < 1 of the center of the target, then ,,a2=2 P(E) =-=a2 7r For example, the probability that the dart lies within a distance 1/2 of the center is 1/4. D Example 2.9 In the dart game considered above, suppose that, instead of observ- ing where the dart lands, we observe how far it lands from the center of the target. In this case, we take as our sample space the set Q of all circles with centers at the center of the target. It is convenient to describe these circles by their radii, so that each circle is identified by its radius r, 0 1. From this we easily calculate that the density function of X is 10, if x 0, fx(x) = 1/(2 ),if01. Note that Fx(x) is continuous, but fx(x) is not. (See Figure 2.13.) F-I  2.2. CONTINUOUS DENSITY FUNCTIONS 63 0 .8 0 .6 0 .2 0 . 4 0 .6 0.8 1 Figure 2.14: Calculation of distribution function for Example 2.14. When referring to a continuous random variable X (say with a uniform density function), it is customary to say that "X is uniformly distributed on the interval [a, b]." It is also customary to refer to the cumulative distribution function of X as the distribution function of X. Thus, the word "distribution" is being used in sev- eral different ways in the subject of probability. (Recall that it also has a meaning when discussing discrete random variables.) When referring to the cumulative dis- tribution function of a continuous random variable X, we will always use the word "cumulative" as a modifier, unless the use of another modifier, such as "normal" or "exponential," makes it clear. Since the phrase "uniformly densitied on the interval [a, b]" is not acceptable English, we will have to say "uniformly distributed" instead. Example 2.14 In Example 2.4, we considered a random variable, defined to be the sum of two random real numbers chosen uniformly from [0, 1]. Let the random variables X and Y denote the two chosen real numbers. Define Z = X + Y. We will now derive expressions for the cumulative distribution function and the density function of Z. Here we take for our sample space Q the unit square in R2 with uniform density. A point w E Q then consists of a pair (x, y) of numbers chosen at random. Then 0 < Z < 2. Let Ez denote the event that Z K z. In Figure 2.14, we show the set E.8. The event Ez, for any z between 0 and 1, looks very similar to the shaded set in the figure. For 1 < z K 2, the set Ez looks like the unit square with a triangle removed from the upper right-hand corner. We can now calculate the probability distribution Fz of Z; it is given by Fz(z) = P(Z < z) =Area of E2  64 CHAPTER 2. CONTINUOUS PROBABILITY DENSITIES fz(z) Figure 2.15: Distribution and density functions for Example 2.14. Figure 2.16: Calculation of Fz for Example 2.15. 0, if z < 0, (1/2)z2, if 0 < z < 1, 1 -(1/2)(2 - z)2, if 1 < z < 2, 1, if 2 < z. The density function is obtained by differentiating this function: 0, if z < 0, f z (z ) -z , if 1< z < , 2 -z, ifl 1. The density fz (z) is given again by the derivative of Fz (z): 0, if z 1. The reader is referred to Figure 2.17 for the graphs of these functions. We can verify this result by simulation, as follows: We choose values for X and Y at random from [0, 1] with uniform distribution, calculate Z = X2 + Y2, check whether 0 < Z < 1, and present the results in a bar graph (see Figure 2.18). Q Example 2.16 Suppose Mr. and Mrs. Lockhorn agree to meet at the Hanover Inn between 5:00 and 6:00 P.M. on Tuesday. Suppose each arrives at a time between 5:00 and 6:00 chosen at random with uniform probability. What is the distribution function for the length of time that the first to arrive has to wait for the other? What is the density function? Here again we can take the unit square to represent the sample space, and (X, Y) as the arrival times (after 5:00 P.M.) for the Lockhorns. Let Z =IX - Yl. Then we have Fx (x) = x and Fy (y) = y. Moreover (see Figure 2.19), Fz(z) = P(Z -z) = P(X-Y| 5z) = Area ofE .  66 CHAPTER 2. CONTINUOUS PROBABILITY DENSITIES 2 1.5 0.5 0 0 0.2 0.4 0.6 0.8 1 Figure 2.18: Simulation results for Example 2.15. Thus, we have 0, if z < 0, Fz(z) 1-(1-z)2, if01. The density fz (z) is again obtained by differentiation: 10, if z < 0, fz(z) = 2(1 -z),if 01. Example 2.17 There are many occasions where we observe a sequence of occur- rences which occur at "random" times. For example, we might be observing emis- sions of a radioactive isotope, or cars passing a milepost on a highway, or light bulbs burning out. In such cases, we might define a random variable X to denote the time between successive occurrences. Clearly, X is a continuous random variable whose range consists of the non-negative real numbers. It is often the case that we can model X by using the exponential density. This density is given by the formula r Ae-a, if t>0 f() 0, if t <0. The number A is a non-negative real number, and represents the reciprocal of the average value of X. (This will be shown in Chapter 6.) Thus, if the average time between occurrences is 30 minutes, then A = 1/30. A graph of this density function with A = 1/30 is shown in Figure 2.20. One can see from the figure that even though the average value is 30, occasionally much larger values are taken on by X. Suppose that we have bought a computer that contains a Warp 9 hard drive. The salesperson says that the average time between breakdowns of this type of hard drive is 30 months. It is often assumed that the length of time between breakdowns  2.2. CONTINUOUS DENSITY FUNCTIONS 67 1 - z - 1 -z 1 -z 4- 1 - z - Figure 2.19: Calculation of Fz. 0.03 0.025 0.02 0.015 0.01 0.005 f (t) = (1/30) e (1/30) t 20 40 60 80 100 120 Figure 2.20: Exponential density with A = 1/30.  68 CHAPTER 2. CONTINUOUS PROBABILITY DENSITIES 0.03 0.025 0.02 0.015 0.01 0.005 0 0 20 40 60 80 100 Figure 2.21: Residual lifespan of a hard drive. is distributed according to the exponential density. We will assume that this model applies here, with A,= 1/30. Now suppose that we have been operating our computer for 15 months. We assume that the original hard drive is still running. We ask how long we should expect the hard drive to continue to run. One could reasonably expect that the hard drive will run, on the average, another 15 months. (One might also guess that it will run more than 15 months, since the fact that it has already run for 15 months implies that we don't have a lemon.) The time which we have to wait is a new random variable, which we will call Y. Obviously, Y = X - 15. We can write a computer program to produce a sequence of simulated Y-values. To do this, we first produce a sequence of X's, and discard those values which are less than or equal to 15 (these values correspond to the cases where the hard drive has quit running before 15 months). To simulate a value of X, we compute the value of the expression (-h) log(rnd) , where rnd represents a random real number between 0 and 1. (That this expression has the exponential density will be shown in Chapter 4.3.) Figure 2.21 shows an area bar graph of 10,000 simulated Y-values. The average value of Y in this simulation is 29.74, which is closer to the original average life span of 30 months than to the value of 15 months which was guessed above. Also, the distribution of Y is seen to be close to the distribution of X. It is in fact the case that X and Y have the same distribution. This property is called the memoryless property, because the amount of time that we have to wait for an occurrence does not depend on how long we have already waited. The only continuous density function with this property is the exponential density. D  2.2. CONTINUOUS DENSITY FUNCTIONS 69 Assignment of Probabilities A fundamental question in practice is: How shall we choose the probability density function in describing any given experiment? The answer depends to a great extent on the amount and kind of information available to us about the experiment. In some cases, we can see that the outcomes are equally likely. In some cases, we can see that the experiment resembles another already described by a known density. In some cases, we can run the experiment a large number of times and make a reasonable guess at the density on the basis of the observed distribution of outcomes, as we did in Chapter 1. In general, the problem of choosing the right density function for a given experiment is a central problem for the experimenter and is not always easy to solve (see Example 2.6). We shall not examine this question in detail here but instead shall assume that the right density is already known for each of the experiments under study. The introduction of suitable coordinates to describe a continuous sample space, and a suitable density to describe its probabilities, is not always so obvious, as our final example shows. Infinite Tree Example 2.18 Consider an experiment in which a fair coin is tossed repeatedly, without stopping. We have seen in Example 1.6 that, for a coin tossed n times, the natural sample space is a binary tree with n stages. On this evidence we expect that for a coin tossed repeatedly, the natural sample space is a binary tree with an infinite number of stages, as indicated in Figure 2.22. It is surprising to learn that, although the n-stage tree is obviously a finite sample space, the unlimited tree can be described as a continuous sample space. To see how this comes about, let us agree that a typical outcome of the unlimited coin tossing experiment can be described by a sequence of the form w= {H H T H T T H ...}. If we write 1 for H and 0 for T, thenw= {1 1 0 1 0 0 1...}. In this way, each outcome is described by a sequence of 0's and l's. Now suppose we think of this sequence of 0's and l's as the binary expansion of some real number x = .1101001 ... lying between 0 and 1. (A binary expansion is like a decimal expansion but based on 2 instead of 10.) Then each outcome is described by a value of x, and in this way x becomes a coordinate for the sample space, taking on all real values between 0 and 1. (We note that it is possible for two different sequences to correspond to the same real number; for example, the sequences {T H H H H H ...} and {H T T T T T...} both correspond to the real number 1/2. We will not concern ourselves with this apparent problem here.) What probabilities should be assigned to the events of this sample space? Con- sider, for example, the event E consisting of all outcomes for which the first toss comes up heads and the second tails. Every such outcome has the form .10 ****.-..- where * can be either 0 or 1. Now if x is our real-valued coordinate, then the value of x for every such outcome must lie between 1/2 =.10000.-.-. and 3/4 =.11000.-..- and moreover, every value of x between 1/2 and 3/4 has a binary expansion of the  70 CHAPTER 2. CONTINUOUS PROBABILITY DENSITIES 11 0 1 00 1 - 0 - 1 0 0 0 ' - 0 1 00 1 0 (start) 0 0 0- Figure 2.22: Tree for infinite number of tosses of a coin. form .10 * * * * - - -. This means that w E E if and only if 1/2 < x < 3/4, and in this way we see that we can describe E by the interval [1/2, 3/4). More generally, every event consisting of outcomes for which the results of the first n tosses are prescribed is described by a binary interval of the form [k/2n, (k + 1)/2n). We have already seen in Section 1.2 that in the experiment involving n tosses, the probability of any one outcome must be exactly 1/2n. It follows that in the unlimited toss experiment, the probability of any event consisting of outcomes for which the results of the first n tosses are prescribed must also be 1/2n. But 1/2n is exactly the length of the interval of x-values describing E! Thus we see that, just as with the spinner experiment, the probability of an event E is determined by what fraction of the unit interval lies in E. Consider again the statement: The probability is 1/2 that a fair coin will turn up heads when tossed. We have suggested that one interpretation of this statement is that if we toss the coin indefinitely the proportion of heads will approach 1/2. That is, in our correspondence with binary sequences we expect to get a binary sequence with the proportion of l's tending to 1/2. The event E of binary sequences for which this is true is a proper subset of the set of all possible binary sequences. It does not contain, for example, the sequence 011011011 ... (i.e., (011) repeated again and again). The event E is actually a very complicated subset of the binary sequences, but its probability can be determined as a limit of probabilities for events with a finite number of outcomes whose probabilities are given by finite tree measures. When the probability of E is computed in this way, its value is found to be 1. This remarkable result is known as the Strong Law of Large Numbers (or Law of Averages) and is one justification for our frequency concept of probability. We shall prove a weak form of this theorem in Chapter 8.D  2.2. CONTINUOUS DENSITY FUNCTIONS 71 Exercises 1 Suppose you choose at random a real number X from the interval [2, 10]. (a) Find the density function f(x) and the probability of an event E for this experiment, where E is a subinterval [a, b] of [2, 10]. (b) From (a), find the probability that X > 5, that 5 < X < 7, and that X2- 12X + 35 > 0. 2 Suppose you choose a real number X from the interval [2, 10] with a density function of the form f(x) = Cx , where C is a constant. (a) Find C. (b) Find P(E), where E = [a, b] is a subinterval of [2, 10]. (c) Find P(X > 5), P(X < 7), and P(X2 - 12X + 35 > 0). 3 Same as Exercise 2, but suppose C f(x)=-. 4 Suppose you throw a dart at a circular target of radius 10 inches. Assuming that you hit the target and that the coordinates of the outcomes are chosen at random, find the probability that the dart falls (a) within 2 inches of the center. (b) within 2 inches of the rim. (c) within the first quadrant of the target. (d) within the first quadrant and within 2 inches of the rim. 5 Suppose you are watching a radioactive source that emits particles at a rate described by the exponential density f(t) = Ae-A, where A = 1, so that the probability P(0, T) that a particle will appear in the next T seconds is P([0, T]) =f Ae-at dt. Find the probability that a particle (not necessarily the first) will appear (a) within the next second. (b) within the next 3 seconds. (c) between 3 and 4 seconds from now. (d) after 4 seconds from now.  72 CHAPTER 2. CONTINUOUS PROBABILITY DENSITIES 6 Assume that a new light bulb will burn out after t hours, where t is chosen from [0, x0) with an exponential density f(t) = Ae-A In this context, A is often called the failure rate of the bulb. (a) Assume that A = 0.01, and find the probability that the bulb will not burn out before T hours. This probability is often called the reliability of the bulb. (b) For what T is the reliability of the bulb = 1/2? 7 Choose a number B at random from the interval [0, 1] with uniform density. Find the probability that (a) 1/3 < B <2/3. (b) B - 1/2 < 1/4. (c) B < 1/4 or 1 - B < 1/4. (d) 3B2 < B. 8 Choose independently two numbers B and C at random from the interval [0, 1] with uniform density. Note that the point (B, C) is then chosen at random in the unit square. Find the probability that (a) B + C < 1/2. (b) BC < 1/2. (c) B - C < 1/2. (d) max{B, C} < 1/2. (e) min{B, C} < 1/2. (f) B < 1/2 and 1 - C < 1/2. (g) conditions (c) and (f) both hold. (h) B2 + C2 < 1/2. (i) (B-1/2)2-+ (C-1/2)2 <1/4. 9 Suppose that we have a sequence of occurrences. We assume that the time X between occurrences is exponentially distributed with A = 1/10, so on the average, there is one occurrence every 10 minutes (see Example 2.17). You come upon this system at time 100, and wait until the next occurrence. Make a conjecture concerning how long, on the average, you will have to wait. Write a program to see if your conjecture is right. 10 As in Exercise 9, assume that we have a sequence of occurrences, but now assume that the time X between occurrences is uniformly distributed between 5 and 15. As before, you come upon this system at time 100, and wait until the next occurrence. Make a conjecture concerning how long, on the average, you will have to wait. Write a program to see if your conjecture is right.  2.2. CONTINUOUS DENSITY FUNCTIONS 73 11 For examples such as those in Exercises 9 and 10, it might seem that at least you should not have to wait on average more than 10 minutes if the average time between occurrences is 10 minutes. Alas, even this is not true. To see why, consider the following assumption about the times between occurrences. Assume that the time between occurrences is 3 minutes with probability .9 and 73 minutes with probability .1. Show by simulation that the average time between occurrences is 10 minutes, but that if you come upon this system at time 100, your average waiting time is more than 10 minutes. 12 Take a stick of unit length and break it into three pieces, choosing the break points at random. (The break points are assumed to be chosen simultane- ously.) What is the probability that the three pieces can be used to form a triangle? Hint: The sum of the lengths of any two pieces must exceed the length of the third, so each piece must have length < 1/2. Now use Exer- cise 8(g). 13 Take a stick of unit length and break it into two pieces, choosing the break point at random. Now break the longer of the two pieces at a random point. What is the probability that the three pieces can be used to form a triangle? 14 Choose independently two numbers B and C at random from the interval [-1, 1] with uniform distribution, and consider the quadratic equation x2 +Bx +C=0. Find the probability that the roots of this equation (a) are both real. (b) are both positive. Hints: (a) requires 0 < B2 - 4C, (b) requires 0 < B2 - 4C, B < 0, 0 K C. 15 At the Tunbridge World's Fair, a coin toss game works as follows. Quarters are tossed onto a checkerboard. The management keeps all the quarters, but for each quarter landing entirely within one square of the checkerboard the management pays a dollar. Assume that the edge of each square is twice the diameter of a quarter, and that the outcomes are described by coordinates chosen at random. Is this a fair game? 16 Three points are chosen at random on a circle of unit circumference. What is the probability that the triangle defined by these points as vertices has three acute angles? Hint: One of the angles is obtuse if and only if all three points lie in the same semicircle. Take the circumference as the interval [0, 1]. Take one point at 0 and the others at B and C. 17 Write a program to choose a random number X in the interval [2, 10] 1000 times and record what fraction of the outcomes satisfy X > 5, what fraction satisfy 5 < X < 7, and what fraction satisfy x2 - 12x +35 > 0. How do these results compare with Exercise 1?  74 CHAPTER 2. CONTINUOUS PROBABILITY DENSITIES 18 Write a program to choose a point (X, Y) at random in a square of side 20 inches, doing this 10,000 times, and recording what fraction of the outcomes fall within 19 inches of the center; of these, what fraction fall between 8 and 10 inches of the center; and, of these, what fraction fall within the first quadrant of the square. How do these results compare with those of Exercise 4? 19 Write a program to simulate the problem describe in Exercise 7 (see Exer- cise 17). How do the simulation results compare with the results of Exercise 7? 20 Write a program to simulate the problem described in Exercise 12. 21 Write a program to simulate the problem described in Exercise 16. 22 Write a program to carry out the following experiment. A coin is tossed 100 times and the number of heads that turn up is recorded. This experiment is then repeated 1000 times. Have your program plot a bar graph for the proportion of the 1000 experiments in which the number of heads is n, for each n in the interval [35, 65]. Does the bar graph look as though it can be fit with a normal curve? 23 Write a program that picks a random number between 0 and 1 and computes the negative of its logarithm. Repeat this process a large number of times and plot a bar graph to give the number of times that the outcome falls in each interval of length 0.1 in [0, 10]. On this bar graph plot a graph of the density f(x) =e-. How well does this density fit your graph?  Chapter 3 Combinatorics 3.1 Permutations Many problems in probability theory require that we count the number of ways that a particular event can occur. For this, we study the topics of permutations and combinations. We consider permutations in this section and combinations in the next section. Before discussing permutations, it is useful to introduce a general counting tech- nique that will enable us to solve a variety of counting problems, including the problem of counting the number of possible permutations of n objects. Counting Problems Consider an experiment that takes place in several stages and is such that the number of outcomes m at the nth stage is independent of the outcomes of the previous stages. The number m may be different for different stages. We want to count the number of ways that the entire experiment can be carried out. Example 3.1 You are eating at Emile's restaurant and the waiter informs you that you have (a) two choices for appetizers: soup or juice; (b) three for the main course: a meat, fish, or vegetable dish; and (c) two for dessert: ice cream or cake. How many possible choices do you have for your complete meal? We illustrate the possible meals by a tree diagram shown in Figure 3.1. Your menu is decided in three stages-at each stage the number of possible choices does not depend on what is chosen in the previous stages: two choices at the first stage, three at the second, and two at the third. From the tree diagram we see that the total number of choices is the product of the number of choices at each stage. In this examples we have 2 - 3 - 2 = 12 possible menus. Our menu example is an example of the following general counting technique. D 75  76 CHAPTER 3. COMBINATORICS ice cream meat cake ice cream soup fish cake ice cream vegetable cake (start) ice cream meat cake ice cream juice fish cake ice cream vegetable cake Figure 3.1: Tree for your menu. A Counting Technique A task is to be carried out in a sequence of r stages. There are ni ways to carry out the first stage; for each of these ni ways, there are n2 ways to carry out the second stage; for each of these n2 ways, there are n3 ways to carry out the third stage, and so forth. Then the total number of ways in which the entire task can be accomplished is given by the product N *= ni -n2 - .. --nr. Tree Diagrams It will often be useful to use a tree diagram when studying probabilities of events relating to experiments that take place in stages and for which we are given the probabilities for the outcomes at each stage. For example, assume that the owner of Emile's restaurant has observed that 80 percent of his customers choose the soup for an appetizer and 20 percent choose juice. Of those who choose soup, 50 percent choose meat, 30 percent choose fish, and 20 percent choose the vegetable dish. Of those who choose juice for an appetizer, 30 percent choose meat, 40 percent choose fish, and 30 percent choose the vegetable dish. We can use this to estimate the probabilities at the first two stages as indicated on the tree diagram of Figure 3.2. We choose for our sample space the set Q of all possible paths w= wi, w2, ..., w6 through the tree. How should we assign our probability distribution? For example, what probability should we assign to the customer choosing soup and then the meat? If 8/10 of the customers choose soup and then 1/2 of these choose meat, a proportion 8/10.- 1/2 =4/10 of the customers choose soup and then meat. This suggests choosing our probability distribution for each path through the tree to be the product of the probabilities at each of the stages along the path. This results in the probability distribution for the sample points o indicated in Figure 3.2. (Note that m(wi) +. + -+m(w6) =1.) From this we see, for example, that the probability  3.1. PERMUTATIONS 77 (0 m (CO) meat (01 .4 .5 soup .3 fish ( 2 .24 .2 .8 vegetable 0 3 .16 (start) meat 4.06 .2 .3 juice -4 fish ( 5 .08 .3 vegetable 0) 6 .06 Figure 3.2: Two-stage probability assignment. that a customer chooses meat is m(wi) + m(w4) = .46. We shall say more about these tree measures when we discuss the concept of conditional probability in Chapter 4. We return now to more counting problems. Example 3.2 We can show that there are at least two people in Columbus, Ohio, who have the same three initials. Assuming that each person has three initials, there are 26 possibilities for a person's first initial, 26 for the second, and 26 for the third. Therefore, there are 263 = 17,576 possible sets of initials. This number is smaller than the number of people living in Columbus, Ohio; hence, there must be at least two people with the same three initials. D We consider next the celebrated birthday problem-often used to show that naive intuition cannot always be trusted in probability. Birthday Problem Example 3.3 How many people do we need to have in a room to make it a favorable bet (probability of success greater than 1/2) that two people in the room will have the same birthday? Since there are 365 possible birthdays, it is tempting to guess that we would need about 1/2 this number, or 183. You would surely win this bet. In fact, the number required for a favorable bet is only 23. To show this, we find the probability pr that, in a room with r people, there is no duplication of birthdays; we will have a favorable bet if this probability is less than one half.  78 CHAPTER 3. COMBINATORICS Number of people Probability that all birthdays are different 20 .5885616 21 .5563117 22 .5243047 23 .4927028 24 .4616557 25 .4313003 Table 3.1: Birthday problem. Assume that there are 365 possible birthdays for each person (we ignore leap years). Order the people from 1 to r. For a sample point w, we choose a possible sequence of length r of birthdays each chosen as one of the 365 possible dates. There are 365 possibilities for the first element of the sequence, and for each of these choices there are 365 for the second, and so forth, making 365' possible sequences of birthdays. We must find the number of these sequences that have no duplication of birthdays. For such a sequence, we can choose any of the 365 days for the first element, then any of the remaining 364 for the second, 363 for the third, and so forth, until we make r choices. For the rth choice, there will be 365 - r + 1 possibilities. Hence, the total number of sequences with no duplications is 365.364.363"-... - (365-r+ 1) . Thus, assuming that each sequence is equally likely, 365.364-..."-(365-r+1) 365w We denote the product (n) (n -1)"-""-(n - r+1) by (n)r (read "n down r," or "n lower r"). Thus, (365)r -=(365)r The program Birthday carries out this computation and prints the probabilities for r = 20 to 25. Running this program, we get the results shown in Table 3.1. As we asserted above, the probability for no duplication changes from greater than one half to less than one half as we move from 22 to 23 people. To see how unlikely it is that we would lose our bet for larger numbers of people, we have run the program again, printing out values from r = 10 to r = 100 in steps of 10. We see that in a room of 40 people the odds already heavily favor a duplication, and in a room of 100 the odds are overwhelmingly in favor of a duplication. We have assumed that birthdays are equally likely to fall on any particular day. Statistical evidence suggests that this is not true. However, it is intuitively clear (but not easy to prove) that this makes it even more likely to have a duplication with a group of 23 people. (See Exercise 19 to find out what happens on planets with more or fewer than 365 days per year.)  3.1. PERMUTATIONS 79 Number of people Probability that all birthdays are different 10 .8830518 20 .5885616 30 .2936838 40 .1087682 50 .0296264 60 .0058773 70 .0008404 80 .0000857 90 .0000062 100 .0000003 Table 3.2: Birthday problem. We now turn to the topic of permutations. Permutations Definition 3.1 Let A be any finite set. A permutation of A is a one-to-one mapping of A onto itself. D To specify a particular permutation we list the elements of A and, under them, show where each element is sent by the one-to-one mapping. For example, if A {a, b, c} a possible permutation a would be a b c b c a' By the permutation a, a is sent to b, b is sent to c, and c is sent to a. The condition that the mapping be one-to-one means that no two elements of A are sent, by the mapping, into the same element of A. We can put the elements of our set in some order and rename them 1, 2, ..., n. Then, a typical permutation of the set A = {ai, a2, a3, a4} can be written in the form 1 2 3 4 (2 1 4 3' indicating that a1 went to a2, a2 to a1, a3 to a4, and a4 to a3. If we always choose the top row to be 1 2 3 4 then, to prescribe the permutation, we need only give the bottom row, with the understanding that this tells us where 1 goes, 2 goes, and so forth, under the mapping. When this is done, the permutation is often called a rearrangement of the n objects 1, 2, 3, ..., n. For example, all possible permutations, or rearrangements, of the numbers A = {1, 2, 3} are: 123, 132, 213, 231, 312, 321. It is an easy matter to count the number of possible permutations of n objects. By our general counting principle, there are n ways to assign the first element, for  80 CHAPTER 3. COMBINATORICS n n! 0 1 1 1 2 2 3 6 4 24 5 120 6 720 7 5040 8 40320 9 362880 10 3628800 Table 3.3: Values of the factorial function. each of these we have n - 1 ways to assign the second object, n - 2 for the third, and so forth. This proves the following theorem. Theorem 3.1 The total number of permutations of a set A of n elements is given byn - (n - 1)- (n1-2)-..."-1. It is sometimes helpful to consider orderings of subsets of a given set. This prompts the following definition. Definition 3.2 Let A be an n-element set, and let k be an integer between 0 and n. Then a k-permutation of A is an ordered listing of a subset of A of size k. Q Using the same techniques as in the last theorem, the following result is easily proved. Theorem 3.2 The total number of k-permutations of a set A of n elements is given byn-(n -1)-(n -2)-..."-(n -k+1). Factorials The number given in Theorem 3.1 is called n factorial, and is denoted by n!. The expression 0! is defined to be 1 to make certain formulas come out simpler. The first few values of this function are shown in Table 3.3. The reader will note that this function grows very rapidly. The expression n! will enter into many of our calculations, and we shall need to have some estimate of its magnitude when 1 is large. It is clearly not practical to make exact calculations in this case. We shall instead use a result called Stirling 's formula. Before stating this formula we need a definition.  3.1. PERMUTATIONS 81 n n! Approximation Ratio 1 1 .922 1.084 2 2 1.919 1.042 3 6 5.836 1.028 4 24 23.506 1.021 5 120 118.019 1.016 6 720 710.078 1.013 7 5040 4980.396 1.011 8 40320 39902.395 1.010 9 362880 359536.873 1.009 10 3628800 3598696.619 1.008 Table 3.4: Stirling approximations to the factorial function. Definition 3.3 Let an and bn be two sequences of numbers. We say that an is asymptotically equal to bn, and write an bn, if lim 1a n-oo bn Example 3.4 If a n + /n and bn = n then, since a/bn = 1 +1//n and this ratio tends to 1 as n tends to infinity, we have an~ bn. Theorem 3.3 (Stirling's Formula) The sequence n! is asymptotically equal to n'e- 27rn. The proof of Stirling's formula may be found in most analysis texts. Let us verify this approximation by using the computer. The program StirlingApprox- imations prints n!, the Stirling approximation, and, finally, the ratio of these two numbers. Sample output of this program is shown in Table 3.4. Note that, while the ratio of the numbers is getting closer to 1, the difference between the exact value and the approximation is increasing, and indeed, this difference will tend to infinity as n tends to infinity, even though the ratio tends to 1. (This was also true in our Example 3.4 where n + \/n n, but the difference is /n.) Generating Random Permutations We now consider the question of generating a random permutation of the integers between 1 and n. Consider the following experiment. We start with a deck of n cards, labelled 1 through n. We choose a random card out of the deck, note its label, and put the card aside. We repeat this process until all n cards have been chosen. It is clear that each permutation of the integers from 1 to n can occur as a sequence  82 CHAPTER 3. COMBINATORICS Number of fixed points Fraction of permutations n=10 n=20 n=30 0 .362 .370 .358 1 .368 .396 .358 2 .202 .164 .192 3 .052 .060 .070 4 .012 .008 .020 5 .004 .002 .002 Average number of fixed points .996 .948 1.042 Table 3.5: Fixed point distributions. of labels in this experiment, and that each sequence of labels is equally likely to occur. In our implementations of the computer algorithms, the above procedure is called RandomPermutation. Fixed Points There are many interesting problems that relate to properties of a permutation chosen at random from the set of all permutations of a given finite set. For example, since a permutation is a one-to-one mapping of the set onto itself, it is interesting to ask how many points are mapped onto themselves. We call such points fixed points of the mapping. Let Pk(n) be the probability that a random permutation of the set {1, 2, ..., n} has exactly k fixed points. We will attempt to learn something about these prob- abilities using simulation. The program FixedPoints uses the procedure Ran- domPermutation to generate random permutations and count fixed points. The program prints the proportion of times that there are k fixed points as well as the average number of fixed points. The results of this program for 500 simulations for the cases n = 10, 20, and 30 are shown in Table 3.5. Notice the rather surprising fact that our estimates for the probabilities do not seem to depend very heavily on the number of elements in the permutation. For example, the probability that there are no fixed points, when n = 10, 20, or 30 is estimated to be between .35 and .37. We shall see later (see Example 3.12) that for n > 10 the exact probabilities pn(0) are, to six decimal place accuracy, equal to 1/e ~ .367879. Thus, for all practi- cal purposes, after n = 10 the probability that a random permutation of the set {1, 2, ... , n} has no fixed points does not depend upon n. These simulations also suggest that the average number of fixed points is close to 1. It can be shown (see Example 6.8) that the average is exactly equal to 1 for all n. More picturesque versions of the fixed-point problem are: You have arranged the books on your book shelf in alphabetical order by author and they get returned to your shelf at random; what is the probability that exactly k of the books end up in their correct position? (The library problem.) In a restaurant n hats are checked and they are hopelessly scrambled; what is the probability that no one gets his own hat back? (The hat check problem.) In the Historical Remarks at the end of this section, we give one method for solving the hat check problem exactly. Another  3.1. PERMUTATIONS 83 Date Snowfall in inches 1974 75 1975 88 1976 72 1977 110 1978 85 1979 30 1980 55 1981 86 1982 51 1983 64 Table 3.6: Snowfall in Hanover. Year 1 2 3 4 5 6 7 8 9 10 Ranking 6 9 5 10 7 1 3 8 2 4 Table 3.7: Ranking of total snowfall. method is given in Example 3.12. Records Here is another interesting probability problem that involves permutations. Esti- mates for the amount of measured snow in inches in Hanover, New Hampshire, in the ten years from 1974 to 1983 are shown in Table 3.6. Suppose we have started keeping records in 1974. Then our first year's snowfall could be considered a record snowfall starting from this year. A new record was established in 1975; the next record was established in 1977, and there were no new records established after this year. Thus, in this ten-year period, there were three records established: 1974, 1975, and 1977. The question that we ask is: How many records should we expect to be established in such a ten-year period? We can count the number of records in terms of a permutation as follows: We number the years from 1 to 10. The actual amounts of snowfall are not important but their relative sizes are. We can, therefore, change the numbers measuring snowfalls to numbers 1 to 10 by replacing the smallest number by 1, the next smallest by 2, and so forth. (We assume that there are no ties.) For our example, we obtain the data shown in Table 3.7. This gives us a permutation of the numbers from 1 to 10 and, from this per- mutation, we can read off the records; they are in years 1, 2, and 4. Thus we can define records for a permutation as follows: Definition 3.4 Let a be a permutation of the set {1, 2,... , n}. Then i is a record ofuif eitheri = 1 orur(j) < (i) foreveryj=1,..., i-i. 1 Now if we regard all rankings of snowfalls over an n-year period to be equally likely (and allow no ties), we can estimate the probability that there will be k records in n years as well as the average number of records by simulation.  84 CHAPTER 3. COMBINATORICS We have written a program Records that counts the number of records in ran- domly chosen permutations. We have run this program for the cases n = 10, 20, 30. For n = 10 the average number of records is 2.968, for 20 it is 3.656, and for 30 it is 3.960. We see now that the averages increase, but very slowly. We shall see later (see Example 6.11) that the average number is approximately log n. Since log 10 = 2.3, log 20 = 3, and log 30 = 3.4, this is consistent with the results of our simulations. As remarked earlier, we shall be able to obtain formulas for exact results of certain problems of the above type. However, only minor changes in the problem make this impossible. The power of simulation is that minor changes in a problem do not make the simulation much more difficult. (See Exercise 20 for an interesting variation of the hat check problem.) List of Permutations Another method to solve problems that is not sensitive to small changes in the problem is to have the computer simply list all possible permutations and count the fraction that have the desired property. The program AllPermutations produces a list of all of the permutations of n. When we try running this program, we run into a limitation on the use of the computer. The number of permutations of n increases so rapidly that even to list all permutations of 20 objects is impractical. Historical Remarks Our basic counting principle stated that if you can do one thing in r ways and for each of these another thing in s ways, then you can do the pair in rs ways. This is such a self-evident result that you might expect that it occurred very early in mathematics. N. L. Biggs suggests that we might trace an example of this principle as follows: First, he relates a popular nursery rhyme dating back to at least 1730: As I was going to St. Ives, I met a man with seven wives, Each wife had seven sacks, Each sack had seven cats, Each cat had seven kits. Kits, cats, sacks and wives, How many were going to St. Ives? (You need our principle only if you are not clever enough to realize that you are supposed to answer one, since only the narrator is going to St. Ives; the others are going in the other direction!) He also gives a problem appearing on one of the oldest surviving mathematical manuscripts of about 1650 B.C., roughly translated as:  3.1. PERMUTATIONS 85 Houses 7 Cats 49 Mice 343 Wheat 2401 Hekat 16807 19607 The following interpretation has been suggested: there are seven houses, each with seven cats; each cat kills seven mice; each mouse would have eaten seven heads of wheat, each of which would have produced seven hekat measures of grain. With this interpretation, the table answers the question of how many hekat measures were saved by the cats' actions. It is not clear why the writer of the table wanted to add the numbers together.1 One of the earliest uses of factorials occurred in Euclid's proof that there are infinitely many prime numbers. Euclid argued that there must be a prime number between n and n! + 1 as follows: n! and n! + 1 cannot have common factors. Either n! + 1 is prime or it has a proper factor. In the latter case, this factor cannot divide n! and hence must be between n and n! + 1. If this factor is not prime, then it has a factor that, by the same argument, must be bigger than n. In this way, we eventually reach a prime bigger than n, and this holds for all n. The "n!" rule for the number of permutations seems to have occurred first in India. Examples have been found as early as 300 B.C., and by the eleventh century the general formula seems to have been well known in India and then in the Arab countries. The hat check problem is found in an early probability book written by de Mont- mort and first printed in 1708.2 It appears in the form of a game called Treize. In a simplified version of this game considered by de Montmort one turns over cards numbered 1 to 13, calling out 1, 2, ..., 13 as the cards are examined. De Montmort asked for the probability that no card that is turned up agrees with the number called out. This probability is the same as the probability that a random permutation of 13 elements has no fixed point. De Montmort solved this problem by the use of a recursion relation as follows: let wn be the number of permutations of n elements with no fixed point (such permutations are called derangements). Then wi = 0 and W2 =1. Now assume that n > 3 and choose a derangement of the integers between 1 and n. Let k be the integer in the first position in this derangement. By the definition of derangement, we have k / 1. There are two possibilities of interest concerning the position of 1 in the derangement: either 1 is in the kth position or it is elsewhere. In the first case, the n - 2 remaining integers can be positioned in wn-2 ways without resulting in any fixed points. In the second case, we consider the set of integers {1, 2, . .. ,.k - 1, k +1_.... ,-n}. The numbers in this set must occupy the positions { 2,3, . .. , n} so that none of the numbers other than 1 in this set are fixed, and 1N. L. Biggs, "The Roots of Combinatorics," Historia Mathematica, vol. 6 (1979), pp. 109-136. 2P. R. de Montmort, Essay d'Analyse sur des Jeuix de Hazard, 2d ed. (Paris: Quillau, 1713).  86 CHAPTER 3. COMBINATORICS also so that 1 is not in position k. The number of ways of achieving this kind of arrangement is just wn_1. Since there are n - 1 possible values of k, we see that wn = (n - 1)wn-1 + (n - 1)wn-2 for n > 3. One might conjecture from this last equation that the sequence {wn} grows like the sequence {n!}. In fact, it is easy to prove by induction that wn = nwn-1 + (-1)" Then p2 = wi/i! satisfies (-1)2 SPi -_1 = . . If we sum from i= 2 to n, and use the fact that pi1= 0, we obtain 1 1(-1 pn= - +--+ . 2! 3! .. n! This agrees with the first n + 1 terms of the expansion for ex for x = -1 and hence for large n is approximately e-1 .368. David remarks that this was possibly the first use of the exponential function in probability.3 We shall see another way to derive de Montmort's result in the next section, using a method known as the Inclusion-Exclusion method. Recently, a related problem appeared in a column of Marilyn vos Savant.4 Charles Price wrote to ask about his experience playing a certain form of solitaire, sometimes called "frustration solitaire." In this particular game, a deck of cards is shuffled, and then dealt out, one card at a time. As the cards are being dealt, the player counts from 1 to 13, and then starts again at 1. (Thus, each number is counted four times.) If a number that is being counted coincides with the rank of the card that is being turned up, then the player loses the game. Price found that he rarely won and wondered how often he should win. Vos Savant remarked that the expected number of matches is 4 so it should be difficult to win the game. Finding the chance of winning is a harder problem than the one that de Mont- mort solved because, when one goes through the entire deck, there are different patterns for the matches that might occur. For example matches may occur for two cards of the same rank, say two aces, or for two different ranks, say a two and a three. A discussion of this problem can be found in Riordan.5 In this book, it is shown that as n -- o, the probability of no matches tends to 1/e4. The original game of Treize is more difficult to analyze than frustration solitaire. The game of Treize is played as follows. One person is chosen as dealer and the others are players. Each player, other than the dealer, puts up a stake. The dealer shuffles the cards and turns them up one at a time calling out, "Ace, two, three,..., 3F. N. David, Games, Gods and Gambling (London: Griffin, 1962), p. 146. 4M. vos Savant, Ask Marilyn, Parade Magazine, Boston Globe, 21 August 1994. 5J. Riordan, An Introduction to Combinatorial Analysis, (New York: John Wiley & Sons, 1958).  3.1. PERMUTATIONS 87 king," just as in frustration solitaire. If the dealer goes through the 13 cards without a match he pays the players an amount equal to their stake, and the deal passes to someone else. If there is a match the dealer collects the players' stakes; the players put up new stakes, and the dealer continues through the deck, calling out, "Ace, two, three, ...." If the dealer runs out of cards he reshuffles and continues the count where he left off. He continues until there is a run of 13 without a match and then a new dealer is chosen. The question at this point is how much money can the dealer expect to win from each player. De Montmort found that if each player puts up a stake of 1, say, then the dealer will win approximately .801 from each player. Peter Doyle calculated the exact amount that the dealer can expect to win. The answer is: 26516072156010218582227607912734182784642120482136091446715371962089931 52311343541724554334912870541440299239251607694113500080775917818512013 82176876653563173852874555859367254632009477403727395572807459384342747 87664965076063990538261189388143513547366316017004945507201764278828306 60117107953633142734382477922709835281753299035988581413688367655833113 24476153310720627474169719301806649152698704084383914217907906954976036 28528211590140316202120601549126920880824913325553882692055427830810368 57818861208758248800680978640438118582834877542560955550662878927123048 26997601700116233592793308297533642193505074540268925683193887821301442 70519791882/ 33036929133582592220117220713156071114975101149831063364072138969878007 99647204708825303387525892236581323015628005621143427290625658974433971 65719454122908007086289841306087561302818991167357863623756067184986491 35353553622197448890223267101158801016285931351979294387223277033396967 79797069933475802423676949873661605184031477561560393380257070970711959 69641268242455013319879747054693517809383750593488858698672364846950539 88868628582609905586271001318150621134407056983214740221851567706672080 94586589378459432799868706334161812988630496327287254818458879353024498 00322425586446741048147720934108061350613503856973048971213063937040515 59533731591. This is .803 to 3 decimal places. A description of the algorithm used to find this answer can be found on his Web page.6 A discussion of this problem and other problems can be found in Doyle et al.7 The birthday problem does not seem to have a very old history. Problems of this type were first discussed by von Mises.8 It was made popular in the 1950s by Feller's book.9 6P. Doyle, "Solution to Montmort's Probleme du Treize," http://math.ucsd.edu/~doyle/. 7P. Doyle, C. Grinstead, and J. Snell, "Frustration Solitaire," UMAP Jouirnal, vol. 16, no. 2 (1995), pp. 137-145. 8R. von Mises, "Uber Aufteilungs- und Besetzungs-Wahrscheinlichkeiten," Revue de la Faculti des Sciences de l'Universiti d'Istanbuil, N. S. vol. 4 (1938-39), pp. 145-163. 9W. Feller, Introduiction to Probability Theory anid Its Applications, vol. 1, 3rd ed. (New York:  88 CHAPTER 3. COMBINATORICS Stirling presented his formula n!~ ~/27rn -21 in his work Methodus Differentialis published in 1730.10 This approximation was used by de Moivre in establishing his celebrated central limit theorem that we will study in Chapter 9. De Moivre himself had independently established this approximation, but without identifying the constant 7. Having established the approximation 2B for the central term of the binomial distribution, where the constant B was deter- mined by an infinite series, de Moivre writes: ... my worthy and learned Friend, Mr. James Stirling, who had applied himself after me to that inquiry, found that the Quantity B did denote the Square-root of the Circumference of a Circle whose Radius is Unity, so that if that Circumference be called c the Ratio of the middle Term to the Sum of all Terms will be expressed by 2// ....11 Exercises 1 Four people are to be arranged in a row to have their picture taken. In how many ways can this be done? 2 An automobile manufacturer has four colors available for automobile exteri- ors and three for interiors. How many different color combinations can he produce? 3 In a digital computer, a bit is one of the integers {0,1}, and a word is any string of 32 bits. How many different words are possible? 4 What is the probability that at least 2 of the presidents of the United States have died on the same day of the year? If you bet this has happened, would you win your bet? 5 There are three different routes connecting city A to city B. How many ways can a round trip be made from A to B and back? How many ways if it is desired to take a different route on the way back? 6 In arranging people around a circular table, we take into account their seats relative to each other, not the actual position of any one person. Show that n2 people can be arranged around a circular table in (n - 1)! ways. John Wiley & Sons, 1968). 10 J. Stirling, Methodus Differentialis, (London: Bowyer, 1730). 11A. de Moivre, The Doctrine of Chances, 3rd ed. (London: Millar, 1756).  3.1. PERMUTATIONS 89 7 Five people get on an elevator that stops at five floors. Assuming that each has an equal probability of going to any one floor, find the probability that they all get off at different floors. 8 A finite set Q has n elements. Show that if we count the empty set and Q as subsets, there are 2n subsets of Q. 9 A more refined inequality for approximating n! is given by 2rn ei/(12n+l) < n! < 27rn (nWnei/(12n) \e /\e/ Write a computer program to illustrate this inequality for n = 1 to 9. 10 A deck of ordinary cards is shuffled and 13 cards are dealt. What is the probability that the last card dealt is an ace? 11 There are n applicants for the director of computing. The applicants are inter- viewed independently by each member of the three-person search committee and ranked from 1 to n. A candidate will be hired if he or she is ranked first by at least two of the three interviewers. Find the probability that a candidate will be accepted if the members of the committee really have no ability at all to judge the candidates and just rank the candidates randomly. In particular, compare this probability for the case of three candidates and the case of ten candidates. 12 A symphony orchestra has in its repertoire 30 Haydn symphonies, 15 modern works, and 9 Beethoven symphonies. Its program always consists of a Haydn symphony followed by a modern work, and then a Beethoven symphony. (a) How many different programs can it play? (b) How many different programs are there if the three pieces can be played in any order? (c) How many different three-piece programs are there if more than one piece from the same category can be played and they can be played in any order? 13 A certain state has license plates showing three numbers and three letters. How many different license plates are possible (a) if the numbers must come before the letters? (b) if there is no restriction on where the letters and numbers appear? 14 The door on the computer center has a lock which has five buttons numbered from 1 to 5. The combination of numbers that opens the lock is a sequence of five numbers and is reset every week. (a) How many combinations are possible if every button must be used once?  90 CHAPTER 3. COMBINATORICS (b) Assume that the lock can also have combinations that require you to push two buttons simultaneously and then the other three one at a time. How many more combinations does this permit? 15 A computing center has 3 processors that receive n jobs, with the jobs assigned to the processors purely at random so that all of the 3n possible assignments are equally likely. Find the probability that exactly one processor has no jobs. 16 Prove that at least two people in Atlanta, Georgia, have the same initials, assuming no one has more than four initials. 17 Find a formula for the probability that among a set of n people, at least two have their birthdays in the same month of the year (assuming the months are equally likely for birthdays). 18 Consider the problem of finding the probability of more than one coincidence of birthdays in a group of n people. These include, for example, three people with the same birthday, or two pairs of people with the same birthday, or larger coincidences. Show how you could compute this probability, and write a computer program to carry out this computation. Use your program to find the smallest number of people for which it would be a favorable bet that there would be more than one coincidence of birthdays. *19 Suppose that on planet Zorg a year has n days, and that the lifeforms there are equally likely to have hatched on any day of the year. We would like to estimate d, which is the minimum number of lifeforms needed so that the probability of at least two sharing a birthday exceeds 1/2. (a) In Example 3.3, it was shown that in a set of d lifeforms, the probability that no two life forms share a birthday is (n)d nd' where (n)d = (n)(n - 1) ... (n - d + 1). Thus, we would like to set this equal to 1/2 and solve for d. (b) Using Stirling's Formula, show that (n)d dn-d+1/2 / ~1+ e . nan - d (c) Now take the logarithm of the right-hand expression, and use the fact that for small values of x, we have 2 (We are implicitly using the fact that d is of smaller order of magnitude than n. We will also use this fact in part (d).)  3.1. PERMUTATIONS 91 (d) Set the expression found in part (c) equal to - log(2), and solve for d as a function of n, thereby showing that d ~ 2(log 2)n . Hint: If all three summands in the expression found in part (b) are used, one obtains a cubic equation in d. If the smallest of the three terms is thrown away, one obtains a quadratic equation in d. (e) Use a computer to calculate the exact values of d for various values of n. Compare these values with the approximate values obtained by using the answer to part d). 20 At a mathematical conference, ten participants are randomly seated around a circular table for meals. Using simulation, estimate the probability that no two people sit next to each other at both lunch and dinner. Can you make an intelligent conjecture for the case of n participants when n is large? 21 Modify the program AllPermutations to count the number of permutations of n objects that have exactly j fixed points for j = 0, 1, 2, ..., n. Run your program for n = 2 to 6. Make a conjecture for the relation between the number that have 0 fixed points and the number that have exactly 1 fixed point. A proof of the correct conjecture can be found in Wilf.12 22 Mr. Wimply Dimple, one of London's most prestigious watch makers, has come to Sherlock Holmes in a panic, having discovered that someone has been producing and selling crude counterfeits of his best selling watch. The 16 counterfeits so far discovered bear stamped numbers, all of which fall between 1 and 56, and Dimple is anxious to know the extent of the forger's work. All present agree that it seems reasonable to assume that the counterfeits thus far produced bear consecutive numbers from 1 to whatever the total number is. "Chin up, Dimple," opines Dr. Watson. "I shouldn't worry overly much if I were you; the Maximum Likelihood Principle, which estimates the total number as precisely that which gives the highest probability for the series of numbers found, suggests that we guess 56 itself as the total. Thus, your forgers are not a big operation, and we shall have them safely behind bars before your business suffers significantly." "Stuff, nonsense, and bother your fancy principles, Watson," counters Holmes. "Anyone can see that, of course, there must be quite a few more than 56 watches-why the odds of our having discovered precisely the highest num- bered watch made are laughably negligible. A much better guess would be twice 56." (a) Show that Watson is correct that the Maximum Likelihood Principle gives 56. 12H. S. Wilf, "A Bijection in the Theory of Derangements," Mathematics Magazine, vol. 57, no. 1 (1984), pp. 37-40.  92 CHAPTER 3. COMBINATORICS (b) Write a computer program to compare Holmes's and Watson's guessing strategies as follows: fix a total N and choose 16 integers randomly between 1 and N. Let m denote the largest of these. Then Watson's guess for N is m, while Holmes's is 2m. See which of these is closer to N. Repeat this experiment (with N still fixed) a hundred or more times, and determine the proportion of times that each comes closer. Whose seems to be the better strategy? 23 Barbara Smith is interviewing candidates to be her secretary. As she inter- views the candidates, she can determine the relative rank of the candidates but not the true rank. Thus, if there are six candidates and their true rank is 6, 1, 4, 2, 3, 5, (where 1 is best) then after she had interviewed the first three candidates she would rank them 3, 1, 2. As she interviews each candidate, she must either accept or reject the candidate. If she does not accept the candidate after the interview, the candidate is lost to her. She wants to de- cide on a strategy for deciding when to stop and accept a candidate that will maximize the probability of getting the best candidate. Assume that there are n candidates and they arrive in a random rank order. (a) What is the probability that Barbara gets the best candidate if she inter- views all of the candidates? What is it if she chooses the first candidate? (b) Assume that Barbara decides to interview the first half of the candidates and then continue interviewing until getting a candidate better than any candidate seen so far. Show that she has a better than 25 percent chance of ending up with the best candidate. 24 For the task described in Exercise 23, it can be shown13 that the best strategy is to pass over the first k - 1 candidates where k is the smallest integer for which 1 1 1 - + ---+ < 1. k k+1 . n-1- Using this strategy the probability of getting the best candidate is approxi- mately 1/e = .368. Write a program to simulate Barbara Smith's interviewing if she uses this optimal strategy, using n = 10, and see if you can verify that the probability of success is approximately 1/e. 3.2 Combinations Having mastered permutations, we now consider combinations. Let U be a set with n elements; we want to count the number of distinct subsets of the set U that have exactly j elements. The empty set and the set U are considered to be subsets of U. The empty set is usually denoted by #5. 13E. B. Dynkin and A. A. Yushkevich, Markov Processes: Theorems anid Problems, trans. J. S. Wood (New York: Plenum, 1969).  3.2. COMBINATIONS 93 Example 3.5 Let U = {a, b, c}. The subsets of U are #, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} . Binomial Coefficients The number of distinct subsets with j elements that can be chosen from a set with n elements is denoted by (g), and is pronounced "n choose j." The number (n) is called a binomial coefficient. This terminology comes from an application to algebra which will be discussed later in this section. In the above example, there is one subset with no elements, three subsets with exactly 1 element, three subsets with exactly 2 elements, and one subset with exactly 3 elements. Thus, () =1, () 3, () = 3, and (3) = 1. Note that there are 23 = 8 subsets in all. (We have already seen that a set with n elements has 2n subsets; see Exercise 3.1.8.) It follows that )+ ()+ ()+ ( 23= 8 0 1 2 13. (n)=(n)=1. Assume that n > 0. Then, since there is only one way to choose a set with no elements and only one way to choose a set with n elements, the remaining values of (n) are determined by the following recurrence relation: Theorem 3.4 For integers n and j, with 0 < j < n, the binomial coefficients satisfy: n n -1 n-1 .. + . . (3.1) Proof. We wish to choose a subset of j elements. Choose an element u of U. Assume first that we do not want u in the subset. Then we must choose the j elements from a set of n -1 elements; this can be done in ("1) ways. On the other hand, assume that we do want u in the subset. Then we must choose the other j - 1 elements from the remaining n - 1 elements of U; this can be done in (n-) ways. Since u is either in our subset or not, the number of ways that we can choose a subset of j elements is the sum of the number of subsets of j elements which have vi as a member and the number which do not-this is what Equation 3.1 states. D The binomial coefficient (g) is defined to be 0, if]j < 0 or if]j> n. With this definition, the restrictions on]j in Theorem 3.4 are unnecessary.  94 CHAPTER 3. COMBINATORICS j=0 1 2 3 4 5 6 7 8 9 10 n=0 1 1 1 1 2 1 2 1 3 1 3 3 1 4 1 4 6 4 1 5 1 5 10 10 5 1 6 1 6 15 20 15 6 1 7 1 7 21 35 35 21 7 1 8 1 8 28 56 70 56 28 8 1 9 1 9 36 84 126 126 84 36 9 1 10 1 10 45 120 210 252 210 120 45 10 1 Figure 3.3: Pascal's triangle. Pascal's Triangle The relation 3.1, together with the knowledge that n n = 1, determines completely the numbers ( ). We can use these relations to determine the famous triangle of Pascal, which exhibits all these numbers in matrix form (see Figure 3.3). The nth row of this triangle has the entries (), (),..., (). We know that the first and last of these numbers are 1. The remaining numbers are determined by the recurrence relation Equation 3.1; that is, the entry (7) for 0 < j < n in the nth row of Pascal's triangle is the sum of the entry immediately above and the one immediately to its left in the (n - 1)st row. For example, (5)2= 6 + 4 = 10. This algorithm for constructing Pascal's triangle can be used to write a computer program to compute the binomial coefficients. You are asked to do this in Exercise 4. While Pascal's triangle provides a way to construct recursively the binomial coefficients, it is also possible to give a formula for (7). Theorem 3.5 The binomial coefficients are given by the formula n( (n)i(3.2) Proof. Each subset of size j of a set of size n can be ordered in j! ways. Each of these orderings is a j-permutation of the set of size n. The number of j-permutations is (n)1, so the number of subsets of size j is (n. This ompltes te prof.! This completes the proof. F-I  3.2. COMBINATIONS 95 The above formula can be rewritten in the form n n! f j!(n - This immediately shows that n n j n - j When using Equation 3.2 in the calculation of (), if one alternates the multi- plications and divisions, then all of the intermediate values in the calculation are integers. Furthermore, none of these intermediate values exceed the final value. (See Exercise 40.) Another point that should be made concerning Equation 3.2 is that if it is used to define the binomial coefficients, then it is no longer necessary to require n to be a positive integer. The variable j must still be a non-negative integer under this definition. This idea is useful when extending the Binomial Theorem to general exponents. (The Binomial Theorem for non-negative integer exponents is given below as Theorem 3.7.) Poker Hands Example 3.6 Poker players sometimes wonder why a four of a kind beats a full house. A poker hand is a random subset of 5 elements from a deck of 52 cards. A hand has four of a kind if it has four cards with the same value-for example, four sixes or four kings. It is a full house if it has three of one value and two of a second-for example, three twos and two queens. Let us see which hand is more likely. How many hands have four of a kind? There are 13 ways that we can specify the value for the four cards. For each of these, there are 48 possibilities for the fifth card. Thus, the number of four-of-a-kind hands is 13 - 48 = 624. Since the total number of possible hands is (2)= 2598960, the probability of a hand with four of a kind is 624/2598960 = .00024. Now consider the case of a full house; how many such hands are there? There are 13 choices for the value which occurs three times; for each of these there are (4) = 4 choices for the particular three cards of this value that are in the hand. Having picked these three cards, there are 12 possibilities for the value which occurs twice; for each of these there are (4)2= 6 possibilities for the particular pair of this value. Thus, the number of full houses is 13 - 4.- 12 - 6 = 3744, and the probability of obtaining a hand with a full house is 3744/2598960 =.0014. Thus, while both types of hands are unlikely, you are six times more likely to obtain a full house than four of a kind.  96 CHAPTER 3. COMBINATORICS o m(o) p (start) q P S q F q p S S q F F q F p - S S q F p S F q F (0 p2 (03 pq (04 pq2 (05 qp2 6 q2p o 2 7 q2p 8 q3 p3 Figure 3.4: Tree diagram of three Bernoulli trials. Bernoulli Trials Our principal use of the binomial coefficients will occur in the study important chance processes called Bernoulli trials. of one of the Definition 3.5 A Bernoulli trials process is a sequence of n chance experiments such that 1. Each experiment has two possible outcomes, which we may call success and failure. 2. The probability p of success on each experiment is the same for each ex- periment, and this probability is not affected by any knowledge of previous outcomes. The probability q of failure is given by q = 1 - p. D- Example 3.7 The following are Bernoulli trials processes: 1. A coin is tossed ten times. The two possible outcomes are heads and tails. The probability of heads on any one toss is 1/2. 2. An opinion poll is carried out by asking 1000 people, randomly chosen from the population, if they favor the Equal Rights Amendment-the two outcomes being yes and no. The probability p of a yes answer (i.e., a success) indicates the proportion of people in the entire population that favor this amendment. 3. A gambler makes a sequence of 1-dollar bets, betting each time on black at roulette at Las Vegas. Here a success is winning 1 dollar and a failure is losing  3.2. COMBINATIONS 97 1 dollar. Since in American roulette the gambler wins if the ball stops on one of 18 out of 38 positions and loses otherwise, the probability of winning is p = 18/38 = .474. To analyze a Bernoulli trials process, we choose as our sample space a binary tree and assign a probability distribution to the paths in this tree. Suppose, for example, that we have three Bernoulli trials. The possible outcomes are indicated in the tree diagram shown in Figure 3.4. We define X to be the random variable which represents the outcome of the process, i.e., an ordered triple of S's and F's. The probabilities assigned to the branches of the tree represent the probability for each individual trial. Let the outcome of the ith trial be denoted by the random variable X2, with distribution function m2. Since we have assumed that outcomes on any one trial do not affect those on another, we assign the same probabilities at each level of the tree. An outcome w for the entire experiment will be a path through the tree. For example, w3 represents the outcomes SFS. Our frequency interpretation of probability would lead us to expect a fraction p of successes on the first experiment; of these, a fraction q of failures on the second; and, of these, a fraction p of successes on the third experiment. This suggests assigning probability pqp to the outcome w3. More generally, we assign a distribution function m(w) for paths w by defining m(w) to be the product of the branch probabilities along the path w. Thus, the probability that the three events S on the first trial, F on the second trial, and S on the third trial occur is the product of the probabilities for the individual events. We shall see in the next chapter that this means that the events involved are independent in the sense that the knowledge of one event does not affect our prediction for the occurrences of the other events. Binomial Probabilities We shall be particularly interested in the probability that in n Bernoulli trials there are exactly j successes. We denote this probability by b(n, p, j). Let us calculate the particular value b(3, p, 2) from our tree measure. We see that there are three paths which have exactly two successes and one failure, namely w2, w3, and w5. Each of these paths has the same probability p2q. Thus b(3, p, 2) = 3p2q. Considering all possible numbers of successes we have b(3,p, 0) = q3 b(3, p, 1) = 3pq2 b(3, p, 2) = 3p2q b(3,p,3) =p3 We can, in the same manner, carry out a tree measure for n experiments and determine b(n, p, j) for the general case of n Bernoulli trials.  98 CHAPTER 3. COMBINATORICS Theorem 3.6 Given n Bernoulli trials with probability p of success on each exper- iment, the probability of exactly j successes is b(n, p, j n jp q- where q = 1-p. Proof. We construct a tree measure as described above. We want to find the sum of the probabilities for all paths which have exactly j successes and n - j failures. Each such path is assigned a probability p3qn-3. How many such paths are there? To specify a path, we have to pick, from the n possible trials, a subset of j to be successes, with the remaining n - j outcomes being failures. We can do this in (7) ways. Thus the sum of the probabilities is b(n, p,.j ) _=.piq"-j Example 3.8 A fair coin is tossed six times. What is the probability that exactly three heads turn up? The answer is 6 1 3 13 1 b(6, .5, 3) I-3I-2 20 -6= .3125 . 3 2 264 Example 3.9 A die is rolled four times. What is the probability that we obtain exactly one 6? We treat this as Bernoulli trials with success = "rolling a 6" and failure = "rolling some number other than a 6." Then p = 1/6, and the probability of exactly one success in four trials is b(4, 1/6, 1) (= (1 1 .386 . To compute binomial probabilities using the computer, multiply the function choose(n, k) by pkqn-k. The program BinomialProbabilities prints out the bi- nomial probabilities b(n, p, k) for k between kmin and kmax, and the sum of these probabilities. We have run this program for n = 100, p = 1/2, kmin = 45, and kmax =55; the output is shown in Table 3.8. Note that the individual probabilities are quite small. The probability of exactly 50 heads in 100 tosses of a coin is about .08. Our intuition tells us that this is the most likely outcome, which is correct; but, all the same, it is not a very likely outcome.  3.2. COMBINATIONS 99 k b(n,p, k) 45 .0485 46 .0580 47 .0666 48 .0735 49 .0780 50 .0796 51 .0780 52 .0735 53 .0666 54 .0580 55 .0485 Table 3.8: Binomial probabilities for n = 100, p = 1/2. Binomial Distributions Definition 3.6 Let n be a positive integer, and let p be a real number between 0 and 1. Let B be the random variable which counts the number of successes in a Bernoulli trials process with parameters n and p. Then the distribution b(n, p, k) of B is called the binomial distribution. Q We can get a better idea about the binomial distribution by graphing this dis- tribution for different values of n and p (see Figure 3.5). The plots in this figure were generated using the program BinomialPlot. We have run this program for p = .5 and p = .3. Note that even for p = .3 the graphs are quite symmetric. We shall have an explanation for this in Chapter 9. We also note that the highest probability occurs around the value np, but that these highest probabilities get smaller as n increases. We shall see in Chapter 6 that np is the mean or expected value of the binomial distribution b(n, p, k). The following example gives a nice way to see the binomial distribution, when p = 1/2. Example 3.10 A Galton board is a board in which a large number of BB-shots are dropped from a chute at the top of the board and deflected off a number of pins on their way down to the bottom of the board. The final position of each slot is the result of a number of random deflections either to the left or the right. We have written a program GaltonBoard to simulate this experiment. We have run the program for the case of 20 rows of pins and 10,000 shots being dropped. We show the result of this simulation in Figure 3.6. Note that if we write 0 every time the shot is deflected to the left, and 1 every time it is deflected to the right, then the path of the shot can be described by a sequence of 0's and l's of length ri, just as for the ri-fold coin toss. The distribution shown in Figure 3.6 is an example of an empirical distribution, in the sense that it comes about by means of a sequence of experiments. As expected,  100 CHAPTER 3. COMBINATORICS p = .5 n = 40 n = 80 n = 160 .. . .. . ..1.. . ... .. ... ||| . . Figure 3.5: Binomial distributions.  3.2. COMBINATIONS 101 " Figure 3.6: Simulation of the Galton board. this empirical distribution resembles the corresponding binomial distribution with parameters n =20 and p =1/2. Hypothesis Testing Example 3.11 Suppose that ordinary aspirin has been found effective against headaches 60 percent of the time, and that a drug company claims that its new aspirin with a special headache additive is more effective. We can test this claim as follows: we call their claim the alternate hypothesis, and its negation, that the additive has no appreciable effect, the null hypothesis. Thus the null hypothesis is that p .6, and the alternate hypothesis is that p> .6, where p is the probability that the new aspirin is effective. We give the aspirin to n people to take when they have a headache. We want to find a number m, called the critical value for our experiment, such that we reject the null hypothesis if at least m people are cured, and otherwise we accept it. How should we determine this critical value? First note that we can make two kinds of errors. The first, often called a type 1 error in statistics, is to reject the null hypothesis when in fact it is true. The second, called a type 2 error, is to accept the null hypothesis when it is false. To determine the probability of both these types of errors we introduce a function a(p), defined to be the probability that we reject the null hypothesis, where this probability is calculated under the assumption that the null hypothesis is true. In the present case, we have a p).=.b.nmpk)r.. . . . . . . . .m .k . .n  102 CHAPTER 3. COMBINATORICS Note that a(.6) is the probability of a type 1 error, since this is the probability of a high number of successes for an ineffective additive. So for a given n we want to choose m so as to make a(.6) quite small, to reduce the likelihood of a type 1 error. But as m increases above the most probable value np = .6n, a(.6), being the upper tail of a binomial distribution, approaches 0. Thus increasing m makes a type 1 error less likely. Now suppose that the additive really is effective, so that p is appreciably greater than .6; say p = .8. (This alternative value of p is chosen arbitrarily; the following calculations depend on this choice.) Then choosing m well below np = .8n will increase a(.8), since now a(.8) is all but the lower tail of a binomial distribution. Indeed, if we put /3(.8) = 1 - a(.8), then /3(.8) gives us the probability of a type 2 error, and so decreasing m makes a type 2 error less likely. The manufacturer would like to guard against a type 2 error, since if such an error is made, then the test does not show that the new drug is better, when in fact it is. If the alternative value of p is chosen closer to the value of p given in the null hypothesis (in this case p = .6), then for a given test population, the value of 3 will increase. So, if the manufacturer's statistician chooses an alternative value for p which is close to the value in the null hypothesis, then it will be an expensive proposition (i.e., the test population will have to be large) to reject the null hypothesis with a small value of /3. What we hope to do then, for a given test population n, is to choose a value of m, if possible, which makes both these probabilities small. If we make a type 1 error we end up buying a lot of essentially ordinary aspirin at an inflated price; a type 2 error means we miss a bargain on a superior medication. Let us say that we want our critical number m to make each of these undesirable cases less than 5 percent probable. We write a program PowerCurve to plot, for n = 100 and selected values of m, the function a(p), for p ranging from .4 to 1. The result is shown in Figure 3.7. We include in our graph a box (in dotted lines) from .6 to .8, with bottom and top at heights .05 and .95. Then a value for m satisfies our requirements if and only if the graph of a enters the box from the bottom, and leaves from the top (why?-which is the type 1 and which is the type 2 criterion?). As m increases, the graph of a moves to the right. A few experiments have shown us that m = 69 is the smallest value for m that thwarts a type 1 error, while m = 73 is the largest which thwarts a type 2. So we may choose our critical value between 69 and 73. If we're more intent on avoiding a type 1 error we favor 73, and similarly we favor 69 if we regard a type 2 error as worse. Of course, the drug company may not be happy with having as much as a 5 percent chance of an error. They might insist on having a 1 percent chance of an error. For this we would have to increase the number n of trials (see Exercise 28). D Binomial Expansion We next remind the reader of an application of the binomial coefficients to algebra. This is the binomial expansion, from which we get the term binomial coefficient.  3.2. COMBINATIONS 103 Figure 3.7: The power curve. Theorem 3.7 (Binomial Theorem) The quantity (a + b)< can be expressed in the form (a+b)" = ZQ ab"-3 . Proof. To see that this expansion is correct, write (a+b)" = (a+b)(a+b)- (a+b) When we multiply this out we will have a sum of terms each of which results from a choice of an a or b for each of n factors. When we choose j a's and (n - j) b's, we obtain a term of the form a3 b"-3. To determine such a term, we have to specify j of the n terms in the product from which we choose the a. This can be done in (n) ways. Thus, collecting these terms in the sum contributes a term (g)a3b-3. D For example, we have (a+b)0 = 1 (a+b)1 = a+b (a+b)2 = a2+2ab+b2 (a+b)3 _a3+3a2b+3ab2+b3 We see here that the coefficients of successive powers do indeed yield Pascal's tri- angle. Corollary 3.1 The sum of the elements in the nth row of Pascal's triangle is 2n. If the elements in the nth row of Pascal's triangle are added with alternating signs, the sum is 0.  104 CHAPTER 3. COMBINATORICS Proof. The first statement in the corollary follows from the fact that a n dl tfnha t 2")=+(1+1)) = +;+ +.-.-.+ n, and the second from the fact that 0 = (1- 1)n = (\o) () + (n) (_I)n (n) El The first statement of the corollary tells us that the number of subsets of a set of n elements is 2n. We shall use the second statement in our next application of the binomial theorem. We have seen that, when A and B are any two events (cf. Section 1.2), P(A U B) =P(A)+ P(B) - P(A n B). We now extend this theorem to a more general version, which will enable us to find the probability that at least one of a number of events occurs. Inclusion-Exclusion Principle Theorem 3.8 Let P be a probability distribution on a sample space Q, and let {A1, A2, ..., An} be a finite set of events. Then P(A1UA2U...UAn) P(Ai) i=1 P(A n Aj ) 1 1. Use this fact to determine the value or values of j which give b(n, p, j) its greatest value. Hint: Consider the successive ratios as j increases. 8 A die is rolled 30 times. What is the probability that a 6 turns up exactly 5 times? What is the most probable number of times that a 6 will turn up? 9 Find integers n and r such that the following equation is true: 13 2 13 13 n (t)+2Q)+ (.3 7 5 6 7 (r 10 In a ten-question true-false exam, find the probability that a student gets a grade of 70 percent or better by guessing. Answer the same question if the test has 30 questions, and if the test has 50 questions. 11 A restaurant offers apple and blueberry pies and stocks an equal number of each kind of pie. Each day ten customers request pie. They choose, with equal probabilities, one of the two kinds of pie. How many pieces of each kind of pie should the owner provide so that the probability is about .95 that each customer gets the pie of his or her own choice? 12 A poker hand is a set of 5 cards randomly chosen from a deck of 52 cards. Find the probability of a (a) royal flush (ten, jack, queen, king, ace in a single suit). (b) straight flush (five in a sequence in a single suit, but not a royal flush). (c) four of a kind (four cards of the same face value). (d) full house (one pair and one triple, each of the same face value). (e) flush (five cards in a single suit but not a straight or royal flush). (f) straight (five cards in a sequence, not all the same suit). (Note that in straights, an ace counts high or low.) 13 If a set has 2n elements, show that it has more subsets with n elements than with any other number of elements.  3.2. COMBINATIONS 115 14 Let b(2n, .5, n) be the probability that in 2n tosses of a fair coin exactly n heads turn up. Using Stirling's formula (Theorem 3.3), show that b(2n, .5, n) 1/ wn. Use the program BinomialProbabilities to compare this with the exact value for n = 10 to 25. 15 A baseball player, Smith, has a batting average of .300 and in a typical game comes to bat three times. Assume that Smith's hits in a game can be consid- ered to be a Bernoulli trials process with probability .3 for success. Find the probability that Smith gets 0, 1, 2, and 3 hits. 16 The Siwash University football team plays eight games in a season, winning three, losing three, and ending two in a tie. Show that the number of ways that this can happen is (8(5 8! 3 3 3! 3! 2! 17 Using the technique of Exercise 16, show that the number of ways that one can put n different objects into three boxes with a in the first, b in the second, and c in the third is n!/(a! b! c!). 18 Baumgartner, Prosser, and Crowell are grading a calculus exam. There is a true-false question with ten parts. Baumgartner notices that one student has only two out of the ten correct and remarks, "The student was not even bright enough to have flipped a coin to determine his answers." "Not so clear," says Prosser. "With 340 students I bet that if they all flipped coins to determine their answers there would be at least one exam with two or fewer answers correct." Crowell says, "I'm with Prosser. In fact, I bet that we should expect at least one exam in which no answer is correct if everyone is just guessing." Who is right in all of this? 19 A gin hand consists of 10 cards from a deck of 52 cards. Find the probability that a gin hand has (a) all 10 cards of the same suit. (b) exactly 4 cards in one suit and 3 in two other suits. (c) a 4, 3, 2, 1, distribution of suits. 20 A six-card hand is dealt from an ordinary deck of cards. Find the probability that: (a) All six cards are hearts. (b) There are three aces, two kings, and one queen. (c) There are three cards of one suit and three of another suit. 21 A lady wishes to color her fingernails on one hand using at most two of the colors red, yellow, and blue. How many ways can she do this?  116 CHAPTER 3. COMBINATORICS 22 How many ways can six indistinguishable letters be put in three mail boxes? Hint: One representation of this is given by a sequence LL L LLL where the 's represent the partitions for the boxes and the L's the letters. Any possible way can be so described. Note that we need two bars at the ends and the remaining two bars and the six L's can be put in any order. 23 Using the method for the hint in Exercise 22, show that r indistinguishable objects can be put in n boxes in nor-1 nor-1 n -1r different ways. 24 A travel bureau estimates that when 20 tourists go to a resort with ten hotels they distribute themselves as if the bureau were putting 20 indistinguishable objects into ten distinguishable boxes. Assuming this model is correct, find the probability that no hotel is left vacant when the first group of 20 tourists arrives. 25 An elevator takes on six passengers and stops at ten floors. We can assign two different equiprobable measures for the ways that the passengers are dis- charged: (a) we consider the passengers to be distinguishable or (b) we con- sider them to be indistinguishable (see Exercise 23 for this case). For each case, calculate the probability that all the passengers get off at different floors. 26 You are playing heads or tails with Prosser but you suspect that his coin is unfair. Von Neumann suggested that you proceed as follows: Toss Prosser's coin twice. If the outcome is HT call the result win. if it is TH call the result lose. If it is TT or HH ignore the outcome and toss Prosser's coin twice again. Keep going until you get either an HT or a TH and call the result win or lose in a single play. Repeat this procedure for each play. Assume that Prosser's coin turns up heads with probability p. (a) Find the probability of HT, TH, HH, TT with two tosses of Prosser's coin. (b) Using part (a), show that the probability of a win on any one play is 1/2, no matter what p is. 27 John claims that he has extrasensory powers and can tell which of two symbols is on a card turned face down (see Example 3.11). To test his ability he is asked to do this for a sequence of trials. Let the null hypothesis be that he is just guessing, so that the probability is 1/2 of his getting it right each time, and let the alternative hypothesis be that he can name the symbol correctly more than half the time. Devise a test with the property that the probability of a type 1 error is less than .05 and the probability of a type 2 error is less than .05 if John can name the symbol correctly 75 percent of the time.  3.2. COMBINATIONS 117 28 In Example 3.11 assume the alternative hypothesis is that p = .8 and that it is desired to have the probability of each type of error less than .01. Use the program PowerCurve to determine values of n and m that will achieve this. Choose n as small as possible. 29 A drug is assumed to be effective with an unknown probability p. To estimate p the drug is given to n patients. It is found to be effective for m patients. The method of maximum likelihood for estimating p states that we should choose the value for p that gives the highest probability of getting what we got on the experiment. Assuming that the experiment can be considered as a Bernoulli trials process with probability p for success, show that the maximum likelihood estimate for p is the proportion m/n of successes. 30 Recall that in the World Series the first team to win four games wins the series. The series can go at most seven games. Assume that the Red Sox and the Mets are playing the series. Assume that the Mets win each game with probability p. Fermat observed that even though the series might not go seven games, the probability that the Mets win the series is the same as the probability that they win four or more game in a series that was forced to go seven games no matter who wins the individual games. (a) Using the program PowerCurve of Example 3.11 find the probability that the Mets win the series for the cases p = .5, p = .6, p = .7. (b) Assume that the Mets have probability .6 of winning each game. Use the program PowerCurve to find a value of n so that, if the series goes to the first team to win more than half the games, the Mets will have a 95 percent chance of winning the series. Choose n as small as possible. 31 Each of the four engines on an airplane functions correctly on a given flight with probability .99, and the engines function independently of each other. Assume that the plane can make a safe landing if at least two of its engines are functioning correctly. What is the probability that the engines will allow for a safe landing? 32 A small boy is lost coming down Mount Washington. The leader of the search team estimates that there is a probability p that he came down on the east side and a probability 1 - p that he came down on the west side. He has n people in his search team who will search independently and, if the boy is on the side being searched, each member will find the boy with probability u. Determine how he should divide the n people into two groups to search the two sides of the mountain so that he will have the highest probability of finding the boy. How does this depend on u? *33 2n balls are chosen at random from a total of 2n red balls and 2n blue balls. Find a combinatorial expression for the probability that the chosen balls are equally divided in color. Use Stirling's formula to estimate this probability.  118 CHAPTER 3. COMBINATORICS Using BinomialProbabilities, compare the exact value with Stirling's ap- proximation for n = 20. 34 Assume that every time you buy a box of Wheaties, you receive one of the pictures of the n players on the New York Yankees. Over a period of time, you buy m > n boxes of Wheaties. (a) Use Theorem 3.8 to show that the probability that you get all n pictures is fn n -1 m n n-2 m 1- + ) n + (-1)n)- Hint: Let Ek be the event that you do not get the kth player's picture. (b) Write a computer program to compute this probability. Use this program to find, for given n, the smallest value of m which will give probability > .5 of getting all n pictures. Consider n = 50, 100, and 150 and show that m = n log n + n log 2 is a good estimate for the number of boxes needed. (For a derivation of this estimate, see Feller.26) *35 Prove the following binomial identity 2n) ( n 2 n3 j=0 Hint: Consider an urn with n red balls and n blue balls inside. Show that each side of the equation equals the number of ways to choose n balls from the urn. 36 Let j and n be positive integers, with j < n. An experiment consists of choosing, at random, a j-tuple of positive integers whose sum is at most n. (a) Find the size of the sample space. Hint: Consider n indistinguishable balls placed in a row. Place j markers between consecutive pairs of balls, with no two markers between the same pair of balls. (We also allow one of the n markers to be placed at the end of the row of balls.) Show that there is a 1-1 correspondence between the set of possible positions for the markers and the set of j-tuples whose size we are trying to count. (b) Find the probability that the j-tuple selected contains at least one 1. 37 Let n (mod m) denote the remainder when the integer n is divided by the integer m. Write a computer program to compute the numbers (n) (mod m) where (7) is a binomial coefficient and m is an integer. You can do this by using the recursion relations for generating binomial coefficients, doing all the 26W. Feller, Introduction to Probability Theory anid its Applications, vol. I, 3rd ed. (New York: John Wiley & Sons, 1968), p. 106.  3.2. COMBINATIONS 119 arithmetic using the basic function mod(n, m). Try to write your program to make as large a table as possible. Run your program for the cases m = 2 to 7. Do you see any patterns? In particular, for the case m = 2 and n a power of 2, verify that all the entries in the (n - 1)st row are 1. (The corresponding binomial numbers are odd.) Use your pictures to explain why this is true. 38 Lucas27 proved the following general result relating to Exercise 37. If p is any prime number, then (n) (mod p) can be found as follows: Expand n and j in base p as n = so + s1p + s22 +.---.+ skpk and j = ro + rip + r2P2 +"" -+ rkpk, respectively. (Here k is chosen large enough to represent all numbers from 0 to n in base p using k digits.) Let s = (so, si, s2, ... , sk) and r = (ro,ri,r2,...,rk). Then ()(mod p) = 17 )(mod p) . For example, if p = 7, n = 12, and j = 9, then 12 = 5.70+1.71, 9 = 2.70+1.71, so that s = (5,1) , r = (2,1), and this result states that 12 5 1 (mod p) ( ()(mod 7). Since (9) = 220 = 3 (mod 7), and (5)2= 10 = 3 (mod 7), we see that the result is correct for this example. Show that this result implies that, for p = 2, the (pk -1)st row of your triangle in Exercise 37 has no zeros. 39 Prove that the probability of exactly n heads in 2n tosses of a fair coin is given by the product of the odd numbers up to 2n -1 divided by the product of the even numbers up to 2n. 40 Let n be a positive integer, and assume thatj is a positive integer not exceed- ing n/2. Show that in Theorem 3.5, if one alternates the multiplications and divisions, then all of the intermediate values in the calculation are integers. Show also that none of these intermediate values exceed the final value. 7E. Lucas, "Theorie des Functions Num~riques Simplement Periodiques," American J. Math., vol. 1 (1878), pp. 184-240, 289-321.  120 CHAPTER 3. COMBINATORICS 3.3 Card Shuffling Much of this section is based upon an article by Brad Mann,28 which is an exposition of an article by David Bayer and Persi Diaconis.29 Riffle Shuffles Given a deck of n cards, how many times must we shuffle it to make it "random"? Of course, the answer depends upon the method of shuffling which is used and what we mean by "random." We shall begin the study of this question by considering a standard model for the riffle shuffle. We begin with a deck of n cards, which we will assume are labelled in increasing order with the integers from 1 to n. A riffle shuffle consists of a cut of the deck into two stacks and an interleaving of the two stacks. For example, if n = 6, the initial ordering is (1, 2, 3, 4, 5, 6), and a cut might occur between cards 2 and 3. This gives rise to two stacks, namely (1, 2) and (3, 4, 5, 6). These are interleaved to form a new ordering of the deck. For example, these two stacks might form the ordering (1, 3, 4, 2, 5, 6). In order to discuss such shuffles, we need to assign a probability distribution to the set of all possible shuffles. There are several reasonable ways in which this can be done. We will give several different assignment strategies, and show that they are equivalent. (This does not mean that this assignment is the only reasonable one.) First, we assign the binomial probability b(n, 1/2, k) to the event that the cut occurs after the kth card. Next, we assume that all possible interleavings, given a cut, are equally likely. Thus, to complete the assignment of probabilities, we need to determine the number of possible interleavings of two stacks of cards, with k and n - k cards, respectively. We begin by writing the second stack in a line, with spaces in between each pair of consecutive cards, and with spaces at the beginning and end (so there are n - k + 1 spaces). We choose, with replacement, k of these spaces, and place the cards from the first stack in the chosen spaces. This can be done in n ways. Thus, the probability of a given interleaving should be 1 (nk) Next, we note that if the new ordering is not the identity ordering, it is the result of a unique cut-interleaving pair. If the new ordering is the identity, it is the result of any one of n + 1 cut-interleaving pairs. We define a rising sequence in an ordering to be a maximal subsequence of consecutive integers in increasing order. For example, in the ordering (2,3,5,-1,4,7,6), 28B. Mann, "How Many Times Should You Shuffle a Deck of Cards?", UMAP Jouirnal, vol. 15, no. 4 (1994), pp. 303-331. 29D. Bayer and P. Diaconis, "Trailing the Dovetail Shuffle to its Lair," Annals of Applied Prob- ability, vol. 2, no. 2 (1992), pp. 294-313.  3.3. CARD SHUFFLING 121 there are 4 rising sequences; they are (1), (2, 3, 4), (5, 6), and (7). It is easy to see that an ordering is the result of a riffle shuffle applied to the identity ordering if and only if it has no more than two rising sequences. (If the ordering has two rising sequences, then these rising sequences correspond to the two stacks induced by the cut, and if the ordering has one rising sequence, then it is the identity ordering.) Thus, the sample space of orderings obtained by applying a riffle shuffle to the identity ordering is naturally described as the set of all orderings with at most two rising sequences. It is now easy to assign a probability distribution to this sample space. Each ordering with two rising sequences is assigned the value b(n, 1/2, k) 1 (k) 2n ' and the identity ordering is assigned the value n+1 2n There is another way to view a riffle shuffle. We can imagine starting with a deck cut into two stacks as before, with the same probabilities assignment as before i.e., the binomial distribution. Once we have the two stacks, we take cards, one by one, off of the bottom of the two stacks, and place them onto one stack. If there are k1 and k2 cards, respectively, in the two stacks at some point in this process, then we make the assumption that the probabilities that the next card to be taken comes from a given stack is proportional to the current stack size. This implies that the probability that we take the next card from the first stack equals ki k1 + k2 and the corresponding probability for the second stack is k2 k1 + k2 We shall now show that this process assigns the uniform probability to each of the possible interleavings of the two stacks. Suppose, for example, that an interleaving came about as the result of choosing cards from the two stacks in some order. The probability that this result occurred is the product of the probabilities at each point in the process, since the choice of card at each point is assumed to be independent of the previous choices. Each factor of this product is of the form k2 k1 + k2 where i =1 or 2, and the denominator of each factor equals the number of cards left to be chosen. Thus, the denominator of the probability is just 12!. At the moment when a card is chosen from a stack that has i cards in it, the numerator of the  122 CHAPTER 3. COMBINATORICS corresponding factor in the probability is i, and the number of cards in this stack decreases by 1. Thus, the numerator is seen to be k!(n - k)!, since all cards in both stacks are eventually chosen. Therefore, this process assigns the probability 1 G) to each possible interleaving. We now turn to the question of what happens when we riffle shuffle s times. It should be clear that if we start with the identity ordering, we obtain an ordering with at most 2s rising sequences, since a riffle shuffle creates at most two rising sequences from every rising sequence in the starting ordering. In fact, it is not hard to see that each such ordering is the result of s riffle shuffles. The question becomes, then, in how many ways can an ordering with r rising sequences come about by applying s riffle shuffles to the identity ordering? In order to answer this question, we turn to the idea of an a-shuffle. a-Shuffles There are several ways to visualize an a-shuffle. One way is to imagine a creature with a hands who is given a deck of cards to riffle shuffle. The creature naturally cuts the deck into a stacks, and then riffles them together. (Imagine that!) Thus, the ordinary riffle shuffle is a 2-shuffle. As in the case of the ordinary 2-shuffle, we allow some of the stacks to have 0 cards. Another way to visualize an a-shuffle is to think about its inverse, called an a-unshuffle. This idea is described in the proof of the next theorem. We will now show that an a-shuffle followed by a b-shuffle is equivalent to an ab- shuffle. This means, in particular, that s riffle shuffles in succession are equivalent to one 2s-shuffle. This equivalence is made precise by the following theorem. Theorem 3.9 Let a and b be two positive integers. Let Sa,b be the set of all ordered pairs in which the first entry is an a-shuffle and the second entry is a b-shuffle. Let Sab be the set of all ab-shuffles. Then there is a 1-1 correspondence between Sa,b and Sab with the following property. Suppose that (T1, T2) corresponds to T3. If T1 is applied to the identity ordering, and T2 is applied to the resulting ordering, then the final ordering is the same as the ordering that is obtained by applying T3 to the identity ordering. Proof. The easiest way to describe the required correspondence is through the idea of an unshuffle. An a-unshuffle begins with a deck of n cards. One by one, cards are taken from the top of the deck and placed, with equal probability, on the bottom of any one of a stacks, where the stacks are labelled from 0 to a - 1. After all of the cards have been distributed, we combine the stacks to form one stack by placing stack i on top of stack i +1, for 0 K i K a -i. It is easy to see that if one starts with a deck, there is exactly one way to cut the deck to obtain the a stacks generated by the a-unshuffle, and with these a stacks, there is exactly one way to interleave them  3.3. CARD SHUFFLING 123 to obtain the deck in the order that it was in before the unshuffle was performed. Thus, this a-unshuffle corresponds to a unique a-shuffle, and this a-shuffle is the inverse of the original a-unshuffle. If we apply an ab-unshuffle U3 to a deck, we obtain a set of ab stacks, which are then combined, in order, to form one stack. We label these stacks with ordered pairs of integers, where the first coordinate is between 0 and a - 1, and the second coordinate is between 0 and b - 1. Then we label each card with the label of its stack. The number of possible labels is ab, as required. Using this labelling, we can describe how to find a b-unshuffle and an a-unshuffle, such that if these two unshuffles are applied in this order to the deck, we obtain the same set of ab stacks as were obtained by the ab-unshuffle. To obtain the b-unshuffle U2, we sort the deck into b stacks, with the ith stack containing all of the cards with second coordinate i, for 0 < i K b - 1. Then these stacks are combined to form one stack. The a-unshuffle U1 proceeds in the same manner, except that the first coordinates of the labels are used. The resulting a stacks are then combined to form one stack. The above description shows that the cards ending up on top are all those labelled (0, 0). These are followed by those labelled (0, 1), (0, 2), ..., (0, b - 1), (1, 0), (1, 1),..., (a - 1, b - 1). Furthermore, the relative order of any pair of cards with the same labels is never altered. But this is exactly the same as an ab-unshuffle, if, at the beginning of such an unshuffle, we label each of the cards with one of the labels (0,0), (0,1), ..., (0, b - 1), (1,0), (1,1), ..., (a-1,b-1). This completes the proof. D In Figure 3.11, we show the labels for a 2-unshuffle of a deck with 10 cards. There are 4 cards with the label 0 and 6 cards with the label 1, so if the 2-unshuffle is performed, the first stack will have 4 cards and the second stack will have 6 cards. When this unshuffle is performed, the deck ends up in the identity ordering. In Figure 3.12, we show the labels for a 4-unshuffle of the same deck (because there are four labels being used). This figure can also be regarded as an example of a pair of 2-unshuffles, as described in the proof above. The first 2-unshuffle will use the second coordinate of the labels to determine the stacks. In this case, the two stacks contain the cards whose values are {5,1,6,2,7} and {8, 9, 3, 4,10} After this 2-unshuffle has been performed, the deck is in the order shown in Fig- ure 3.11, as the reader should check. If we wish to perform a 4-unshuffle on the deck, using the labels shown, we sort the cards lexicographically, obtaining the four stacks {1, 2}, {3, 4}, {5,6,7}, and {8, 9,10} When these stacks are combined, we once again obtain the identity ordering of the deck. The point of the above theorem is that both sorting procedures always lead to the same initial ordering.  124 CHAPTER 3. COMBINATORICS Figure 3.11: Before a 2-unshuffle. Figure 3.12: Before a 4-unshuffle.  3.3. CARD SHUFFLING 125 Theorem 3.10 If D is any ordering that is the result of applying an a-shuffle and then a b-shuffle to the identity ordering, then the probability assigned to D by this pair of operations is the same as the probability assigned to D by the process of applying an ab-shuffle to the identity ordering. Proof. Call the sample space of a-shuffles Sa. If we label the stacks by the integers from 0 to a - 1, then each cut-interleaving pair, i.e., shuffle, corresponds to exactly one n-digit base a integer, where the ith digit in the integer is the stack of which the ith card is a member. Thus, the number of cut-interleaving pairs is equal to the number of n-digit base a integers, which is an . Of course, not all of these pairs leads to different orderings. The number of pairs leading to a given ordering will be discussed later. For our purposes it is enough to point out that it is the cut-interleaving pairs that determine the probability assignment. The previous theorem shows that there is a 1-1 correspondence between Sa,b and Sab. Furthermore, corresponding elements give the same ordering when applied to the identity ordering. Given any ordering D, let mi be the number of elements of Sa,b which, when applied to the identity ordering, result in D. Let m2 be the number of elements of Sab which, when applied to the identity ordering, result in D. The previous theorem implies that min= m2. Thus, both sets assign the probability m1 (ab)" to D. This completes the proof. D Connection with the Birthday Problem There is another point that can be made concerning the labels given to the cards by the successive unshuffles. Suppose that we 2-unshuffle an n-card deck until the labels on the cards are all different. It is easy to see that this process produces each permutation with the same probability, i.e., this is a random process. To see this, note that if the labels become distinct on the sth 2-unshuffle, then one can think of this sequence of 2-unshuffles as one 2s-unshuffle, in which all of the stacks determined by the unshuffle have at most one card in them (remember, the stacks correspond to the labels). If each stack has at most one card in it, then given any two cards in the deck, it is equally likely that the first card has a lower or a higher label than the second card. Thus, each possible ordering is equally likely to result from this 2s-unshuffle. Let T be the random variable that counts the number of 2-unshuffles until all labels are distinct. One can think of T as giving a measure of how long it takes in the unshuffling process until randomness is reached. Since shuffling and unshuffling are inverse processes, T also measures the number of shuffles necessary to achieve randomness. Suppose that we have an n-card deck, and we ask for P(T <; s). This equals 1 - P(T > s). But T > s if and only if it is the case that not all of the labels after s 2-unshuffles are distinct. This is just the birthday problem; we are asking for the probability that at least two people have the same birthday, given  126 CHAPTER 3. COMBINATORICS that we have n people and there are 2s possible birthdays. Using our formula from Example 3.3, we find that P(T>s) 1-(2)j . (3.4) In Chapter 6, we will define the average value of a random variable. Using this idea, and the above equation, one can calculate the average value of the random variable T (see Exercise 6.1.41). For example, if n = 52, then the average value of T is about 11.7. This means that, on the average, about 12 riffle shuffles are needed for the process to be considered random. Cut-Interleaving Pairs and Orderings As was noted in the proof of Theorem 3.10, not all of the cut-interleaving pairs lead to different orderings. However, there is an easy formula which gives the number of such pairs that lead to a given ordering. Theorem 3.11 If an ordering of length n has r rising sequences, then the number of cut-interleaving pairs under an a-shuffle of the identity ordering which lead to the ordering is n+a -r n~ Proof. To see why this is true, we need to count the number of ways in which the cut in an a-shuffle can be performed which will lead to a given ordering with r rising sequences. We can disregard the interleavings, since once a cut has been made, at most one interleaving will lead to a given ordering. Since the given ordering has r rising sequences, r - 1 of the division points in the cut are determined. The remaining a - 1 - (r - 1) = a - r division points can be placed anywhere. The number of places to put these remaining division points is n + 1 (which is the number of spaces between the consecutive pairs of cards, including the positions at the beginning and the end of the deck). These places are chosen with repetition allowed, so the number of ways to make these choices is n+a-r n+a-r a -r n In particular, this means that if D is an ordering that is the result of applying an a-shuffle to the identity ordering, and if D has r rising sequences, then the probability assigned to D by this process is (n~a-r) This completes the proof. F-I  3.3. CARD SHUFFLING 127 The above theorem shows that the essential information about the probability assigned to an ordering under an a-shuffle is just the number of rising sequences in the ordering. Thus, if we determine the number of orderings which contain exactly r rising sequences, for each r between 1 and n, then we will have determined the distribution function of the random variable which consists of applying a random a-shuffle to the identity ordering. The number of orderings of {1, 2,... , n} with r rising sequences is denoted by A(n, r), and is called an Eulerian number. There are many ways to calculate the values of these numbers; the following theorem gives one recursive method which follows immediately from what we already know about a-shuffles. Theorem 3.12 Let a and n be positive integers. Then " n+a -r A(n, r) (3.5) n r=1 Thus, n+a-r A(n, a)= a" - A(n, r) . n r=1 In addition, A(n,1)= 1. Proof. The second equation can be used to calculate the values of the Eulerian numbers, and follows immediately from the Equation 3.5. The last equation is a consequence of the fact that the only ordering of {1, 2,... , n} with one rising sequence is the identity ordering. Thus, it remains to prove Equation 3.5. We will count the set of a-shuffles of a deck with n cards in two ways. First, we know that there are an such shuffles (this was noted in the proof of Theorem 3.10). But there are A(n, r) orderings of {1, 2,... , n} with r rising sequences, and Theorem 3.11 states that for each such ordering, there are exactly n+a -r n cut-interleaving pairs that lead to the ordering. Therefore, the right-hand side of Equation 3.5 counts the set of a-shuffles of an n-card deck. This completes the proof. D Random Orderings and Random Processes We now turn to the second question that was asked at the beginning of this section: What do we mean by a "random" ordering? It is somewhat misleading to think about a given ordering as being random or not random. If we want to choose a random ordering from the set of all orderings of {1, 2,... ,n}, we mean that we want every ordering to be chosen with the same probability, i.e., any ordering is as "'random'' as any other.  128 CHAPTER 3. COMBINATORICS The word "random" should really be used to describe a process. We will say that a process that produces an object from a (finite) set of objects is a random process if each object in the set is produced with the same probability by the process. In the present situation, the objects are the orderings, and the process which produces these objects is the shuffling process. It is easy to see that no a-shuffle is really a random process, since if T1 and T2 are two orderings with a different number of rising sequences, then they are produced by an a-shuffle, applied to the identity ordering, with different probabilities. Variation Distance Instead of requiring that a sequence of shuffles yield a process which is random, we will define a measure that describes how far away a given process is from a random process. Let X be any process which produces an ordering of {1, 2,... , n}. Define fx (7) be the probability that X produces the ordering 7. (Thus, X can be thought of as a random variable with distribution function f.) Let Qn be the set of all orderings of {1, 2,... , n}. Finally, let u(7) = 1/| Q| for all 7 E Qn. The function u is the distribution function of a process which produces orderings and which is random. For each ordering 7 E Qn, the quantity |fx() -u(7) is the difference between the actual and desired probabilities that X produces 7. If we sum this over all orderings 7 and call this sum S, we see that S = 0 if and only if X is random, and otherwise S is positive. It is easy to show that the maximum value of S is 2, so we will multiply the sum by 1/2 so that the value falls in the interval [0, 1]. Thus, we obtain the following sum as the formula for the variation distance between the two processes: |fx - S fx(7) -u(7)|. Now we apply this idea to the case of shuffling. We let X be the process of s successive riffle shuffles applied to the identity ordering. We know that it is also possible to think of X as one 2s-shuffle. We also know that fx is constant on the set of all orderings with r rising sequences, where r is any positive integer. Finally, we know the value of fx on an ordering with r rising sequences, and we know how many such orderings there are. Thus, in this specific case, we have nfx - u j=S A(n, r) (2S )/2s 2r= n n! Since this sum has only n summands, it is easy to compute this for moderate sized values of n. For n =52, we obtain the list of values given in Table 3.14. To help in understanding these data, they are shown in graphical form in Fig- ure 3.13. The program VariationList produces the data shown in both Table 3.14 and Figure 3.13. One sees that until 5 shuffles have occurred, the output of X is  3.3. CARD SHUFFLING 129 Number of Riffle Shuffles Variation Distance 1 1 2 1 3 1 4 0.9999995334 5 0.9237329294 6 0.6135495966 7 0.3340609995 8 0.1671586419 9 0.0854201934 10 0.0429455489 11 0.0215023760 12 0.0107548935 13 0.0053779101 14 0.0026890130 Table 3.14: Distance to the random process. 0.8 0.6 0.4 0.2 5 10 15 20 Figure 3.13: Distance to the random process.  130 CHAPTER 3. COMBINATORICS very far from random. After 5 shuffles, the distance from the random process is essentially halved each time a shuffle occurs. Given the distribution functions fx(7r) and u(7r) as above, there is another way to view the variation distance fx - u |. Given any event T (which is a subset of Sn), we can calculate its probability under the process X and under the uniform process. For example, we can imagine that T represents the set of all permutations in which the first player in a 7-player poker game is dealt a straight flush (five consecutive cards in the same suit). It is interesting to consider how much the probability of this event after a certain number of shuffles differs from the probability of this event if all permutations are equally likely. This difference can be thought of as describing how close the process X is to the random process with respect to the event T. Now consider the event T such that the absolute value of the difference between these two probabilities is as large as possible. It can be shown that this absolute value is the variation distance between the process X and the uniform process. (The reader is asked to prove this fact in Exercise 4.) We have just seen that, for a deck of 52 cards, the variation distance between the 7-riffle shuffle process and the random process is about .334. It is of interest to find an event T such that the difference between the probabilities that the two processes produce T is close to .334. An event with this property can be described in terms of the game called New-Age Solitaire. New-Age Solitaire This game was invented by Peter Doyle. It is played with a standard 52-card deck. We deal the cards face up, one at a time, onto a discard pile. If an ace is encountered, say the ace of Hearts, we use it to start a Heart pile. Each suit pile must be built up in order, from ace to king, using only subsequently dealt cards. Once we have dealt all of the cards, we pick up the discard pile and continue. We define the Yin suits to be Hearts and Clubs, and the Yang suits to be Diamonds and Spades. The game ends when either both Yin suit piles have been completed, or both Yang suit piles have been completed. It is clear that if the ordering of the deck is produced by the random process, then the probability that the Yin suit piles are completed first is exactly 1/2. Now suppose that we buy a new deck of cards, break the seal on the package, and riffle shuffle the deck 7 times. If one tries this, one finds that the Yin suits win about 75% of the time. This is 25% more than we would get if the deck were in truly random order. This deviation is reasonably close to the theoretical maximum of 33.4% obtained above. Why do the Yin suits win so often? In a brand new deck of cards, the suits are in the following order, from top to bottom: ace through king of Hearts, ace through king of Clubs, king through ace of Diamonds, and king through ace of Spades. Note that if the cards were not shuffled at all, then the Yin suit piles would be completed on the first pass, before any Yang suit cards are even seen. If we were to continue playing the game until the Yang suit piles are completed, it would take 13 passes  3.3. CARD SHUFFLING 131 through the deck to do this. Thus, one can see that in a new deck, the Yin suits are in the most advantageous order and the Yang suits are in the least advantageous order. Under 7 riffle shuffles, the relative advantage of the Yin suits over the Yang suits is preserved to a certain extent. Exercises 1 Given any ordering a of {1, 2, ... , n}, we can define a-1, the inverse ordering of 0-, to be the ordering in which the ith element is the position occupied by i in a. For example, if a = (1,3,5,2,4,7,6), then a-1 = (1, 4, 2, 5, 3, 7, 6). (If one thinks of these orderings as permutations, then a-1 is the inverse of a.) A fall occurs between two positions in an ordering if the left position is occu- pied by a larger number than the right position. It will be convenient to say that every ordering has a fall after the last position. In the above example, c-1 has four falls. They occur after the second, fourth, sixth, and seventh positions. Prove that the number of rising sequences in an ordering a equals the number of falls in o-1. 2 Show that if we start with the identity ordering of {1, 2,... , n}, then the prob- ability that an a-shuffle leads to an ordering with exactly r rising sequences equals " A(n, r) an for 1 4}. We assign the distribution function m(w) = 1/6 for w= 1, 2,...., 6. Thus, P(F) = 1/6. Now suppose that the die is rolled and we are told that the event E has occurred. This leaves only two possible outcomes: 5 and 6. In the absence of any other information, we would still regard these outcomes to be equally likely, so the probability of F becomes 1/2, making P(F E) = 1/2. D Example 4.2 In the Life Table (see Appendix C), one finds that in a population of 100,000 females, 89.835% can expect to live to age 60, while 57.062% can expect to live to age 80. Given that a woman is 60, what is the probability that she lives to age 80? This is an example of a conditional probability. In this case, the original sample space can be thought of as a set of 100,000 females. The events E and F are the subsets of the sample space consisting of all women who live at least 60 years, and at least 80 years, respectively. We consider E to be the new sample space, and note that F is a subset of E. Thus, the size of E is 89,835, and the size of F is 57,062. So, the probability in question equals 57,062/89,835 = .6352. Thus, a woman who is 60 has a 63.52% chance of living to age 80. D 133  134 CHAPTER 4. CONDITIONAL PROBABILITY Example 4.3 Consider our voting example from Section 1.2: three candidates A, B, and C are running for office. We decided that A and B have an equal chance of winning and C is only 1/2 as likely to win as A. Let A be the event "A wins," B that "B wins," and C that "C wins." Hence, we assigned probabilities P(A) = 2/5, P(B) = 2/5, and P(C) = 1/5. Suppose that before the election is held, A drops out of the race. As in Exam- ple 4.1, it would be natural to assign new probabilities to the events B and C which are proportional to the original probabilities. Thus, we would have P(B A) = 2/3, and P(C A) = 1/3. It is important to note that any time we assign probabilities to real-life events, the resulting distribution is only useful if we take into account all relevant information. In this example, we may have knowledge that most voters who favor A will vote for C if A is no longer in the race. This will clearly make the probability that C wins greater than the value of 1/3 that was assigned above. Q In these examples we assigned a distribution function and then were given new information that determined a new sample space, consisting of the outcomes that are still possible, and caused us to assign a new distribution function to this space. We want to make formal the procedure carried out in these examples. Let Q = {w1, w2,. . .., wr} be the original sample space with distribution function m(w) assigned. Suppose we learn that the event E has occurred. We want to assign a new distribution function m(w E) to Q to reflect this fact. Clearly, if a sample point w is not in E, we want m(w E) = 0. Moreover, in the absence of information to the contrary, it is reasonable to assume that the probabilities for wk in E should have the same relative magnitudes that they had before we learned that E had occurred. For this we require that m(wk|E) = cm(wk) for all wk in E, with c some positive constant. But we must also have Zm(wkE) = c m(wk) = 1 . E E Thus, 1 1 EEm(Wk) P(E) (Note that this requires us to assume that P(E) > 0.) Thus, we will define m(ok|E) = ~k P(E) for Wk in E. We will call this new distribution the conditional distribution given E. For a general event F, this gives P(FE) Zm w E m(k) P(F n E) FnE FnEPP(E) We call P(F|E) the conditional probability of F occurring given that E occurs, and compute it using the formula P(F) _P(FnmE) P(E )  4.1. DISCRETE CONDITIONAL PROBABILITY 135 Urn Color of ball C,) p () b (i0 1/5 1/2 (start) 1/2 2/5 3/5 1/2 II 1/2 3/10 b (03 1/4 w (04 1/4 Figure 4.1: Tree diagram. Example 4.4 (Example 4.1 continued) Let us return to the example of rolling a die. Recall that F is the event X = 6, and E is the event is the event F. So, the above formula gives X > 4. Note that E n F P(F|E) P(F n E) P(E) 1/6 1/3 1 2 ' in agreement with the calculations performed earlier. El Example 4.5 We have two urns, I and II. Urn I contains 2 black balls and 3 white balls. Urn II contains 1 black ball and 1 white ball. An urn is drawn at random and a ball is chosen at random from it. We can represent the sample space of this experiment as the paths through a tree as shown in Figure 4.1. The probabilities assigned to the paths are also shown. Let B be the event "a black ball is drawn," and I the event "urn I is chosen." Then the branch weight 2/5, which is shown on one branch in the figure, can now be interpreted as the conditional probability P(BII). Suppose we wish to calculate P(I|B). Using the formula, we obtain P(I|B) P(I n B) P(B) P(I n B) P(B n I)+P(B n II) 1/5 4 1/5+1/4 9 D-  136 CHAPTER 4. CONDITIONAL PROBABILITY Color of ball Urn o p (o) | (01 1/5 b 9/20 5/9 II (0 1/4 (start) 1206/11( 2 3/10 <_ 11 /20W 5/11 II (04 1/4 Figure 4.2: Reverse tree diagram. Bayes Probabilities Our original tree measure gave us the probabilities for drawing a ball of a given color, given the urn chosen. We have just calculated the inverse probability that a particular urn was chosen, given the color of the ball. Such an inverse probability is called a Bayes probability and may be obtained by a formula that we shall develop later. Bayes probabilities can also be obtained by simply constructing the tree measure for the two-stage experiment carried out in reverse order. We show this tree in Figure 4.2. The paths through the reverse tree are in one-to-one correspondence with those in the forward tree, since they correspond to individual outcomes of the experiment, and so they are assigned the same probabilities. From the forward tree, we find that the probability of a black ball is 1 2 1 1 9 2 5+2 2 20 The probabilities for the branches at the second level are found by simple divi- sion. For example, if x is the probability to be assigned to the top branch at the second level, we must have 9 1 20 5 or x = 4/9. Thus, P(I|B) = 4/9, in agreement with our previous calculations. The reverse tree then displays all of the inverse, or Bayes, probabilities. Example 4.6 We consider now a problem called the Monty Hall problem. This has long been a favorite problem but was revived by a letter from Craig Whitaker to Marilyn vos Savant for consideration in her column in Parade Magazine.1 Craig wrote: 1Marilyn vos Savant, Ask Marilyn, Parade Magazine, 9 September; 2 December; 17 February 1990, reprinted in Marilyn vos Savant, Ask Marilyn, St. Martins, New York, 1992.  4.1. DISCRETE CONDITIONAL PROBABILITY 137 Suppose you're on Monty Hall's Let's Make a Deal! You are given the choice of three doors, behind one door is a car, the others, goats. You pick a door, say 1, Monty opens another door, say 3, which has a goat. Monty says to you "Do you want to pick door 2?" Is it to your advantage to switch your choice of doors? Marilyn gave a solution concluding that you should switch, and if you do, your probability of winning is 2/3. Several irate readers, some of whom identified them- selves as having a PhD in mathematics, said that this is absurd since after Monty has ruled out one door there are only two possible doors and they should still each have the same probability 1/2 so there is no advantage to switching. Marilyn stuck to her solution and encouraged her readers to simulate the game and draw their own conclusions from this. We also encourage the reader to do this (see Exercise 11). Other readers complained that Marilyn had not described the problem com- pletely. In particular, the way in which certain decisions were made during a play of the game were not specified. This aspect of the problem will be discussed in Sec- tion 4.3. We will assume that the car was put behind a door by rolling a three-sided die which made all three choices equally likely. Monty knows where the car is, and always opens a door with a goat behind it. Finally, we assume that if Monty has a choice of doors (i.e., the contestant has picked the door with the car behind it), he chooses each door with probability 1/2. Marilyn clearly expected her readers to assume that the game was played in this manner. As is the case with most apparent paradoxes, this one can be resolved through careful analysis. We begin by describing a simpler, related question. We say that a contestant is using the "stay" strategy if he picks a door, and, if offered a chance to switch to another door, declines to do so (i.e., he stays with his original choice). Similarly, we say that the contestant is using the "switch" strategy if he picks a door, and, if offered a chance to switch to another door, takes the offer. Now suppose that a contestant decides in advance to play the "stay" strategy. His only action in this case is to pick a door (and decline an invitation to switch, if one is offered). What is the probability that he wins a car? The same question can be asked about the "switch" strategy. Using the "stay" strategy, a contestant will win the car with probability 1/3, since 1/3 of the time the door he picks will have the car behind it. On the other hand, if a contestant plays the "switch" strategy, then he will win whenever the door he originally picked does not have the car behind it, which happens 2/3 of the time. This very simple analysis, though correct, does not quite solve the problem that Craig posed. Craig asked for the conditional probability that you win if you switch, given that you have chosen door 1 and that Monty has chosen door 3. To solve this problem, we set up the problem before getting this information and then compute the conditional probability given this information. This is a process that takes place in several stages; the car is put behind a door, the contestant picks a door, and finally Monty opens a door. Thus it is natural to analyze this using a tree measure. Here we make an additional assumption that if Monty has a choice  138 CHAPTER 4. CONDITIONAL PROBABILITY Placement Door chosen Door opened Path of car by contestant by Monty probabilities 1/2 2 1/18 1 1/2 1/3 13 1/18 1 1/3 2 3 1/9 1/3 3 2 1/9 1/3 131/9 1/3 131/2 11/18 1/3 2 1/3 2 1/3 1/2 3 1/18 3 1 1/9 1/3 1 1 2 1/9 1/3 1/3 2 1 1/9 1/3 1/2 1 1/18 3 1/2 2 1/18 Figure 4.3: The Monty Hall problem. of doors (i.e., the contestant has picked the door with the car behind it) then he picks each door with probability 1/2. The assumptions we have made determine the branch probabilities and these in turn determine the tree measure. The resulting tree and tree measure are shown in Figure 4.3. It is tempting to reduce the tree's size by making certain assumptions such as: "Without loss of generality, we will assume that the contestant always picks door 1." We have chosen not to make any such assumptions, in the interest of clarity. Now the given information, namely that the contestant chose door 1 and Monty chose door 3, means only two paths through the tree are possible (see Figure 4.4). For one of these paths, the car is behind door 1 and for the other it is behind door 2. The path with the car behind door 2 is twice as likely as the one with the car behind door 1. Thus the conditional probability is 2/3 that the car is behind door 2 and 1/3 that it is behind door 1, 50 if you switch you have a 2/3 chance of winning the car, as Marilyn claimed. At this point, the reader may think that the two problems above are the same, since they have the same answers. Recall that we assumed in the original problem  4.1. DISCRETE CONDITIONAL PROBABILITY 139 Placement Door chosen Door opened Conditional of car by contestant by Monty probability 1 1/ 1/313 1/18 1/3 1/3 11 3 1/9 2/3 1 /3 2 Figure 4.4: Conditional probabilities for the Monty Hall problem. if the contestant chooses the door with the car, so that Monty has a choice of two doors, he chooses each of them with probability 1/2. Now suppose instead that in the case that he has a choice, he chooses the door with the larger number with probability 3/4. In the "switch" vs. "stay" problem, the probability of winning with the "switch" strategy is still 2/3. However, in the original problem, if the contestant switches, he wins with probability 4/7. The reader can check this by noting that the same two paths as before are the only two possible paths in the tree. The path leading to a win, if the contestant switches, has probability 1/3, while the path which leads to a loss, if the contestant switches, has probability 1/4. Independent Events It often happens that the knowledge that a certain event E has occurred has no effect on the probability that some other event F has occurred, that is, that P(FIE) = P(F). One would expect that in this case, the equation P(EIF) = P(E) would also be true. In fact (see Exercise 1), each equation implies the other. If these equations are true, we might say the F is independent of E. For example, you would not expect the knowledge of the outcome of the first toss of a coin to change the probability that you would assign to the possible outcomes of the second toss, that is, you would not expect that the second toss depends on the first. This idea is formalized in the following definition of independent events. Definition 4.1 Let E and F be two events. We say that they are independent if either 1) both events have positive probability and P(E|F) =P(E ) and P(F|E ) =PF or 2) at least one of the events has probability 0. F-I  140 CHAPTER 4. CONDITIONAL PROBABILITY As noted above, if both P(E) and P(F) are positive, then each of the above equations imply the other, so that to see whether two events are independent, only one of these equations must be checked (see Exercise 1). The following theorem provides another way to check for independence. Theorem 4.1 Two events E and F are independent if and only if P(En F) = P(E)P(F) . Proof. If either event has probability 0, then the two events are independent and the above equation is true, so the theorem is true in this case. Thus, we may assume that both events have positive probability in what follows. Assume that E and F are independent. Then P(EIF) = P(E), and so P(E nF) = P(E|F)P(F) P(E)P(F). Assume next that P(E n F) = P(E)P(F). Then PEF=P(E|OF)= P(E) . P( ) P(F) Also, FP(F|PFE)= P(F) . P( ) P(E) Therefore, E and F are independent. D Example 4.7 Suppose that we have a coin which comes up heads with probability p, and tails with probability q. Now suppose that this coin is tossed twice. Using a frequency interpretation of probability, it is reasonable to assign to the outcome (H, H) the probability p2, to the outcome (H, T) the probability pq, and so on. Let E be the event that heads turns up on the first toss and F the event that tails turns up on the second toss. We will now check that with the above probability assignments, these two events are independent, as expected. We have P(E) p2 + pq= p, P(F) = pq + q2 = q. Finally P(E n F) = pq, so P(E n F) = P(E)P(F). Example 4.8 It is often, but not always, intuitively clear when two events are independent. In Example 4.7, let A be the event "the first toss is a head" and B the event "the two outcomes are the same." Then P(B nA) P{HH} _ 1/4 1 P(B|A) ===-=- =P(B). P(A) P{HH,HT} 1/2 2 Therefore, A and B are independent, but the result was not so obvious. E  4.1. DISCRETE CONDITIONAL PROBABILITY 141 Example 4.9 Finally, let us give an example of two events that are not indepen- dent. In Example 4.7, let I be the event "heads on the first toss" and J the event "two heads turn up." Then P(I) = 1/2 and P(J) = 1/4. The event InJ is the event "heads on both tosses" and has probability 1/4. Thus, I and J are not independent since P(I)P(J) = 1/8 P(I n J). D We can extend the concept of independence to any finite set of events A1, A2, ..., An. Definition 4.2 A set of events {A1, A2, ..., An} is said to be mutually indepen- dent if for any subset {A2, A3, ..., Am} of these events we have P(AinAn---nAm)=P(A)P(Aj) --P(Am), or equivalently, if for any sequence A1, A2, ..., An with A= A3 or Aj, P(1 n A2n---.n An) =P(A1)P(A2)---.P(An). (For a proof of the equivalence in the case n = 3, see Exercise 33.) D Using this terminology, it is a fact that any sequence (S, S, F, F, S, ... , S) of possible outcomes of a Bernoulli trials process forms a sequence of mutually independent events. It is natural to ask: If all pairs of a set of events are independent, is the whole set mutually independent? The answer is not necessarily, and an example is given in Exercise 7. It is important to note that the statement P( A1 n A2 n --- n An) = P( A1)P( A2) --- P( An) does not imply that the events A1, A2, ..., An are mutually independent (see Exercise 8). Joint Distribution Functions and Independence of Random Variables It is frequently the case that when an experiment is performed, several different quantities concerning the outcomes are investigated. Example 4.10 Suppose we toss a coin three times. The basic random variable X corresponding to this experiment has eight possible outcomes, which are the ordered triples consisting of H's and T's. We can also define the random variable Xi, for i =1, 2, 3, to be the outcome of the ith toss. If the coin is fair, then we should assign the probability 1/8 to each of the eight possible outcomes. Thus, the distribution functions of X1, X2, and X73 are identical; in each case they are defined by m(H) =m(T) =1/2.D  142 CHAPTER 4. CONDITIONAL PROBABILITY If we have several random variables X1, X2, ...., Xn which correspond to a given experiment, then we can consider the joint random variable X = (X1, X2,..., Xn) defined by taking an outcome w of the experiment, and writing, as an n-tuple, the corresponding n outcomes for the random variables X1, X2......, Xn. Thus, if the random variable Xi has, as its set of possible outcomes the set R2, then the set of possible outcomes of the joint random variable X is the Cartesian product of the Ri's, i.e., the set of all n-tuples of possible outcomes of the Xi's. Example 4.11 (Example 4.10 continued) In the coin-tossing example above, let Xi denote the outcome of the ith toss. Then the joint random variable X = (X1, X2, X3) has eight possible outcomes. Suppose that we now define Y, for i = 1, 2, 3, as the number of heads which occur in the first i tosses. Then Y has {0, 1, ..., i} as possible outcomes, so at first glance, the set of possible outcomes of the joint random variable y = (Y1, Y2, Y3) should be the set {(ai,a2,a3) : 0 a1 < 1,0 < a2 <2, 0 a3 <3}. However, the outcome (1, 0, 1) cannot occur, since we must have ai a2 0, P(A n B n C) = P(A)P(B|A)P(C|A n B) . 17 Prove that if A and B are independent so are (a) A and B. (b) A and B. 18 A doctor assumes that a patient has one of three diseases di, d2, or d3. Before any test, he assumes an equal probability for each disease. He carries out a test that will be positive with probability .8 if the patient has di, .6 if he has disease d2, and .4 if he has disease d3. Given that the outcome of the test was positive, what probabilities should the doctor now assign to the three possible diseases? 19 In a poker hand, John has a very strong hand and bets 5 dollars. The prob- ability that Mary has a better hand is .04. If Mary had a better hand she would raise with probability .9, but with a poorer hand she would only raise with probability .1. If Mary raises, what is the probability that she has a better hand than John does? 20 The Polya urn model for contagion is as follows: We start with an urn which contains one white ball and one black ball. At each second we choose a ball at random from the urn and replace this ball and add one more of the color chosen. Write a program to simulate this model, and see if you can make any predictions about the proportion of white balls in the urn after a large number of draws. Is there a tendency to have a large fraction of balls of the same color in the long run? 21 It is desired to find the probability that in a bridge deal each player receives an ace. A student argues as follows. It does not matter where the first ace goes. The second ace must go to one of the other three players and this occurs with probability 3/4. Then the next must go to one of two, an event of probability 1/2, and finally the last ace must go to the player who does not have an ace. This occurs with probability 1/4. The probability that all these events occur is the product (3/4)(1/2)(1/4) = 3/32. Is this argument correct? 22 One coin in a collection of 65 has two heads. The rest are fair. If a coin, chosen at random from the lot and then tossed, turns up heads 6 times in a row, what is the probability that it is the two-headed coin?  4.1. DISCRETE CONDITIONAL PROBABILITY 153 23 You are given two urns and fifty balls. Half of the balls are white and half are black. You are asked to distribute the balls in the urns with no restriction placed on the number of either type in an urn. How should you distribute the balls in the urns to maximize the probability of obtaining a white ball if an urn is chosen at random and a ball drawn out at random? Justify your answer. 24 A fair coin is thrown n times. Show that the conditional probability of a head on any specified trial, given a total of k heads over the n trials, is k/n (k > 0). 25 (Johnsonbough8) A coin with probability p for heads is tossed n times. Let E be the event "a head is obtained on the first toss' and Fk the event 'exactly k heads are obtained." For which pairs (n, k) are E and Fk independent? 26 Suppose that A and B are events such that P(A B) = P(BIA) and P(AUB) 1 and P(A n B) > 0. Prove that P(A) > 1/2. 27 (Chung9) In London, half of the days have some rain. The weather forecaster is correct 2/3 of the time, i.e., the probability that it rains, given that she has predicted rain, and the probability that it does not rain, given that she has predicted that it won't rain, are both equal to 2/3. When rain is forecast, Mr. Pickwick takes his umbrella. When rain is not forecast, he takes it with probability 1/3. Find (a) the probability that Pickwick has no umbrella, given that it rains. (b) the probability that he brings his umbrella, given that it doesn't rain. 28 Probability theory was used in a famous court case: People v. Collins.10 In this case a purse was snatched from an elderly person in a Los Angeles suburb. A couple seen running from the scene were described as a black man with a beard and a mustache and a blond girl with hair in a ponytail. Witnesses said they drove off in a partly yellow car. Malcolm and Janet Collins were arrested. He was black and though clean shaven when arrested had evidence of recently having had a beard and a mustache. She was blond and usually wore her hair in a ponytail. They drove a partly yellow Lincoln. The prosecution called a professor of mathematics as a witness who suggested that a conservative set of probabilities for the characteristics noted by the witnesses would be as shown in Table 4.5. The prosecution then argued that the probability that all of these character- istics are met by a randomly chosen couple is the product of the probabilities or 1/12,000,000, which is very small. He claimed this was proof beyond a rea- sonable doubt that the defendants were guilty. The jury agreed and handed down a verdict of guilty of second-degree robbery. 8R. Johnsonbough, "Problem #103," Two Year College Math Jouirnal, vol. 8 (1977), p. 292. 9K. L. Chung, Elementary Probability Theory With Stochastic Processes, 3rd ed. (New York: Springer-Verlag, 1979), p. 152. ioM. W. Gray, "Statistics and the Law," Mathematics Magazine, vol. 56 (1983), pp. 67-81.  154 CHAPTER 4. CONDITIONAL PROBABILITY man with mustache 1/4 girl with blond hair 1/3 girl with ponytail 1/10 black man with beard 1/10 interracial couple in a car 1/1000 partly yellow car 1/10 Table 4.5: Collins case probabilities. If you were the lawyer for the Collins couple how would you have countered the above argument? (The appeal of this case is discussed in Exercise 5.1.34.) 29 A student is applying to Harvard and Dartmouth. He estimates that he has a probability of .5 of being accepted at Dartmouth and .3 of being accepted at Harvard. He further estimates the probability that he will be accepted by both is .2. What is the probability that he is accepted by Dartmouth if he is accepted by Harvard? Is the event "accepted at Harvard" independent of the event "accepted at Dartmouth"? 30 Luxco, a wholesale lightbulb manufacturer, has two factories. Factory A sells bulbs in lots that consists of 1000 regular and 2000 softglow bulbs each. Ran- dom sampling has shown that on the average there tend to be about 2 bad regular bulbs and 11 bad softglow bulbs per lot. At factory B the lot size is reversed-there are 2000 regular and 1000 softglow per lot-and there tend to be 5 bad regular and 6 bad softglow bulbs per lot. The manager of factory A asserts, "We're obviously the better producer; our bad bulb rates are .2 percent and .55 percent compared to B's .25 percent and .6 percent. We're better at both regular and softglow bulbs by half of a tenth of a percent each." "Au contraire," counters the manager of B, "each of our 3000 bulb lots con- tains only 11 bad bulbs, while A's 3000 bulb lots contain 13. So our .37 percent bad bulb rate beats their .43 percent." Who is right? 31 Using the Life Table for 1981 given in Appendix C, find the probability that a male of age 60 in 1981 lives to age 80. Find the same probability for a female. 32 (a) There has been a blizzard and Helen is trying to drive from Woodstock to Tunbridge, which are connected like the top graph in Figure 4.6. Here p and q are the probabilities that the two roads are passable. What is the probability that Helen can get from Woodstock to Tunbridge? (b) Now suppose that Woodstock and Tunbridge are connected like the mid- dle graph in Figure 4.6. What now is the probability that she can get from W to T? Note that if we think of the roads as being components of a system, then in (a) and (b) we have computed the reliability of a system whose components are (a) in series and (b) in parallel.  4.1. DISCRETE CONDITIONAL PROBABILITY 155 P q Woodstock 40 g Tunbridge (a) P W T q (b) C .8 .9 W .95 T .9 .8 D (c) Figure 4.6: From Woodstock to Tunbridge. (c) Now suppose W and T are connected like the bottom graph in Figure 4.6. Find the probability of Helen's getting from W to T. Hint: If the road from C to D is impassable, it might as well not be there at all; if it is passable, then figure out how to use part (b) twice. 33 Let A1, A2, and A3 be events, and let BZ represent either AZ or its complement A. Then there are eight possible choices for the triple (B1, B2, B3). Prove that the events A1, A2, A3 are independent if and only if P(B1 n B2 n B3) = P(B1)P(B2)P(B3) , for all eight of the possible choices for the triple (B1, B2, B3). 34 Four women, A, B, C, and D, check their hats, and the hats are returned in a random manner. Let Q be the set of all possible permutations of A, B, C, D. Let X= 1 if the jth woman gets her own hat back and 0 otherwise. What is the distribution of X? Are the X2's mutually independent? 35 A box has numbers from 1 to 10. A number is drawn at random. Let X1 be the number drawn. This number is replaced, and the ten numbers mixed. A second number X2 is drawn. Find the distributions of X1 and X2. Are X1 and X2 independent? Answer the same questions if the first number is not replaced before the second is drawn.  156 CHAPTER 4. CONDITIONAL PROBABILITY Y -1 0 1 2 X -1 0 1/36 1/6 1/12 0 1/18 0 1/18 0 1 0 1/36 1/6 1/12 2 1/12 0 1/12 1/6 Table 4.6: Joint distribution. 36 A die is thrown twice. Let X1 and X2 denote the outcomes. Define X = min(X1, X2). Find the distribution of X. *37 Given that P(X = a) =r, P(max(X, Y) = a) = s, and P(min(X, Y) = a) = t, show that you can determine u = P(Y = a) in terms of r, s, and t. 38 A fair coin is tossed three times. Let X be the number of heads that turn up on the first two tosses and Y the number of heads that turn up on the third toss. Give the distribution of (a) the random variables X and Y. (b) the random variable Z = X + Y. (c) the random variable W = X - Y. 39 Assume that the random variables X and Y have the joint distribution given in Table 4.6. (a) What is P(X > 1 and Y <;0)? (b) What is the conditional probability that Y < 0 given that X = 2? (c) Are X and Y independent? (d) What is the distribution of Z = XY? 40 In the problem of points, discussed in the historical remarks in Section 3.2, two players, A and B, play a series of points in a game with player A winning each point with probability p and player B winning each point with probability q = 1 - p. The first player to win N points wins the game. Assume that N = 3. Let X be a random variable that has the value 1 if player A wins the series and 0 otherwise. Let Y be a random variable with value the number of points played in a game. Find the distribution of X and Y when p = 1/2. Are X and Y independent in this case? Answer the same questions for the case p = 2/3. 41 The letters between Pascal and Fermat, which are often credited with having started probability theory, dealt mostly with the problem of points described in Exercise 40. Pascal and Fermat considered the problem of finding a fair division of stakes if the game must be called off when the first player has won r games and the second player has won s games, with r < N and s < N. Let P(r, s) be the probability that player A wins the game if he has already won r points and player B has won s points. Then  4.1. DISCRETE CONDITIONAL PROBABILITY 157 (a) P(r, N) = 0 if r < N, (b) P(N, s) = 1 if s < N, (c) P(r,s) = pP(r + 1, s) + qP(r, s + 1) if r < N and s < N; and (1), (2), and (3) determine P(r, s) for r < N and s < N. Pascal used these facts to find P(r, s) by working backward: He first obtained P(N - 1,j) for j = N - 1, N -2, ... , 0; then, from these values, he obtained P(N - 2,j) for j = N - 1, N - 2, ..., 0 and, continuing backward, obtained all the values P(r, s). Write a program to compute P(r, s) for given N, a, b, and p. Warning: Follow Pascal and you will be able to run N = 100; use recursion and you will not be able to run N = 20. 42 Fermat solved the problem of points (see Exercise 40) as follows: He realized that the problem was difficult because the possible ways the play might go are not equally likely. For example, when the first player needs two more games and the second needs three to win, two possible ways the series might go for the first player are WLW and LWLW. These sequences are not equally likely. To avoid this difficulty, Fermat extended the play, adding fictitious plays so that the series went the maximum number of games needed (four in this case). He obtained equally likely outcomes and used, in effect, the Pascal triangle to calculate P(r, s). Show that this leads to a formula for P(r, s) even for the case p p 1/2. 43 The Yankees are playing the Dodgers in a world series. The Yankees win each game with probability .6. What is the probability that the Yankees win the series? (The series is won by the first team to win four games.) 44 C. L. Anderson" has used Fermat's argument for the problem of points to prove the following result due to J. G. Kingston. You are playing the game of points (see Exercise 40) but, at each point, when you serve you win with probability p, and when your opponent serves you win with probability p. You will serve first, but you can choose one of the following two conventions for serving: for the first convention you alternate service (tennis), and for the second the person serving continues to serve until he loses a point and then the other player serves (racquetball). The first player to win N points wins the game. The problem is to show that the probability of winning the game is the same under either convention. (a) Show that, under either convention, you will serve at most N points and your opponent at most N - 1 points. (b) Extend the number of points to 2N - 1 so that you serve N points and your opponent serves N - 1. For example, you serve any additional points necessary to make N serves and then your opponent serves any additional points necessary to make him serve N - 1 points. The winner 11C. L. Anderson, "Note on the Advantage of First Serve," Jouirnal of Combinatorial Theory, Series A, vol. 23 (1977), p. 363.  158 CHAPTER 4. CONDITIONAL PROBABILITY is now the person, in the extended game, who wins the most points. Show that playing these additional points has not changed the winner. (c) Show that (a) and (b) prove that you have the same probability of win- ning the game under either convention. 45 In the previous problem, assume that p = 1 - p. (a) Show that under either service convention, the first player will win more often than the second player if and only if p > .5. (b) In volleyball, a team can only win a point while it is serving. Thus, any individual "play" either ends with a point being awarded to the serving team or with the service changing to the other team. The first team to win N points wins the game. (We ignore here the additional restriction that the winning team must be ahead by at least two points at the end of the game.) Assume that each team has the same probability of winning the play when it is serving, i.e., that p = 1 - p. Show that in this case, the team that serves first will win more than half the time, as long as p > 0. (If p = 0, then the game never ends.) Hint: Define p' to be the probability that a team wins the next point, given that it is serving. If we write q = 1 - p, then one can show that , P If one now considers this game in a slightly different way, one can see that the second service convention in the preceding problem can be used, with p replaced by p'. 46 A poker hand consists of 5 cards dealt from a deck of 52 cards. Let X and Y be, respectively, the number of aces and kings in a poker hand. Find the joint distribution of X and Y. 47 Let X1 and X2 be independent random variables and let Yi = #1(X1) and Y2 = 02(X2). (a) Show that P(Y1 = r,Y2 = s)=( P(X1 = a, X2 = b) . +1(a)=r #2 (b)=s (b) Using (a), show that P(Y = r, Y2 = s) = P(Yi = r)P(Y2 = s) so that Yi and Y2 are independent. 48 Let Q be the sample space of an experiment. Let E be an event with P(E) > 0 and define mE(w) by mE(w) =m(wlE). Prove that mE(w) is a distribution function on E, that is, that mE(w) ;> 0 and that 3E- mE(w) =1. The function mE is called the conditional distribution given E.  4.1. DISCRETE CONDITIONAL PROBABILITY 159 49 You are given two urns each containing two biased coins. The coins in urn I come up heads with probability pi, and the coins in urn II come up heads with probability p2 # P1. You are given a choice of (a) choosing an urn at random and tossing the two coins in this urn or (b) choosing one coin from each urn and tossing these two coins. You win a prize if both coins turn up heads. Show that you are better off selecting choice (a). 50 Prove that, if A1, A2, ..., An are independent events defined on a sample space Q and if 0 < P(AS) < 1 for all j, then Q must have at least 2' points. 51 Prove that if P(AC) >P(B|C) and P(AC) >P(B|C) , then P(A) > P(B). 52 A coin is in one of n boxes. The probability that it is in the ith box is p2. If you search in the ith box and it is there, you find it with probability a2. Show that the probability p that the coin is in the jth box, given that you have looked in the ith box and not found it, is { pj/(l - api), if j # i, (1 - a)p/(1 - aipi), if j = i. 53 George Wolford has suggested the following variation on the Linda problem (see Exercise 1.2.25). The registrar is carrying John and Mary's registration cards and drops them in a puddle. When he pickes them up he cannot read the names but on the first card he picked up he can make out Mathematics 23 and Government 35, and on the second card he can make out only Mathematics 23. He asks you if you can help him decide which card belongs to Mary. You know that Mary likes government but does not like mathematics. You know nothing about John and assume that he is just a typical Dartmouth student. From this you estimate: P(Mary takes Government 35) = .5 , P(Mary takes Mathematics 23) = .1 , P(John takes Government 35) .3 , P(John takes Mathematics 23) .2 . Assume that their choices for courses are independent events. Show that the card with Mathematics 23 and Government 35 showing is more likely to be Mary's than John's. The conjunction fallacy referred to in the Linda problem would be to assume that the event "Mary takes Mathematics 23 and Government 35" is more likely than the event "Mary takes Mathematics 23." Why are we not making this fallacy here?  160 CHAPTER 4. CONDITIONAL PROBABILITY 54 (Suggested by Eisenberg and Ghosh12) A deck of playing cards can be de- scribed as a Cartesian product Deck = Suit x Rank , where Suit {4, O, Y, 4} and Rank = {2, 3, ..., 10, J, Q, K, A}. This just means that every card may be thought of as an ordered pair like (Q, 2). By a suit event we mean any event A contained in Deck which is described in terms of Suit alone. For instance, if A is "the suit is red," then A= {Q,Y2} x Rank , so that A consists of all cards of the form (Q, r) or (?, r) where r is any rank. Similarly, a rank event is any event described in terms of rank alone. (a) Show that if A is any suit event and B any rank event, then A and B are independent. (We can express this briefly by saying that suit and rank are independent.) (b) Throw away the ace of spades. Show that now no nontrivial (i.e., neither empty nor the whole space) suit event A is independent of any nontrivial rank event B. Hint: Here independence comes down to c/51 = (a/51) - (b/51) , where a, b, c are the respective sizes of A, B and A n B. It follows that 51 must divide ab, hence that 3 must divide one of a and b, and 17 the other. But the possible sizes for suit and rank events preclude this. (c) Show that the deck in (b) nevertheless does have pairs A, B of nontrivial independent events. Hint: Find 2 events A and B of sizes 3 and 17, respectively, which intersect in a single point. (d) Add a joker to a full deck. Show that now there is no pair A, B of nontrivial independent events. Hint: See the hint in (b); 53 is prime. The following problems are suggested by Stanley Gudder in his article "Do Good Hands Attract?"13 He says that event A attracts event B if P(BIA) > P(B) and repels B if P(B|A) < P(B). 55 Let RZ be the event that the ith player in a poker game has a royal flush. Show that a royal flush (A,K,Q,J,10 of one suit) attracts another royal flush, that is P(R2|R1) > P(R2). Show that a royal flush repels full houses. 56 Prove that A attracts B if and only if B attracts A. Hence we can say that A and B are mutually attractive if A attracts B. 12B. Eisenberg and B. K. Ghosh, "Independent Events in a Discrete Uniform Probability Space," The American Statistician, vol. 41, no. 1 (1987), pp. 52-56. 13S. Gudder, "Do Good Hands Attract?" Mathematics Magazine, vol. 54, no. 1 (1981), pp. 13- 16.  4.1. DISCRETE CONDITIONAL PROBABILITY 161 57 Prove that A neither attracts nor repels B if and only if A and B are inde- pendent. 58 Prove that A and B are mutually attractive if and only if P(BIA) > P(BIA). 59 Prove that if A attracts B, then A repels B. 60 Prove that if A attracts both B and C, and A repels B n C, then A attracts B U C. Is there any example in which A attracts both B and C and repels BJC? 61 Prove that if B1, B2, ..., Bn are mutually disjoint and collectively exhaustive, and if A attracts some Bi, then A must repel some B3. 62 (a) Suppose that you are looking in your desk for a letter from some time ago. Your desk has eight drawers, and you assess the probability that it is in any particular drawer is 10% (so there is a 20% chance that it is not in the desk at all). Suppose now that you start searching systematically through your desk, one drawer at a time. In addition, suppose that you have not found the letter in the first i drawers, where 0 < i K 7. Let pi denote the probability that the letter will be found in the next drawer, and let qi denote the probability that the letter will be found in some subsequent drawer (both pi and qi are conditional probabilities, since they are based upon the assumption that the letter is not in the first i drawers). Show that the pi's increase and the qi's decrease. (This problem is from Falk et al.14) (b) The following data appeared in an article in the Wall Street Journal.15 For the ages 20, 30, 40, 50, and 60, the probability of a woman in the U.S. developing cancer in the next ten years is 0.5%, 1.2%, 3.2%, 6.4%, and 10.8%, respectively. At the same set of ages, the probability of a woman in the U.S. eventually developing cancer is 39.6%, 39.5%, 39.1%, 37.5%, and 34.2%, respectively. Do you think that the problem in part (a) gives an explanation for these data? 63 Here are two variations of the Monty Hall problem that are discussed by Granberg.16 (a) Suppose that everything is the same except that Monty forgot to find out in advance which door has the car behind it. In the spirit of "the show must go on," he makes a guess at which of the two doors to open and gets lucky, opening a door behind which stands a goat. Now should the contestant switch? 14R. Falk, A. Lipson, and C. Konold, "The ups and downs of the hope function in a fruitless search," in Subjective Probability, G. Wright and P. Ayton, (eds.) (Chichester: Wiley, 1994), pgs. 353-377. 15C. Crossen, "Fright by the numbers: Alarming disease data are frequently flawed," Wall Street Journal, 11 April 1996, p. B1. 16D. Granberg, "To switch or not to switch," in The power of logical thinking, M. vos Savant, (New York: St. Martin's 1996).  162 CHAPTER 4. CONDITIONAL PROBABILITY (b) You have observed the show for a long time and found that the car is put behind door A 45% of the time, behind door B 40% of the time and behind door C 15% of the time. Assume that everything else about the show is the same. Again you pick door A. Monty opens a door with a goat and offers to let you switch. Should you? Suppose you knew in advance that Monty was going to give you a chance to switch. Should you have initially chosen door A? 4.2 Continuous Conditional Probability In situations where the sample space is continuous we will follow the same procedure as in the previous section. Thus, for example, if X is a continuous random variable with density function f(x), and if E is an event with positive probability, we define a conditional density function by the formula f(lE) (f(x)/P(E), if x E E, { 0, if x E. Then for any event F, we have P(F|E) I= f (z|E) d The expression P(FIE) is called the conditional probability of F given E. As in the previous section, it is easy to obtain an alternative expression for this probability: P F E= x E //= f (x) d P(E n F) f() E FPE)d PE< We can think of the conditional density function as being 0 except on E, and normalized to have integral 1 over E. Note that if the original density is a uniform density corresponding to an experiment in which all events of equal size are equally likely, then the same will be true for the conditional density. Example 4.18 In the spinner experiment (cf. Example 2.1), suppose we know that the spinner has stopped with head in the upper half of the circle, 0 < x 1/2. What is the probability that 1/6 x < 1/3? Here E = [0,1/2], F = [1/6,1/3], and F n E = F. Hence P F E =P(F n E) P(FE) P E P(E) 1/6 1/2 1 which is reasonable, since F is 1/3 the size of E. The conditional density function here is given by  4.2. CONTINUOUS CONDITIONAL PROBABILITY 163 f =2,if O 0 }, and F = { (x, y) : x2 + y2 < (1/2)2}. Hence, P(FIE) - P(F n E) _ (1/7)[(1/2)(7/4)] P(E) (1/r)(r/2) = 1/4 . Here again, the size of F n E is 1/4 the size of E. The conditional density function is f ((x y)|E) { f (x, y)/P(E) = 2/7, if (x, y) E, 0, if (x, y) E. Example 4.20 We return to the exponential density (cf. Example 2.17). We sup- pose that we are observing a lump of plutonium-239. Our experiment consists of waiting for an emission, then starting a clock, and recording the length of time X that passes until the next emission. Experience has shown that X has an expo- nential density with some parameter A, which depends upon the size of the lump. Suppose that when we perform this experiment, we notice that the clock reads r seconds, and is still running. What is the probability that there is no emission in a further s seconds? Let G(t) be the probability that the next particle is emitted after time t. Then G(t) j e-a dx = C-e-x * = e-At Let E be the event "the next particle is emitted after time r" and F the event "the next particle is emitted after time r + s." Then P F E =P(F n E) P(FE) P E P(E) G(r+s) G(r) =e- .~s  164 CHAPTER 4. CONDITIONAL PROBABILITY This tells us the rather surprising fact that the probability that we have to wait s seconds more for an emission, given that there has been no emission in r seconds, is independent of the time r. This property (called the memoryless property) was introduced in Example 2.17. When trying to model various phenomena, this property is helpful in deciding whether the exponential density is appropriate. The fact that the exponential density is memoryless means that it is reasonable to assume if one comes upon a lump of a radioactive isotope at some random time, then the amount of time until the next emission has an exponential density with the same parameter as the time between emissions. A well-known example, known as the "bus paradox," replaces the emissions by buses. The apparent paradox arises from the following two facts: 1) If you know that, on the average, the buses come by every 30 minutes, then if you come to the bus stop at a random time, you should only have to wait, on the average, for 15 minutes for a bus, and 2) Since the buses arrival times are being modelled by the exponential density, then no matter when you arrive, you will have to wait, on the average, for 30 minutes for a bus. The reader can now see that in Exercises 2.2.9, 2.2.10, and 2.2.11, we were asking for simulations of conditional probabilities, under various assumptions on the distribution of the interarrival times. If one makes a reasonable assumption about this distribution, such as the one in Exercise 2.2.10, then the average waiting time is more nearly one-half the average interarrival time. D Independent Events If E and F are two events with positive probability in a continuous sample space, then, as in the case of discrete sample spaces, we define E and F to be independent if P(EIF) = P(E) and P(FIE) = P(F). As before, each of the above equations imply the other, so that to see whether two events are independent, only one of these equations must be checked. It is also the case that, if E and F are independent, then P(E n F) = P(E)P(F). Example 4.21 (Example 4.18 continued) In the dart game (see Example 4.18), let E be the event that the dart lands in the upper half of the target (y > 0) and F the event that the dart lands in the right half of the target (x > 0). Then P(E n F) is the probability that the dart lies in the first quadrant of the target, and P(EnF) = - ldxdy 7 EnF Area (E nF) Area (E) Area (F) = (ILdxdy) -Fdxdy P(E)P(F) so that E and F are independent. What makes this work is that the events E and F are described by restricting different coordinates. This idea is made more precise below.D  4.2. CONTINUOUS CONDITIONAL PROBABILITY 165 Joint Density and Cumulative Distribution Functions In a manner analogous with discrete random variables, we can define joint density functions and cumulative distribution functions for multi-dimensional continuous random variables. Definition 4.6 Let X1, X2,..., X be continuous random variables associated with an experiment, and let X = (X1, X2,..., Xn). Then the joint cumulative distribution function of X is defined by F(zi, x2, ..., zn) = P(X1 < zi, X2 < x2, ..., Xn < zn). The joint density function of X satisfies the following equation: F(zi, z2, . .. , zn) =--- f (ti, t2, . .. tn) dtndtn_1 . .. dti It is straightforward to show that, in the above notation, f(Fx2.x. .,9zx2). = (4.4) &x1&2--x Independent Random Variables As with discrete random variables, we can define mutual independence of continuous random variables. Definition 4.7 Let X1, X2, ... , Xn be continuous random variables with cumula- tive distribution functions F1(x), F2(x), ..., F (x). Then these random variables are mutually independent if F(zi, x2, . .. , zn) = F1 (zi)F2 (X2) -.-.-Fa (xn) for any choice of X1,x2,.... ,x. Thus, if X1, X2,..., Xn are mutually inde- pendent, then the joint cumulative distribution function of the random variable X = (X, X2, .. . , Xn) is just the product of the individual cumulative distribution functions. When two random variables are mutually independent, we shall say more briefly that they are independent. D Using Equation 4.4, the following theorem can easily be shown to hold for mu- tually independent continuous random variables. Theorem 4.2 Let X1, X2, ..., Xn be continuous random variables with density functions fi(x), f2 (x),..., f n(x). Then these random variables are mutually in- dependent if and only if for any choice of Xx2,. . . , X. F-I  166 CHAPTER 4. CONDITIONAL PROBABILITY 1 r2 E1 w1 0 r1 1 Figure 4.7: X1 and X2 are independent. Let's look at some examples. Example 4.22 In this example, we define three random variables, X1, X2, and X3. We will show that X1 and X2 are independent, and that X1 and X3 are not independent. Choose a point w = (wi, w2) at random from the unit square. Set X1 =w2, X2 = 2o, and X3 =Wi + w2. Find the joint distributions F12(r1, r2) and F23(r2, r3). We have already seen (see Example 2.13) that Fi(ri) = P(-)o < X1 < ri) = /, if0 1 and 3> 1 it is bell-shaped with the parameters a and 3 determining its peak and its spread. Assume that the experimenter has chosen a beta density to describe the state of his knowledge about x before the experiment. Then he gives the drug to n subjects and records the number i of successes. The number i is a discrete random variable, so we may conveniently describe the set of possible outcomes of this experiment by referring to the ordered pair (x, i). We let m(ilx) denote the probability that we observe i successes given the value of x. By our assumptions, m(ilx) is the binomial distribution with probability x for success: m(ilz) = b(n, x, i) = nz)i(1 -x)3 where j = n - i. If x is chosen at random from [0, 1] with a beta density B(a, 3, x), then the density function for the outcome of the pair (x, i) is f (x, i) = m(ilx)B(a,3,x) i(1 - x)i B(a,3x (1 -z),- n 1 x+2-1(1_-xz)3+j-1. Now let m(i) be the probability that we observe i successes not knowing the value of x. Then m(i) = m(ilx)B(a, 3, x) dx 0 S( a+i-(1 - x)'+j-1 dx i fB(a, 1#) J (n B(a+i,#+j) i B(a,#3) ~ Hence, the probability density f(xli) for x, given that i successes were observed, is f (x, i) f~i~m(i)  170 CHAPTER 4. CONDITIONAL PROBABILITY za+i-1(1 - x)#+j-1 B(a+i,3+j) '(45) that is, f(xli) is another beta density. This says that if we observe i successes and j failures in n subjects, then the new density for the probability that the drug is effective is again a beta density but with parameters a + i, 13 + j. Now we assume that before the experiment we choose a beta density with pa- rameters a and 13, and that in the experiment we obtain i successes in n trials. We have just seen that in this case, the new density for x is a beta density with parameters a + i and 3 + j. Now we wish to calculate the probability that the drug is effective on the next subject. For any particular real number t between 0 and 1, the probability that x has the value t is given by the expression in Equation 4.5. Given that x has the value t, the probability that the drug is effective on the next subject is just t. Thus, to obtain the probability that the drug is effective on the next subject, we integrate the product of the expression in Equation 4.5 and t over all possible values of t. We obtain: 1 t.- cx+i-1(1 - t)O+j-1 dt B(a+2i,0+ j) J B(a+i+1,/3+j) B(a + i,3+ j) (a+i)!(/+j -1)! (a+#+i+j -1)! (a+/+i+j)! (a+i-1)!(#+j-1)! a+i If n is large, then our estimate for the probability of success after the experiment is approximately the proportion of successes observed in the experiment, which is certainly a reasonable conclusion. D The next example is another in which the true probabilities are unknown and must be estimated based upon experimental data. Example 4.24 (Two-armed bandit problem) You are in a casino and confronted by two slot machines. Each machine pays off either 1 dollar or nothing. The probability that the first machine pays off a dollar is x and that the second machine pays off a dollar is y. We assume that x and y are random numbers chosen independently from the interval [0, 1] and unknown to you. You are permitted to make a series of ten plays, each time choosing one machine or the other. How should you choose to maximize the number of times that you win? One strategy that sounds reasonable is to calculate, at every stage, the prob- ability that each machine will pay off and choose the machine with the higher probability. Let win(i), for i =1 or 2, be the number of times that you have won on the ith machine. Similarly, let lose(i) be the number of times you have lost on the ith machine. Then, from Example 4.23, the probability p(i) that you win if you  4.2. CONTINUOUS CONDITIONAL PROBABILITY 171 Machine Result 2.5 1 W 1 L 2 2 L 1 L 1.5- 1 L 1 L1 1 L 2 W 0.5- 2 L 0- 0 0.2 0.4 0.6 0.8 1 Figure 4.10: Play the best machine. choose the ith machine is win(i) + 1 win(i) + lose(i) + 2 Thus, if p(l) > p(2) you would play machine 1 and otherwise you would play machine 2. We have written a program TwoArm to simulate this experiment. In the program, the user specifies the initial values for x and y (but these are unknown to the experimenter). The program calculates at each stage the two conditional densities for x and y, given the outcomes of the previous trials, and then computes p(i), for i = 1, 2. It then chooses the machine with the highest value for the probability of winning for the next play. The program prints the machine chosen on each play and the outcome of this play. It also plots the new densities for x (solid line) and y (dotted line), showing only the current densities. We have run the program for ten plays for the case x= .6 and y = .7. The result is shown in Figure 4.10. The run of the program shows the weakness of this strategy. Our initial proba- bility for winning on the better of the two machines is .7. We start with the poorer machine and our outcomes are such that we always have a probability greater than .6 of winning and so we just keep playing this machine even though the other ma- chine is better. If we had lost on the first play we would have switched machines. Our final density for y is the same as our initial density, namely, the uniform den- sity. Our final density for x is different and reflects a much more accurate knowledge about x. The computer did pretty well with this strategy, winning seven out of the ten trials, but ten trials are not enough to judge whether this is a good strategy in the long run. Another popular strategy is the play-the-winner strategy. As the name suggests, for this strategy we choose the same machine when we win and switch machines when we lose. The program TwoArm will simulate this strategy as well. In Figure 4.11, we show the results of running this program with the play-the-winner strategy and the same true probabilities of .6 and .7 for the two machines. After ten plays our densities for the unknown probabilities of winning suggest to us that the second machine is indeed the better of the two. We again won seven out of the ten trials.  172 CHAPTER 4. CONDITIONAL PROBABILITY Machine Result 1 W 2 1 W 1 L 2 L . 1 W 1 W 1 1 L 2 L 0.5- 1 L 2 W 0.2 0.4 0.6 0.8 1 Figure 4.11: Play the winner. Neither of the strategies that we simulated is the best one in terms of maximizing our average winnings. This best strategy is very complicated but is reasonably ap- proximated by the play-the-winner strategy. Variations on this example have played an important role in the problem of clinical tests of drugs where experimenters face a similar situation. D Exercises 1 Pick a point x at random (with uniform density) in the interval [0, 1]. Find the probability that x > 1/2, given that (a) x > 1/4. (b) x < 3/4. (c) x - 1/2 < 1/4. (d) x2 - x + 2/9 < 0. 2 A radioactive material emits a-particles at a rate described by the density function f (t) = .le-. Find the probability that a particle is emitted in the first 10 seconds, given that (a) no particle is emitted in the first second. (b) no particle is emitted in the first 5 seconds. (c) a particle is emitted in the first 3 seconds. (d) a particle is emitted in the first 20 seconds. 3 The Acme Super light bulb is known to have a useful life described by the density function f (t) =.01e-01 , where time t is measured in hours.  4.2. CONTINUOUS CONDITIONAL PROBABILITY 173 (a) Find the failure rate of this bulb (see Exercise 2.2.6). (b) Find the reliability of this bulb after 20 hours. (c) Given that it lasts 20 hours, find the probability that the bulb lasts another 20 hours. (d) Find the probability that the bulb burns out in the forty-first hour, given that it lasts 40 hours. 4 Suppose you toss a dart at a circular target of radius 10 inches. Given that the dart lands in the upper half of the target, find the probability that (a) it lands in the right half of the target. (b) its distance from the center is less than 5 inches. (c) its distance from the center is greater than 5 inches. (d) it lands within 5 inches of the point (0, 5). 5 Suppose you choose two numbers x and y, independently at random from the interval [0, 1]. Given that their sum lies in the interval [0, 1], find the probability that (a) x-y <1. (b) xy < 1/2. (c) max{x, y} < 1/2. (d) x2 + y2 <1/4. (e) x > y. 6 Find the conditional density functions for the following experiments. (a) A number x is chosen at random in the interval [0, 1], given that x > 1/4. (b) A number t is chosen at random in the interval [0, o0) with exponential density e-t, given that 1 < t < 10. (c) A dart is thrown at a circular target of radius 10 inches, given that it falls in the upper half of the target. (d) Two numbers x and y are chosen at random in the interval [0, 1], given that x > y. 7 Let x and y be chosen at random from the interval [0, 1]. Show that the events x > 1/3 and y > 2/3 are independent events. 8 Let x and y be chosen at random from the interval [0, 1]. Which pairs of the following events are independent? (a) x > 1/3. (b) y > 2/3. (c) x > y.  174 CHAPTER 4. CONDITIONAL PROBABILITY (d) x + y < 1. 9 Suppose that X and Y are continuous random variables with density functions fx(x) and fy(y), respectively. Let f(x, y) denote the joint density function of (X, Y). Show that f(xy)dy = fx(x), and ff(x,y)dx = fy(y). *10 In Exercise 2.2.12 you proved the following: If you take a stick of unit length and break it into three pieces, choosing the breaks at random (i.e., choosing two real numbers independently and uniformly from [0, 1]), then the prob- ability that the three pieces form a triangle is 1/4. Consider now a similar experiment: First break the stick at random, then break the longer piece at random. Show that the two experiments are actually quite different, as follows: (a) Write a program which simulates both cases for a run of 1000 trials, prints out the proportion of successes for each run, and repeats this process ten times. (Call a trial a success if the three pieces do form a triangle.) Have your program pick (x, y) at random in the unit square, and in each case use x and y to find the two breaks. For each experiment, have it plot (x, y) if (x, y) gives a success. (b) Show that in the second experiment the theoretical probability of success is actually 2log 2 - 1. 11 A coin has an unknown bias p that is assumed to be uniformly distributed between 0 and 1. The coin is tossed n times and heads turns up j times and tails turns up k times. We have seen that the probability that heads turns up next time is j+l Show that this is the same as the probability that the next ball is black for the Polya urn model of Exercise 4.1.20. Use this result to explain why, in the Polya urn model, the proportion of black balls does not tend to 0 or 1 as one might expect but rather to a uniform distribution on the interval [0, 1]. 12 Previous experience with a drug suggests that the probability p that the drug is effective is a random quantity having a beta density with parameters a = 2 and #/= 3. The drug is used on ten subjects and found to be successful in four out of the ten patients. What density should we now assign to the probability p? What is the probability that the drug will be successful the next time it is used?  4.3. PARADOXES 175 13 Write a program to allow you to compare the strategies play-the-winner and play-the-best-machine for the two-armed bandit problem of Example 4.24. Have your program determine the initial payoff probabilities for each machine by choosing a pair of random numbers between 0 and 1. Have your program carry out 20 plays and keep track of the number of wins for each of the two strategies. Finally, have your program make 1000 repetitions of the 20 plays and compute the average winning per 20 plays. Which strategy seems to be the best? Repeat these simulations with 20 replaced by 100. Does your answer to the above question change? 14 Consider the two-armed bandit problem of Example 4.24. Bruce Barnes pro- posed the following strategy, which is a variation on the play-the-best-machine strategy. The machine with the greatest probability of winning is played un- less the following two conditions hold: (a) the difference in the probabilities for winning is less than .08, and (b) the ratio of the number of times played on the more often played machine to the number of times played on the less often played machine is greater than 1.4. If the above two conditions hold, then the machine with the smaller probability of winning is played. Write a program to simulate this strategy. Have your program choose the initial payoff probabilities at random from the unit interval [0, 1], make 20 plays, and keep track of the number of wins. Repeat this experiment 1000 times and obtain the average number of wins per 20 plays. Implement a second strategy-for example, play-the-best-machine or one of your own choice, and see how this second strategy compares with Bruce's on average wins. 4.3 Paradoxes Much of this section is based on an article by Snell and Vanderbei.18 One must be very careful in dealing with problems involving conditional prob- ability. The reader will recall that in the Monty Hall problem (Example 4.6), if the contestant chooses the door with the car behind it, then Monty has a choice of doors to open. We made an assumption that in this case, he will choose each door with probability 1/2. We then noted that if this assumption is changed, the answer to the original question changes. In this section, we will study other examples of the same phenomenon. Example 4.25 Consider a family with two children. Given that one of the children is a boy, what is the probability that both children are boys? One way to approach this problem is to say that the other child is equally likely to be a boy or a girl, so the probability that both children are boys is 1/2. The "text- book" solution would be to draw the tree diagram and then form the conditional tree by deleting paths to leave only those paths that are consistent with the given 18J. L. Snell and R. Vanderbei, "Three Bewitching Paradoxes," in Topics in Contemporary Probability anid Its Applications, CRC Press, Boca Raton, 1995.  176 CHAPTER 4. CONDITIONAL PROBABILITY First Second Unconditional child child probability 1 /2 b 1/4 b 1/2 1/2 g 1/4 1/2 1 /2 b 1 /4 g 1/2 g 1/4 First Second Unconditional Conditional child child probability probability 1 /2 b 1/4 1/3 b 1/2 1/2 g 1/4 1/3 21/2 b 1/4 1/3 1/2 g Figure 4.12: Tree for Example 4.25. information. The result is shown in Figure 4.12. We see that the probability of two boys given a boy in the family is not 1/2 but rather 1/3. D This problem and others like it are discussed in Bar-Hillel and Falk.19 These authors stress that the answer to conditional probabilities of this kind can change depending upon how the information given was actually obtained. For example, they show that 1/2 is the correct answer for the following scenario. Example 4.26 Mr. Smith is the father of two. We meet him walking along the street with a young boy whom he proudly introduces as his son. What is the probability that Mr. Smith's other child is also a boy? As usual we have to make some additional assumptions. For example, we will assume that if Mr. Smith has a boy and a girl, he is equally likely to choose either one to accompany him on his walk. In Figure 4.13 we show the tree analysis of this problem and we see that 1/2 is, indeed, the correct answer. D Example 4.27 It is not so easy to think of reasonable scenarios that would lead to the classical 1/3 answer. An attempt was made by Stephen Geller in proposing this problem to Marilyn vos Savant.20 Geller's problem is as follows: A shopkeeper says she has two new baby beagles to show you, but she doesn't know whether they're both male, both female, or one of each sex. You tell her that you want only a male, and she telephones the fellow who's giving them a bath. "Is at least one a male?" 19M. Bar-Hillel and R. Falk, "Some teasers concerning conditional probabilities," Cognition, vol. 11 (1982), pgs. 109-122. 20M. vos Savant, "Ask Marilyn," Parade Magazine, 9 September; 2 December; 17 February 1990, reprinted in Marilyn vos Savant, Ask Marilyn, St. Martins, New York, 1992.  4.3. PARADOXES 177 Mr.Smith's children Walking with Unconditional Mr.Smith probability b bb b 1/2 bg 1/2 g b 1/2 gb 1/2 g gg 1 g 1/4 1/8 1/8 1/8 1/8 1/4 Mr.Smith's children Walking with Unconditional Conditional Mr. Smith probability probability bb b 1/4 1/8 b 1/2 1/2 1/4 1/4 bg b 1/2 gb 1/8 Figure 4.13: Tree for Example 4.26.  178 CHAPTER 4. CONDITIONAL PROBABILITY she asks. "Yes," she informs you with a smile. What is the probability that the other one is male? The reader is asked to decide whether the model which gives an answer of 1/3 is a reasonable one to use in this case. D In the preceding examples, the apparent paradoxes could easily be resolved by clearly stating the model that is being used and the assumptions that are being made. We now turn to some examples in which the paradoxes are not so easily resolved. Example 4.28 Two envelopes each contain a certain amount of money. One en- velope is given to Ali and the other to Baba and they are told that one envelope contains twice as much money as the other. However, neither knows who has the larger prize. Before anyone has opened their envelope, Ali is asked if she would like to trade her envelope with Baba. She reasons as follows: Assume that the amount in my envelope is x. If I switch, I will end up with 4/2 with probability 1/2, and 2x with probability 1/2. If I were given the opportunity to play this game many times, and if I were to switch each time, I would, on average, get lx 1 5 + -2x =-x . 22 2 4 This is greater than my average winnings if I didn't switch. Of course, Baba is presented with the same opportunity and reasons in the same way to conclude that he too would like to switch. So they switch and each thinks that his/her net worth just went up by 25%. Since neither has yet opened any envelope, this process can be repeated and so again they switch. Now they are back with their original envelopes and yet they think that their fortune has increased 25% twice. By this reasoning, they could convince themselves that by repeatedly switching the envelopes, they could become arbitrarily wealthy. Clearly, something is wrong with the above reasoning, but where is the mistake? One of the tricks of making paradoxes is to make them slightly more difficult than is necessary to further befuddle us. As John Finn has suggested, in this paradox we could just have well started with a simpler problem. Suppose Ali and Baba know that I am going to give then either an envelope with $5 or one with $10 and I am going to toss a coin to decide which to give to Ali, and then give the other to Baba. Then Ali can argue that Baba has 2x with probability 1/2 and 4/2 with probability 1/2. This leads Ali to the same conclusion as before. But now it is clear that this is nonsense, since if Ali has the envelope containing $5, Baba cannot possibly have half of this, namely $2.50, since that was not even one of the choices. Similarly, if Ali has $10, Baba cannot have twice as much, namely $20. In fact, in this simpler problem the possibly outcomes are given by the tree diagram in Figure 4.14. From the diagram, it is clear that neither is made better off by switching. D In the above example, Ali's reasoning is incorrect because he infers that if the amount in his envelope is x, then the probability that his envelope contains the  4.3. PARADOXES 179 In Ali' s In Baba' s envelope envelope 1/2 $5 1 $10 1/2 1/2 $10 1 $5 1/2 Figure 4.14: John Finn's version of Example 4.28. smaller amount is 1/2, and the probability that her envelope contains the larger amount is also 1/2. In fact, these conditional probabilities depend upon the distri- bution of the amounts that are placed in the envelopes. For definiteness, let X denote the positive integer-valued random variable which represents the smaller of the two amounts in the envelopes. Suppose, in addition, that we are given the distribution of X, i.e., for each positive integer x, we are given the value of Px = P(X = x) . (In Finn's example, 5= 1, and pn = 0 for all other values of n.) Then it is easy to calculate the conditional probability that an envelope contains the smaller amount, given that it contains x dollars. The two possible sample points are (x, 4/2) and (x, 2x). If x is odd, then the first sample point has probability 0, since /2 is not an integer, so the desired conditional probability is 1 that x is the smaller amount. If x is even, then the two sample points have probabilities Px/2 and px, respectively, so the conditional probability that x is the smaller amount is Px Px/2 +Px which is not necessarily equal to 1/2. Steven Brams and D. Marc Kilgour21 study the problem, for different distri- butions, of whether or not one should switch envelopes, if one's objective is to maximize the long-term average winnings. Let x be the amount in your envelope. They show that for any distribution of X, there is at least one value of x such that you should switch. They give an example of a distribution for which there is exactly one value of x such that you should switch (see Exercise 5). Perhaps the most interesting case is a distribution in which you should always switch. We now give this example. Example 4.29 Suppose that we have two envelopes in front of us, and that one envelope contains twice the amount of money as the other (both amounts are pos- itive integers). We are given one of the envelopes, and asked if we would like to switch. 21S. J. Brains and D. M. Kilgour, "The Box Problem: To Switch or Not to Switch," Mathematics Magazine, vol. 68, no. 1 (1995), p. 29.  180 CHAPTER 4. CONDITIONAL PROBABILITY As above, we let X denote the smaller of the two amounts in the envelopes, and let px = P(X = x) . We are now in a position where we can calculate the long-term average winnings, if we switch. (This long-term average is an example of a probabilistic concept known as expectation, and will be discussed in Chapter 6.) Given that one of the two sample points has occurred, the probability that it is the point (x, 4/2) is Px/2 Px/2 + Px and the probability that it is the point (x, 2x) is Px Px/2 + Px Thus, if we switch, our long-term average winnings are Px/2 X Px Px2-+ x 2x . Px/2+Px2 Px/2+Px If this is greater than x, then it pays in the long run for us to switch. Some routine algebra shows that the above expression is greater than x if and only if Px/2 2 (4.6) Px/2+Px 3 It is interesting to consider whether there is a distribution on the positive integers such that the inequality 4.6 is true for all even values of x. Brams and Kilgour22 give the following example. We define px as follows: _ 3} 3 , ifx -2k, Px- 0, otherwise. It is easy to calculate (see Exercise 4) that for all relevant values of x, we have Px/2 3 Px/2+Px 5 which means that the inequality 4.6 is always true. D So far, we have been able to resolve paradoxes by clearly stating the assumptions being made and by precisely stating the models being used. We end this section by describing a paradox which we cannot resolve. Example 4.30 Suppose that we have two envelopes in front of us, and we are told that the envelopes contain X and Y dollars, respectively, where X and Y are different positive integers. We randomly choose one of the envelopes, and we open 22ibid.  4.3. PARADOXES 181 it, revealing X, say. Is it possible to determine, with probability greater than 1/2, whether X is the smaller of the two dollar amounts? Even if we have no knowledge of the joint distribution of X and Y, the surprising answer is yes! Here's how to do it. Toss a fair coin until the first time that heads turns up. Let Z denote the number of tosses required plus 1/2. If Z > X, then we say that X is the smaller of the two amounts, and if Z < X, then we say that X is the larger of the two amounts. First, if Z lies between X and Y, then we are sure to be correct. Since X and Y are unequal, Z lies between them with positive probability. Second, if Z is not between X and Y, then Z is either greater than both X and Y, or is less than both X and Y. In either case, X is the smaller of the two amounts with probability 1/2, by symmetry considerations (remember, we chose the envelope at random). Thus, the probability that we are correct is greater than 1/2. D Exercises 1 One of the first conditional probability paradoxes was provided by Bertrand.23 It is called the Box Paradox. A cabinet has three drawers. In the first drawer there are two gold balls, in the second drawer there are two silver balls, and in the third drawer there is one silver and one gold ball. A drawer is picked at random and a ball chosen at random from the two balls in the drawer. Given that a gold ball was drawn, what is the probability that the drawer with the two gold balls was chosen? 2 The following problem is called the two aces problem. This problem, dat- ing back to 1936, has been attributed to the English mathematician J. H. C. Whitehead (see Gridgeman24). This problem was also submitted to Mar- ilyn vos Savant by the master of mathematical puzzles Martin Gardner, who remarks that it is one of his favorites. A bridge hand has been dealt, i. e. thirteen cards are dealt to each player. Given that your partner has at least one ace, what is the probability that he has at least two aces? Given that your partner has the ace of hearts, what is the probability that he has at least two aces? Answer these questions for a version of bridge in which there are eight cards, namely four aces and four kings, and each player is dealt two cards. (The reader may wish to solve the problem with a 52-card deck.) 3 In the preceding exercise, it is natural to ask "How do we get the information that the given hand has an ace?" Gridgeman considers two different ways that we might get this information. (Again, assume the deck consists of eight cards.) (a) Assume that the person holding the hand is asked to "Name an ace in your hand" and answers "The ace of hearts." What is the probability that he has a second ace? 3J. Bertrand, Calcul des Probabilitis, Gauthier-Ujillars, 1888. 24N. T. Gridgeman, Letter, American Statistician, 21 (1967), pgs. 38-39.  182 CHAPTER 4. CONDITIONAL PROBABILITY (b) Suppose the person holding the hand is asked the more direct question "Do you have the ace of hearts?" and the answer is yes. What is the probability that he has a second ace? 4 Using the notation introduced in Example 4.29, show that in the example of Brams and Kilgour, if x is a positive power of 2, then Px/2 3 Px/2 +Px 5 5 Using the notation introduced in Example 4.29, let Px { (0 if x-2k otherwise. Show that there is exactly one value of x such that if your envelope contains x, then you should switch. *6 (For bridge players only. From Sutherland.25) Suppose that we are the de- clarer in a hand of bridge, and we have the king, 9, 8, 7, and 2 of a certain suit, while the dummy has the ace, 10, 5, and 4 of the same suit. Suppose that we want to play this suit in such a way as to maximize the probability of having no losers in the suit. We begin by leading the 2 to the ace, and we note that the queen drops on our left. We then lead the 10 from the dummy, and our right-hand opponent plays the six (after playing the three on the first round). Should we finesse or play for the drop? 25E. Sutherland, "Restricted Choice Fact or Fiction?", Canadian Master Point, November 1, 1993.  Chapter 5 Important Distributions and Densities 5.1 Important Distributions In this chapter, we describe the discrete probability distributions and the continuous probability densities that occur most often in the analysis of experiments. We will also show how one simulates these distributions and densities on a computer. Discrete Uniform Distribution In Chapter 1, we saw that in many cases, we assume that all outcomes of an exper- iment are equally likely. If X is a random variable which represents the outcome of an experiment of this type, then we say that X is uniformly distributed. If the sample space S is of size n, where 0 < n < o0, then the distribution function m(w) is defined to be 1/n for all w E S. As is the case with all of the discrete probabil- ity distributions discussed in this chapter, this experiment can be simulated on a computer using the program GeneralSimulation. However, in this case, a faster algorithm can be used instead. (This algorithm was described in Chapter 1; we repeat the description here for completeness.) The expression 1 + [n(rnd)] takes on as a value each integer between 1 and n with probability 1/n (the notation [x] denotes the greatest integer not exceeding x). Thus, if the possible outcomes of the experiment are labelled wi w2, ..., wn, then we use the above expression to represent the subscript of the output of the experiment. If the sample space is a countably infinite set, such as the set of positive integers, then it is not possible to have an experiment which is uniform on this set (see Exercise 3). If the sample space is an uncountable set, with positive, finite length, such as the interval [0, 1], then we use continuous density functions (see Section 5.2). 183  184 CHAPTER 5. DISTRIBUTIONS AND DENSITIES Binomial Distribution The binomial distribution with parameters n, p, and k was defined in Chapter 3. It is the distribution of the random variable which counts the number of heads which occur when a coin is tossed n times, assuming that on any one toss, the probability that a head occurs is p. The distribution function is given by the formula b(np, k) (=T)pkq"-k, where q= 1 - p. One straightforward way to simulate a binomial random variable X is to compute the sum of n independent 0 -1 random variables, each of which take on the value 1 with probability p. This method requires n calls to a random number generator to obtain one value of the random variable. When n is relatively large (say at least 30), the Central Limit Theorem (see Chapter 9) implies that the binomial distribution is well-approximated by the corresponding normal density function (which is defined in Section 5.2) with parameters p = np and - = npq. Thus, in this case we can compute a value Y of a normal random variable with these parameters, and if -1/2 < Y < n + 1/2, we can use the value [Y+1/2] to represent the random variable X. If Y < -1/2 or Y > n +1/2, we reject Y and compute another value. We will see in the next section how we can quickly simulate normal random variables. Geometric Distribution Consider a Bernoulli trials process continued for an infinite number of trials; for example, a coin tossed an infinite sequence of times. We showed in Section 2.2 how to assign a probability distribution to the infinite tree. Thus, we can determine the distribution for any random variable X relating to the experiment provided P(X = a) can be computed in terms of a finite number of trials. For example, let T be the number of trials up to and including the first success. Then P(T=1) = p, P(T = 2) = qp , P(T = 3) = q2p and in general, P(T =n) =q"-p. To show that this is a distribution, we must show that p+qp+q2p+---=1=.  5.1. IMPORTANT DISTRIBUTIONS 185 Figure 5.1: Geometric distributions. The left-hand expression is just a geometric series with first term p and common ratio q, so its sum is p 1-q which equals 1. In Figure 5.1 we have plotted this distribution using the program Geometric- Plot for the cases p = .5 and p = .2. We see that as p decreases we are more likely to get large values for T, as would be expected. In both cases, the most probable value for T is 1. This will always be true since P(T = j+ 1) q rnd . (5.1) Then we have P(Y = j) = P (1-qgi;>rnd > 1-qi-1 =qi -1 - qi qj-1(1 - q) q3- p .  186 CHAPTER 5. DISTRIBUTIONS AND DENSITIES Thus, Y is geometrically distributed with parameter p. To generate Y, all we have to do is solve Equation 5.1 for Y. We obtain log (1 - rrnd) Ylog q where the notation [x] means the least integer which is greater than or equal to x. Since log(1 - rnd) and log(rnd) are identically distributed, Y can also be generated using the equation Y ~log rnd Y=. log q Example 5.1 The geometric distribution plays an important role in the theory of queues, or waiting lines. For example, suppose a line of customers waits for service at a counter. It is often assumed that, in each small time unit, either 0 or 1 new customers arrive at the counter. The probability that a customer arrives is p and that no customer arrives is q = 1 - p. Then the time T until the next arrival has a geometric distribution. It is natural to ask for the probability that no customer arrives in the next k time units, that is, for P(T > k). This is given by 00 P(T > k) = q-p = qk-(p+ qp+q2p +...) j=k+1 qk This probability can also be found by noting that we are asking for no successes (i.e., arrivals) in a sequence of k consecutive time units, where the probability of a success in any one time unit is p. Thus, the probability is just qk, since arrivals in any two time units are independent events. It is often assumed that the length of time required to service a customer also has a geometric distribution but with a different value for p. This implies a rather special property of the service time. To see this, let us compute the conditional probability P(T > r + s) _ qr+s P(T>r+sT>r)= P(T>r) gr Thus, the probability that the customer's service takes s more time units is inde- pendent of the length of time r that the customer has already been served. Because of this interpretation, this property is called the "memoryless" property, and is also obeyed by the exponential distribution. (Fortunately, not too many service stations have this property.) D Negative Binomial Distribution Suppose we are given a coin which has probability p of coming up heads when it is tossed. We fix a positive integer k, and toss the coin until the kth head appears. We  5.1. IMPORTANT DISTRIBUTIONS 187 let X represent the number of tosses. When k = 1, X is geometrically distributed. For a general k, we say that X has a negative binomial distribution. We now calculate the probability distribution of X. If X = x, then it must be true that there were exactly k - 1 heads thrown in the first x - 1 tosses, and a head must have been thrown on the xth toss. There are sequences of length x with these properties, and each of them is assigned the same probability, namely pk-i x-k Therefore, if we define u(x, k, p) = P(X = x) , then x- 1 U(x, k, p) =-pkq-k One can simulate this on a computer by simulating the tossing of a coin. The following algorithm is, in general, much faster. We note that X can be understood as the sum of k outcomes of a geometrically distributed experiment with parameter p. Thus, we can use the following sum as a means of generating X: kFlog rrud j log q Example 5.2 A fair coin is tossed until the second time a head turns up. The distribution for the number of tosses is u(x, 2, p). Thus the probability that x tosses are needed to obtain two heads is found by letting k = 2 in the above formula. We obtain x -1 1 u(x, 2, 1/2) = 1 2x for x = 2, 3, ... . In Figure 5.2 we give a graph of the distribution for k = 2 and p = .25. Note that the distribution is quite asymmetric, with a long tail reflecting the fact that large values of x are possible. D Poisson Distribution The Poisson distribution arises in many situations. It is safe to say that it is one of the three most important discrete probability distributions (the other two being the uniform and the binomial distributions). The Poisson distribution can be viewed as arising from the binomial distribution or from the exponential density. We shall now explain its connection with the former; its connection with the latter will be explained in the next section.  188 CHAPTER 5. DISTRIBUTIONS AND DENSITIES 0.1 0.08 0.06 0.04 0.02 01 5 10 15 20 25 30 Figure 5.2: Negative binomial distribution with k = 2 and p = .25. Suppose that we have a situation in which a certain kind of occurrence happens at random over a period of time. For example, the occurrences that we are interested in might be incoming telephone calls to a police station in a large city. We want to model this situation so that we can consider the probabilities of events such as more than 10 phone calls occurring in a 5-minute time interval. Presumably, in our example, there would be more incoming calls between 6:00 and 7:00 P.M. than between 4:00 and 5:00 A.M., and this fact would certainly affect the above probability. Thus, to have a hope of computing such probabilities, we must assume that the average rate, i.e., the average number of occurrences per minute, is a constant. This rate we will denote by A. (Thus, in a given 5-minute time interval, we would expect about 5A occurrences.) This means that if we were to apply our model to the two time periods given above, we would simply use different rates for the two time periods, thereby obtaining two different probabilities for the given event. Our next assumption is that the number of occurrences in two non-overlapping time intervals are independent. In our example, this means that the events that there are j calls between 5:00 and 5:15 P.M. and k calls between 6:00 and 6:15 P.M. on the same day are independent. We can use the binomial distribution to model this situation. We imagine that a given time interval is broken up into n subintervals of equal length. If the subin- tervals are sufficiently short, we can assume that two or more occurrences happen in one subinterval with a probability which is negligible in comparison with the probability of at most one occurrence. Thus, in each subinterval, we are assuming that there is either 0 or 1 occurrence. This means that the sequence of subintervals can be thought of as a sequence of Bernoulli trials, with a success corresponding to an occurrence in the subinterval.  5.1. IMPORTANT DISTRIBUTIONS 189 To decide upon the proper value of p, the probability of an occurrence in a given subinterval, we reason as follows. On the average, there are At occurrences in a time interval of length t. If this time interval is divided into n subintervals, then we would expect, using the Bernoulli trials interpretation, that there should be np occurrences. Thus, we want At rip, so At n We now wish to consider the random variable X, which counts the number of occurrences in a given time interval. We want to calculate the distribution of X. For ease of calculation, we will assume that the time interval is of length 1; for time intervals of arbitrary length t, see Exercise 11. We know that n P(X =0) b(n,p,0) _(1-p)m (i1-n For large n, this is approximately e-A. It is easy to calculate that for any fixed k, we have b(n,p,k) _A - (k -1)p b(n, p, k -1) kq which, for large n (and therefore small p) is approximately A/k. Thus, we have P(X = 1) Ae-A and in general, Ak P(X = k)~! e-A (5.2) The above distribution is the Poisson distribution. We note that it must be checked that the distribution given in Equation 5.2 really is a distribution, i.e., that its values are non-negative and sum to 1. (See Exercise 12.) The Poisson distribution is used as an approximation to the binomial distribu- tion when the parameters n and p are large and small, respectively (see Examples 5.3 and 5.4). However, the Poisson distribution also arises in situations where it may not be easy to interpret or measure the parameters n and p (see Example 5.5). Example 5.3 A typesetter makes, on the average, one mistake per 1000 words. Assume that he is setting a book with 100 words to a page. Let Sioo be the number of mistakes that he makes on a single page. Then the exact probability distribution for S100 would be obtained by considering Sioo as a result of 100 Bernoulli trials with p = 1/1000. The expected value of S100 is A = 100(1/1000) = .1. The exact probability that S100= j is b(100, 1/1000,j), and the Poisson approximation is e-.1(.1 )j In Table 5.1 we give, for various values of ni and p, the exact values computed by the binomial distribution and the Poisson approximation.D  190 190 CHAPTER 5. DISTRIBUTIONS AND DENSITIES Poisson Binomial Poisson BinomialT Poisson Binomial 1n:::100 n=100 n~1000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 .9048 .0905 .0045 .0002 .0000 .9048 .0905 .0045 .0002 .0000 .3679 .3679 .1839 .0613 .0153 .0031 .0005 .0001 .0000 .3660 .3697 .1849 .0610 .0149 .0029 .0005 .0001 .0000 .0000 .0005 .0023 .0076 .0189 .0378 .0631 .0901 .1126 .1251 .1251 .1137 .0948 .0729 .0521 .0347 .0217 .0128 .0071 .0037 .0019 .0009 .0004 .0002 .0001 .0000 .0000 .0004 .0022 .0074 .0186 .0374 .0627 .0900 .1128 .1256 .1257 .1143 .0952 .0731 .0520 .0345 .0215 .0126 .0069 .0036 .0018 .0009 .0004 .0002 .0001 .0000 Table 5.1: Poisson approximation to the binomial distribution.  5.1. IMPORTANT DISTRIBUTIONS 191 Example 5.4 In his book,1 Feller discusses the statistics of flying bomb hits in the south of London during the Second World War. Assume that you live in a district of size 10 blocks by 10 blocks so that the total district is divided into 100 small squares. How likely is it that the square in which you live will receive no hits if the total area is hit by 400 bombs? We assume that a particular bomb will hit your square with probability 1/100. Since there are 400 bombs, we can regard the number of hits that your square receives as the number of successes in a Bernoulli trials process with n = 400 and p = 1/100. Thus we can use the Poisson distribution with A,= 400- 1/100 = 4 to approximate the probability that your square will receive j hits. This probability is p(j) = e-44i/j!. The expected number of squares that receive exactly j hits is then 100 - p(j). It is easy to write a program LondonBombs to simulate this situation and compare the expected number of squares with j hits with the observed number. In Exercise 26 you are asked to compare the actual observed data with that predicted by the Poisson distribution. In Figure 5.3, we have shown the simulated hits, together with a spike graph showing both the observed and predicted frequencies. The observed frequencies are shown as squares, and the predicted frequencies are shown as dots. Q If the reader would rather not consider flying bombs, he is invited to instead consider an analogous situation involving cookies and raisins. We assume that we have made enough cookie dough for 500 cookies. We put 600 raisins in the dough, and mix it thoroughly. One way to look at this situation is that we have 500 cookies, and after placing the cookies in a grid on the table, we throw 600 raisins at the cookies. (See Exercise 22.) Example 5.5 Suppose that in a certain fixed amount A of blood, the average human has 40 white blood cells. Let X be the random variable which gives the number of white blood cells in a random sample of size A from a random individual. We can think of X as binomially distributed with each white blood cell in the body representing a trial. If a given white blood cell turns up in the sample, then the trial corresponding to that blood cell was a success. Then p should be taken as the ratio of A to the total amount of blood in the individual, and n will be the number of white blood cells in the individual. Of course, in practice, neither of these parameters is very easy to measure accurately, but presumably the number 40 is easy to measure. But for the average human, we then have 40 = np, so we can think of X as being Poisson distributed, with parameter A = 40. In this case, it is easier to model the situation using the Poisson distribution than the binomial distribution. To simulate a Poisson random variable on a computer, a good way is to take advantage of the relationship between the Poisson distribution and the exponential density. This relationship and the resulting simulation algorithm will be described in the next section. iibid., p. 161.  192 CHAPTER 5. DISTRIBUTIONS AND DENSITIES " * -"- -- - i " " 5 *:* 0 * 0.. 0.1 0. 0.0 0 2 4 6 8 10 Figure 5.3: Flying bomb hits.  5.1. IMPORTANT DISTRIBUTIONS 193 Hypergeometric Distribution Suppose that we have a set of N balls, of which k are red and N - k are blue. We choose n of these balls, without replacement, and define X to be the number of red balls in our sample. The distribution of X is called the hypergeometric distribution. We note that this distribution depends upon three parameters, namely N, k, and n. There does not seem to be a standard notation for this distribution; we will use the notation h(N, k, n, x) to denote P(X = x). This probability can be found by noting that there are N n different samples of size n, and the number of such samples with exactly x red balls is obtained by multiplying the number of ways of choosing x red balls from the set of k red balls and the number of ways of choosing n - x blue balls from the set of N - k blue balls. Hence, we have (xk)(N- h(N,k,n,z) = N- n This distribution can be generalized to the case where there are more than two types of objects. (See Exercise 40.) If we let N and k tend to o, in such a way that the ratio k/N remains fixed, then the hypergeometric distribution tends to the binomial distribution with parameters n and p = k/N. This is reasonable because if N and k are much larger than n, then whether we choose our sample with or without replacement should not affect the probabilities very much, and the experiment consisting of choosing with replacement yields a binomially distributed random variable (see Exercise 44). An example of how this distribution might be used is given in Exercises 36 and 37. We now give another example involving the hypergeometric distribution. It illustrates a statistical test called Fisher's Exact Test. Example 5.6 It is often of interest to consider two traits, such as eye color and hair color, and to ask whether there is an association between the two traits. Two traits are associated if knowing the value of one of the traits for a given person allows us to predict the value of the other trait for that person. The stronger the association, the more accurate the predictions become. If there is no association between the traits, then we say that the traits are independent. In this example, we will use the traits of gender and political party, and we will assume that there are only two possible genders, female and male, and only two possible political parties, Democratic and Republican. Suppose that we have collected data concerning these traits. To test whether there is an association between the traits, we first assume that there is no association between the two traits. This gives rise to an "expected" data set, in which knowledge of the value of one trait is of no help in predicting the value of the other trait. Our collected data set usually differs from this expected data set. If it differs by quite a bit, then we would tend to reject the assumption of independence of the traits. To  194 CHAPTER 5. DISTRIBUTIONS AND DENSITIES Democrat Republican Female 24 4 28 Male 8 14 22 32 18 50 Table 5.2: Observed data. Democrat Republican Female s11 s12 t11 Male s21 s22 t12 t21 t22 n Table 5.3: General data table. nail down what is meant by "quite a bit," we decide which possible data sets differ from the expected data set by at least as much as ours does, and then we compute the probability that any of these data sets would occur under the assumption of independence of traits. If this probability is small, then it is unlikely that the difference between our collected data set and the expected data set is due entirely to chance. Suppose that we have collected the data shown in Table 5.2. The row and column sums are called marginal totals, or marginals. In what follows, we will denote the row sums by t11 and t12, and the column sums by t21 and t22. The ijth entry in the table will be denoted by sg2. Finally, the size of the data set will be denoted by n. Thus, a general data table will look as shown in Table 5.3. We now explain the model which will be used to construct the "expected" data set. In the model, we assume that the two traits are independent. We then put t21 yellow balls and t22 green balls, corresponding to the Democratic and Republican marginals, into an urn. We draw tin balls, without replacement, from the urn, and call these balls females. The t12 balls remaining in the urn are called males. In the specific case under consideration, the probability of getting the actual data under this model is given by the expression (2)(18 i.e., a value of the hypergeometric distribution. We are now ready to construct the expected data set. If we choose 28 balls out of 50, we should expect to see, on the average, the same percentage of yellow balls in our sample as in the urn. Thus, we should expect to see, on the average, 28(32/50) = 17.92 ~ 18 yellow balls in our sample. (See Exercise 36.) The other expected values are computed in exactly the same way. Thus, the expected data set is shown in Table 5.4. We note that the value of sui determines the other three values in the table, since the marginals are all fixed. Thus, in considering the possible data sets that could appear in this model, it is enough to consider the various possible values of sn.* In the specific case at hand, what is the probability  5.1. IMPORTANT DISTRIBUTIONS 195 Democrat Republican Female 18 10 28 Male 14 8 22 32 18 50 Table 5.4: Expected data. of drawing exactly a yellow balls, i.e., what is the probability that s11= a? It is 3218 (3a)I(28-a) 5 2 - . (5 .3 ) We are now ready to decide whether our actual data differs from the expected data set by an amount which is greater than could be reasonably attributed to chance alone. We note that the expected number of female Democrats is 18, but the actual number in our data is 24. The other data sets which differ from the expected data set by more than ours correspond to those where the number of female Democrats equals 25, 26, 27, or 28. Thus, to obtain the required probability, we sum the expression in (5.3) from a = 24 to a = 28. We obtain a value of .000395. Thus, we should reject the hypothesis that the two traits are independent. Q Finally, we turn to the question of how to simulate a hypergeometric random variable X. Let us assume that the parameters for X are N, k, and n. We imagine that we have a set of N balls, labelled from 1 to N. We decree that the first k of these balls are red, and the rest are blue. Suppose that we have chosen m balls, and that j of them are red. Then there are k - j red balls left, and N - m balls left. Thus, our next choice will be red with probability k-j So at this stage, we choose a random number in [0, 1], and report that a red ball has been chosen if and only if the random number does not exceed the above expression. Then we update the values of m and j, and continue until n balls have been chosen. Benford Distribution Our next example of a distribution comes from the study of leading digits in data sets. It turns out that many data sets that occur "in real life" have the property that the first digits of the data are not uniformly distributed over the set {1, 2,... , 9}. Rather, it appears that the digit 1 is most likely to occur, and that the distribution is monotonically decreasing on the set of possible digits. The Benford distribution appears, in many cases, to fit such data. Many explanations have been given for the occurrence of this distribution. Possibly the most convincing explanation is that this distribution is the only one that is invariant under a change of scale. If one thinks of certain data sets as somehow "naturally occurring," then the distribution should be unaffected by which units are chosen in which to represent the data, i.e., the distribution should be invariant under change of scale.  196 CHAPTER 5. DISTRIBUTIONS AND DENSITIES 0.3 0.25 0.2 0.15 0.1 0.05 0 , 2 4 6 8 Figure 5.4: Leading digits in President Clinton's tax returns. Theodore Hill2 gives a general description of the Benford distribution, when one considers the first d digits of integers in a data set. We will restrict our attention to the first digit. In this case, the Benford distribution has distribution function f(k) = loglo(k + 1) - logio(k) for 1 3). (c) Find P(T > 6|T > 3). 8 If a coin is tossed a sequence of times, what is the probability that the first head will occur after the fifth toss, given that it has not occurred in the first two tosses? 9 A worker for the Department of Fish and Game is assigned the job of esti- mating the number of trout in a certain lake of modest size. She proceeds as follows: She catches 100 trout, tags each of them, and puts them back in the lake. One month later, she catches 100 more trout, and notes that 10 of them have tags. (a) Without doing any fancy calculations, give a rough estimate of the num- ber of trout in the lake. (b) Let N be the number of trout in the lake. Find an expression, in terms of N, for the probability that the worker would catch 10 tagged trout out of the 100 trout that she caught the second time. (c) Find the value of N which maximizes the expression in part (b). This value is called the maximum likelihood estimate for the unknown quantity N. Hint: Consider the ratio of the expressions for successive values of N. 10 A census in the United States is an attempt to count everyone in the country. It is inevitable that many people are not counted. The U. S. Census Bureau proposed a way to estimate the number of people who were not counted by the latest census. Their proposal was as follows: In a given locality, let N denote the actual number of people who live there. Assume that the census counted n1 people living in this area. Now, another census was taken in the locality, and n2 people were counted. In addition, n12 people were counted both times. (a) Given N, n1, and n2, let X denote the number of people counted both times. Find the probability that X = k, where k is a fixed positive integer between 0 and n2. (b) Now assume that X = n12. Find the value of N which maximizes the expression in part (a). Hint: Consider the ratio of the expressions for successive values of N. 11 Suppose that X is a random variable which represents the number of calls coming in to a police station in a one-minute interval. In the text, we showed that X could be modelled using a Poisson distribution with parameter A, where this parameter represents the average number of incoming calls per minute. Now suppose that Y is a random variable which represents the num- ber of incoming calls in an interval of length t. Show that the distribution of Y is given by (At)k P(Y =k) =A e-  5.1. IMPORTANT DISTRIBUTIONS 199 i.e., Y is Poisson with parameter At. Hint: Suppose a Martian were to observe the police station. Let us also assume that the basic time interval used on Mars is exactly t Earth minutes. Finally, we will assume that the Martian understands the derivation of the Poisson distribution in the text. What would she write down for the distribution of Y? 12 Show that the values of the Poisson distribution given in Equation 5.2 sum to 1. 13 The Poisson distribution with parameter A = .3 has been assigned for the outcome of an experiment. Let X be the outcome function. Find P(X = 0), P(X = 1), and P(X > 1). 14 On the average, only 1 person in 1000 has a particular rare blood type. (a) Find the probability that, in a city of 10,000 people, no one has this blood type. (b) How many people would have to be tested to give a probability greater than 1/2 of finding at least one person with this blood type? 15 Write a program for the user to input n, p, j and have the program print out the exact value of b(n, p, k) and the Poisson approximation to this value. 16 Assume that, during each second, a Dartmouth switchboard receives one call with probability .01 and no calls with probability .99. Use the Poisson ap- proximation to estimate the probability that the operator will miss at most one call if she takes a 5-minute coffee break. 17 The probability of a royal flush in a poker hand is p = 1/649,740. How large must n be to render the probability of having no royal flush in n hands smaller than 1/e? 18 A baker blends 600 raisins and 400 chocolate chips into a dough mix and, from this, makes 500 cookies. (a) Find the probability that a randomly picked cookie will have no raisins. (b) Find the probability that a randomly picked cookie will have exactly two chocolate chips. (c) Find the probability that a randomly chosen cookie will have at least two bits (raisins or chips) in it. 19 The probability that, in a bridge deal, one of the four hands has all hearts is approximately 6.3 x 10-12. In a city with about 50,000 bridge players the resident probability expert is called on the average once a year (usually late at night) and told that the caller has just been dealt a hand of all hearts. Should she suspect that some of these callers are the victims of practical jokes?  200 CHAPTER 5. DISTRIBUTIONS AND DENSITIES 20 An advertiser drops 10,000 leaflets on a city which has 2000 blocks. Assume that each leaflet has an equal chance of landing on each block. What is the probability that a particular block will receive no leaflets? 21 In a class of 80 students, the professor calls on 1 student chosen at random for a recitation in each class period. There are 32 class periods in a term. (a) Write a formula for the exact probability that a given student is called upon j times during the term. (b) Write a formula for the Poisson approximation for this probability. Using your formula estimate the probability that a given student is called upon more than twice. 22 Assume that we are making raisin cookies. We put a box of 600 raisins into our dough mix, mix up the dough, then make from the dough 500 cookies. We then ask for the probability that a randomly chosen cookie will have 0, 1, 2, ... raisins. Consider the cookies as trials in an experiment, and let X be the random variable which gives the number of raisins in a given cookie. Then we can regard the number of raisins in a cookie as the result of n = 600 independent trials with probability p = 1/500 for success on each trial. Since n is large and p is small, we can use the Poisson approximation with A,= 600(1/500) = 1.2. Determine the probability that a given cookie will have at least five raisins. 23 For a certain experiment, the Poisson distribution with parameter A = m has been assigned. Show that a most probable outcome for the experiment is the integer value k such that m - 1 < k < m. Under what conditions will there be two most probable values? Hint: Consider the ratio of successive probabilities. 24 When John Kemeny was chair of the Mathematics Department at Dartmouth College, he received an average of ten letters each day. On a certain weekday he received no mail and wondered if it was a holiday. To decide this he computed the probability that, in ten years, he would have at least 1 day without any mail. He assumed that the number of letters he received on a given day has a Poisson distribution. What probability did he find? Hint: Apply the Poisson distribution twice. First, to find the probability that, in 3000 days, he will have at least 1 day without mail, assuming each year has about 300 days on which mail is delivered. 25 Reese Prosser never puts money in a 10-cent parking meter in Hanover. He assumes that there is a probability of .05 that he will be caught. The first offense costs nothing, the second costs 2 dollars, and subsequent offenses cost 5 dollars each. Under his assumptions, how does the expected cost of parking 100 times without paying the meter compare with the cost of paying the meter each time?  5.1. IMPORTANT DISTRIBUTIONS 201 Number of deaths Number of corps with x deaths in a given year 0 144 1 91 2 32 3 11 4 2 Table 5.5: Mule kicks. 26 Feller5 discusses the statistics of flying bomb hits in an area in the south of London during the Second World War. The area in question was divided into 24 x 24 = 576 small areas. The total number of hits was 537. There were 229 squares with 0 hits, 211 with 1 hit, 93 with 2 hits, 35 with 3 hits, 7 with 4 hits, and 1 with 5 or more. Assuming the hits were purely random, use the Poisson approximation to find the probability that a particular square would have exactly k hits. Compute the expected number of squares that would have 0, 1, 2, 3, 4, and 5 or more hits and compare this with the observed results. 27 Assume that the probability that there is a significant accident in a nuclear power plant during one year's time is .001. If a country has 100 nuclear plants, estimate the probability that there is at least one such accident during a given year. 28 An airline finds that 4 percent of the passengers that make reservations on a particular flight will not show up. Consequently, their policy is to sell 100 reserved seats on a plane that has only 98 seats. Find the probability that every person who shows up for the flight will find a seat available. 29 The king's coinmaster boxes his coins 500 to a box and puts 1 counterfeit coin in each box. The king is suspicious, but, instead of testing all the coins in 1 box, he tests 1 coin chosen at random out of each of 500 boxes. What is the probability that he finds at least one fake? What is it if the king tests 2 coins from each of 250 boxes? 30 (From Kemeny6) Show that, if you make 100 bets on the number 17 at roulette at Monte Carlo (see Example 6.13), you will have a probability greater than 1/2 of coming out ahead. What is your expected winning? 31 In one of the first studies of the Poisson distribution, von Bortkiewicz7 con- sidered the frequency of deaths from kicks in the Prussian army corps. From the study of 14 corps over a 20-year period, he obtained the data shown in Table 5.5. Fit a Poisson distribution to this data and see if you think that the Poisson distribution is appropriate. 5ibid., p. 161. 6PrvaeCommunication. 7 L. von BortkiewiCz, Das Gesetz der Kleinen Zahien (Leipzig: Teubner, 1898), p. 24.  202 CHAPTER 5. DISTRIBUTIONS AND DENSITIES 32 It is often assumed that the auto traffic that arrives at the intersection during a unit time period has a Poisson distribution with expected value m. Assume that the number of cars X that arrive at an intersection from the north in unit time has a Poisson distribution with parameter A = m and the number Y that arrive from the west in unit time has a Poisson distribution with parameter A = m. If X and Y are independent, show that the total number X + Y that arrive at the intersection in unit time has a Poisson distribution with parameter A = m + m. 33 Cars coming along Magnolia Street come to a fork in the road and have to choose either Willow Street or Main Street to continue. Assume that the number of cars that arrive at the fork in unit time has a Poisson distribution with parameter A = 4. A car arriving at the fork chooses Main Street with probability 3/4 and Willow Street with probability 1/4. Let X be the random variable which counts the number of cars that, in a given unit of time, pass by Joe's Barber Shop on Main Street. What is the distribution of X? 34 In the appeal of the People v. Collins case (see Exercise 4.1.28), the counsel for the defense argued as follows: Suppose, for example, there are 5,000,000 couples in the Los Angeles area and the probability that a randomly chosen couple fits the witnesses' description is 1/12,000,000. Then the probability that there are two such couples given that there is at least one is not at all small. Find this probability. (The California Supreme Court overturned the initial guilty verdict.) 35 A manufactured lot of brass turnbuckles has S items of which D are defective. A sample of s items is drawn without replacement. Let X be a random variable that gives the number of defective items in the sample. Let p(d) = P(X = d). (a) Show that (D s-D) p(d) = ( -) Thus, X is hypergeometric. (b) Prove the following identity, known as Euler's formula: min(D,s) D S - D S d s-d s d=0 36 A bin of 1000 turnbuckles has an unknown number D of defectives. A sample of 100 turnbuckles has 2 defectives. The maximum likelihood estimate for D is the number of defectives which gives the highest probability for obtaining the number of defectives observed in the sample. Guess this number D and then write a computer program to verify your guess. 37 There are an unknown number of moose on Isle Royale (a National Park in Lake Superior). To estimate the number of moose, 50 moose are captured and  5.1. IMPORTANT DISTRIBUTIONS 203 tagged. Six months later 200 moose are captured and it is found that 8 of these were tagged. Estimate the number of moose on Isle Royale from these data, and then verify your guess by computer program (see Exercise 36). 38 A manufactured lot of buggy whips has 20 items, of which 5 are defective. A random sample of 5 items is chosen to be inspected. Find the probability that the sample contains exactly one defective item (a) if the sampling is done with replacement. (b) if the sampling is done without replacement. 39 Suppose that N and k tend to o in such a way that k/N remains fixed. Show that h(N, k, n, x) - b(n, k/N, x) . 40 A bridge deck has 52 cards with 13 cards in each of four suits: spades, hearts, diamonds, and clubs. A hand of 13 cards is dealt from a shuffled deck. Find the probability that the hand has (a) a distribution of suits 4, 4, 3, 2 (for example, four spades, four hearts, three diamonds, two clubs). (b) a distribution of suits 5, 3, 3, 2. 41 Write a computer algorithm that simulates a hypergeometric random variable with parameters N, k, and n. 42 You are presented with four different dice. The first one has two sides marked 0 and four sides marked 4. The second one has a 3 on every side. The third one has a 2 on four sides and a 6 on two sides, and the fourth one has a 1 on three sides and a 5 on three sides. You allow your friend to pick any of the four dice he wishes. Then you pick one of the remaining three and you each roll your die. The person with the largest number showing wins a dollar. Show that you can choose your die so that you have probability 2/3 of winning no matter which die your friend picks. (See Tenney and Foster.8) 43 The students in a certain class were classified by hair color and eye color. The conventions used were: Brown and black hair were considered dark, and red and blonde hair were considered light; black and brown eyes were considered dark, and blue and green eyes were considered light. They collected the data shown in Table 5.6. Are these traits independent? (See Example 5.6.) 44 Suppose that in the hypergeometric distribution, we let N and k tend to 00 in such a way that the ratio k/N approaches a real number p between 0 and 1. Show that the hypergeometric distribution tends to the binomial distribution with parameters n and p. 8R. L. Tenney and C. C. Foster, Non-transitive Dominance, Math. Mag. 49 (1976) no. 3, pgs. 115-120.  204 CHAPTER 5. DISTRIBUTIONS AND DENSITIES Dark Eyes Light Eyes Dark Hair 28 15 43 Light Hair 9 23 32 37 38 75 Table 5.6: Observed data. I I I I I I I I I I I I I I I I I I I I I I 3500 2000 N 1,1, 1 15004 1000 500 0 0 10 20 30 iii Siii 40 Figure 5.5: Distribution of choices in the Powerball lottery. 45 (a) Compute the leading digits of the first 100 powers of 2, and see how well these data fit the Benford distribution. (b) Multiply each number in the data set of part (a) by 3, and compare the distribution of the leading digits with the Benford distribution. 46 In the Powerball lottery, contestants pick 5 different integers between 1 and 45, and in addition, pick a bonus integer from the same range (the bonus integer can equal one of the first five integers chosen). Some contestants choose the numbers themselves, and others let the computer choose the numbers. The data shown in Table 5.7 are the contestant-chosen numbers in a certain state on May 3, 1996. A spike graph of the data is shown in Figure 5.5. The goal of this problem is to check the hypothesis that the chosen numbers are uniformly distributed. To do this, compute the value v of the random variable x2 given in Example 5.6. In the present case, this random variable has 44 degrees of freedom. One can find, in a x2 table, the value vo = 59.43 , which represents a number with the property that a x2-distributed random variable takes on values that exceed vo only 5% of the time. Does your computed value of v exceed vo? If so, you should reject the hypothesis that the contestants' choices are uniformly distributed.  5.2. IMPORTANT DENSITIES 205 Integer Times Chosen 1 2646 4 3000 7 3657 10 2985 13 2690 16 2456 19 2304 22 2678 25 2616 28 2059 31 2081 34 1463 37 1049 40 1493 43 1207 Integer Times Chosen 2 2934 5 3357 8 3025 11 3138 14 2423 17 2479 20 1971 23 2729 26 2426 29 2039 32 1508 35 1594 38 1165 41 1322 44 1259 Integer Times Chosen 3 3352 6 2892 9 3362 12 3043 15 2556 18 2276 21 2543 24 2414 27 2381 30 2298 33 1887 36 1354 39 1248 42 1423 45 1224 Table 5.7: Numbers chosen by contestants in the Powerball lottery. 5.2 Important Densities In this section, we will introduce some important probability density functions and give some examples of their use. We will also consider the question of how one simulates a given density using a computer. Continuous Uniform Density The simplest density function corresponds to the random variable U whose value represents the outcome of the experiment consisting of choosing a real number at random from the interval [a, b]. 5 )= 1/(b - a), if a < w < b, (w) 0, otherwise. It is easy to simulate this density on a computer. We simply calculate the expression (b-a)rnd+a. Exponential and Gamma Densities The exponential density function is defined by f ) { Ae-x 0, if 0 0, then we have F(x) = P(T x) = 2Xe-AIdt 1 -e-Ax . Both the exponential density and the geometric distribution share a property known as the "memoryless" property. This property was introduced in Example 5.1; it says that P(T>r+s T>r)=P(T>s) . This can be demonstrated to hold for the exponential density by computing both sides of this equation. The right-hand side is just 1 - F(s) = e-As , while the left-hand side is P(T > r+s) _1 -F(r +s) P(T >r) 1 -F(r)  5.2. IMPORTANT DENSITIES 207 e- (r+s) e-Ar There is a very important relationship between the exponential density and the Poisson distribution. We begin by defining X1, X2, ... to be a sequence of independent exponentially distributed random variables with parameter A. We might think of Xi as denoting the amount of time between the ith and (i + 1)st emissions of a particle by a radioactive source. (As we shall see in Chapter 6, we can think of the parameter A as representing the reciprocal of the average length of time between emissions. This parameter is a quantity that might be measured in an actual experiment of this type.) We now consider a time interval of length t, and we let Y denote the random variable which counts the number of emissions that occur in the time interval. We would like to calculate the distribution function of Y (clearly, Y is a discrete random variable). If we let Sn denote the sum X1 + X2 + - - - + Xn, then it is easy to see that P(Y = n) = P(Sn t and Sn+1 > t) . Since the event Sn+1 < t is a subset of the event Sn < t, the above probability is seen to be equal to P(Sn 0, gn (x) = (n-1). ' ' 0 0, otherwise. This density is an example of a gamma density with parameters A and n. The general gamma density allows n to be any positive real number. We shall not discuss this general density. It is easy to show by induction on n that the cumulative distribution function of Sn is given by: 1-e-A 1 + lx+---+ , ~n1- if x>0, Gn(x){ 0, otherwise. Using this expression, the quantity in (5.4) is easy to compute; we obtain _At (At)"n n! which the reader will recognize as the probability that a Poisson-distributed random variable, with parameter At, takes on the value n. The above relationship will allow us to simulate a Poisson distribution, once we have found a way to simulate an exponential density. The following random variable does the job: Y = ~ log(rnd) . (5.5)  208 CHAPTER 5. DISTRIBUTIONS AND DENSITIES Using Corollary 5.2 (below), one can derive the above expression (see Exercise 3). We content ourselves for now with a short calculation that should convince the reader that the random variable Y has the required property. We have P(Y y) =P- log(rnd) y) = P(log(rnd) > -Ay) = P(rnd > e-A) This last expression is seen to be the cumulative distribution function of an expo- nentially distributed random variable with parameter A. To simulate a Poisson random variable W with parameter A, we simply generate a sequence of values of an exponentially distributed random variable with the same parameter, and keep track of the subtotals Sk of these values. We stop generating the sequence when the subtotal first exceeds A. Assume that we find that Sn A 1/p, so the average interarrival time is greater than the average service time, i.e., customers are served more quickly, on average, than new ones arrive. Thus, in this case, it is reasonable to expect that N(t) remains small. However, if A > p then customers arrive more quickly than they are served, and, as expected, N(t) appears to grow without limit. We can now ask: How long will a customer have to wait in the queue for service? To examine this question, we let W be the length of time that the ith customer has to remain in the system (waiting in line and being served). Then we can present these data in a bar graph, using the program Queue, to give some idea of how the W are distributed (see Figure 5.8). (Here A = 1 and pA= 1.1.) We see that these waiting times appear to be distributed exponentially. This is always the case when A < p. The proof of this fact is too complicated to give here, but we can verify it by simulation for different choices of A and p, as above. Q  210 CHAPTER 5. DISTRIBUTIONS AND DENSITIES Functions of a Random Variable Before continuing our list of important densities, we pause to consider random variables which are functions of other random variables. We will prove a general theorem that will allow us to derive expressions such as Equation 5.5. Theorem 5.1 Let X be a continuous random variable, and suppose that #(x) is a strictly increasing function on the range of X. Define Y = #(X). Suppose that X and Y have cumulative distribution functions Fx and Fy respectively. Then these functions are related by Fy(y) = Fx (#-1(y)). If #(x) is strictly decreasing on the range of X, then Fy (y) = 1 - Fx (#-1(y)). Proof. Since # is a strictly increasing function on the range of X, the events (X #-1(y)) and (#(X) < y) are equal. Thus, we have Fy(y) = P(Y < y) P(#(X) y) P(X #-1(y)) 1 - P(X vo, the statistician rejects the hypothesis that the two traits are independent. In the present case, vo = 7.815, so we would not reject the hypothesis that the two traits are independent. D Cauchy Density The following example is from Feller.10 Example 5.11 Suppose that a mirror is mounted on a vertical axis, and is free to revolve about that axis. The axis of the mirror is 1 foot from a straight wall of infinite length. A pulse of light is shown onto the mirror, and the reflected ray hits the wall. Let # be the angle between the reflected ray and the line that is perpendicular to the wall and that runs through the axis of the mirror. We assume that # is uniformly distributed between -7/2 and 7/2. Let X represent the distance between the point on the wall that is hit by the reflected ray and the point on the wall that is closest to the axis of the mirror. We now determine the density of X. Let B be a fixed positive quantity. Then X > B if and only if tan(#) B, which happens if and only if # arctan(B). This happens with probability ir/2 - arctan(B) loW. Feller, An Introduction to Probability Theory anid Its Applications,, vol. 2, (New York: Wiley, 1966)  5.2. IMPORTANT DENSITIES 219 Thus, for positive B, the cumulative distribution function of X is F(B)= 1 - 7r/2 - arctan(B) Fr Therefore, the density function for positive B is 1 B r (1 +B2) Since the physical situation is symmetric with respect to # = 0, it is easy to see that the above expression for the density is correct for negative values of B as well. The Law of Large Numbers, which we will discuss in Chapter 8, states that in many cases, if we take the average of independent values of a random variable, then the average approaches a specific number as the number of values increases. It turns out that if one does this with a Cauchy-distributed random variable, the average does not approach any specific number. D Exercises 1 Choose a number U from the unit interval [0, 1] with uniform distribution. Find the cumulative distribution and density for the random variables (a) Y=U+2. (b) Y = U3. 2 Choose a number U from the interval [0, 1] with uniform distribution. Find the cumulative distribution and density for the random variables (a) Y = 1/(U + 1). (b) Y =log(U +1). 3 Use Corollary 5.2 to derive the expression for the random variable given in Equation 5.5. Hint: The random variables 1 - rrnd and rrnd are identically distributed. 4 Suppose we know a random variable Y as a function of the uniform random variable U: Y = #(U), and suppose we have calculated the cumulative dis- tribution function Fy (y) and thence the density fy (y). How can we check whether our answer is correct? An easy simulation provides the answer: Make a bar graph of Y = #(rnd) and compare the result with the graph of fy(y). These graphs should look similar. Check your answers to Exercises 1 and 2 by this method. 5 Choose a number U from the interval [0, 1] with uniform distribution. Find the cumulative distribution and density for the random variables (a) Y = U -1/2|.  220 CHAPTER 5. DISTRIBUTIONS AND DENSITIES 6 Check your results for Exercise 5 by simulation as described in Exercise 4. 7 Explain how you can generate a random variable whose cumulative distribu- tion function is 0, if x < 0, F(x)= x2, if 0< z< 1, 1, ifx> 1. 8 Write a program to generate a sample of 1000 random outcomes each of which is chosen from the distribution given in Exercise 7. Plot a bar graph of your results and compare this empirical density with the density for the cumulative distribution given in Exercise 7. 9 Let U, V be random numbers chosen independently from the interval [0, 1] with uniform distribution. Find the cumulative distribution and density of each of the variables (a) Y =U+V. (b) Y =|U-VI. 10 Let U, V be random numbers chosen independently from the interval [0, 1]. Find the cumulative distribution and density for the random variables (a) Y = max(U, V). (b) Y =min(U, V). 11 Write a program to simulate the random variables of Exercises 9 and 10 and plot a bar graph of the results. Compare the resulting empirical density with the density found in Exercises 9 and 10. 12 A number U is chosen at random in the interval [0, 1]. Find the probability that (a) R = U2 <1/4. (b) S = U(1 - U) < 1/4. (c) T = U/(1 - U) < 1/4. 13 Find the cumulative distribution function F and the density function f for each of the random variables R, S, and T in Exercise 12. 14 A point P in the unit square has coordinates X and Y chosen at random in the interval [0, 1]. Let D be the distance from P to the nearest edge of the square, and E the distance to the nearest corner. What is the probability that (a) D < 1/4? (b) E < 1/4? 15 In Exercise 14 find the cumulative distribution F and density f for the random variable D.  5.2. IMPORTANT DENSITIES 221 16 Let X be a random variable with density function cx(1-x), if0 0, a = 0, and a < 0 require different arguments. 19 Let X be a random variable with density function fx, and let Y = X + b, Z = aX, and W = aX + b, where a f 0. Find the density functions fy, fz, and fw. (See Exercise 18.) 20 Let X be a random variable uniformly distributed over [c, d], and let Y = aX + b. For what choice of a and b is Y uniformly distributed over [0, 1]? 21 Let X be a random variable with cumulative distribution function F strictly increasing on the range of X. Let Y = F(X). Show that Y is uniformly distributed in the interval [0, 1]. (The formula X = F-1(Y) then tells us how to construct X from a uniform random variable Y.) 22 Let X be a random variable with cumulative distribution function F. The median of X is the value m for which F(m) = 1/2. Then X < m with probability 1/2 and X > m with probability 1/2. Find m if X is (a) uniformly distributed over the interval [a, b]. (b) normally distributed with parameters p and o-. (c) exponentially distributed with parameter A. 23 Let X be a random variable with density function fx. The mean of X is the value p = f xfz (x) dz. Then pt gives an average value for X (see Sec- tion 6.3). Find pt if X is distributed uniformly, normally, or exponentially, as in Exercise 22.  222 CHAPTER 5. DISTRIBUTIONS AND DENSITIES Test Score Letter grade p+0T 50). (b) P(X < 60). (c) P(X > 90). (d) P(60 < X < 80). 26 Bridies' Bearing Works manufactures bearing shafts whose diameters are nor- mally distributed with parameters pA= 1, a = .002. The buyer's specifications require these diameters to be 1.000 + .003 cm. What fraction of the manu- facturer's shafts are likely to be rejected? If the manufacturer improves her quality control, she can reduce the value of a. What value of a will ensure that no more than 1 percent of her shafts are likely to be rejected? 27 A final examination at Podunk University is constructed so that the test scores are approximately normally distributed, with parameters p and c. The instructor assigns letter grades to the test scores as shown in Table 5.10 (this is the process of "grading on the curve"). What fraction of the class gets A, B, C, D, F? 28 (Ross1) An expert witness in a paternity suit testifies that the length (in days) of a pregnancy, from conception to delivery, is approximately normally distributed, with parameters pA= 270, ac= 10. The defendant in the suit is able to prove that he was out of the country during the period from 290 to 240 days before the birth of the child. What is the probability that the defendant was in the country when the child was conceived? 29 Suppose that the time (in hours) required to repair a car is an exponentially distributed random variable with parameter A 1/2. What is the probabil- ity that the repair time exceeds 4 hours? If it exceeds 4 hours what is the probability that it exceeds 8 hours? 11S. Ross, A First ourse in Probability Theory, 2d ed. (New York: Macmillan, 1984).  5.2. IMPORTANT DENSITIES 223 30 Suppose that the number of years a car will run is exponentially distributed with parameter p = 1/4. If Prosser buys a used car today, what is the probability that it will still run after 4 years? 31 Let U be a uniformly distributed random variable on [0, 1]. What is the probability that the equation x2 +4Ux+1=0 has two distinct real roots x1 and x2? 32 Write a program to simulate the random variables whose densities are given by the following, making a suitable bar graph of each and comparing the exact density with the bar graph. (a) fx(x) = e-x on [0, o0) (but just do it on [0, 10]). (b) fx (x) = 2x on [0, 1]. (c) fx (x) = 3x2 on [0,1]. (d) fx (x) = 4|x - 1/2| on [0, 1]. 33 Suppose we are observing a process such that the time between occurrences is exponentially distributed with A = 1/30 (i.e., the average time between occurrences is 30 minutes). Suppose that the process starts at a certain time and we start observing the process 3 hours later. Write a program to simulate this process. Let T denote the length of time that we have to wait, after we start our observation, for an occurrence. Have your program keep track of T. What is an estimate for the average value of T? 34 Jones puts in two new lightbulbs: a 60 watt bulb and a 100 watt bulb. It is claimed that the lifetime of the 60 watt bulb has an exponential density with average lifetime 200 hours (A = 1/200). The 100 watt bulb also has an exponential density but with average lifetime of only 100 hours (A = 1/100). Jones wonders what is the probability that the 100 watt bulb will outlast the 60 watt bulb. If X and Y are two independent random variables with exponential densities f(x) = Ae-Ax and g(x) = pe-x, respectively, then the probability that X is less than Y is given by P(X < Y) = f (x)(1 - G(x)) dx, 0 where G(x) is the cumulative distribution function for g(x). Explain why this is the case. Use this to show that P(X 1. Thus, E(Y)= Z22 n21 which is a divergent sum. Thus, Y has no expectation. This example is called the St. Petersburg Paradox. The fact that the above sum is infinite suggests that a player should be willing to pay any fixed amount per game for the privilege of playing this game. The reader is asked to consider how much he or she would be willing to pay for this privilege. It is unlikely that the reader's answer is more than 10 dollars; therein lies the paradox. In the early history of probability, various mathematicians gave ways to resolve this paradox. One idea (due to G. Cramer) consists of assuming that the amount of money in the world is finite. He thus assumes that there is some fixed value of n such that if the number of tosses equals or exceeds n, the payment is 2n dollars. The reader is asked to show in Exercise 20 that the expected value of the payment is now finite. Daniel Bernoulli and Cramer also considered another way to assign value to the payment. Their idea was that the value of a payment is some function of the payment; such a function is now called a utility function. Examples of reasonable utility functions might include the square-root function or the logarithm function. In both cases, the value of 2n dollars is less than twice the value of n dollars. It can easily be shown that in both cases, the expected utility of the payment is finite (see Exercise 20).D  228 CHAPTER 6. EXPECTED VALUE AND VARIANCE Example 6.4 Let T be the time for the first success in a Bernoulli trials process. Then we take as sample space Q the integers 1, 2, ... and assign the geometric distribution m(j)=P(T =j)=qi-lp. Thus, E(T) = 1-p+2qp+3q2p+--- = p(1+2q+3q2+...) . Now if x < 1, then 1 1 + x+x2 +x3+_p... Differentiating this formula, we get 1 1 + 2x + 3x 2 + . (1 -X )2 so E(T) ==1=A. (1 -q)2 p In particular, we see that if we toss a fair coin a sequence of times, the expected time until the first heads is 1/(1/2) = 2. If we roll a die a sequence of times, the expected number of rolls until the first six is 1/(1/6) = 6. D Interpretation of Expected Value In statistics, one is frequently concerned with the average value of a set of data. The following example shows that the ideas of average value and expected value are very closely related. Example 6.5 The heights, in inches, of the women on the Swarthmore basketball team are 5' 9", 5' 9", 5' 6", 5' 8", 5' 11", 5' 5", 5' 7", 5' 6", 5' 6", 5' 7", 5' 10", and 6' 0". A statistician would compute the average height (in inches) as follows: 69+69+66+68+71+65+67+66+66+67+70+72 = 67.9 . 12 One can also interpret this number as the expected value of a random variable. To see this, let an experiment consist of choosing one of the women at random, and let X denote her height. Then the expected value of X equals 67.9. D Of course, just as with the frequency interpretation of probability, to interpret expected value as an average outcome requires further justification. We know that for any finite experiment the average of the outcomes is not predictable. However, we shall eventually prove that the average will usually be close to E(X) if we repeat the experiment a large number of times. We first need to develop some properties of the expected value. Using these properties, and those of the concept of the variance  6.1. EXPECTED VALUE 229 X Y HHH 1 HHT 2 HTH 3 HTT 2 THH 2 THT 3 TTH 2 TTT 1 Table 6.2: Tossing a coin three times. to be introduced in the next section, we shall be able to prove the Law of Large Numbers. This theorem will justify mathematically both our frequency concept of probability and the interpretation of expected value as the average value to be expected in a large number of experiments. Expectation of a Function of a Random Variable Suppose that X is a discrete random variable with sample space Q, and #(x) is a real-valued function with domain Q. Then #(X) is a real-valued random vari- able. One way to determine the expected value of #(X) is to first determine the distribution function of this random variable, and then use the definition of expec- tation. However, there is a better way to compute the expected value of #(X), as demonstrated in the next example. Example 6.6 Suppose a coin is tossed 9 times, with the result HHHTTTTHT. The first set of three heads is called a run. There are three more runs in this sequence, namely the next four tails, the next head, and the next tail. We do not consider the first two tosses to constitute a run, since the third toss has the same value as the first two. Now suppose an experiment consists of tossing a fair coin three times. Find the expected number of runs. It will be helpful to think of two random variables, X and Y, associated with this experiment. We let X denote the sequence of heads and tails that results when the experiment is performed, and Y denote the number of runs in the outcome X. The possible outcomes of X and the corresponding values of Y are shown in Table 6.2. To calculate E(Y) using the definition of expectation, we first must find the distribution function m(y) of Y i.e., we group together those values of X with a common value of Y and add their probabilities. In this case, we calculate that the distribution function of Y is: m(1) =1/4, m(2) =1/2, and m(3) =1/4. One easily finds that E(Y) =2.  230 CHAPTER 6. EXPECTED VALUE AND VARIANCE Now suppose we didn't group the values of X with a common Y-value, but instead, for each X-value x, we multiply the probability of x and the corresponding value of Y, and add the results. We obtain 11 +2() +3(1 +2() +2(1 +3(1 +2(1 +1 , which equals 2. This illustrates the following general principle. If X and Y are two random variables, and Y can be written as a function of X, then one can compute the expected value of Y using the distribution function of X. D Theorem 6.1 If X is a discrete random variable with sample space Q and distri- bution function m(x), and if # : Q - R is a function, then E(#(X)) _=E#(x)m(x), provided the series converges absolutely. D The proof of this theorem is straightforward, involving nothing more than group- ing values of X with a common Y-value, as in Example 6.6. The Sum of Two Random Variables Many important results in probability theory concern sums of random variables. We first consider what it means to add two random variables. Example 6.7 We flip a coin and let X have the value 1 if the coin comes up heads and 0 if the coin comes up tails. Then, we roll a die and let Y denote the face that comes up. What does X + Y mean, and what is its distribution? This question is easily answered in this case, by considering, as we did in Chapter 4, the joint random variable Z = (X, Y), whose outcomes are ordered pairs of the form (x, y), where 0 < x < 1 and 1 K y K 6. The description of the experiment makes it reasonable to assume that X and Y are independent, so the distribution function of Z is uniform, with 1/12 assigned to each outcome. Now it is an easy matter to find the set of outcomes of X + Y, and its distribution function. D In Example 6.1, the random variable X denoted the number of heads which occur when a fair coin is tossed three times. It is natural to think of X as the sum of the random variables X1, X2, X3, where Xi is defined to be 1 if the ith toss comes up heads, and 0 if the ith toss comes up tails. The expected values of the Xi' are extremely easy to compute. It turns out that the expected value of X7 can be obtained by simply adding the expected values of the X7's. This fact is stated in the following theorem.  6.1. EXPECTED VALUE 231 Theorem 6.2 Let X and Y be random variables with finite expected values. Then E(X+ Y) =E(X) + E(Y) , and if c is any constant, then E(cX) = cE(X) Proof. Let the sample spaces of X and Y be denoted by Qx and Qy, and suppose that QX = {xi, x2, ... } and Qy = {yi,y2,...}- Then we can consider the random variable X + Y to be the result of applying the function q, it is favorable. Q If you are in a casino, you will see players adopting elaborate systems of play to try to make unfavorable games favorable. Two such systems, the martingale doubling system and the more conservative Labouchere system, were described in Exercises 1.1.9 and 1.1.10. Unfortunately, such systems cannot change even a fair game into a favorable game. Even so, it is a favorite pastime of many people to develop systems of play for gambling games and for other games such as the stock market. We close this section with a simple illustration of such a system. Stock Prices Example 6.16 Let us assume that a stock increases or decreases in value each day by 1 dollar, each with probability 1/2. Then we can identify this simplified model with our familiar game of heads or tails. We assume that a buyer, Mr. Ace, adopts the following strategy. He buys the stock on the first day at its price V. He then waits until the price of the stock increases by one to V + 1 and sells. He then continues to watch the stock until its price falls back to V. He buys again and waits until it goes up to V +1 and sells. Thus he holds the stock in intervals during which it increases by 1 dollar. In each such interval, he makes a profit of 1 dollar. However, we assume that he can do this only for a finite number of trading days. Thus he can lose if, in the last interval that he holds the stock, it does not get back up to V + 1; and this is the only way he can lose. In Figure 6.4 we illustrate a typical history if Mr. Ace must stop in twenty days. Mr. Ace holds the stock under his system during the days indicated by broken lines. We note that for the history shown in Figure 6.4, his system nets him a gain of 4 dollars. We have written a program StockSystem to simulate the fortune of Mr. Ace if he uses his sytem over an n-day period. If one runs this program a large number  242 CHAPTER 6. EXPECTED VALUE AND VARIANCE 1.5 - 1- 5 10 15 20 -0. / \ / -0 .5 -1 vI Figure 6.4: Mr. Ace's system. of times, for n = 20, say, one finds that his expected winnings are very close to 0, but the probability that he is ahead after 20 days is significantly greater than 1/2. For small values of n, the exact distribution of winnings can be calculated. The distribution for the case n = 20 is shown in Figure 6.5. Using this distribution, it is easy to calculate that the expected value of his winnings is exactly 0. This is another instance of the fact that a fair game (a martingale) remains fair under quite general systems of play. Although the expected value of his winnings is 0, the probability that Mr. Ace is ahead after 20 days is about .610. Thus, he would be able to tell his friends that his system gives him a better chance of being ahead than that of someone who simply buys the stock and holds it, if our simple random model is correct. There have been a number of studies to determine how random the stock market is. Q Historical Remarks With the Law of Large Numbers to bolster the frequency interpretation of proba- bility, we find it natural to justify the definition of expected value in terms of the average outcome over a large number of repetitions of the experiment. The concept of expected value was used before it was formally defined; and when it was used, it was considered not as an average value but rather as the appropriate value for a gamble. For example recall, from the Historical Remarks section of Chapter 1, Section 1.2, Pascal's way of finding the value of a three-game series that had to be called off before it is finished. Pascal first observed that if each player has only one game to win, then the stake of 64 pistoles should be divided evenly. Then he considered the case where one player has won two games and the other one. Then consider, Sir, if the first man wins, he gets 64 pistoles, if he loses he gets 32. Thus if they do not wish to risk this last game, but wish  6.1. EXPECTED VALUE 243 I I I I I I I 0.2 0.15 0.1 0.05 0 I -20 -15 -10 -5 0 5 10 Figure 6.5: Winnings distribution for n = 20. to separate without playing it, the first man must say: "I am certain to get 32 pistoles, even if I lose I still get them; but as for the other 32 pistoles, perhaps I will get them, perhaps you will get them, the chances are equal. Let us then divide these 32 pistoles in half and give one half to me as well as my 32 which are mine for sure." He will then have 48 pistoles and the other 16.2 Note that Pascal reduced the problem to a symmetric bet in which each player gets the same amount and takes it as obvious that in this case the stakes should be divided equally. The first systematic study of expected value appears in Huygens' book. Like Pascal, Huygens find the value of a gamble by assuming that the answer is obvious for certain symmetric situations and uses this to deduce the expected for the general situation. He does this in steps. His first proposition is Prop. I. If I expect a or b, either of which, with equal probability, may fall to me, then my Expectation is worth (a+ b)/2, that is, the half Sum of a and b.3 Huygens proved this as follows: Assume that two player A and B play a game in which each player puts up a stake of (a + b)/2 with an equal chance of winning the total stake. Then the value of the game to each player is (a + b)/2. For example, if the game had to be called off clearly each player should just get back his original stake. Now, by symmetry, this value is not changed if we add the condition that the winner of the game has to pay the loser an amount b as a consolation prize. Then for player A the value is still (a + b)/2. But what are his possible outcomes 2Quoted in F. N. David, Games, Gods anid Gambling (London: Griffin, 1962), p. 231. 3C. Huygens, Calculating in Games of Chance, translation attributed to John Arbuthnot (Lon- don, 1692), p. 34.  244 CHAPTER 6. EXPECTED VALUE AND VARIANCE for the modified game? If he wins he gets the total stake a + b and must pay B an amount b so ends up with a. If he loses he gets an amount b from player B. Thus player A wins a or b with equal chances and the value to him is (a + b)/2. Huygens illustrated this proof in terms of an example. If you are offered a game in which you have an equal chance of winning 2 or 8, the expected value is 5, since this game is equivalent to the game in which each player stakes 5 and agrees to pay the loser 3 a game in which the value is obviously 5. Huygens' second proposition is Prop. II. If I expect a, b, or c, either of which, with equal facility, may happen, then the Value of my Expectation is (a + b + c)/3, or the third of the Sum of a, b, and c.4 His argument here is similar. Three players, A, B, and C, each stake (a+b+c)/3 in a game they have an equal chance of winning. The value of this game to player A is clearly the amount he has staked. Further, this value is not changed if A enters into an agreement with B that if one of them wins he pays the other a consolation prize of b and with C that if one of them wins he pays the other a consolation prize of c. By symmetry these agreements do not change the value of the game. In this modified game, if A wins he wins the total stake a + b + c minus the consolation prizes b + c giving him a final winning of a. If B wins, A wins b and if C wins, A wins c. Thus A finds himself in a game with value (a + b + c)/3 and with outcomes a, b, and c occurring with equal chance. This proves Proposition II. More generally, this reasoning shows that if there are n outcomes a1, a2, ..., an , all occurring with the same probability, the expected value is ai+a2+---+an n In his third proposition Huygens considered the case where you win a or b but with unequal probabilities. He assumed there are p chances of winning a, and q chances of winning b, all having the same probability. He then showed that the expected value is E= p -a+ -b. p+q p+q This follows by considering an equivalent gamble with p + q outcomes all occurring with the same probability and with a payoff of a in p of the outcomes and b in q of the outcomes. This allowed Huygens to compute the expected value for experiments with unequal probabilities, at least when these probablities are rational numbers. Thus, instead of defining the expected value as a weighted average, Huygens assumed that the expected value of certain symmetric gambles are known and de- duced the other values from these. Although this requires a good deal of clever 4ibid., p. 35.  6.1. EXPECTED VALUE 245 manipulation, Huygens ended up with values that agree with those given by our modern definition of expected value. One advantage of this method is that it gives a justification for the expected value in cases where it is not reasonable to assume that you can repeat the experiment a large number of times, as for example, in betting that at least two presidents died on the same day of the year. (In fact, three did; all were signers of the Declaration of Independence, and all three died on July 4.) In his book, Huygens calculated the expected value of games using techniques similar to those which we used in computing the expected value for roulette at Monte Carlo. For example, his proposition XIV is: Prop. XIV. If I were playing with another by turns, with two Dice, on this Condition, that if I throw 7 I gain, and if he throws 6 he gains allowing him the first Throw: To find the proportion of my Hazard to his.5 A modern description of this game is as follows. Huygens and his opponent take turns rolling a die. The game is over if Huygens rolls a 7 or his opponent rolls a 6. His opponent rolls first. What is the probability that Huygens wins the game? To solve this problem Huygens let x be his chance of winning when his opponent threw first and y his chance of winning when he threw first. Then on the first roll his opponent wins on 5 out of the 36 possibilities. Thus, 31 X 36Y. But when Huygens rolls he wins on 6 out of the 36 possible outcomes, and in the other 30, he is led back to where his chances are x. Thus 6 30 Y + .x. 36 36 From these two equations Huygens found that x = 31/61. Another early use of expected value appeared in Pascal's argument to show that a rational person should believe in the existence of God.6 Pascal said that we have to make a wager whether to believe or not to believe. Let p denote the probability that God does not exist. His discussion suggests that we are playing a game with two strategies, believe and not believe, with payoffs as shown in Table 6.4. Here -u represents the cost to you of passing up some worldly pleasures as a consequence of believing that God exists. If you do not believe, and God is a vengeful God, you will lose x. If God exists and you do believe you will gain v. Now to determine which strategy is best you should compare the two expected values p(-t) + (1- p)v and p0O+(1l-p)(-x), sibid., p. 47. 6Quoted in I. Hacking, The Emergence of Probability (Cambridge: Cambridge Univ. Press, 1975).  246 CHAPTER 6. EXPECTED VALUE AND VARIANCE God does not exist God exists p 1-p -u v 0 -x believe not believe Table 6.4: Payoffs. Age Survivors 0 100 6 64 16 40 26 25 36 16 46 10 56 6 66 3 76 1 Table 6.5: Graunt's mortality data. and choose the larger of the two. In general, the choice will depend upon the value of p. But Pascal assumed that the value of v is infinite and so the strategy of believing is best no matter what probability you assign for the existence of God. This example is considered by some to be the beginning of decision theory. Decision analyses of this kind appear today in many fields, and, in particular, are an important part of medical diagnostics and corporate business decisions. Another early use of expected value was to decide the price of annuities. The study of statistics has its origins in the use of the bills of mortality kept in the parishes in London from 1603. These records kept a weekly tally of christenings and burials. From these John Graunt made estimates for the population of London and also provided the first mortality data,7 shown in Table 6.5. As Hacking observes, Graunt apparently constructed this table by assuming that after the age of 6 there is a constant probability of about 5/8 of surviving for another decade.8 For example, of the 64 people who survive to age 6, 5/8 of 64 or 40 survive to 16, 5/8 of these 40 or 25 survive to 26, and so forth. Of course, he rounded off his figures to the nearest whole person. Clearly, a constant mortality rate cannot be correct throughout the whole range, and later tables provided by Halley were more realistic in this respect.9 7ibid., p. 108. 8ibid., p. 109. 9E. Halley, "An Estimate of The Degrees of Mortality of Mankind," Phil. Trans. Royal. Soc.,  6.1. EXPECTED VALUE 247 A terminal annuity provides a fixed amount of money during a period of n years. To determine the price of a terminal annuity one needs only to know the appropriate interest rate. A life annuity provides a fixed amount during each year of the buyer's life. The appropriate price for a life annuity is the expected value of the terminal annuity evaluated for the random lifetime of the buyer. Thus, the work of Huygens in introducing expected value and the work of Graunt and Halley in determining mortality tables led to a more rational method for pricing annuities. This was one of the first serious uses of probability theory outside the gambling houses. Although expected value plays a role now in every branch of science, it retains its importance in the casino. In 1962, Edward Thorp's book Beat the Dealer10 provided the reader with a strategy for playing the popular casino game of blackjack that would assure the player a positive expected winning. This book forevermore changed the belief of the casinos that they could not be beat. Exercises 1 A card is drawn at random from a deck consisting of cards numbered 2 through 10. A player wins 1 dollar if the number on the card is odd and loses 1 dollar if the number if even. What is the expected value of his win- nings? 2 A card is drawn at random from a deck of playing cards. If it is red, the player wins 1 dollar; if it is black, the player loses 2 dollars. Find the expected value of the game. 3 In a class there are 20 students: 3 are 5' 6", 5 are 5'8", 4 are 5'10", 4 are 6', and 4 are 6' 2". A student is chosen at random. What is the student's expected height? 4 In Las Vegas the roulette wheel has a 0 and a 00 and then the numbers 1 to 36 marked on equal slots; the wheel is spun and a ball stops randomly in one slot. When a player bets 1 dollar on a number, he receives 36 dollars if the ball stops on this number, for a net gain of 35 dollars; otherwise, he loses his dollar bet. Find the expected value for his winnings. 5 In a second version of roulette in Las Vegas, a player bets on red or black. Half of the numbers from 1 to 36 are red, and half are black. If a player bets a dollar on black, and if the ball stops on a black number, he gets his dollar back and another dollar. If the ball stops on a red number or on 0 or 00 he loses his dollar. Find the expected winnings for this bet. 6 A die is rolled twice. Let X denote the sum of the two numbers that turn up, and Y the difference of the numbers (specifically, the number on the first roll minus the number on the second). Show that E(XY) =E(X)E(Y). Are X and Y independent? vol. 17 (1693), pp. 596-610; 654-656. 'oE. Thorp, Beat the Dealer (New York: Random House, 1962).  248 CHAPTER 6. EXPECTED VALUE AND VARIANCE *7 Show that, if X and Y are random variables taking on only two values each, and if E(XY) = E(X)E(Y), then X and Y are independent. 8 A royal family has children until it has a boy or until it has three children, whichever comes first. Assume that each child is a boy with probability 1/2. Find the expected number of boys in this royal family and the expected num- ber of girls. 9 If the first roll in a game of craps is neither a natural nor craps, the player can make an additional bet, equal to his original one, that he will make his point before a seven turns up. If his point is four or ten he is paid off at 2 : 1 odds; if it is a five or nine he is paid off at odds 3 : 2; and if it is a six or eight he is paid off at odds 6 : 5. Find the player's expected winnings if he makes this additional bet when he has the opportunity. 10 In Example 6.16 assume that Mr. Ace decides to buy the stock and hold it until it goes up 1 dollar and then sell and not buy again. Modify the program StockSystem to find the distribution of his profit under this system after a twenty-day period. Find the expected profit and the probability that he comes out ahead. 11 On September 26, 1980, the New York Times reported that a mysterious stranger strode into a Las Vegas casino, placed a single bet of 777,000 dollars on the "don't pass" line at the crap table, and walked away with more than 1.5 million dollars. In the "don't pass" bet, the bettor is essentially betting with the house. An exception occurs if the roller rolls a 12 on the first roll. In this case, the roller loses and the "don't pass" better just gets back the money bet instead of winning. Show that the "don't pass" bettor has a more favorable bet than the roller. 12 Recall that in the martingale doubling system (see Exercise 1.1.10), the player doubles his bet each time he loses. Suppose that you are playing roulette in a fair casino where there are no 0's, and you bet on red each time. You then win with probability 1/2 each time. Assume that you enter the casino with 100 dollars, start with a 1-dollar bet and employ the martingale system. You stop as soon as you have won one bet, or in the unlikely event that black turns up six times in a row so that you are down 63 dollars and cannot make the required 64-dollar bet. Find your expected winnings under this system of play. 13 You have 80 dollars and play the following game. An urn contains two white balls and two black balls. You draw the balls out one at a time without replacement until all the balls are gone. On each draw, you bet half of your present fortune that you will draw a white ball. What is your expected final fortune? 14 In the hat check problem (see Example 3.12), it was assumed that N people check their hats and the hats are handed back at random. Let XA = 1 if the  6.1. EXPECTED VALUE 249 jth person gets his or her hat and 0 otherwise. Find E(X3) and E(XJ - Xk) for j not equal to k. Are X and Xk independent? 15 A box contains two gold balls and three silver balls. You are allowed to choose successively balls from the box at random. You win 1 dollar each time you draw a gold ball and lose 1 dollar each time you draw a silver ball. After a draw, the ball is not replaced. Show that, if you draw until you are ahead by 1 dollar or until there are no more gold balls, this is a favorable game. 16 Gerolamo Cardano in his book, The Gambling Scholar, written in the early 1500s, considers the following carnival game. There are six dice. Each of the dice has five blank sides. The sixth side has a number between 1 and 6-a different number on each die. The six dice are rolled and the player wins a prize depending on the total of the numbers which turn up. (a) Find, as Cardano did, the expected total without finding its distribution. (b) Large prizes were given for large totals with a modest fee to play the game. Explain why this could be done. 17 Let X be the first time that a failure occurs in an infinite sequence of Bernoulli trials with probability p for success. Let Pk = P(X = k) for k = 1, 2, .... Show that Pk = pk-lq where q = 1 - p. Show that >kPk = 1. Show that E(X) = 1/q. What is the expected number of tosses of a coin required to obtain the first tail? 18 Exactly one of six similar keys opens a certain door. If you try the keys, one after another, what is the expected number of keys that you will have to try before success? 19 A multiple choice exam is given. A problem has four possible answers, and exactly one answer is correct. The student is allowed to choose a subset of the four possible answers as his answer. If his chosen subset contains the correct answer, the student receives three points, but he loses one point for each wrong answer in his chosen subset. Show that if he just guesses a subset uniformly and randomly his expected score is zero. 20 You are offered the following game to play: a fair coin is tossed until heads turns up for the first time (see Example 6.3). If this occurs on the first toss you receive 2 dollars, if it occurs on the second toss you receive 22 = 4 dollars and, in general, if heads turns up for the first time on the nth toss you receive 2n dollars. (a) Show that the expected value of your winnings does not exist (i.e., is given by a divergent sum) for this game. Does this mean that this game is favorable no matter how much you pay to play it? (b) Assume that you only receive 210 dollars if any number greater than or equal to ten tosses are required to obtain the first head. Show that your expected value for this modified game is finite and find its value.  250 CHAPTER 6. EXPECTED VALUE AND VARIANCE (c) Assume that you pay 10 dollars for each play of the original game. Write a program to simulate 100 plays of the game and see how you do. (d) Now assume that the utility of n dollars is /. Write an expression for the expected utility of the payment, and show that this expression has a finite value. Estimate this value. Repeat this exercise for the case that the utility function is log(n). 21 Let X be a random variable which is Poisson distributed with parameter A. Show that E(X) = A. Hint: Recall that x2 3 ex=1+x+ + +-... 2! 3! 22 Recall that in Exercise 1.1.14, we considered a town with two hospitals. In the large hospital about 45 babies are born each day, and in the smaller hospital about 15 babies are born each day. We were interested in guessing which hospital would have on the average the largest number of days with the property that more than 60 percent of the children born on that day are boys. For each hospital find the expected number of days in a year that have the property that more than 60 percent of the children born on that day were boys. 23 An insurance company has 1,000 policies on men of age 50. The company estimates that the probability that a man of age 50 dies within a year is .01. Estimate the number of claims that the company can expect from beneficiaries of these men within a year. 24 Using the life table for 1981 in Appendix C, write a program to compute the expected lifetime for males and females of each possible age from 1 to 85. Compare the results for males and females. Comment on whether life insur- ance should be priced differently for males and females. *25 A deck of ESP cards consists of 20 cards each of two types: say ten stars, ten circles (normally there are five types). The deck is shuffled and the cards turned up one at a time. You, the alleged percipient, are to name the symbol on each card before it is turned up. Suppose that you are really just guessing at the cards. If you do not get to see each card after you have made your guess, then it is easy to calculate the expected number of correct guesses, namely ten. If, on the other hand, you are guessing with information, that is, if you see each card after your guess, then, of course, you might expect to get a higher score. This is indeed the case, but calculating the correct expectation is no longer easy. But it is easy to do a computer simulation of this guessing with information, so we can get a good idea of the expectation by simulation. (This is similar to the way that skilled blackjack players make blackjack into a favorable game by observing the cards that have already been played. See Exercise 29.)  6.1. EXPECTED VALUE 251 (a) First, do a simulation of guessing without information, repeating the experiment at least 1000 times. Estimate the expected number of correct answers and compare your result with the theoretical expectation. (b) What is the best strategy for guessing with information? (c) Do a simulation of guessing with information, using the strategy in (b). Repeat the experiment at least 1000 times, and estimate the expectation in this case. (d) Let S be the number of stars and C the number of circles in the deck. Let h(S, C) be the expected winnings using the optimal guessing strategy in (b). Show that h(S, C) satisfies the recursion relation h(SC) S Ch(S - 1,C) +SCh(S, C - 1) max(S,C) S+C and h(0, 0) = h(-1, 0) = h(0, -1) = 0. Using this relation, write a program to compute h(S, C) and find h(10, 10). Compare the computed value of h(10, 10) with the result of your simulation in (c). For more about this exercise and Exercise 26 see Diaconis and Graham." *26 Consider the ESP problem as described in Exercise 25. You are again guessing with information, and you are using the optimal guessing strategy of guessing star if the remaining deck has more stars, circle if more circles, and tossing a coin if the number of stars and circles are equal. Assume that S;> C, where S is the number of stars and C the number of circles. We can plot the results of a typical game on a graph, where the horizontal axis represents the number of steps and the vertical axis represents the difference between the number of stars and the number of circles that have been turned up. A typical game is shown in Figure 6.6. In this particular game, the order in which the cards were turned up is (C, S, S, S, S, C, C, S, S, C). Thus, in this particular game, there were six stars and four circles in the deck. This means, in particular, that every game played with this deck would have a graph which ends at the point (10, 2). We define the line L to be the horizontal line which goes through the ending point on the graph (so its vertical coordinate is just the difference between the number of stars and circles in the deck). (a) Show that, when the random walk is below the line L, the player guesses right when the graph goes up (star is turned up) and, when the walk is above the line, the player guesses right when the walk goes down (circle turned up). Show from this property that the subject is sure to have at least S correct guesses. (b) When the walk is at a point (x, x) on the line L the number of stars and circles remaining is the same, and so the subject tosses a coin. Show that 11P. Diaconis and R. Graham, "The Analysis of Sequential Experiments with Feedback to Sub- jects," Annals of Statistics, vol. 9 (1981), pp. 3-23.  252 CHAPTER 6. EXPECTED VALUE AND VARIANCE 2 1 L (10,2) 1 2 3 4 5 6 7 8 9 10 Figure 6.6: Random walk for ESP. the probability that the walk reaches (x, x) is (x) (x) (S+C) \ 2x Hint: The outcomes of 2x cards is a hypergeometric distribution (see Section 5.1). (c) Using the results of (a) and (b) show that the expected number of correct guesses under intelligent guessing is C 1Sx)(C S + S -1)) x=1 ()2x 27 It has been said12 that a Dr. B. Muriel Bristol declined a cup of tea stating that she preferred a cup into which milk had been poured first. The famous statistician R. A. Fisher carried out a test to see if she could tell whether milk was put in before or after the tea. Assume that for the test Dr. Bristol was given eight cups of tea-four in which the milk was put in before the tea and four in which the milk was put in after the tea. (a) What is the expected number of correct guesses the lady would make if she had no information after each test and was just guessing? (b) Using the result of Exercise 26 find the expected number of correct guesses if she was told the result of each guess and used an optimal guessing strategy. 28 In a popular computer game the computer picks an integer from 1 to n at random. The player is given k chances to guess the number. After each guess the computer responds "correct," "too small," or "too big." 12J. F. Box, R. A. Fisher, The Life of a Scientist (New York: John Wiley and Sons, 1978).  6.1. EXPECTED VALUE 253 (a) Show that if n < 2k -1, then there is a strategy that guarantees you will correctly guess the number in k tries. (b) Show that if n ;> 2k -1, there is a strategy that assures you of identifying one of 2k - 1 numbers and hence gives a probability of (2k - 1)/n of winning. Why is this an optimal strategy? Illustrate your result in terms of the case n = 9 and k = 3. 29 In the casino game of blackjack the dealer is dealt two cards, one face up and one face down, and each player is dealt two cards, both face down. If the dealer is showing an ace the player can look at his down cards and then make a bet called an insurance bet. (Expert players will recognize why it is called insurance.) If you make this bet you will win the bet if the dealer's second card is a ten card: namely, a ten, jack, queen, or king. If you win, you are paid twice your insurance bet; otherwise you lose this bet. Show that, if the only cards you can see are the dealer's ace and your two cards and if your cards are not ten cards, then the insurance bet is an unfavorable bet. Show, however, that if you are playing two hands simultaneously, and you have no ten cards, then it is a favorable bet. (Thorp13 has shown that the game of blackjack is favorable to the player if he or she can keep good enough track of the cards that have been played.) 30 Assume that, every time you buy a box of Wheaties, you receive a picture of one of the n players for the New York Yankees (see Exercise 3.2.34). Let Xk be the number of additional boxes you have to buy, after you have obtained k -1 different pictures, in order to obtain the next new picture. Thus X1 = 1, X2 is the number of boxes bought after this to obtain a picture different from the first pictured obtained, and so forth. (a) Show that Xk has a geometric distribution with p = (n - k + 1)/n. (b) Simulate the experiment for a team with 26 players (25 would be more accurate but we want an even number). Carry out a number of simula- tions and estimate the expected time required to get the first 13 players and the expected time to get the second 13. How do these expectations compare? (c) Show that, if there are 2n players, the expected time to get the first half of the players is 1 1 1 2 2n- + 2 -1+ -+-n-++ and the expected time to get the second half is 13E. Thorp, Beat the Dealer (New York: Random House, 1962).  254 CHAPTER 6. EXPECTED VALUE AND VARIANCE (d) In Example 6.11 we stated that 1 1 1 1 + 1-++-- logn+.5772+1. 2 3 n 2n Use this to estimate the expression in (c). Compare these estimates with the exact values and also with your estimates obtained by simulation for the case n = 26. *31 (Feller14) A large number, N, of people are subjected to a blood test. This can be administered in two ways: (1) Each person can be tested separately, in this case N test are required, (2) the blood samples of k persons can be pooled and analyzed together. If this test is negative, this one test suffices for the k people. If the test is positive, each of the k persons must be tested separately, and in all, k + 1 tests are required for the k people. Assume that the probability p that a test is positive is the same for all people and that these events are independent. (a) Find the probability that the test for a pooled sample of k people will be positive. (b) What is the expected value of the number X of tests necessary under plan (2)? (Assume that N is divisible by k.) (c) For small p, show that the value of k which will minimize the expected number of tests under the second plan is approximately 1//p. 32 Write a program to add random numbers chosen from [0, 1] until the first time the sum is greater than one. Have your program repeat this experiment a number of times to estimate the expected number of selections necessary in order that the sum of the chosen numbers first exceeds 1. On the basis of your experiments, what is your estimate for this number? *33 The following related discrete problem also gives a good clue for the answer to Exercise 32. Randomly select with replacement t1, t2, ..., tr from the set (1/n, 2/n, .. ., n/n). Let X be the smallest value of r satisfying tl+t2+---+tr > 1 . Then E(X) = (1 + 1/n)". To prove this, we can just as well choose ti, t2, ..., tr randomly with replacement from the set (1, 2,... , n) and let X be the smallest value of r for which tl+t2 + -.- +tr > n . (a) Use Exercise 3.2.36 to show that 14W. Feller, Introduction to Probability Theory anid Its Applications, 3rd ed., vol. 1 (New York: John Wiley and Sons, 1968), p. 240.  6.1. EXPECTED VALUE 255 (b) Show that E(X)= ZP(X>j+1). j=0 (c) From these two facts, find an expression for E(X). This proof is due to Harris Schultz.15 *34 (Banach's Matchbox16) A man carries in each of his two front pockets a box of matches originally containing N matches. Whenever he needs a match, he chooses a pocket at random and removes one from that box. One day he reaches into a pocket and finds the box empty. (a) Let pr denote the probability that the other pocket contains r matches. Define a sequence of counter random variables as follows: Let Xi = 1 if the ith draw is from the left pocket, and 0 if it is from the right pocket. Interpret pr in terms of Sn = X1 + X2 + - - - + Xn. Find a binomial expression for p,. (b) Write a computer program to compute the pr, as well as the probability that the other pocket contains at least r matches, for N = 100 and r from 0 to 50. (c) Show that (N - r)pr = (1/2)(2N + 1)pr+1 - (1/2)(r + 1)pr+1 (d) Evaluate >jpr. (e) Use (c) and (d) to determine the expectation E of the distribution {pr}. (f) Use Stirling's formula to obtain an approximation for E. How many matches must each box contain to ensure a value of about 13 for the expectation E? (Take 7r= 22/7.) 35 A coin is tossed until the first time a head turns up. If this occurs on the nth toss and n is odd you win 2n /n, but if n is even then you lose 2n /n. Then if your expected winnings exist they are given by the convergent series 1 1 1 2+34+. called the alternating harmonic series. It is tempting to say that this should be the expected value of the experiment. Show that if we were to do this, the expected value of an experiment would depend upon the order in which the outcomes are listed. 36 Suppose we have an urn containing c yellow balls and d green balls. We draw k balls, without replacement, from the urn. Find the expected number of yellow balls drawn. Hint: Write the number of yellow balls drawn as the sum of c random variables. 15H. Schultz, "An Expected Value Problem," Two- Year Mathematics Jouirnal, vol. 10, no. 4 (1979), pp. 277-78. 16W. Feller, Introduiction to Probability Theory, vol. 1, p. 166.  256 CHAPTER 6. EXPECTED VALUE AND VARIANCE 37 The reader is referred to Example 6.13 for an explanation of the various op- tions available in Monte Carlo roulette. (a) Compute the expected winnings of a 1 franc bet on red under option (a). (b) Repeat part (a) for option (b). (c) Compare the expected winnings for all three options. *38 (from Pittel17) Telephone books, n in number, are kept in a stack. The probability that the book numbered i (where 1 < i < n) is consulted for a given phone call is pi > 0, where the pi's sum to 1. After a book is used, it is placed at the top of the stack. Assume that the calls are independent and evenly spaced, and that the system has been employed indefinitely far into the past. Let di be the average depth of book i in the stack. Show that di <; d whenever pi > pj. Thus, on the average, the more popular books have a tendency to be closer to the top of the stack. Hint: Let pig denote the probability that book i is above book j. Show that pig = pig (1 - p3) + pjipi. *39 (from Propp18) In the previous problem, let P be the probability that at the present time, each book is in its proper place, i.e., book i is ith from the top. Find a formula for P in terms of the pi's. In addition, find the least upper bound on P, if the pi's are allowed to vary. Hint: First find the probability that book 1 is in the right place. Then find the probability that book 2 is in the right place, given that book 1 is in the right place. Continue. *40 (from H. Shultz and B. Leonard19) A sequence of random numbers in [0, 1) is generated until the sequence is no longer monotone increasing. The num- bers are chosen according to the uniform distribution. What is the expected length of the sequence? (In calculating the length, the term that destroys monotonicity is included.) Hint: Let a1, a2, ... be the sequence and let X denote the length of the sequence. Then P(X > k) = P(a1 < a2 < -.--< ak), and the probability on the right-hand side is easy to calculate. Furthermore, one can show that E(X) = 1+P(X > 1)+ P(X > 2)+ - - - . 41 Let T be the random variable that counts the number of 2-unshuffles per- formed on an n-card deck until all of the labels on the cards are distinct. This random variable was discussed in Section 3.3. Using Equation 3.4 in that section, together with the formula 00 E(T) = P(T>s) 17B. Pittel, Problem #1195, Mathematics Magazine, vol. 58, no. 3 (May 1985), pg. 183. 18J. Propp, Problem #1159, Mathematics Magazine vol. 57, no. 1 (Feb. 1984), pg. 50. 19H. Shultz and B. Leonard, "Unexpected Occurrences of the Number e," Mathematics Magazine vol. 62, no. 4 (October, 1989), pp. 269-271.  6.2. VARIANCE OF DISCRETE RANDOM VARIABLES 257 that was proved in Exercise 33, show that E(T)=Z1 s=o 2s n! n 2sn Show that for n = 52, this expression is approximately equal to 11.7. (As was stated in Chapter 3, this means that on the average, almost 12 riffle shuffles of a 52-card deck are required in order for the process to be considered random.) 6.2 Variance of Discrete Random Variables The usefulness of the expected value as a prediction for the outcome of an ex- periment is increased when the outcome is not likely to deviate too much from the expected value. In this section we shall introduce a measure of this deviation, called the variance. Variance Definition 6.3 Let X be a numerically valued random variable with expected value p = E(X). Then the variance of X, denoted by V(X), is V(X) = E((X - P)2) El Note that, by Theorem 6.1, V(X) is given by V(X) =Z(x - p)2m(x) x where m is the distribution function of X. Standard Deviation The standard deviation of X, denoted by D(X), is D(X) write o- for D(X) and o2 for V(X). (6.1) /V(X). We often Example 6.17 Consider one roll of a die. Let X be the number that turns up. To find V(X), we must first find the expected value of X. This is p= E(X) = 1(-+2(')+3('+4(')+5('+6(' 7 2 To find the variance of X, we form the new random variable (X - compute its expectation. We can easily do this using the following table. p)2 and  258 CHAPTER 6. EXPECTED VALUE AND VARIANCE x m(x) (x - 7/2)2 1 1/6 25/4 2 1/6 9/4 3 1/6 1/4 4 1/6 1/4 5 1/6 9/4 6 1/6 25/4 Table 6.6: Variance calculation. From this table we find E((X - p)2) is = 1 25 9 1 1 9 25) V(X)+6 4 A A A4 A4 ) 35 12 and the standard deviation D(X) = 35/12 ~ 1.707. D Calculation of Variance We next prove a theorem that gives us a useful alternative form for computing the variance. Theorem 6.6 If X is any random variable with E(X) = p, then V(X) = E(X2) - A2. Proof. We have V(X) = E((X - p)2) = E(X2 - 2pX + p2) = E(X2) - 2pE(X)+ p2 = E(X2)- 2. Using Theorem 6.6, we can compute the variance of the outcome of a roll of a die by first computing E(X2) = - +4 +9 +16 +25 +36 91 6 and, 91 (72 35 in agreement with the value obtained directly from the definition of V(X).  6.2. VARIANCE OF DISCRETE RANDOM VARIABLES 259 Properties of Variance The variance has properties very different from those of the expectation. If c is any constant, E(cX) = cE(X) and E(X + c) = E(X) + c. These two statements imply that the expectation is a linear function. However, the variance is not linear, as seen in the next theorem. Theorem 6.7 If X is any random variable and c is any constant, then V(cX) = c2V(X) and V(X+c)=V(X) . Proof. Let p = E(X). Then E(cX) = cp, and V(cX) = E((cX - cp)2) = E(c2(X - p)2) = c2E((X - p)2) = c2V(X) . To prove the second assertion, we note that, to compute V(X + c), we would replace x by x+c and p by pt+c in Equation 6.1. Then the c's would cancel, leaving V(X). D We turn now to some general properties of the variance. Recall that if X and Y are any two random variables, E(X+Y) = E(X)+E(Y). This is not always true for the case of the variance. For example, let X be a random variable with V(X) # 0, and define Y = -X. Then V(X) = V(Y), so that V(X) + V(Y) = 2V(X). But X + Y is always 0 and hence has variance 0. Thus V(X + Y) V(X) + V(Y). In the important case of mutually independent random variables, however, the variance of the sum is the sum of the variances. Theorem 6.8 Let X and Y be two independent random variables. Then V(X +Y) =V(X)+V(Y) . Proof. Let E(X) = a and E(Y) = b. Then V(X+Y) = E((X+Y)2)-(a+b)2 = E(X2) + 2E(XY) + E(Y2) - a2 - 2ab -b2 Since X and Y are independent, E(XY) = E(X)E(Y) =ab. Thus, V(X +Y ) =E(X2) - a2 + E(Y2) - b2 =V(X)+V(Y) . D-  260 CHAPTER 6. EXPECTED VALUE AND VARIANCE It is easy to extend this proof, by mathematical induction, to show that the variance of the sum of any number of mutually independent random variables is the sum of the individual variances. Thus we have the following theorem. Theorem 6.9 Let X1, X2, ..., Xn be an independent trials process with E(X3) p and V(X3) =_o.2. Let Sn=X1+X2+...+Xn be the sum, and Anb= n be the average. Then E(Sn) V(Sn) -(S~) np , no- 2 E(An) = pi, V(An) o-(Am) 0 Proof. Since all the random variables Xj have the same expected value, we have E(Sn) = E(Xi) + -..-+ E(X) = p12p, V(Sn)=V(Xi) +... + V(X)=no.2 and o- (S,) = o-v-. We have seen that, if we multiply a random variable X with mean p and variance o2 by a constant c, the new random variable has expected value cp and variance c2.2. Thus, E(A)=E ( - -p, and V(An)=V(-) n V(Sn) no.2 U.2 n n2 n Finally, the standard deviation of An is given by U(Am) U D-  6.2. VARIANCE OF DISCRETE RANDOM VARIABLES 261 n==1 2 n=100 0.5 1.5 0.4 0.31 0.2 0.5 0.0 1 2 3 4 5 6 2 2.5 3 3.5 4 4. 5 Figure 6.7: Empirical distribution of An. The last equation in the above theorem implies that in an independent trials process, if the individual summands have finite variance, then the standard devi- ation of the average goes to 0 as n -- c. Since the standard deviation tells us something about the spread of the distribution around the mean, we see that for large values of n, the value of An is usually very close to the mean of An, which equals p, as shown above. This statement is made precise in Chapter 8, where it is called the Law of Large Numbers. For example, let X represent the roll of a fair die. In Figure 6.7, we show the distribution of a random variable An corresponding to X, for n = 10 and n = 100. Example 6.18 Consider n rolls of a die. We have seen that, if Xj is the outcome if the jth roll, then E(X3) = 7/2 and V(X3) = 35/12. Thus, if Sn is the sum of the outcomes, and An = Sn/n is the average of the outcomes, we have E(An) = 7/2 and V(An) = (35/12)/n. Therefore, as n increases, the expected value of the average remains constant, but the variance tends to 0. If the variance is a measure of the expected deviation from the mean this would indicate that, for large n, we can expect the average to be very near the expected value. This is in fact the case, and we shall justify it in Chapter 8. D Bernoulli Trials Consider next the general Bernoulli trials process. As usual, we let X3j= 1 if the jth outcome is a success and 0 if it is a failure. If p is the probability of a success, and q = 1 - p, then E(X3) = 0q+1p=p, E(Xj) = 02q+12p=p and V(X) =E(Xj) - (E(X3))2 p-p2 pq. Thus, for Bernoulli trials, if Sn X1 +X2 +.- + -X, is the number of successes, then E(Sn) =np, V(Sn) =npq, and D(Sn) = npq. I f An Sn/n is the average number of successes, then E(An) =p, V(An) =pq/n, and D(An) = pq/n. We see that the expected proportion of successes remains p and the variance tends to 0.  262 CHAPTER 6. EXPECTED VALUE AND VARIANCE This suggests that the frequency interpretation of probability is a correct one. We shall make this more precise in Chapter 8. Example 6.19 Let T denote the number of trials until the first success in a Bernoulli trials process. Then T is geometrically distributed. What is the vari- ance of T? In Example 4.15, we saw that 1 2 3 --- MT = p qp q2p ... In Example 6.4, we showed that E(T)=1/p. Thus, V(T) = E(T2) - 1/p2 so we need only find E(T2) = 1p+4qp+9q2p+--- = p(1+4q+9q2+...) . To evaluate this sum, we start again with 1 1+x+x2 + 1 1-x Differentiating, we obtain 1 1+2x3x ~(1 -x)2 Multiplying by x, x+2x2+3x3+.. . 2 (1 - x)2 Differentiating again gives 1+4x+9x1++ x 1 +4x+ 92..-. -= . (1 - x)3 Thus, 1+q 1+q p2 and V(T) = E(T2) - (E(T))2 _ 1+qq 1 _ p2 2 p2 For example, the variance for the number of tosses of a coin until the first head turns up is (1/2)/(1/2)2 2. The variance for the number of rolls of a die until the first six turns up is (5/6)/(1/6)2 =30. Note that, as p decreases, the variance increases rapidly. This corresponds to the increased spread of the geometric distribution as p decreases (noted in Figure 5.1).D  6.2. VARIANCE OF DISCRETE RANDOM VARIABLES 263 Poisson Distribution Just as in the case of expected values, it is easy to guess the variance of the Poisson distribution with parameter A. We recall that the variance of a binomial distribution with parameters n and p equals npq. We also recall that the Poisson distribution could be obtained as a limit of binomial distributions, if n goes to o and p goes to 0 in such a way that their product is kept fixed at the value A. In this case, npq =Aq approaches A, since q goes to 1. So, given a Poisson distribution with parameter A, we should guess that its variance is A. The reader is asked to show this in Exercise 29. Exercises 1 A number is chosen at random from the set S {-1,0,1}. Let X be the number chosen. Find the expected value, variance, and standard deviation of X. 2 A random variable X has the distribution 0 1 2 4 Px (1/3 1/3 1/6 1/6) Find the expected value, variance, and standard deviation of X. 3 You place a 1-dollar bet on the number 17 at Las Vegas, and your friend places a 1-dollar bet on black (see Exercises 1.1.6 and 1.1.7). Let X be your winnings and Y be her winnings. Compare E(X), E(Y), and V(X), V(Y). What do these computations tell you about the nature of your winnings if you and your friend make a sequence of bets, with you betting each time on a number and your friend betting on a color? 4 X is a random variable with E(X) = 100 and V(X) = 15. Find (a) E(X2). (b) E(3X + 10). (c) E(-X). (d) V(-X). (e) D(-X). 5 In a certain manufacturing process, the (Fahrenheit) temperature never varies by more than 20 from 620. The temperature is, in fact, a random variable F with distribution (60 61 62 63 64 1/10 2/10 4/10 2/10 1/10) (a) Find E(F) and V(F). (b) Define T =F - 62. Find E(T) and V(T), and compare these answers with those in part (a).  264 CHAPTER 6. EXPECTED VALUE AND VARIANCE (c) It is decided to report the temperature readings on a Celsius scale, that is, C = (5/9)(F - 32). What is the expected value and variance for the readings now? 6 Write a computer program to calculate the mean and variance of a distribution which you specify as data. Use the program to compare the variances for the following densities, both having expected value 0: -2 -1 0 1 2 Px (3/11 2/11 1/11 2/11 3/11 -2 -1 0 1 2 _1/112/11 5/11 2/11 1/11) 7 A coin is tossed three times. Let X be the number of heads that turn up. Find V(X) and D(X). 8 A random sample of 2400 people are asked if they favor a government pro- posal to develop new nuclear power plants. If 40 percent of the people in the country are in favor of this proposal, find the expected value and the stan- dard deviation for the number S2400 of people in the sample who favored the proposal. 9 A die is loaded so that the probability of a face coming up is proportional to the number on that face. The die is rolled with outcome X. Find V(X) and D(X). 10 Prove the following facts about the standard deviation. (a) D(X + c) = D(X ). (b) D(cX) = |cID(X). 11 A number is chosen at random from the integers 1, 2, 3, ..., n. Let X be the number chosen. Show that E(X) = (n+1)/2 and V(X) =_(n -1)(n+1)/12. Hint: The following identity may be useful: 12 +2 + + 2 (n) (n + 1) (2n + 1) 6 12 Let X be a random variable with p = E(X) and o.2 =V(X). Define X* (X-p)/o. The random variable X* is called the standardized random variable associated with X. Show that this standardized random variable has expected value 0 and variance 1. 13 Peter and Paul play Heads or Tails (see Example 1.4). Let Wn be Peter's winnings after n matches. Show that E(W,) = 0 and V(W,) = n. 14 Find the expected value and the variance for the number of boys and the number of girls in a royal family that has children until there is a boy or until there are three children, whichever comes first.  6.2. VARIANCE OF DISCRETE RANDOM VARIABLES 265 15 Suppose that n people have their hats returned at random. Let Xi = 1 if the ith person gets his or her own hat back and 0 otherwise. Let Sn = Z_1 Xi. Then Sn is the total number of people who get their own hats back. Show that (a) E(X ) = 1/n. (b) E(Xi - X) = 1/n(n - 1) for i j. (c) E(S ) = 2 (using (a) and (b)). (d) V(Sn) = 1. 16 Let Sn be the number of successes in n independent trials. Use the program BinomialProbabilities (Section 3.2) to compute, for given n, p, and j, the probability P(-j npq 0. D Example 6.27 Let Z be a standard normal random variable with density function fz(x) e-x2/2 Since this density function is symmetric with respect to the y-axis, then it is easy to show that J0 xfz(x) dx has value 0. The reader should recall however, that the expectation is defined to be the above integral only if the integral xfz(x) d  2 74 CHAPTER 6. EXPECTED VALUE AND VARIANCE which one can easily show is finite. Thus, the expected value of Z is 0. To calculate the variance of Z, we begin by applying Theorem 6.15: V(Z) = x2fz(x) dx - pt2 If we write x2 as x - x, and integrate by parts, we obtain (xex2/2) ±00 + e dzx. The first summand above can be shown to equal 0, since as x - too, e-x2/2 gets small more quickly than x gets large. The second summand is just the standard normal density integrated over its domain, so the value of this summand is 1. Therefore, the variance of the standard normal density equals 1. Now let X be a (not necessarily standard) normal random variable with param- eters p and a. Then the density function of X is fx(x) 1 e(x-)2/2a2 27ru We can write X =uaZ + p, where Z is a standard normal random variable. Since E(Z) = 0 and V(Z) = 1 by the calculation above, Theorems 6.10 and 6.14 imply that E(X) = E(aZ + p) = p , V(X) = V(aZ+p )=a2 Example 6.28 Let X be a continuous random variable with the Cauchy density function a 1 fx (x) _=7- 2 x. Then the expectation of X does not exist, because the integral a +oo |xdx 7r _00a2 +x2 diverges. Thus the variance of X also fails to exist. Densities whose variance is not defined, like the Cauchy density, behave quite differently in a number of important respects from those whose variance is finite. We shall see one instance of this difference in Section 8.2.D Independent Trials  6.3. CONTINUOUS RANDOM VARIABLES 275 Corollary 6.1 If X1, X2, ..., Xn is an independent trials process of real-valued random variables, with E(XZ) = and V(XZ) =_a2, and if Sn = X1 +X2 +---.+Xn , An =S n then E(Sn) = np , E(An) = p, V(Sn) = no2 a2 V(An) n It follows that if we set S* = - -n then E(S*) = 0 , V(S*) = 1. We say that S* is a standardized version of Sn (see Exercise 12 in Section 6.2). Q Queues Example 6.29 Let us consider again the queueing problem, that is, the problem of the customers waiting in a queue for service (see Example 5.7). We suppose again that customers join the queue in such a way that the time between arrivals is an exponentially distributed random variable X with density function fx(t) =_Ae-a. Then the expected value of the time between arrivals is simply 1/A (see Exam- ple 6.26), as was stated in Example 5.7. The reciprocal A of this expected value is often referred to as the arrival rate. The service time of an individual who is first in line is defined to be the amount of time that the person stays at the head of the line before leaving. We suppose that the customers are served in such a way that the service time is another exponentially distributed random variable Y with density function fx (t) = pe- t. Then the expected value of the service time is E(X) = ltfxt~dt 1 The reciprocal p1 if this expected value is often referred to as the service rate.  276 CHAPTER 6. EXPECTED VALUE AND VARIANCE We expect on grounds of our everyday experience with queues that if the service rate is greater than the arrival rate, then the average queue size will tend to stabilize, but if the service rate is less than the arrival rate, then the queue will tend to increase in length without limit (see Figure 5.7). The simulations in Example 5.7 tend to bear out our everyday experience. We can make this conclusion more precise if we introduce the traffic intensity as the product A _1/pt p =(arrival rate)(average service time) - /A pA1/ The traffic intensity is also the ratio of the average service time to the average time between arrivals. If the traffic intensity is less than 1 the queue will perform reasonably, but if it is greater than 1 the queue will grow indefinitely large. In the critical case of p = 1, it can be shown that the queue will become large but there will always be times at which the queue is empty.22 In the case that the traffic intensity is less than 1 we can consider the length of the queue as a random variable Z whose expected value is finite, E(Z)=N. The time spent in the queue by a single customer can be considered as a random variable W whose expected value is finite, E(W)=T . Then we can argue that, when a customer joins the queue, he expects to find N people ahead of him, and when he leaves the queue, he expects to find AT people behind him. Since, in equilibrium, these should be the same, we would expect to find that N=AT. This last relationship is called Little's law for queues.23 We will not prove it here. A proof may be found in Ross.24 Note that in this case we are counting the waiting time of all customers, even those that do not have to wait at all. In our simulation in Section 4.2, we did not consider these customers. If we knew the expected queue length then we could use Little's law to obtain the expected waiting time, since N TA= -. The queue length is a random variable with a discrete distribution. We can estimate this distribution by simulation, keeping track of the queue lengths at the times at which a customer arrives. We show the result of this simulation (using the program Queue) in Figure 6.8. 22L. Kleinrock, Qztezeing Systems, vol. 2 (New York: John Wiley and Sons, 1975). 23ibid., p. 17. 248. M. Ross, Applied Probability Models with Optimization Applications, (San Francisco: Holden-Day, 1970)  6.3. CONTINUOUS RANDOM VARIABLES 277 0.08 0.06 0.04 0.02 0 0 10 20 30 40 50 Figure 6.8: Distribution of queue lengths. We note that the distribution appears to be a geometric distribution. In the study of queueing theory it is shown that the distribution for the queue length in equilibrium is indeed a geometric distribution with si = (1 -p)p3 for j = 0, 1, 2, ..., if p < 1. The expected value of a random variable with this distribution is N = P (1 - p) (see Example 6.4). Thus by Little's result the expected waiting time is p _= 1 A(1 - p) p - A where p is the service rate, A the arrival rate, and p the traffic intensity. In our simulation, the arrival rate is 1 and the service rate is 1.1. Thus, the traffic intensity is 1/1.1 = 10/11, the expected queue size is 10/11 = 10 (1 - 10/11) ' and the expected waiting time is 1 10 1.1 - 1 In our simulation the average queue size was 8.19 and the average waiting time was 7.37. In Figure 6.9, we show the histogram for the waiting times. This histogram suggests that the density for the waiting times is exponential with parameter At - A, and this is the case.D  278 CHAPTER 6. EXPECTED VALUE AND VARIANCE 0.08 0.06 0.04 0.02 0 0 10 20 30 40 50 Figure 6.9: Distribution of queue waiting times. Exercises 1 Let X be a random variable with range [-1, 1] and let fx(x) be the density function of X. Find p(X) and o2(X) if, for x < 1, (a) fx (x) = 1/2. (b) fx (x) = Iz|. (c) fx (x) = 1 - Iz|. (d) fx (x) = (3/2)x2. 2 Let X be a random variable with range [-1, 1] and fx its density function. Find p(X) and o2(X) if, for x > 1, fx(x) =0, and for x < 1, (a) fx (x) = (3/4)(1 - x2) (b) fx (x) = (7r/4) cos(7x/2). (c) fx (x) = (x + 1)/2. (d) fx (x) = (3/8)(x + 1)2. 3 The lifetime, measure in hours, of the ACME super light bulb is a random variable T with density function fT(t) = A2te-, where A,= .05. What is the expected lifetime of this light bulb? What is its variance? 4 Let X be a random variable with range [-1, 1] and density function fx(x) ax + b if x < 1. (a) Show that if J_ fx (x) dx= 1, then b = 1/2. (b) Show that if fx(x) ;> 0, then -1/2 < a 1/2. (c) Show that p = (2/3)a, and hence that -1/3 pt 1/3.  6.3. CONTINUOUS RANDOM VARIABLES 279 (d) Show that a2(X) = (2/3)b - (4/9)a2 = 1/3 - (4/9)a2. 5 Let X be a random variable with range [-1, 1] and density function fx(x) ax2 + bx + c if x < 1 and 0 otherwise. (a) Show that 2a/3 + 2c = 1 (see Exercise 4). (b) Show that 2b/3 = p(X). (c) Show that 2a/5 + 2c/3 = a2(X). (d) Find a, b, and c if p (X) = 0, a2(X) = 1/15, and sketch the graph of fx. (e) Find a, b, and c if p (X) = 0, a2(X) = 1/2, and sketch the graph of fx. 6 Let T be a random variable with range [0, oc] and fT its density function. Find p (T) and a2(T) if, for t < 0, fT(t) = 0, and for t > 0, (a) fT(t) = 3e-3t. (b) fT(t) = 9te-3t. (c) fT(t) = 3/(1 + t)4. 7 Let X be a random variable with density function fx. Show, using elementary calculus, that the function #(a) = E((X - a)2) takes its minimum value when a = p(X), and in that case #(a) = -2(X). 8 Let X be a random variable with mean p and variance a2. Let Y = aX2 + bX + c. Find the expected value of Y. 9 Let X, Y, and Z be independent random variables, each with mean p and variance a2 (a) Find the expected value and variance of S = X + Y + Z. (b) Find the expected value and variance of A = (1/3)(X + Y + Z). (c) Find the expected value of S2 and A2. 10 Let X and Y be independent random variables with uniform density functions on [0,1]. Find (a) E(X - Yl). (b) E(max(X, Y)). (c) E(min(X, Y)). (d) E(X2 +Y2). (e) E((X +Y)2).  280 CHAPTER 6. EXPECTED VALUE AND VARIANCE 11 The Pilsdorff Beer Company runs a fleet of trucks along the 100 mile road from Hangtown to Dry Gulch. The trucks are old, and are apt to break down at any point along the road with equal probability. Where should the company locate a garage so as to minimize the expected distance from a typical breakdown to the garage? In other words, if X is a random variable giving the location of the breakdown, measured, say, from Hangtown, and b gives the location of the garage, what choice of b minimizes E(IX - b)? Now suppose X is not distributed uniformly over [0, 100], but instead has density function fx(x) = 2x10,000. Then what choice of b minimizes E(IX - b)? 12 Find E(X'), where X and Y are independent random variables which are uniform on [0, 1]. Then verify your answer by simulation. 13 Let X be a random variable that takes on nonnegative values and has distri- bution function F(x). Show that E(X) =j (1 - F(x)) dx. Hint: Integrate by parts. Illustrate this result by calculating E(X) by this method if X has an expo- nential distribution F(x) = 1 - e-Ax for x > 0, and F(x) = 0 otherwise. 14 Let X be a continuous random variable with density function fx (x). Show that if x2fx(x) dx < 0 then Hint: Except on the interval [-1, 1], the first integrand is greater than the second integrand. 15 Let X be a random variable distributed uniformly over [0, 20]. Define a new random variable Y by Y = [X] (the greatest integer in X). Find the expected value of Y. Do the same for Z = [X + .5]. Compute E(IX - Y) and E(X - Z). (Note that Y is the value of X rounded off to the nearest smallest integer, while Z is the value of X rounded off to the nearest integer. Which method of rounding off is better? Why?) 16 Assume that the lifetime of a diesel engine part is a random variable X with density fx. When the part wears out, it is replaced by another with the same density. Let N(t) be the number of parts that are used in time t. We want to study the random variable N(t)/t. Since parts are replaced on the average every E(X) time units, we expect about t/E(X) parts to be used in time t. That is, we expect that lim E __ . t0o tJ E(X)  6.3. CONTINUOUS RANDOM VARIABLES 281 This result is correct but quite difficult to prove. Write a program that will allow you to specify the density fx, and the time t, and simulate this experi- ment to find N(t)/t. Have your program repeat the experiment 500 times and plot a bar graph for the random outcomes of N(t)/t. From this data, estimate E(N(t)/t) and compare this with 1/E(X). In particular, do this for t = 100 with the following two densities: (a) fx = e-t. (b) fx = te-t. 17 Let X and Y be random variables. The covariance Cov(X, Y) is defined by (see Exercise 6.2.23) cov(X, Y) = E((X - p(X))(Y - p(Y))). (a) Show that cov(X, Y) = E(XY) - E(X)E(Y). (b) Using (a), show that cov(X, Y) = 0, if X and Y are independent. (Cau- tion: the converse is not always true.) (c) Show that V(X + Y) = V(X) + V(Y) + 2cov(X, Y). 18 Let X and Y be random variables with positive variance. The correlation of X and Y is defined as (X, Y) cov(X, Y) V(X)V(Y) (a) Using Exercise 17(c), show that 0 E(Y), then the expected value for X given Y = y will be less than y (i.e., we have regression on the mean). 22 A point Y is chosen at random from [0, 1]. A second point X is then chosen from the interval [0, Y]. Find the density for X. Hint: Calculate fx y as in Exercise 21 and then use fx(x) j fx y(xly)fy(y) dy . Can you also derive your result geometrically? *23 Let X and V be two standard normal random variables. Let p be a real number between -1 and 1. (a) Let Y =pX + 1 - p2V. Show that E(Y) =0 and Var(Y) =1. We shall see later (see Example 7.5 and Example 10.17), that the sum of two independent normal random variables is again normal. Thus, assuming this fact, we have shown that Y is standard normal.  6.3. CONTINUOUS RANDOM VARIABLES 283 (b) Using Exercises 17 and 18, show that the correlation of X and Y is p. (c) In Exercise 20, the joint density function fx,y(x, y) for the random vari- able (X, Y) is given. Now suppose that we want to know the set of points (x, y) in the xy-plane such that fx,y(x, y) = C for some constant C. This set of points is called a set of constant density. Roughly speak- ing, a set of constant density is a set of points where the outcomes (X, Y) are equally likely to fall. Show that for a given C, the set of points of constant density is a curve whose equation is X2 - 2pxy + y2= D, where D is a constant which depends upon C. (This curve is an ellipse.) (d) One can plot the ellipse in part (c) by using the parametric equations r cosO8 r sinO8 2(1 - p) 2(1 + p) r cosO8 r sinO8 2(1-p) 2(1+p) Write a program to plot 1000 pairs (X, Y) for p = -1/2, 0, 1/2. For each plot, have your program plot the above parametric curves for r = 1, 2, 3. *24 Following Galton, let us assume that the fathers and sons have heights that are dependent normal random variables. Assume that the average height is 68 inches, standard deviation is 2.7 inches, and the correlation coefficient is .5 (see Exercises 20 and 21). That is, assume that the heights of the fathers and sons have the form 2.7X + 68 and 2.7Y + 68, respectively, where X and Y are correlated standardized normal random variables, with correlation coefficient .5. (a) What is the expected height for the son of a father whose height is 72 inches? (b) Plot a scatter diagram of the heights of 1000 father and son pairs. Hint: You can choose standardized pairs as in Exercise 23 and then plot (2.7X+ 68, 2.7Y + 68). *25 When we have pairs of data (Xi, y2) that are outcomes of the pairs of dependent random variables X, Y we can estimate the coorelation coefficient p by _ =E (zi - z)(yj -y) (n - 1)sx sy' where z- and Y are the sample means for X and Y, respectively, and sx and sy are the sample standard deviations for X and Y (see Exercise 6.2.17). Write a program to compute the sample means, variances, and correlation for such dependent data. Use your program to compute these quantities for Galton's data on heights of parents and children given in Appendix B.  284 CHAPTER 6. EXPECTED VALUE AND VARIANCE Plot the equal density ellipses as defined in Exercise 23 for r = 4, 6, and 8, and on the same graph print the values that appear in the table at the appropriate points. For example, print 12 at the point (70.5, 68.2), indicating that there were 12 cases where the parent's height was 70.5 and the child's was 68.12. See if Galton's data is consistent with the equal density ellipses. 26 (from Hamming25) Suppose you are standing on the bank of a straight river. (a) Choose, at random, a direction which will keep you on dry land, and walk 1 km in that direction. Let P denote your position. What is the expected distance from P to the river? (b) Now suppose you proceed as in part (a), but when you get to P, you pick a random direction (from among all directions) and walk 1 km. What is the probability that you will reach the river before the second walk is completed? 27 (from Hamming26) A game is played as follows: A random number X is chosen uniformly from [0, 1]. Then a sequence Yi, Y2, ... of random numbers is chosen independently and uniformly from [0, 1]. The game ends the first time that Y > X. You are then paid (i - 1) dollars. What is a fair entrance fee for this game? 28 A long needle of length L much bigger than 1 is dropped on a grid with horizontal and vertical lines one unit apart. Show that the average number a of lines crossed is approximately 4L a= 25R. W. Hamming, The Art of Probability for Scientists and Engineers (Redwood City: Addison-Wesley, 1991), p. 192. 26ibid., pg. 205.  Chapter 7 Sums of Independent Random Variables 7.1 Sums of Discrete Random Variables In this chapter we turn to the important question of determining the distribution of a sum of independent random variables in terms of the distributions of the individual constituents. In this section we consider only sums of discrete random variables, reserving the case of continuous random variables for the next section. We consider here only random variables whose values are integers. Their distri- bution functions are then defined on these integers. We shall find it convenient to assume here that these distribution functions are defined for all integers, by defining them to be 0 where they are not otherwise defined. Convolutions Suppose X and Y are two independent discrete random variables with distribution functions mi(x) and m2(x). Let Z = X + Y. We would like to determine the dis- tribution function m3(x) of Z. To do this, it is enough to determine the probability that Z takes on the value z, where z is an arbitrary integer. Suppose that X = k, where k is some integer. Then Z = z if and only if Y = z - k. So the event Z = z is the union of the pairwise disjoint events (X k) and (Y z - k), where k runs over the integers. Since these events are pairwise disjoint, we have 00 P(Z = z) = 1 P(X = k) - P(Y = z - k) . k=-oo Thus, we have found the distribution function of the random variable Z. This leads to the following definition. 285  286 CHAPTER 7. SUMS OF RANDOM VARIABLES Definition 7.1 Let X and Y be two independent integer-valued random variables, with distribution functions mi (x) and m2 (x) respectively. Then the convolution of mi (x) and m2 (x) is the distribution function m3 = mi * m2 given by m3(j) Zm1(k) . m2(j - k) k for j = ... , -2, -1, 0, 1, 2, .... The function m3(x) is the distribution function of the random variable Z = X + Y. D It is easy to see that the convolution operation is commutative, and it is straight- forward to show that it is also associative. Now let Sn = X1 + X2 +- --+ X be the sum of n independent random variables of an independent trials process with common distribution function m defined on the integers. Then the distribution function of Si is m. We can write Sn = Sn -1 + X n . Thus, since we know the distribution function of Xn is m, we can find the distribu- tion function of Sn by induction. Example 7.1 A die is rolled twice. Let X1 and X2 be the outcomes, and let S2 = X1 + X2 be the sum of these outcomes. Then X1 and X2 have the common distribution function: 1 2 3 4 5 6 m 1/16 1/6 1/6 1/6 1/6 The distribution function of S2 is then the convolution of this distribution with itself. Thus, P(S2 = 2) = m(1)m(1) 1_ 1 1 6 6 36 ' P(S2 = 3) = m(1)m(2) + m(2)m(1) 1 1 1 1 2 6 6 6 6 36' P(S2 = 4) = m(1)m(3) + m(2)m(2) + m(3)m(1) 1 1 1 1 1 1 3 6 66 66 36 Continuing in this way we would find P(S2 = 5) = 4/36, P(S2 = 6) = 5/36, P(S2 = 7) = 6/36, P(S2 = 8) = 5/36, P(S2 = 9) = 4/36, P(S2 = 10) = 3/36, P(S2 = 11) = 2/36, and P(S2 = 12) = 1/36. The distribution for S3 would then be the convolution of the distribution for S2 with the distribution for X3. Thus P(S3 = 3) P(S2 = 2)P(X3 = 1)  7.1. SUMS OF DISCRETE RANDOM VARIABLES 287 1 1 1 36 6 216 ' P(S3 = 4) = P(S2 = 3)P(X3 = 1) + P(S2 = 2)P(X3 = 2) 2 1 1 1 3 36 6 36 6216' and so forth. This is clearly a tedious job, and a program should be written to carry out this calculation. To do this we first write a program to form the convolution of two densities p and q and return the density r. We can then write a program to find the density for the sum Sn of n independent random variables with a common density p, at least in the case that the random variables have a finite number of possible values. Running this program for the example of rolling a die n times for n = 10, 20, 30 results in the distributions shown in Figure 7.1. We see that, as in the case of Bernoulli trials, the distributions become bell-shaped. We shall discuss in Chapter 9 a very general theorem called the Central Limit Theorem that will explain this phenomenon. D Example 7.2 A well-known method for evaluating a bridge hand is: an ace is assigned a value of 4, a king 3, a queen 2, and a jack 1. All other cards are assigned a value of 0. The point count of the hand is then the sum of the values of the cards in the hand. (It is actually more complicated than this, taking into account voids in suits, and so forth, but we consider here this simplified form of the point count.) If a card is dealt at random to a player, then the point count for this card has distribution 0 1 2 3 4 (36/52 4/52 4/52 4/52 4/52) Let us regard the total hand of 13 cards as 13 independent trials with this common distribution. (Again this is not quite correct because we assume here that we are always choosing a card from a full deck.) Then the distribution for the point count C for the hand can be found from the program NFoldConvolution by using the distribution for a single card and choosing n = 13. A player with a point count of 13 or more is said to have an opening bid. The probability of having an opening bid is then P(C > 13) . Since we have the distribution of C, it is easy to compute this probability. Doing this we find that P(C > 13) =.2845 so that about one in four hands should be an opening bid according to this simplified model. A more realistic discussion of this problem can be found in Epstein, The Theory of Gambling and Statistical Logic.1 D 1R. A. Epstein, The Theory of Gambling anid Statistical Logic, rev. ed. (New York: Academic Press, 1977).  288 CHAPTER 7. SUMS OF RANDOM VARIABLES n =10 i.I | 1 i i | | 1 i i | | 1 i i | | 1 i i | | 1 i i | | 1 i i i n =20 n =30 Figure 7.1: Density of Sn for rolling a die n times.  7.1. SUMS OF DISCRETE RANDOM VARIABLES 289 For certain special distributions it is possible to find an expression for the dis- tribution that results from convoluting the distribution with itself n times. The convolution of two binomial distributions, one with parameters m and p and the other with parameters n and p, is a binomial distribution with parameters (m + n) and p. This fact follows easily from a consideration of the experiment which consists of first tossing a coin m times, and then tossing it n more times. The convolution of k geometric distributions with common parameter p is a negative binomial distribution with parameters p and k. This can be seen by con- sidering the experiment which consists of tossing a coin until the kth head appears. Exercises 1 A die is rolled three times. Find the probability that the sum of the outcomes is (a) greater than 9. (b) an odd number. 2 The price of a stock on a given trading day changes according to the distri- bution -1 0 1 2 S1/41/2 1/8 1/8) Find the distribution for the change in stock price after two (independent) trading days. 3 Let X1 and X2 be independent random variables with common distribution 0 1 2 (1/8 3/8 1/2) Find the distribution of the sum X1 + X2. 4 In one play of a certain game you win an amount X with distribution 1 2 3 (1/4 1/4 1/2) Using the program NFoldConvolution find the distribution for your total winnings after ten (independent) plays. Plot this distribution. 5 Consider the following two experiments: the first has outcome X taking on the values 0, 1, and 2 with equal probabilities; the second results in an (in- dependent) outcome Y taking on the value 3 with probability 1/4 and 4 with probability 3/4. Find the distribution of (a) Y +X. (b) Y -X.  290 CHAPTER 7. SUMS OF RANDOM VARIABLES 6 People arrive at a queue according to the following scheme: During each minute of time either 0 or 1 person arrives. The probability that 1 person arrives is p and that no person arrives is q = 1 - p. Let Cr be the number of customers arriving in the first r minutes. Consider a Bernoulli trials process with a success if a person arrives in a unit time and failure if no person arrives in a unit time. Let Tr be the number of failures before the rth success. (a) What is the distribution for Tr? (b) What is the distribution for Cr? (c) Find the mean and variance for the number of customers arriving in the first r minutes. 7 (a) A die is rolled three times with outcomes X1, X2, and X3. Let Y3 be the maximum of the values obtained. Show that P(Y3 j) = P(X1 j) Use this to find the distribution of Y3. Does Y3 have a bell-shaped dis- tribution? (b) Now let Yn be the maximum value when n dice are rolled. Find the distribution of Yn. Is this distribution bell-shaped for large values of n? 8 A baseball player is to play in the World Series. Based upon his season play, you estimate that if he comes to bat four times in a game the number of hits he will get has a distribution 0 1 2 3 4 Px .4 . 2 .1.1 Assume that the player comes to bat four times in each game of the series. (a) Let X denote the number of hits that he gets in a series. Using the program NFoldConvolution, find the distribution of X for each of the possible series lengths: four-game, five-game, six-game, seven-game. (b) Using one of the distribution found in part (a), find the probability that his batting average exceeds .400 in a four-game series. (The batting average is the number of hits divided by the number of times at bat.) (c) Given the distribution px, what is his long-term batting average? 9 Prove that you cannot load two dice in such a way that the probabilities for any sum from 2 to 12 are the same. (Be sure to consider the case where one or more sides turn up with probability zero.) 10 (L~vy2) Assume that n is an integer, not prime. Show that you can find two distributions a and b on the nonnegative integers such that the convolution of 2See M. Krasner and B. Ranulae, "Sur une Propriet4 des Polynomes de la Division du Circle"; and the following note by J. Hadamard, in C. R. Acad. Sci., vol. 204 (1937), pp. 397-399.  7.2. SUMS OF CONTINUOUS RANDOM VARIABLES 291 a and b is the equiprobable distribution on the set 0, 1, 2, ..., n - 1. If n is prime this is not possible, but the proof is not so easy. (Assume that neither a nor b is concentrated at 0.) 11 Assume that you are playing craps with dice that are loaded in the following way: faces two, three, four, and five all come up with the same probability (1/6) + r. Faces one and six come up with probability (1/6) - 2r, with 0 < r < .02. Write a computer program to find the probability of winning at craps with these dice, and using your program find which values of r make craps a favorable game for the player with these dice. 7.2 Sums of Continuous Random Variables In this section we consider the continuous version of the problem posed in the previous section: How are sums of independent random variables distributed? Convolutions Definition 7.2 Let X and Y be two continuous random variables with density functions f(x) and g(y), respectively. Assume that both f(x) and g(y) are defined for all real numbers. Then the convolution f * g of f and g is the function given by (f*g)(z) = f (z - y)g(y) dy =J g(z-x)f(x)dx. This definition is analogous to the definition, given in Section 7.1, of the con- volution of two distribution functions. Thus it should not be surprising that if X and Y are independent, then the density of their sum is the convolution of their densities. This fact is stated as a theorem below, and its proof is left as an exercise (see Exercise 1). Theorem 7.1 Let X and Y be two independent random variables with density functions fx (x) and fy (y) defined for all x. Then the sum Z = X + Y is a random variable with density function fz (z), where fz is the convolution of fx and fy. D To get a better understanding of this important result, we will look at some examples.  292 CHAPTER 7. SUMS OF RANDOM VARIABLES Sum of Two Independent Uniform Random Variables Example 7.3 Suppose we choose independently two numbers at random from the interval [0, 1] with uniform probability density. What is the density of their sum? Let X and Y be random variables describing our choices and Z = X + Y their sum. Then we have _1 if 0< z< 1, fx(x)- fy(x) 0 otherwise; and the density function for the sum is given by fz(z) J fx (z - y)fy (y) dy . Since fy(y) = 1 if 0 < y < 1 and 0 otherwise, this becomes fz(z) =j fx (z - y) dy . Now the integrand is 0 unless 0 < z - y < 1 (i.e., unless z - 1 < y < z) and then it is 1. Soif0 2 we have fz(z) = 0 (see Figure 7.2). Hence, f z, if0 0, fz(z) = fx (z - y)fy (y) dy j A-A(z-y)Ae-Ay dy Iz A2-Azdy 0 A2 ze-Az while if z < 0, fz(z) =0 (see Figure 7.3). Hence, { A2 ze-Az tif~ e fz(z)= 0,ohrie  294 CHAPTER 7. SUMS OF RANDOM VARIABLES Sum of Two Independent Normal Random Variables Example 7.5 It is an interesting and important fact that the convolution of two normal densities with means p1 and p2 and variances a1 and O-2 is again a normal density, with mean si + p2 and variance a + 02. We will show this in the special case that both random variables are standard normal. The general case can be done in the same way, but the calculation is messier. Another way to show the general result is given in Example 10.17. Suppose X and Y are two independent random variables, each with the standard normal density (see Example 5.8). We have fx(x) fy(y) 1 e-x2/2 2wr and so fz (z) fx * fy (z) 2 I2 e-(z-y)2/2e-y2/2 dy 2 e-z2/4 e(y-z/2)2 dy 1 e-z2/4 1[ e-(Y-2 2x -/0_ /2)2 dy The expression in the brackets equals 1, since it is the integral of the normal density function with p = 0 and a= v/-2. So, we have fz(z) = 1 e-z2/4 v 4 Sum of Two Independent Cauchy Random Variables Example 7.6 Choose two numbers at random from the interval (-o, +o0) with the Cauchy density with parameter a = 1 (see Example 5.10). Then fx(x) =fy(x) 1 (1+x2) and Z = X + Y has density 1 +fo 1y 1 f 2 _) 2 1+(z_- y2 1+y2  7.2. SUMS OF CONTINUOUS RANDOM VARIABLES 295 This integral requires some effort, and we give here only the result (see Section 10.3, or Dwass3): fz(z) =2 7r(4 +z2) Now, suppose that we ask for the density function of the average A = (1/2) (X + Y) of X and Y. Then A = (1/2)Z. Exercise 5.2.19 shows that if U and V are two continuous random variables with density functions fu(x) and fv (x), respectively, and if V = aU, then fv(x) = (-)fu (-). Thus, we have 1 fA(z) =2fz(2z) = 1 . 7r(1 +z2) Hence, the density function for the average of two random variables, each having a Cauchy density, is again a random variable with a Cauchy density; this remarkable property is a peculiarity of the Cauchy density. One consequence of this is if the error in a certain measurement process had a Cauchy density and you averaged a number of measurements, the average could not be expected to be any more accurate than any one of your individual measurements! D Rayleigh Density Example 7.7 Suppose X and Y are two independent standard normal random variables. Now suppose we locate a point P in the xy-plane with coordinates (X, Y) and ask: What is the density of the square of the distance of P from the origin? (We have already simulated this problem in Example 5.9.) Here, with the preceding notation, we have fx(x) 1 e-x2/2 2wr Moreover, if X2 denotes the square of X, then (see Theorem 5.1 and the discussion following) fx2(r) = {2(fx(r)+fx(- r)) ifr>0, f (r0 otherwise. (e-r/2) if r> 0, 0 otherwise. 3M. Dwass, "On the Convolution of Cauchy Distributions," American Mathematical Monthly, vol. 92, no. 1, (1985), pp. 55-57; see also R. Nelson, letters to the Editor, ibid., p. 679.  296 CHAPTER 7. SUMS OF RANDOM VARIABLES This is a gamma density with A = 1/2, 3= 1/2 (see Example 7.4). Now let R2 = X2 + Y2. Then fR2(r) fx2(r - s)fY2(s)ds 1 f+oo O-s)/2 r - _1/2 __ -1/2 4w _, 2 e2 '/ _2e-2/2 if r > 0, { 0, otherwise. Hence, R2 has a gamma density with A= 1/2, 3= 1. We can interpret this result as giving the density for the square of the distance of P from the center of a target if its coordinates are normally distributed. The density of the random variable R is obtained from that of R2 in the usual way (see Theorem 5.1), and we find f2(r) f e-2/2 . 2r = re-2/2 if r > 0 (T 0, otherwise. Physicists will recognize this as a Rayleigh density. Our result here agrees with our simulation in Example 5.9. D Chi-Squared Density More generally, the same method shows that the sum of the squares of n independent normally distributed random variables with mean 0 and standard deviation 1 has a gamma density with A = 1/2 and #/= n/2. Such a density is called a chi-squared density with n degrees of freedom. This density was introduced in Chapter 4.3. In Example 5.10, we used this density to test the hypothesis that two traits were independent. Another important use of the chi-squared density is in comparing experimental data with a theoretical discrete distribution, to see whether the data supports the theoretical model. More specifically, suppose that we have an experiment with a finite set of outcomes. If the set of outcomes is countable, we group them into finitely many sets of outcomes. We propose a theoretical distribution which we think will model the experiment well. We obtain some data by repeating the experiment a number of times. Now we wish to check how well the theoretical distribution fits the data. Let X be the random variable which represents a theoretical outcome in the model of the experiment, and let m(x) be the distribution function of X. In a manner similar to what was done in Example 5.10, we calculate the value of the expression (o - n - m(x))2 where the sum runs over all possible outcomes x, n is the number of data points, and om denotes the number of outcomes of type x observed in the data. Then  7.2. SUMS OF CONTINUOUS RANDOM VARIABLES 297 Outcome Observed Frequency 1 15 2 8 3 7 4 5 5 7 6 18 Table 7.1: Observed data. for moderate or large values of n, the quantity V is approximately chi-squared distributed, with v -1 degrees of freedom, where v represents the number of possible outcomes. The proof of this is beyond the scope of this book, but we will illustrate the reasonableness of this statement in the next example. If the value of V is very large, when compared with the appropriate chi-squared density function, then we would tend to reject the hypothesis that the model is an appropriate one for the experiment at hand. We now give an example of this procedure. Example 7.8 Suppose we are given a single die. We wish to test the hypothesis that the die is fair. Thus, our theoretical distribution is the uniform distribution on the integers between 1 and 6. So, if we roll the die n times, the expected number of data points of each type is n/6. Thus, if o2 denotes the actual number of data points of type i, for 1 < i K 6, then the expression V =0(gn/6) i=1 is approximately chi-squared distributed with 5 degrees of freedom. Now suppose that we actually roll the die 60 times and obtain the data in Table 7.1. If we calculate V for this data, we obtain the value 13.6. The graph of the chi-squared density with 5 degrees of freedom is shown in Figure 7.4. One sees that values as large as 13.6 are rarely taken on by V if the die is fair, so we would reject the hypothesis that the die is fair. (When using this test, a statistician will reject the hypothesis if the data gives a value of V which is larger than 95% of the values one would expect to obtain if the hypothesis is true.) In Figure 7.5, we show the results of rolling a die 60 times, then calculating V, and then repeating this experiment 1000 times. The program that performs these calculations is called DieTest. We have superimposed the chi-squared density with 5 degrees of freedom; one can see that the data values fit the curve fairly well, which supports the statement that the chi-squared density is the correct one to use. Q So far we have looked at several important special cases for which the convolution integral can be evaluated explicitly. In general, the convolution of two continuous densities cannot be evaluated explicitly, and we must resort to numerical methods. Fortunately, these prove to be remarkably effective, at least for bounded densities.  298 CHAPTER 7. SUMS OF RANDOM VARIABLES 0.15 0.125 0.1 0.075 0.05 0.025 5 10 15 Figure 7.4: Chi-squared density with 5 degrees of freedom. I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 0.15 0.125 0.1 0.075 0.05 0.025 0 I 1000 experiments 60 rolls per experiment ,.,.., , , , 0 5 10 15 20 25 30 Figure 7.5: Rolling a fair die.  7.2. SUMS OF CONTINUOUS RANDOM VARIABLES 299 1 n=2 0.8 0 1 2 3 4 5 6 7 8 Figure 7.6: Convolution of n uniform densities. Independent Trials We now consider briefly the distribution of the sum of n independent random vari- ables, all having the same density function. If X1, X2, ..., Xn are these random variables and S = X1 + X2 + - - - + X is their sum, then we will have fsi, (x)=(fxi * fx2 * - - - * fx0) (x), where the right-hand side is an n-fold convolution. It is possible to calculate this density for general values of n in certain simple cases. Example 7.9 Suppose the Xi are uniformly distributed on the interval [0, 1]. Then {1 if0 0 } in each case? 3 Suppose again that Z = X + Y. Find fz if (a) fx (x) =fy(-) { x/2, = 0, (b) fixfix) fyfix) { (1/2)(x - (c) ( 1/2, if 0 X 0, othE if 0 0 } in each case? 4 Let X, Y, and Z be independent random variables with f x(x) = fvy(x) = f z(x) 1 ' < 0, otherwise. Suppose that W = X + Y + Z. Find fw directly, and compare your answer with that given by the formula in Example 7.9. Hint: See Example 7.3. 5 Suppose that X and Y are independent and Z = X + Y. Find fz if (a) { Ae-AX ifx>0, fx (x='{ 0, otherwise. fv(x={ -, if x > 0, 0, otherwise. (b) { Ae-AX ifx>0, 0, otherwise. fv( 1, if0 22 minutes.  7.2. SUMS OF CONTINUOUS RANDOM VARIABLES 303 10 Let X1, X2, ..., Xn be n independent random variables each of which has an exponential density with mean p. Let M be the minimum value of the X. Show that the density for M is exponential with mean p/n. Hint: Use cumulative distribution functions. 11 A company buys 100 lightbulbs, each of which has an exponential lifetime of 1000 hours. What is the expected time for the first of these bulbs to burn out? (See Exercise 10.) 12 An insurance company assumes that the time between claims from each of its homeowners' policies is exponentially distributed with mean p. It would like to estimate p by averaging the times for a number of policies, but this is not very practical since the time between claims is about 30 years. At Galambos'5 suggestion the company puts its customers in groups of 50 and observes the time of the first claim within each group. Show that this provides a practical way to estimate the value of p. 13 Particles are subject to collisions that cause them to split into two parts with each part a fraction of the parent. Suppose that this fraction is uniformly distributed between 0 and 1. Following a single particle through several split- tings we obtain a fraction of the original particle Z = X1 - X2 -... . -X where each X is uniformly distributed between 0 and 1. Show that the density for the random variable Zn is fn (z) = (1)(logz) . Hint: Show that Yk= - log Xk is exponentially distributed. Use this to find the density function for Sn = Y1 + Y2 +- --+ Yn, and from this the cumulative distribution and density of Zn = e-S . 14 Assume that X1 and X2 are independent random variables, each having an exponential density with parameter A. Show that Z = X1 - X2 has density fz(z) =_(1/2)Ae-Azl 15 Suppose we want to test a coin for fairness. We flip the coin n times and record the number of times Xo that the coin turns up tails and the number of times X1 = n - Xo that the coin turns up heads. Now we set z :1(XZ - n/2)2 i=0 n/2 Then for a fair coin Z has approximately a chi-squared distribution with 2 - 1 =1 degree of freedom. Verify this by computer simulation first for a fair coin (p =1/2) and then for a biased coin (p =1/3). 5J. Galambos, Introductory Probability Theory (New York: Marcel Dekker, 1984), p. 159.  304 CHAPTER 7. SUMS OF RANDOM VARIABLES 16 Verify your answers in Exercise 2(a) by computer simulation: Choose X and Y from [-1, 1] with uniform density and calculate Z = X + Y. Repeat this experiment 500 times, recording the outcomes in a bar graph on [-2, 2] with 40 bars. Does the density fz calculated in Exercise 2(a) describe the shape of your bar graph? Try this for Exercises 2(b) and Exercise 2(c), too. 17 Verify your answers to Exercise 3 by computer simulation. 18 Verify your answer to Exercise 4 by computer simulation. 19 The support of a function f(x) is defined to be the set {x : f(x) >0} . Suppose that X and Y are two continuous random variables with density functions fx (x) and fy (y), respectively, and suppose that the supports of these density functions are the intervals [a, b] and [c, d], respectively. Find the support of the density function of the random variable X + Y. 20 Let X1, X2, ..., Xn be a sequence of independent random variables, all having a common density function fx with support [a, b] (see Exercise 19). Let Sn = X1 + X2 + - - - + Xn, with density function fsn. Show that the support of fsn is the interval [na, nb]. Hint: Write fsn = fs_ 1 * fx. Now use Exercise 19 to establish the desired result by induction. 21 Let X1, X2, ..., Xn be a sequence of independent random variables, all having a common density function fx. Let A = Sn/n be their average. Find fA if (a) fx (x) = (1/ /2)e-x2/2 (normal density). (b) fx (x) = e-x (exponential density). Hint: Write fA(x) in terms of fs (x ).  Chapter 8 Law of Large Numbers 8.1 Law of Large Numbers for Discrete Random Variables We are now in a position to prove our first fundamental theorem of probability. We have seen that an intuitive way to view the probability of a certain outcome is as the frequency with which that outcome occurs in the long run, when the ex- periment is repeated a large number of times. We have also defined probability mathematically as a value of a distribution function for the random variable rep- resenting the experiment. The Law of Large Numbers, which is a theorem proved about the mathematical model of probability, shows that this model is consistent with the frequency interpretation of probability. This theorem is sometimes called the law of averages. To find out what would happen if this law were not true, see the article by Robert M. Coates.1 Chebyshev Inequality To discuss the Law of Large Numbers, we first need an important inequality called the Chebyshev Inequality. Theorem 8.1 (Chebyshev Inequality) Let X be a discrete random variable with expected value p = E(X), and let E > 0 be any positive real number. Then V(X ) P(|X - p> E<) < 2 - Proof. Let m(x) denote the distribution function of X. Then the probability that X differs from p by at least E is given by P(IX - pl > E)= m(x) . 1R. M. Coates, "The Law," The World of Mathematics, ed. James R. Newman (New York: Simon and Schuster, 1956. 305  306 CHAPTER 8. LAW OF LARGE NUMBERS We know that V(X) Z(x- )2m(x), and this is clearly at least as large as (x-t)2m(x), since all the summands are positive and we have restricted the range of summation in the second sum. But this last sum is at least E2m(x) = 2 3 m(x) 2P(X-P|>E). So, V(X) PO|X -p| ) < E2 - Note that X in the above theorem can be any discrete random variable, and E any positive number. Example 8.1 Let X by any random variable with E(X) = and V(X) = a2 Then, if E = ka, Chebyshev's Inequality states that P(|X - p k) 0, it is possible to give an example of a random variable for which Chebyshev's Inequality is in fact an equality. To see this, given E > 0, choose X with distribution PX 1/2 1/2 * Then E(X) = 0, V(X) = E2, and We are now prepared to state and prove the Law of Large Numbers.  8.1. DISCRETE RANDOM VARIABLES 307 Law of Large Numbers Theorem 8.2 (Law of Large Numbers) Let X1, X2, ..., Xm be an independent trials process, with finite expected value p = E(Xi) and finite variance a2 = V(X3). Let Sn = X 1 + X2 + - -- + Xn. Then for any e> 0, P S"-p> E 0 n as n - c. Equivalently, P(S n as n - oc. Proof. Since X1, X2, ..., Xn are independent and have the same distributions, we can apply Theorem 6.9. We obtain V(Sn) = no2 and Sn 2 V( )=. Also we know that Sm E( ")=p. By Chebyshev's Inequality, for any e > 0, 12 P S--p> E < - n - -ne2 Thus, for fixed c, P S"-p> E 0 n as n - o0, or equivalently, P Sn n as n- o. D Law of Averages Note that Sn/n is an average of the individual outcomes, and one often calls the Law of Large Numbers the "law of averages." It is a striking fact that we can start with a random experiment about which little can be predicted and, by taking averages, obtain an experiment in which the outcome can be predicted with a high degree of certainty. The Law of Large Numbers, as we have stated it, is often called the "Weak Law of Large Numbers" to distinguish it from the "Strong Law of Large Numbers" described in Exercise 15.  308 CHAPTER 8. LAW OF LARGE NUMBERS Consider the important special case of Bernoulli trials with probability p for success. Let X3 = 1 if the jth outcome is a success and 0 if it is a failure. Then Sn = X1 + X2+""" -+ Xn is the number of successes in n trials and p = E(X1) = p. The Law of Large Numbers states that for any E > 0 P "-p< E l n as n - c. The above statement says that, in a large number of repetitions of a Bernoulli experiment, we can expect the proportion of times the event will occur to be near p. This shows that our mathematical model of probability agrees with our frequency interpretation of probability. Coin Tossing Let us consider the special case of tossing a coin n times with Sn the number of heads that turn up. Then the random variable Sn/n represents the fraction of times heads turns up and will have values between 0 and 1. The Law of Large Numbers predicts that the outcomes for this random variable will, for large n, be near 1/2. In Figure 8.1, we have plotted the distribution for this example for increasing values of n. We have marked the outcomes between .45 and .55 by dots at the top of the spikes. We see that as n increases the distribution gets more and more con- centrated around .5 and a larger and larger percentage of the total area is contained within the interval (.45, .55), as predicted by the Law of Large Numbers. Die Rolling Example 8.2 Consider n rolls of a die. Let Xj be the outcome of the jth roll. Then Sn = X1 + X2 +--- + Xn is the sum of the first n rolls. This is an independent trials process with E(X3) = 7/2. Thus, by the Law of Large Numbers, for any E > 0 P - 7> E 0 n 2 - ) as n - c. An equivalent way to state this is that, for any E > 0, (S. P S .1) .1) < .21, or if n = 1000, P(IAiooo - .3 > .1) < .021 . These can be rewritten as P(.2 < Aioo < .4) > .79 , P(.2 < Aiooo < .4) > .979 . These values should be compared with the actual values, which are (to six decimal places) P(.2 < Aioo < .4) .962549 P(.2 < Aiooo < .4) 1 . The program Law can be used to carry out the above calculations in a systematic way. Q Historical Remarks The Law of Large Numbers was first proved by the Swiss mathematician James Bernoulli in the fourth part of his work Ars Conjectandi published posthumously in 1713.2 As often happens with a first proof, Bernoulli's proof was much more difficult than the proof we have presented using Chebyshev's inequality. Cheby- shev developed his inequality to prove a general form of the Law of Large Numbers (see Exercise 12). The inequality itself appeared much earlier in a work by Bien- aym6, and in discussing its history Maistrov remarks that it was referred to as the Bienaym6-Chebyshev Inequality for a long time.3 In Ars Conjectandi Bernoulli provides his reader with a long discussion of the meaning of his theorem with lots of examples. In modern notation he has an event 2J. Bernoulli, The Art of Conjectztring IV, trans. Bing Sung, Technical Report No. 2, Dept. of Statistics, Harvard Univ., 1966 3L. E. Maistrov, Probability Theory: A Historical Approach, trans. and ed. Samual Kotz, (New York: Academic Press, 1974), p. 202  8.1. DISCRETE RANDOM VARIABLES 311 that occurs with probability p but he does not know p. He wants to estimate p by the fraction p of the times the event occurs when the experiment is repeated a number of times. He discusses in detail the problem of estimating, by this method, the proportion of white balls in an urn that contains an unknown number of white and black balls. He would do this by drawing a sequence of balls from the urn, replacing the ball drawn after each draw, and estimating the unknown proportion of white balls in the urn by the proportion of the balls drawn that are white. He shows that, by choosing n large enough he can obtain any desired accuracy and reliability for the estimate. He also provides a lively discussion of the applicability of his theorem to estimating the probability of dying of a particular disease, of different kinds of weather occurring, and so forth. In speaking of the number of trials necessary for making a judgement, Bernoulli observes that the "man on the street" believes the "law of averages." Further, it cannot escape anyone that for judging in this way about any event at all, it is not enough to use one or two trials, but rather a great number of trials is required. And sometimes the stupidest man-by some instinct of nature per se and by no previous instruction (this is truly amazing)- knows for sure that the more observations of this sort that are taken, the less the danger will be of straying from the mark.4 But he goes on to say that he must contemplate another possibility. Something futher must be contemplated here which perhaps no one has thought about till now. It certainly remains to be inquired whether after the number of observations has been increased, the probability is increased of attaining the true ratio between the number of cases in which some event can happen and in which it cannot happen, so that this probability finally exceeds any given degree of certainty; or whether the problem has, so to speak, its own asymptote-that is, whether some degree of certainty is given which one can never exceed.5 Bernoulli recognized the importance of this theorem, writing: Therefore, this is the problem which I now set forth and make known after I have already pondered over it for twenty years. Both its novelty and its very great usefullness, coupled with its just as great difficulty, can exceed in weight and value all the remaining chapters of this thesis.6 Bernoulli concludes his long proof with the remark: Whence, finally, this one thing seems to follow: that if observations of all events were to be continued throughout all eternity, (and hence the ultimate probability would tend toward perfect certainty), everything in 4Bernoulli, op. cit., p. 38. 5ibid., p. 39. 6ibdp 42  312 CHAPTER 8. LAW OF LARGE NUMBERS the world would be perceived to happen in fixed ratios and according to a constant law of alternation, so that even in the most accidental and fortuitous occurrences we would be bound to recognize, as it were, a certain necessity and, so to speak, a certain fate. I do now know whether Plato wished to aim at this in his doctrine of the universal return of things, according to which he predicted that all things will return to their original state after countless ages have past.7 Exercises 1 A fair coin is tossed 100 times. The expected number of heads is 50, and the standard deviation for the number of heads is (100.- 1/2.- 1/2)1/2 = 5. What does Chebyshev's Inequality tell you about the probability that the number of heads that turn up deviates from the expected number 50 by three or more standard deviations (i.e., by at least 15)? 2 Write a program that uses the function binomial(n, p, x) to compute the exact probability that you estimated in Exercise 1. Compare the two results. 3 Write a program to toss a coin 10,000 times. Let Sn be the number of heads in the first n tosses. Have your program print out, after every 1000 tosses, Sn - n/2. On the basis of this simulation, is it correct to say that you can expect heads about half of the time when you toss a coin a large number of times? 4 A 1-dollar bet on craps has an expected winning of -.0141. What does the Law of Large Numbers say about your winnings if you make a large number of 1-dollar bets at the craps table? Does it assure you that your losses will be small? Does it assure you that if n is very large you will lose? 5 Let X be a random variable with E(X) = 0 and V(X) = 1. What integer value k will assure us that P(IX > k) < .01? 6 Let Sn be the number of successes in n Bernoulli trials with probability p for success on each trial. Show, using Chebyshev's Inequality, that for any E > 0 P S--p>E < 2 - n P- - ne 7 Find the maximum possible value for p(1 - p) if 0 < p < 1. Using this result and Exercise 6, show that the estimate P S - > 1 is valid for any p. 7ibid., pp. 65-66.  8.1. DISCRETE RANDOM VARIABLES 313 8 A fair coin is tossed a large number of times. Does the Law of Large Numbers assure us that, if n is large enough, with probability > .99 the number of heads that turn up will not deviate from n/2 by more than 100? 9 In Exercise 6.2.15, you showed that, for the hat check problem, the number Sn of people who get their own hats back has E(Sn) = V(Sn) = 1. Using Chebyshev's Inequality, show that P(Sn > 11) < .01 for any n > 11. 10 Let X by any random variable which takes on values 0, 1, 2, ..., n and has E(X) = V(X) = 1. Show that, for any positive integer k, 1 P(X > k+1) 0, P - 5 < E --1 n n ) as n -- c. 13 A fair coin is tossed repeatedly. Before each toss, you are allowed to decide whether to bet on the outcome. Can you describe a betting system with infinitely many bets which will enable you, in the long run, to win more than half of your bets? (Note that we are disallowing a betting system that says to bet until you are ahead, then quit.) Write a computer program that implements this betting system. As stated above, your program must decide whether to bet on a particular outcome before that outcome is determined. For example, you might select only outcomes that come after there have been three tails in a row. See if you can get more than 50% heads by your "system." *14 Prove the following analogue of Chebyshev's Inequality: 1 P(IX -E(X)| > E) <;-E(IX -E(X)|). 8P. L. Chebyshev, "On Mean Values," J. Math. Pure. Appi., vol. 12 (1867), pp. 177-184.  314 CHAPTER 8. LAW OF LARGE NUMBERS *15 We have proved a theorem often called the "Weak Law of Large Numbers." Most people's intuition and our computer simulations suggest that, if we toss a coin a sequence of times, the proportion of heads will really approach 1/2; that is, if Sn is the number of heads in n times, then we will have S 1 An - n 2 as n - oc. Of course, we cannot be sure of this since we are not able to toss the coin an infinite number of times, and, if we could, the coin could come up heads every time. However, the "Strong Law of Large Numbers," proved in more advanced courses, states that P Sn 1 1 n 2 Describe a sample space Q that would make it possible for us to talk about the event E ~Sn1 n 2 Could we assign the equiprobable measure to this space? (See Example 2.18.) *16 In this exercise, we shall construct an example of a sequence of random vari- ables that satisfies the weak law of large numbers, but not the strong law. The distribution of Xi will have to depend on i, because otherwise both laws would be satisfied. (This problem was communicated to us by David Maslen.) Suppose we have an infinite sequence of mutually independent events A1, A2, ..- Let a2 = P(AZ), and let r be a positive integer. (a) Find an expression of the probability that none of the AZ with i > r occur. (b) Use the fact that x - 1 < e-x to show that P(No AZ with i > r occurs) < e- Z ra (c) (The first Borel-Cantelli lemma) Prove that if I 1 aZ diverges, then P(infinitely many AZ occur) = 1. Now, let Xi be a sequence of mutually independent random variables such that for each positive integer i > 2, 1 1 1 P(X = ) = ., (Xi= -) = .,P(X =O) =1 - . . When i =1 we let X27= 0 with probability 1. As usual we let Sm X1 + - - + Xm. Note that the mean of each X72 is 0.  8.1. DISCRETE RANDOM VARIABLES 315 (d) Find the variance of Sn. (e) Show that the sequence (Xi) satisfies the Weak Law of Large Numbers, i.e. prove that for any E > 0 P ;> c- 0, n) as n tends to infinity. We now show that {X2} does not satisfy the Strong Law of Large Num- bers. Suppose that Sn/n - 0. Then because Xn Sn n - 1 Sn_1 2n n n -1 we know that XA7/n - 0. From the definition of limits, we conclude that the inequality IXl> 2i can only be true for finitely many i. (f) Let AZ be the event Xl ;> ji. Find P(AZ). Show that ZiP(A ) diverges (use the Integral Test). (g) Prove that AZ occurs for infinitely many i. (h) Prove that P 0 =0, n ) and hence that the Strong Law of Large Numbers fails for the sequence {Xi}. *17 Let us toss a biased coin that comes up heads with probability p and assume the validity of the Strong Law of Large Numbers as described in Exercise 15. Then, with probability 1, Sn n as n - c. If f(x) is a continuous function on the unit interval, then we also have Sn f(> ->f(p) . Finally, we could hope that E f E(f (p)) =f (p). Show that, if all this is correct, as in fact it is, we would have proven that any continuous function on the unit interval is a limit of polynomial func- tions. This is a sketch of a probabilistic proof of an important theorem in mathematics called the Weierstrass approximation theorem.  316 CHAPTER 8. LAW OF LARGE NUMBERS 8.2 Law of Large Numbers for Continuous Ran- dom Variables In the previous section we discussed in some detail the Law of Large Numbers for discrete probability distributions. This law has a natural analogue for continuous probability distributions, which we consider somewhat more briefly here. Chebyshev Inequality Just as in the discrete case, we begin our discussion with the Chebyshev Inequality. Theorem 8.3 (Chebyshev Inequality) Let X be a continuous random variable with density function f(x). Suppose X has a finite expected value p = E(X) and finite variance a2 = V(X). Then for any positive number E > 0 we have J2 P(X- |>E)<2 The proof is completely analogous to the proof in the discrete case, and we omit it. Note that this theorem says nothing if a2 = V(X) is infinite. Example 8.4 Let X be any continuous random variable with E(X) = pand V(X) = a2. Then, if Ec= ku = k standard deviations for some integer k, then P(|X - > k) 0 we have S limP -"-p> =0, n-oo n or equivalently, S lim P -" p< E =1. n-oo n2  8.2. CONTINUOUS RANDOM VARIABLES 317 Note that this theorem is not necessarily true if o2 is infinite (see Example 8.8). As in the discrete case, the Law of Large Numbers says that the average value of n independent trials tends to the expected value as n - o, in the precise sense that, given E > 0, the probability that the average value and the expected value differ by more than E tends to 0 as n - c. Once again, we suppress the proof, as it is identical to the proof in the discrete case. Uniform Case Example 8.5 Suppose we choose at random n numbers from the interval [0, 1] with uniform distribution. Then if Xi describes the ith choice, we have 1 E(Xi) = xdx = , .2 = V(Xi) = x2 dx - p2 0 1 1 1 3 4 12 Hence, E(-) n 2 ' V Sn1 n 12n ' and for any E > 0, P S -1>E < . n 2 - - 12ne2 This says that if we choose n numbers at random from [0, 1], then the chances are better than 1 - 1/(12ne2) that the difference Sn/n - 1/2 is less than E. Note that E plays the role of the amount of error we are willing to tolerate: If we choose E = 0.1, say, then the chances that Sn/n - 1/2 is less than 0.1 are better than 1 - 100/(12n). For n = 100, this is about .92, but if n = 1000, this is better than .99 and if n = 10,000, this is better than .999. We can illustrate what the Law of Large Numbers says for this example graph- ically. The density for An = Sn/n is determined by fAj(x) = nfs(nx) . We have seen in Section 7.2, that we can compute the density fs, (x) for the sum of n uniform random variables. In Figure 8.2 we have used this to plot the density for An for various values of n. We have shaded in the area for which An would lie between .45 and .55. We see that as we increase n, we obtain more and more of the total area inside the shaded region. The Law of Large Numbers tells us that we can obtain as much of the total area as we please inside the shaded region by choosing n large enough (see also Figure 8.1).D  318 CHAPTER 8. LAW OF LARGE NUMBERS 12 10 8 6 4 2 0 12 10 8 6 4 2 0.2 0.4 0.6 0.8 1 12 10 8 6 4 2 0 12 10 8 6 4 2 0 0.2 0.4 0.6 0.8 1 12 10 S 6 4 2 0 12 10 8 6 4 2 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0:2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 1 Figure 8.2: Illustration of Law of Large Numbers uniform case. Normal Case Example 8.6 Suppose we choose n real numbers at tribution with mean 0 and variance 1. Then random, using a normal dis- y = E(Xi) o.2 = V(Xi) 0, 1. Hence, E (S VSn) 0, 1 nt and, for any c > 0, P ,-0> E < . n - --ne2 In this case it is possible to compare the Chebyshev estimate for P( Sn/n - p > e) in the Law of Large Numbers with exact values, since we know the density function for S,,/n exactly (see Example 7.9). The comparison is shown in Table 8.1, for c = .1. The data in this table was produced by the program LawContinuous. We see here that the Chebyshev estimates are in general not very accurate. Q  8.2. CONTINUOUS RANDOM VARIABLES 319 n P(IS/nl ;> .1) Chebyshev 100 .31731 1.00000 200 .15730 .50000 300 .08326 .33333 400 .04550 .25000 500 .02535 .20000 600 .01431 .16667 700 .00815 .14286 800 .00468 .12500 900 .00270 .11111 1000 .00157 .10000 Table 8.1: Chebyshev estimates. Monte Carlo Method Here is a somewhat more interesting example. Example 8.7 Let g(x) be a continuous function defined for x E [0, 1] with values in [0, 1]. In Section 2.1, we showed how to estimate the area of the region under the graph of g(x) by the Monte Carlo method, that is, by choosing a large number of random values for x and y with uniform distribution and seeing what fraction of the points P(x, y) fell inside the region under the graph (see Example 2.2). Here is a better way to estimate the same area (see Figure 8.3). Let us choose a large number of independent values Xn at random from [0, 1] with uniform density, set Y = g(X ), and find the average value of the Yn. Then this average is our estimate for the area. To see this, note that if the density function for X, is uniform, p E(Yn)= g(x)f(x)dx 0 jg(x)dx =average value of g(x), while the variance is o2 = E((Y -/p)2) = j(g(x)- )2dx< 1 , 0 since for all x in [0, 1], g(x) is in [0, 1], hence p is in [0, 1], and so g(x) - pl < 1. Now let An = (1/n)(Yi + Y2 + - - - + Yn). Then by Chebyshev's Inequality, we have 2 1 P(|An-p>E) 2< - This says that to get within e of the true value for p = fo g(x) dx with probability at least p, we should choose n so that 1/n2 <; 1 -p (i.e., so that n > 1/62( -p)). Note that this method tells us how large to take n to get a desired accuracy. Q  320 CHAPTER 8. LAW OF LARGE NUMBERS Y Y =g (x) 1.1 Figure 8.3: Area problem. The Law of Large Numbers requires that the variance o2 of the original under- lying density be finite: o2 < c. In cases where this fails to hold, the Law of Large Numbers may fail, too. An example follows. Cauchy Case Example 8.8 Suppose we choose n numbers from (-o, +o0) with a Cauchy den- sity with parameter a = 1. We know that for the Cauchy density the expected value and variance are undefined (see Example 6.28). In this case, the density function for An= n is given by (see Example 7.6) 1 fr(1+x2) that is, the density function for An is the same for all n. In this case, as n increases, the density function does not change at all, and the Law of Large Numbers does not hold. D Exercises 1 Let X be a continuous random variable with mean p = 10 and variance o2=100/3. Using Chebyshev's Inequality, find an upper bound for the following probabilities.  8.2. CONTINUOUS RANDOM VARIABLES 321 (a) P(X - 10| >2). (b) P(X - 10| >5). (c) P(IX - 10 > 9). (d) P(X - 10| >20). 2 Let X be a continuous random variable with values unformly distributed over the interval [0, 20]. (a) Find the mean and variance of X. (b) Calculate P(X - 10 > 2), P(X - 10| 5), P(X - 10| 9), and P(IX - 10 > 20) exactly. How do your answers compare with those of Exercise 1? How good is Chebyshev's Inequality in this case? 3 Let X be the random variable of Exercise 2. (a) Calculate the function f (x) = P(IX - 10| x). (b) Now graph the function f (x), and on the same axes, graph the Chebyshev function g(x) = 100/(3x2). Show that f(x) g(x) for all x > 0, but that g(x) is not a very good approximation for f(x). 4 Let X be a continuous random variable with values exponentially distributed over [0, oc) with parameter A = 0.1. (a) Find the mean and variance of X. (b) Using Chebyshev's Inequality, find an upper bound for the following probabilities: P(IX - 10| 2), P(IX - 10| 5), P(IX - 10| 9), and P(X-10 >20). (c) Calculate these probabilities exactly, and compare with the bounds in (b). 5 Let X be a continuous random variable with values normally distributed over (-so, +o0) with mean p = 0 and variance a2=1. (a) Using Chebyshev's Inequality, find upper bounds for the following prob- abilities: P(IX| 1), P(IX| 2), and P(IX| 3). (b) The area under the normal curve between -1 and 1 is .6827, between -2 and 2 is .9545, and between -3 and 3 it is .9973 (see the table in Appendix A). Compare your bounds in (a) with these exact values. How good is Chebyshev's Inequality in this case? 6 If X is normally distributed, with mean p and variance a2, find an upper bound for the following probabilities, using Chebyshev's Inequality. (a) P(|X - |u ). (b) P(|X - p|2u2). (c) P(IX - pl 3u).  322 CHAPTER 8. LAW OF LARGE NUMBERS (d) P(X - pl 4a). Now find the exact value using the program NormalArea or the normal table in Appendix A, and compare. 7 If X is a random variable with mean p 0 and variance a2, define the relative deviation D of X from its mean by D = . (a) Show that P(D a) a2/( 2a2) (b) If X is the random variable of Exercise 1, find an upper bound for P(D > .2), P(D .5), P(D .9), and P(D 2). 8 Let X be a continuous random variable and define the standardized version X* of X by: X* = . 0- (a) Show that P(IX*| a) 1/a2. (b) If X is the random variable of Exercise 1, find bounds for P(IX*| 2), P(X*| 5), and P(X*| >9). 9 (a) Suppose a number X is chosen at random from [0, 20] with uniform probability. Find a lower bound for the probability that X lies between 8 and 12, using Chebyshev's Inequality. (b) Now suppose 20 real numbers are chosen independently from [0, 20] with uniform probability. Find a lower bound for the probability that their average lies between 8 and 12. (c) Now suppose 100 real numbers are chosen independently from [0, 20]. Find a lower bound for the probability that their average lies between 8 and 12. 10 A student's score on a particular calculus final is a random variable with values of [0, 100], mean 70, and variance 25. (a) Find a lower bound for the probability that the student's score will fall between 65 and 75. (b) If 100 students take the final, find a lower bound for the probability that the class average will fall between 65 and 75. 11 The Pilsdorff beer company runs a fleet of trucks along the 100 mile road from Hangtown to Dry Gulch, and maintains a garage halfway in between. Each of the trucks is apt to break down at a point X miles from Hangtown, where X is a random variable uniformly distributed over [0, 100]. (a) Find a lower bound for the probability P(IX - 50| 10).  8.2. CONTINUOUS RANDOM VARIABLES 323 (b) Suppose that in one bad week, 20 trucks break down. Find a lower bound for the probability P(IA20 - 50| 10), where A20 is the average of the distances from Hangtown at the time of breakdown. 12 A share of common stock in the Pilsdorff beer company has a price Yn on the nth business day of the year. Finn observes that the price change X = Yn+1 - Yn appears to be a random variable with mean p = 0 and variance o2 = 1/4. If Y1 = 30, find a lower bound for the following probabilities, under the assumption that the Xn's are mutually independent. (a) P(25 Y2 X 35). (b) P(25 Y11 < 35). (c) P(25 Y1oi 35). 13 Suppose one hundred numbers X1, X2, ..., Xioo are chosen independently at random from [0, 20]. Let S = X1 + X2 + - -- + Xioo be the sum, A = S/100 the average, and S* = (S - 1000)/(10/v/) the standardized sum. Find lower bounds for the probabilities (a) P(IS - 1000| <100). (b) P(A - 10| 1). (c) P(|S*| V3). 14 Let X be a continuous random variable normally distributed on (-o0, +o) with mean 0 and variance 1. Using the normal table provided in Appendix A, or the program NormalArea, find values for the function f(x) = P(IX x) as x increases from 0 to 4.0 in steps of .25. Note that for x > 0 the table gives NA(0, x) = P(0 X x) and thus P(IX x)= 2(.5 - NA(0, x). Plot by hand the graph of f(x) using these values, and the graph of the Chebyshev function g(x) = 1/x2, and compare (see Exercise 3). 15 Repeat Exercise 14, but this time with mean 10 and variance 3. Note that the table in Appendix A presents values for a standard normal variable. Find the standardized version X* for X, find values for f*(x) = P(IX*| x) as in Exercise 14, and then rescale these values for f(x) = P(IX -10 > x). Graph and compare this function with the Chebyshev function g(x) = 3/x2. 16 Let Z = X/Y where X and Y have normal densities with mean 0 and standard deviation 1. Then it can be shown that Z has a Cauchy density. (a) Write a program to illustrate this result by plotting a bar graph of 1000 samples obtained by forming the ratio of two standard normal outcomes. Compare your bar graph with the graph of the Cauchy density. Depend- ing upon which computer language you use, you may or may not need to tell the computer how to simulate a normal random variable. A method for doing this was described in Section 5.2.  324 CHAPTER 8. LAW OF LARGE NUMBERS (b) We have seen that the Law of Large Numbers does not apply to the Cauchy density (see Example 8.8). Simulate a large number of experi- ments with Cauchy density and compute the average of your results. Do these averages seem to be approaching a limit? If so can you explain why this might be? 17 Show that, if X > 0, then P(X > a) < E(X)/a. 18 (Lamperti9) Let X be a non-negative random variable. What is the best upper bound you can give for P(X > a) if you know (a) E(X) = 20. (b) E(X) = 20 and V(X) = 25. (c) E(X) = 20, V(X) = 25, and X is symmetric about its mean. 9 Privatecommunication.  Chapter 9 Central Limit Theorem 9.1 Central Limit Theorem for Bernoulli Trials The second fundamental theorem of probability is the Central Limit Theorem. This theorem says that if Sn is the sum of n mutually independent random variables, then the distribution function of Sn is well-approximated by a certain type of continuous function known as a normal density function, which is given by the formula f (x) 1 e-(x-)2/(2J2) as we have seen in Chapter 4.3. In this section, we will deal only with the case that p = 0 and o-= 1. We will call this particular normal density function the standard normal density, and we will denote it by #(x): (x) 1 e-2/2 2wr A graph of this function is given in Figure 9.1. It can be shown that the area under any normal density equals 1. The Central Limit Theorem tells us, quite generally, what happens when we have the sum of a large number of independent random variables each of which con- tributes a small amount to the total. In this section we shall discuss this theorem as it applies to the Bernoulli trials and in Section 9.2 we shall consider more general processes. We will discuss the theorem in the case that the individual random vari- ables are identically distributed, but the theorem is true, under certain conditions, even if the individual random variables have different distributions. Bernoulli Trials Consider a Bernoulli trials process with probability p for success on each trial. Let Xi = 1 or 0 according as the ith outcome is a success or failure, and let Sn = X1+ X2+' ---+ Xn. Then Sn is the number of successes in n trials. We know that Sn has as its distribution the binomial probabilities b(n, p, j). In Section 3.2, 325  326 CHAPTER 9. CENTRAL LIMIT THEOREM 4 -2 0 2 4 Figure 9.1: Standard normal density. we plotted these distributions for p = .3 and p = .5 for various values of n (see Figure 3.5). We note that the maximum values of the distributions appeared near the ex- pected value np, which causes their spike graphs to drift off to the right as n in- creased. Moreover, these maximum values approach 0 as n increased, which causes the spike graphs to flatten out. Standardized Sums We can prevent the drifting of these spike graphs by subtracting the expected num- ber of successes np from Sn, obtaining the new random variable Sn - np. Now the maximum values of the distributions will always be near 0. To prevent the spreading of these spike graphs, we can normalize S - np to have variance 1 by dividing by its standard deviation npq (see Exercise 6.2.12 and Ex- ercise 6.2.16). Definition 9.1 The standardized sum of Sn is given by S*S S" - "n npq S* always has expected value 0 and variance 1. D Suppose we plot a spike graph with the spikes placed at the possible values of S*: Xo, x1, ..., x, where z = . (9.1) We make the height of the spike at x equal to the distribution value b(n, p, j). An example of this standardized spike graph, with n = 270 and p = .3, is shown in Figure 9.2. This graph is beautifully bell-shaped. We would like to fit a normal density to this spike graph. The obvious choice to try is the standard normal density, since it is centered at 0, just as the standardized spike graph is. In this figure, we  9.1. BERNOULLI TRIALS 327 0.4 0.3 0.2 0.1 -4 -2 0 2 4 Figure 9.2: Normalized binomial distribution and standard normal density. have drawn this standard normal density. The reader will note that a horrible thing has occurred: Even though the shapes of the two graphs are the same, the heights are quite different. If we want the two graphs to fit each other, we must modify one of them; we choose to modify the spike graph. Since the shapes of the two graphs look fairly close, we will attempt to modify the spike graph without changing its shape. The reason for the differing heights is that the sum of the heights of the spikes equals 1, while the area under the standard normal density equals 1. If we were to draw a continuous curve through the top of the spikes, and find the area under this curve, we see that we would obtain, approximately, the sum of the heights of the spikes multiplied by the distance between consecutive spikes, which we will call E. Since the sum of the heights of the spikes equals one, the area under this curve would be approximately E. Thus, to change the spike graph so that the area under this curve has value 1, we need only multiply the heights of the spikes by 1/E. It is easy to see from Equation 9.1 that 1 ripq In Figure 9.3 we show the standardized sum S* for n = 270 and p = .3, after correcting the heights, together with the standard normal density. (This figure was produced with the program CLTBernoulliPlot.) The reader will note that the standard normal fits the height-corrected spike graph extremely well. In fact, one version of the Central Limit Theorem (see Theorem 9.1) says that as n increases, the standard normal density will do an increasingly better job of approximating the height-corrected spike graphs corresponding to a Bernoulli trials process with n summands. Let us fix a value x on the x-axis and let n be a fixed positive integer. Then, using Equation 9.1, the point z3 that is closest to x has a subscript j given by the  328 CHAPTER 9. CENTRAL LIMIT THEOREM 0.4 0.3 0.2 0.1 0 -4 -2 0 2 4 Figure 9.3: Corrected spike graph with standard normal density. formula j=(np+x ripq), where (a) means the integer nearest to a. Thus the height of the spike above xz will be vpqb(n,p,j) = rpq b(n, p, (np+xj npq)). For large n, we have seen that the height of the spike is very close to the height of the normal density at x. This suggests the following theorem. Theorem 9.1 (Central Limit Theorem for Binomial Distributions) For the binomial distribution b(n, p, j) we have lim n-pqb(n, p, (np+x pq)) = #(x), n-oo where #(x) is the standard normal density. The proof of this theorem can be carried out using Stirling's approximation from Section 3.1. We indicate this method of proof by considering the case x= 0. In this case, the theorem states that lim rpqb(n,p, (np)) nh-oo 1 1 .3989... 2wr In order to simplify the calculation, we assume that np is an integer, so that (np) np. Then nl ipq b(n, p, np)= =ypq p"Pq"h4 ~ . (nip)! (riq)! Recall that Stirling's formula (see Theorem 3.3) states that n! ~ 27rn n'e-" as n- oc .  9.1. BERNOULLI TRIALS 329 Using this, we have / pi||pqgn y 7n nne-" npqb(n, p,rnp) ~ 2,rnp/2,rnq (np)rP (nq)ri er-Pe-n which simplifies to 1/ 2r. Approximating Binomial Distributions We can use Theorem 9.1 to find approximations for the values of binomial distri- bution functions. If we wish to find an approximation for b(n, p, j), we set j = np+x pq and solve for x, obtaining j - np nipq Theorem 9.1 then says that /ipq b(n, p, j) is approximately equal to #i(x), so .g(x) b(n,p,j) ~ nipq 1 j - np Example 9.1 Let us estimate the probability of exactly 55 heads in 100 tosses of a coin. For this case np = 100 - 1/2 = 50 and /-pq = /100 -1/2. -1/2 = 5. Thus x55 =(55 - 50)/5 = 1 and P(S100 =55) ~ ( e19 5 5k e/, =.0484 . To four decimal places, the actual value is .0485, and so the approximation is very good. The program CLTBernoulliLocal illustrates this approximation for any choice of n, p, and j. We have run this program for two examples. The first is the probability of exactly 50 heads in 100 tosses of a coin; the estimate is .0798, while the actual value, to four decimal places, is .0796. The second example is the probability of exactly eight sixes in 36 rolls of a die; here the estimate is .1093, while the actual value, to four decimal places, is .1196.  330 CHAPTER 9. CENTRAL LIMIT THEOREM The individual binomial probabilities tend to 0 as n tends to infinity. In most applications we are not interested in the probability that a specific outcome occurs, but rather in the probability that the outcome lies in a given interval, say the interval [a, b]. In order to find this probability, we add the heights of the spike graphs for values of j between a and b. This is the same as asking for the probability that the standardized sum S* lies between a* and b*, where a* and b* are the standardized values of a and b. But as n tends to infinity the sum of these areas could be expected to approach the area under the standard normal density between a* and b*. The Central Limit Theorem states that this does indeed happen. Theorem 9.2 (Central Limit Theorem for Bernoulli Trials) Let Sn be the number of successes in n Bernoulli trials with probability p for success, and let a and b be two fixed real numbers. Then Sn - np b lim P a < < b = #(x) dx . n-- oo rnpq a This theorem can be proved by adding together the approximations to b(n, p, k) given in Theorem 9.1.It is also a special case of the more general Central Limit Theorem (see Section 10.3). We know from calculus that the integral on the right side of this equation is equal to the area under the graph of the standard normal density #(x) between a and b. We denote this area by NA(a*, b*). Unfortunately, there is no simple way to integrate the function ex2/2, and so we must either use a table of values or else a numerical integration program. (See Figure 9.4 for values of NA(0, z). A more extensive table is given in Appendix A.) It is clear from the symmetry of the standard normal density that areas such as that between -2 and 3 can be found from this table by adding the area from 0 to 2 (same as that from -2 to 0) to the area from 0 to 3. Approximation of Binomial Probabilities Suppose that Sn is binomially distributed with parameters n and p. We have seen that the above theorem shows how to estimate a probability of the form P(i S j) , (9.2) where i and j are integers between 0 and n. As we have seen, the binomial distri- bution can be represented as a spike graph, with spikes at the integers between 0 and n, and with the height of the kth spike given by b(n, p, k). For moderate-sized values of ni, if we standardize this spike graph, and change the heights of its spikes, in the manner described above, the sum of the heights of the spikes is approximated by the area under the standard normal density between i* and j*. It turns out that a slightly more accurate approximation is afforded by the area under the standard  9.1. BERNOULLI TRIALS 331 NA (0,z) = area of shaded region Oz z NA(z) z NA(z) z NA(z) z NA(z) .0 .0000 1.0 .3413 2.0 .4772 3.0 .4987 .1 .0398 1.1 .3643 2.1 .4821 3.1 .4990 .2 .0793 1.2 .3849 2.2 .4861 3.2 .4993 .3 .1179 1.3 .4032 2.3 .4893 3.3 .4995 .4 .1554 1.4 .4192 2.4 .4918 3.4 .4997 .5 .1915 1.5 .4332 2.5 .4938 3.5 .4998 .6 .2257 1.6 .4452 2.6 .4953 3.6 .4998 .7 .2580 1.7 .4554 2.7 .4965 3.7 .4999 .8 .2881 1.8 .4641 2.8 .4974 3.8 .4999 .9 .3159 1.9 .4713 2.9 .4981 3.9 .5000 Figure 9.4: Table of values of NA(O, z), the normal area from 0 to z.  332 CHAPTER 9. CENTRAL LIMIT THEOREM normal density between the standardized values corresponding to (i - 1/2) and (j + 1/2); these values are . i-1/2-np rnpq and :1*j + 1/2 - np 3r pq Thus, i1 -np j +2-np P(i Sn 1060) = P(S1700 > 1061) P I(Sf70> 1060.5 - 1020 17oo - 20 = P(S7roo > 2.025) . From Table 9.4, if we interpolate, we would estimate this probability to be .5 - .4784 = .0216. Thus, the college is fairly safe using this admission policy. Q Applications to Statistics There are many important questions in the field of statistics that can be answered using the Central Limit Theorem for independent trials processes. The following example is one that is encountered quite frequently in the news. Another example of an application of the Central Limit Theorem to statistics is given in Section 9.2. Example 9.4 One frequently reads that a poll has been taken to estimate the pro- portion of people in a certain population who favor one candidate over another in a race with two candidates. (This model also applies to races with more than two candidates A and B, and two ballot propositions.) Clearly, it is not possible for pollsters to ask everyone for their preference. What is done instead is to pick a subset of the population, called a sample, and ask everyone in the sample for their preference. Let p be the actual proportion of people in the population who are in favor of candidate A and let q = 1-p. If we choose a sample of size n from the pop- ulation, the preferences of the people in the sample can be represented by random variables X1, X2, ..., Xn, where Xi = 1 if person i is in favor of candidate A, and Xi = 0 if person i is in favor of candidate B. Let Sn = X1 + X2 +'-"- + Xn. If each subset of size n is chosen with the same probability, then Sn is hypergeometrically distributed. If n is small relative to the size of the population (which is typically true in practice), then Sn is approximately binomially distributed, with parameters n and p. The pollster wants to estimate the value p. An estimate for p is provided by the value p = Sn/n, which is the proportion of people in the sample who favor candidate B. The Central Limit Theorem says that the random variable p is approximately normally distributed. (In fact, our version of the Central Limit Theorem says that the distribution function of the random variable S* =S- is approximated by the standard normal density.) But we have p- p /r-pq n  334 CHAPTER 9. CENTRAL LIMIT THEOREM i.e., p is just a linear function of S*. Since the distribution of S* is approximated by the standard normal density, the distribution of the random variable p must also be bell-shaped. We also know how to write the mean and standard deviation of p in terms of p and n. The mean of p is just p, and the standard deviation is pq Thus, it is easy to write down the standardized version of p; it is p =. pq/n Since the distribution of the standardized version of p is approximated by the standard normal density, we know, for example, that 95% of its values will lie within two standard deviations of its mean, and the same is true of p. So we have P p-2 pq 1111.  9.1. BERNOULLI TRIALS 335 25 20 15 10 5 0 0.48 0.5 0.52 0.54 0.56 0.58 0.6 Figure 9.5: Polling simulation. So if the pollster chooses n to be 1200, say, and calculates p using his sample of size 1200, then 19 times out of 20 (i.e., 95% of the time), his confidence interval, which is of length 6%, will contain the true value of p. This type of confidence interval is typically reported in the news as follows: this survey has a 3% margin of error. In fact, most of the surveys that one sees reported in the paper will have sample sizes around 1000. A somewhat surprising fact is that the size of the population has apparently no effect on the sample size needed to obtain a 95% confidence interval for p with a given margin of error. To see this, note that the value of n that was needed depended only on the number .03, which is the margin of error. In other words, whether the population is of size 100,000 or 100,000,000, the pollster needs only to choose a sample of size 1200 or so to get the same accuracy of estimate of p. (We did use the fact that the sample size was small relative to the population size in the statement that Sn is approximately binomially distributed.) In Figure 9.5, we show the results of simulating the polling process. The popula- tion is of size 100,000, and for the population, p = .54. The sample size was chosen to be 1200. The spike graph shows the distribution of p for 10,000 randomly chosen samples. For this simulation, the program kept track of the number of samples for which p was within 3% of .54. This number was 9648, which is close to 95% of the number of samples used. Another way to see what the idea of confidence intervals means is shown in Figure 9.6. In this figure, we show 100 confidence intervals, obtained by computing p for 100 different samples of size 1200 from the same population as before. The reader can see that most of these confidence intervals (96, to be exact) contain the true value of p. The Gallup Poll has used these polling techniques in every Presidential election since 1936 (and in innumerable other elections as well). Table 9.11 shows the results 1The Gallup Poll Monthly, November 1992, No. 326, p. 33. Supplemented with the help of  336 CHAPTER 9. CENTRAL LIMIT THEOREM I I Figure 9.6: Confidence interval simulation. of their efforts. The reader will note that most of the approximations to p are within 3% of the actual value of p. The sample sizes for these polls were typically around 1500. (In the table, both the predicted and actual percentages for the winning candidate refer to the percentage of the vote among the "major" political parties. In most elections, there were two major parties, but in several elections, there were three.) This technique also plays an important role in the evaluation of the effectiveness of drugs in the medical profession. For example, it is sometimes desired to know what proportion of patients will be helped by a new drug. This proportion can be estimated by giving the drug to a subset of the patients, and determining the proportion of this sample who are helped by the drug. D Historical Remarks The Central Limit Theorem for Bernoulli trials was first proved by Abraham de Moivre and appeared in his book, The Doctrine of Chances, first published in 1718.2 De Moivre spent his years from age 18 to 21 in prison in France because of his Protestant background. When he was released he left France for England, where he worked as a tutor to the sons of noblemen. Newton had presented a copy of his Principia Mathematica to the Earl of Devonshire. The story goes that, while de Moivre was tutoring at the Earl's house, he came upon Newton's work and found that it was beyond him. It is said that he then bought a copy of his own and tore Lydia K. Saab, The Gallup Organization. 2A. de Moivre, The Doctrine of Chances, 3d ed. (London: Millar, 1756).  9.1. BERNOULLI TRIALS 337 Year Winning Candidate 1936 Roosevelt 1940 Roosevelt 1944 Roosevelt 1948 Truman 1952 Eisenhower 1956 Eisenhower 1960 Kennedy 1964 Johnson 1968 Nixon 1972 Nixon 1976 Carter 1980 Reagan 1984 Reagan 1988 Bush 1992 Clinton 1996 Clinton Gallup Final Survey 55.7% 52.0% 51.5% 44.5% 51.0% 59.5% 51.0% 64.0% 43.0% 62.0% 48.0% 47.0% 59.0% 56.0% 49.0% 52.0% Election Result 62.5% 55.0% 53.3% 49.9% 55.4% 57.8% 50.1% 61.3% 43.5% 61.8% 50.0% 50.8% 59.1% 53.9% 43.2% 50.1% Deviation 6.8% 3.0% 1.8% 5.4% 4.4% 1.7% 0.9% 2.7% 0.5% 0.2% 2.0% 3.8% 0.1% 2.1% 5.8% 1.9% Table 9.1: Gallup Poll accuracy record. it into separate pages, learning it page by page as he walked around London to his tutoring jobs. De Moivre frequented the coffeehouses in London, where he started his probability work by calculating odds for gamblers. He also met Newton at such a coffeehouse and they became fast friends. De Moivre dedicated his book to Newton. The Doctrine of Chances provides the techniques for solving a wide variety of gambling problems. In the midst of these gambling problems de Moivre rather modestly introduces his proof of the Central Limit Theorem, writing A Method of approximating the Sum of the Terms of the Binomial (a + b)" expanded into a Series, from whence are deduced some prac- tical Rules to estimate the Degree of Assent which is to be given to Experiments.3 De Moivre's proof used the approximation to factorials that we now call Stirling's formula. De Moivre states that he had obtained this formula before Stirling but without determining the exact value of the constant 2. While he says it is not really necessary to know this exact value, he concedes that knowing it "has spread a singular Elegancy on the Solution." The complete proof and an interesting discussion of the life of de Moivre can be found in the book Games, Gods and Gambling by F. N. David.4 3ibid., p. 243. 4F. N. David, Games, Gods and Gambling (London: Griffin, 1962).  338 CHAPTER 9. CENTRAL LIMIT THEOREM Exercises 1 Let S100 be the number of heads that turn up in 100 tosses of a fair coin. Use the Central Limit Theorem to estimate (a) P(S100 < 45). (b) P(45 < Sioo < 55). (c) P(S100 > 63). (d) P(Sioo < 57). 2 Let S200 be the number of heads that turn up in 200 tosses of a fair coin. Estimate (a) P(S200 = 100). (b) P(S200 = 90). (c) P(S200 = 80). 3 A true-false examination has 48 questions. June has probability 3/4 of an- swering a question correctly. April just guesses on each question. A passing score is 30 or more correct answers. Compare the probability that June passes the exam with the probability that April passes it. 4 Let S be the number of heads in 1,000,000 tosses of a fair coin. Use (a) Cheby- shev's inequality, and (b) the Central Limit Theorem, to estimate the prob- ability that S lies between 499,500 and 500,500. Use the same two methods to estimate the probability that S lies between 499,000 and 501,000, and the probability that S lies between 498,500 and 501,500. 5 A rookie is brought to a baseball club on the assumption that he will have a .300 batting average. (Batting average is the ratio of the number of hits to the number of times at bat.) In the first year, he comes to bat 300 times and his batting average is .267. Assume that his at bats can be considered Bernoulli trials with probability .3 for success. Could such a low average be considered just bad luck or should he be sent back to the minor leagues? Comment on the assumption of Bernoulli trials in this situation. 6 Once upon a time, there were two railway trains competing for the passenger traffic of 1000 people leaving from Chicago at the same hour and going to Los Angeles. Assume that passengers are equally likely to choose each train. How many seats must a train have to assure a probability of .99 or better of having a seat for each passenger? 7 Assume that, as in Example 9.3, Dartmouth admits 1750 students. What is the probability of too many acceptances? 8 A club serves dinner to members only. They are seated at 12-seat tables. The manager observes over a long period of time that 95 percent of the time there are between six and nine full tables of members, and the remainder of the  9.1. BERNOULLI TRIALS 339 time the numbers are equally likely to fall above or below this range. Assume that each member decides to come with a given probability p, and that the decisions are independent. How many members are there? What is p? 9 Let Sn be the number of successes in n Bernoulli trials with probability .8 for success on each trial. Let An = Sn/n be the average number of successes. In each case give the value for the limit, and give a reason for your answer. (a) lim-+o P(An = .8). (b) limn--o P(.7n < Sn < .9n). (c) lim-+o P(Sn < .8n + .8/n). (d) lim-+o P(.79 < An < .81). 10 Find the probability that among 10,000 random digits the digit 3 appears not more than 931 times. 11 Write a computer program to simulate 10,000 Bernoulli trials with probabil- ity .3 for success on each trial. Have the program compute the 95 percent confidence interval for the probability of success based on the proportion of successes. Repeat the experiment 100 times and see how many times the true value of .3 is included within the confidence limits. 12 A balanced coin is flipped 400 times. Determine the number x such that the probability that the number of heads is between 200 - x and 200 + x is approximately .80. 13 A noodle machine in Spumoni's spaghetti factory makes about 5 percent de- fective noodles even when properly adjusted. The noodles are then packed in crates containing 1900 noodles each. A crate is examined and found to contain 115 defective noodles. What is the approximate probability of finding at least this many defective noodles if the machine is properly adjusted? 14 A restaurant feeds 400 customers per day. On the average 20 percent of the customers order apple pie. (a) Give a range (called a 95 percent confidence interval) for the number of pieces of apple pie ordered on a given day such that you can be 95 percent sure that the actual number will fall in this range. (b) How many customers must the restaurant have, on the average, to be at least 95 percent sure that the number of customers ordering pie on that day falls in the 19 to 21 percent range? 15 Recall that if X is a random variable, the cumulative distribution function of X is the function F(x) defined by F(x) =P(X 80 before the probability that his error is less than .0001 exceeds .95.  9.3. CONTINUOUS INDEPENDENT TRIALS 359 We have already noticed that the estimate in the Chebyshev inequality is not always a good one, and here is a case in point. If we assume that n is large enough so that the density for Sn is approximately normal, then we have P 1- < .0001 = P (-.5 d < S* x < +.5 n and this last expression is greater than .95 if .5 ;> 2. This says that it suffices to take n = 16 measurements for the same results. This second calculation is stronger, but depends on the assumption that n = 16 is large enough to establish the normal density as a good approximation to S*, and hence to Sn. The Central Limit Theorem here says nothing about how large n has to be. In most cases involving sums of independent random variables, a good rule of thumb is that for n > 30, the approximation is a good one. In the present case, if we assume that the errors are approximately normally distributed, then the approximation is probably fairly good even for n = 16. D Estimating the Mean Example 9.10 (Continuation of Example 9.9) Now suppose our surveyor is mea- suring an unknown distance with the same instruments under the same conditions. He takes 36 measurements and averages them. How sure can he be that his mea- surement lies within .0002 of the true value? Again using the normal approximation, we get P( -- <.0002)= P(S* <.5 ) n 2 3e-x2 22 27r J-3 .997 . This means that the surveyor can be 99.7 percent sure that his average is within .0002 of the true value. To improve his confidence, he can take more measurements, or require less accuracy, or improve the quality of his measurements (i.e., reduce the variance a2). In each case, the Central Limit Theorem gives quantitative infor- mation about the confidence of a measurement process, assuming always that the normal approximation is valid. Now suppose the surveyor does not know the mean or standard deviation of his measurements, but assumes that they are independent. How should he proceed? Again, he makes several measurements of a known distance and averages them. As before, the average error is approximately normally distributed, but now with unknown mean and variance.D  360 CHAPTER 9. CENTRAL LIMIT THEOREM Sample Mean If he knows the variance a2 of the error distribution is .0002, then he can estimate the mean p by taking the average, or sample mean of, say, 36 measurements: _ 1 +X2 +'---+Xz n where n = 36. Then, as before, E(p) = p. Moreover, the preceding argument shows that P(|f - pl < .0002) .997 . The interval (f p-.0002, ft+ .0002) is called the 99.7% confidence interval for p (see Example 9.4). Sample Variance If he does not know the variance a2 of the error distribution, then he can estimate a2 by the sample variance: -2_ (xi -fP)2 +(x2 -fP)2 +...+ (xn -P)2 n where n = 36. The Law of Large Numbers, applied to the random variables (Xi - p)2, says that for large n, the sample variance 02 lies close to the variance a2, so that the surveyor can use O2 in place of 02 in the argument above. Experience has shown that, in most practical problems of this type, the sample variance is a good estimate for the variance, and can be used in place of the variance to determine confidence levels for the sample mean. This means that we can rely on the Law of Large Numbers for estimating the variance, and the Central Limit Theorem for estimating the mean. We can check this in some special cases. Suppose we know that the error distri- bution is normal, with unknown mean and variance. Then we can take a sample of n measurements, find the sample mean f and sample variance 62, and form Sn - np where n = 36. We expect T* to be a good approximation for S* for large n. t-Density The statistician W. S. Gosset13 has shown that in this case T* has a density function that is not normal but rather a t-density with n degrees of freedom. (The number n of degrees of freedom is simply a parameter which tells us which t-density to use.) In this case we can use the t-density in place of the normal density to determine confidence levels for p. As n increases, the t-density approaches the normal density. Indeed, even for n = 8 the t-density and normal density are practically the same (see Figure 9.15). 13W.* S. Gosset discovered the distribution we now call the t-distribution while working for the Guinness Brewery in Dublin. He wrote under the pseudonym "Student." The results discussed here first appeared in Student, "The Probable Error of a Mean," Biometrika, vol. 6 (1908), pp. 1-24.  9.3. CONTINUOUS INDEPENDENT TRIALS 361 o. -6 -4 -2 2 4 6 Figure 9.15: Graph of t-density for n = 1, 3,8 and the normal density with pA= 0,o = 1. Exercises Notes on computer problems: (a) Simulation: Recall (see Corollary 5.2) that X = F-1(rnd) will simulate a random variable with density f(x) and distribution F(X) =Jf(t)dt. In the case that f(x) is a normal density function with mean p and standard deviation o-, where neither F nor F-1 can be expressed in closed form, use instead X = o-2 log(rnd) cos 27(rnd) + pA. (b) Bar graphs: you should aim for about 20 to 30 bars (of equal width) in your graph. You can achieve this by a good choice of the range [xmin, xmin] and the number of bars (for instance, [p - 3o, At + 3o] with 30 bars will work in many cases). Experiment! 1 Let X be a continuous random variable with mean p(X) and variance o2 (X), and let X* = (X - p))/o be its standardized version. Verify directly that p(X*) = 0 and o2(X*) = 1.  362 CHAPTER 9. CENTRAL LIMIT THEOREM 2 Let {Xk}, 1 < k < n, be a sequence of independent random variables, all with mean 0 and variance 1, and let Sn, S*, and An be their sum, standardized sum, and average, respectively. Verify directly that S* = S/ _ =nAn. 3 Let {Xk}, 1 < k < n, be a sequence of random variables, all with mean p and variance a2, and Yk = X* be their standardized versions. Let Sn and Tn be the sum of the Xk and Yk, and S* and T* their standardized version. Show that S* = T* = Tn/\. 4 Suppose we choose independently 25 numbers at random (uniform density) from the interval [0, 20]. Write the normal densities that approximate the densities of their sum S25, their standardized sum S5, and their average A25. 5 Write a program to choose independently 25 numbers at random from [0, 20], compute their sum S25, and repeat this experiment 1000 times. Make a bar graph for the density of S25 and compare it with the normal approximation of Exercise 4. How good is the fit? Now do the same for the standardized sum S25 and the average A25. 6 In general, the Central Limit Theorem gives a better estimate than Cheby- shev's inequality for the average of a sum. To see this, let A25 be the average calculated in Exercise 5, and let N be the normal approximation for A25. Modify your program in Exercise 5 to provide a table of the function F(x) = P(IA25 - 10 > x) = fraction of the total of 1000 trials for which A25 - 10 > x. Do the same for the function f(x) = P(IN - 10 > x). (You can use the normal table, Table 9.4, or the procedure NormalArea for this.) Now plot on the same axes the graphs of F(x), f(x), and the Chebyshev function g(x) = 4/(3x2). How do f(x) and g(x) compare as estimates for F(x)? 7 The Central Limit Theorem says the sums of independent random variables tend to look normal, no matter what crazy distribution the individual variables have. Let us test this by a computer simulation. Choose independently 25 numbers from the interval [0, 1] with the probability density f(x) given below, and compute their sum S25. Repeat this experiment 1000 times, and make up a bar graph of the results. Now plot on the same graph the density #(x) normal (x, p1(S25), a(S25)). How well does the normal density fit your bar graph in each case? (a) f (x) = 1. (b) f (x) = 2x. (c) f (x) = 3x2. (d) f (x) = 4|x - 1/2|. 8 Repeat the experiment described in Exercise 7 but now choose the 25 numbers from [0, 00), using f (x) =e-X.  9.3. CONTINUOUS INDEPENDENT TRIALS 363 9 How large must n be before Sn = X1 + X2 +-" --+Xn is approximately normal? This number is often surprisingly small. Let us explore this question with a computer simulation. Choose n numbers from [0, 1] with probability density f(x), where n = 3, 6, 12, 20, and f(x) is each of the densities in Exercise 7. Compute their sum Sn, repeat this experiment 1000 times, and make up a bar graph of 20 bars of the results. How large must n be before you get a good fit? 10 A surveyor is measuring the height of a cliff known to be about 1000 feet. He assumes his instrument is properly calibrated and that his measurement errors are independent, with mean p = 0 and variance a2 = 10. He plans to take n measurements and form the average. Estimate, using (a) Chebyshev's inequality and (b) the normal approximation, how large n should be if he wants to be 95 percent sure that his average falls within 1 foot of the true value. Now estimate, using (a) and (b), what value should u2 have if he wants to make only 10 measurements with the same confidence? 11 The price of one share of stock in the Pilsdorff Beer Company (see Exer- cise 8.2.12) is given by Yn on the nth day of the year. Finn observes that the differences X = Yn+1 - Yn appear to be independent random variables with a common distribution having mean pA= 0 and variance a2 1/4. If Yi = 100, estimate the probability that Y365 is (a) 100. (b) 110. (c) 120. 12 Test your conclusions in Exercise 11 by computer simulation. First choose 364 numbers Xi with density f (x) = normal(x, 0, 1/4). Now form the sum Y365 = 100 + X1 + X2 + ... + X364, and repeat this experiment 200 times. Make up a bar graph on [50, 150] of the results, superimposing the graph of the approximating normal density. What does this graph say about your answers in Exercise 11? 13 Physicists say that particles in a long tube are constantly moving back and forth along the tube, each with a velocity Vk (in cm/sec) at any given moment that is normally distributed, with mean pA= 0 and variance a2 = 1. Suppose there are 1020 particles in the tube. (a) Find the mean and variance of the average velocity of the particles. (b) What is the probability that the average velocity is 10-9 cm/sec? 14 An astronomer makes n measurements of the distance between Jupiter and a particular one of its moons. Experience with the instruments used leads her to believe that for the proper units the measurements will be normally  364 CHAPTER 9. CENTRAL LIMIT THEOREM distributed with mean d, the true distance, and variance 16. She performs a series of n measurements. Let An =l+ 2+' X n be the average of these measurements. (a) Show that P An - <8 0 for 1 <~ j n, and that  10.1. DISCRETE DISTRIBUTIONS 369 We note that g(t) is differentiable for all t, since it is a finite linear combination of exponential functions. If we compute g'(t)/g(t), we obtain xip(xi)etxl + ... + xmp(xn)etx" p(xi)etxi + ... + p(xn)etzx Dividing both top and bottom by etLX, we obtain the expression xip(xi)et(x1-x?) + ... + xzp(xn) p(zi)et(x1-xr,) + . . . + p(xn)~ Since xn is the largest of the xz's, this expression approaches x as t goes to 00. So we have shown that img'(t) on = lim . t-oo g(t) To find p(xn), we simply divide g(t) by etLX and let t go to c. Once x and p(xn) have been determined, we can subtract p(xm)et x from g(t), and repeat the above procedure with the resulting function, obtaining, in turn, z_1,... , x1 and p(xn_1),...,p(xi). D If we delete the hypothesis that X have finite range in the above theorem, then the conclusion is no longer necessarily true. Ordinary Generating Functions In the special but important case where the xi are all nonnegative integers, x = j, we can prove this theorem in a simpler way. In this case, we have g(t) =:etjp(j) j=0 and we see that g(t) is a polynomial in et. If we write z = et, and define the function h by h(z) = zip(j) j=0 then h(z) is a polynomial in z containing the same information as g(t), and in fact h(z) = g(log z) , g(t) = h(et) The function h(z) is often called the ordinary generating function for X. Note that h(1) = g(O) = 1, h'(1) = g'(0) = pi, and h"(1) = g"(0) - g'(O) = p2 -P 1. It follows from all this that if we know g(t), then we know h(z), and if we know h(z), then we can find the p(j) by Taylor's formula: p(j) =coefficient of zs in h(z) h~ (0)  370 CHAPTER 10. GENERATING FUNCTIONS For example, suppose we know that the moments of a certain discrete random variable X are given by o = 1, 1 2k pk -+ , fork>1. 2 4' Then the moment generating function g of X is 0/k tk g(t) =t k=0 1 tk 1 °° (2t)k 1+2S tk+ k! k=1 . k=1 k 1 1 1 2 -+ -et + -2t 4 2 4 This is a polynomial in z = et, and 1 1 12 h(z)=-+-z+-z2 4 2 4 Hence, X must have range {0, 1, 2}, and p must have values {1/4, 1/2, 1/4}. Properties Both the moment generating function g and the ordinary generating function h have many properties useful in the study of random variables, of which we can consider only a few here. In particular, if X is any discrete random variable and Y = X + a, then gy(t) = E(et ) = E(et(x+a)) = etaE(etx) = etagx(t) while if Y = bX, then gy(t) = E(etY) = E(etbx) =gx(bt) . In particular, if X*=X then (see Exercise 11)  10.1. DISCRETE DISTRIBUTIONS 371 If X and Y are independent random variables and Z = X + Y is their sum, with px, py, and pz the associated distribution functions, then we have seen in Chapter 7 that pz is the convolution of px and py, and we know that convolution involves a rather complicated calculation. But for the generating functions we have instead the simple relations gz(t) = gx(t)gy(t) hz(z) = hx(z)hy (z) that is, gz is simply the product of gx and gy, and similarly for hz. To see this, first note that if X and Y are independent, then etX and etY are independent (see Exercise 5.2.38), and hence E(etXetY) =E(etX)E(eY) . It follows that gz(t) = E(etz) = E(et(X±Y)) = E(etX)E(etY) = gx(t)gy(t) and, replacing t by log z, we also get hz(z) = hx (z)hy (z) . Example 10.5 If X and Y are independent discrete random variables with range {O, 1,2, ... , n} and binomial distribution px(j) =py(j)= ( )piq" 3 and if Z = X + Y, then we know (cf. Section 7.1) that the range of X is {O,1,2,...,2n} and X has binomial distribution pz(j)=(px *py)(j) )p q Here we can easily verify this result by using generating functions. We know that gx (t) = gy(t)= et (fl)piq"-i j=0 (j =(pe + q)" , and hx (z) =hy (z) =(pz +q)" .  372 CHAPTER 10. GENERATING FUNCTIONS Hence, we have gz(t) = gx(t)gy(t) = (pet + q)22n or, what is the same, hz(z) = hx (z)hy (z) = (pz+ q) 2n 2n 2n (pz)3q2n-j j=0 from which we can see that the coefficient of zj is just pz(j) (2)piq2nr-i* w Example 10.6 If X and Y are independent discrete random variables with the non-negative integers {0, 1, 2, 3,.. .} as range, and with geometric distribution func- tion px(j)=py(j)=q3p, then gx(t)=gy(t)= and if Z = X + Y, then gz(t) = gx(t)gy(t) p2 1 - 2qet + q2e2t If we replace et by z, we get hz (z) = p2 (1 - qz)2 00 p2 (k + 1)qkzk k=0 and we can read off the values of pz(j) as the coefficient of zi in this expansion for h(z), even though h(z) is not a polynomial in this case. The distribution pz is a negative binomial distribution (see Section 5.1). D Here is a more interesting example of the power and scope of the method of generating functions. Heads or Tails Example 10.7 In the coin-tossing game discussed in Example 1.4, we now consider the question "When is Peter first in the lead?" Let Xk describe the outcome of the kth trial in the game X { = +1, if kth toss is heads, -1, if kth toss is tails.  10.1. DISCRETE DISTRIBUTIONS 373 Then the Xk are independent random variables describing a Bernoulli process. Let So = 0, and, for n;> 1, let Sn = X1 +X2+---+Xn . Then Sn describes Peter's fortune after n trials, and Peter is first in the lead after n trials if Sk < 0 for 1 1, in which case Si = X1 = -1. In the latter case, Sk = 0 for k =n- 1, and perhaps for other k between 1 and n. Let m be the least such value of k; then Sm = 0 and Sk < 0 for 1 < k < m. In this case Peter loses on the first trial, regains his initial position in the next m - 1 trials, and gains the lead in the next n - m trials. Let p be the probability that the coin comes up heads, and let q = 1 - p. Let rn be the probability that Peter is first in the lead after n trials. Then from the discussion above, we see that rn = 0 , if n even, r1i= p (= probability of heads in a single toss), rn = q(rirn-2 + r3rn-4 + ... + rn-2ri) , if n > 1, n odd. Now let T describe the time (that is, the number of trials) required for Peter to take the lead. Then T is a random variable, and since P(T = n) = rn, r is the distribution function for T. We introduce the generating function hT(z) for T: 00 hT(z) Z rnz"h n=0 Then, by using the relations above, we can verify the relation hT(z) = pz + qz(hT(z))2 If we solve this quadratic equation for hT(z), we get 1+ 1 - 4pqz2 2pz hT (z) 2qz+2 2qz 1 - 1 - 4pqz2 Of these two solutions, we want the one that has a convergent power series in z (i.e., that is finite for z = 0). Hence we choose 1 - 1 - 4pqz2 2pz 2qz 1 + 1 - 4pqz2 Now we can ask: What is the probability that Peter is ever in the lead? This probability is given by (see Exercise 10) E00 mrO 1 - 1 - 4pq hT(1) 2q 2q 1- |p-q 2q { p/q, if p < q, 1, if p;> q,  374 CHAPTER 10. GENERATING FUNCTIONS so that Peter is sure to be in the lead eventually if p > q. How long will it take? That is, what is the expected value of T? This value is given by E(T) = h'r(1) {l/(p-q), ifp q, ( ) T(oo, if p =q. This says that if p > q, then Peter can expect to be in the lead by about 1/(p - q) trials, but if p = q, he can expect to wait a long time. A related problem, known as the Gambler's Ruin problem, is studied in Exer- cise 23 and in Section 12.2. D Exercises 1 Find the generating functions, both ordinary h(z) and moment g(t), for the following discrete probability distributions. (a) The distribution describing a fair coin. (b) The distribution describing a fair die. (c) The distribution describing a die that always comes up 3. (d) The uniform distribution on the set {n, n + 1, n + 2, ... , n + k}. (e) The binomial distribution on {n, n + 1, n + 2, ... , n + k}. (f) The geometric distribution on {0, 1, 2,.. ., } with p(j) = 2/33±1. 2 For each of the distributions (a) through (d) of Exercise 1 calculate the first and second moments, p1 and p2, directly from their definition, and verify that h(1) = 1, h'(1) = pi, and h"(1) =p2 - 1. 3 Let p be a probability distribution on {0, 1, 2} with moments pi = 1, p2 = 3/2. (a) Find its ordinary generating function h(z). (b) Using (a), find its moment generating function. (c) Using (b), find its first six moments. (d) Using (a), find po, pi, and p2. 4 In Exercise 3, the probability distribution is completely determined by its first two moments. Show that this is always true for any probability distribution on {O, 1, 2}. Hint: Given p 1 and 2, find h(z) as in Exercise 3 and use h(z) to determine p. 5 Let p and p' be the two distributions 1 2 3 4 5 p(1/3 0 0 2/30 ' 0 2/3 0 01/ *  10.1. DISCRETE DISTRIBUTIONS 375 (a) Show that p and p' have the same first and second moments, but not the same third and fourth moments. (b) Find the ordinary and moment generating functions for p and p'. 6 Let p be the probability distribution 0 1 2 S0 1/3 2/3 )' and let p, = p * p * ... * p be the n-fold convolution of p with itself. (a) Find p2 by direct calculation (see Definition 7.1). (b) Find the ordinary generating functions h(z) and h2(z) for p and p2, and verify that h2(z)= (h(z))2. (c) Find h (z) from h(z). (d) Find the first two moments, and hence the mean and variance, of pn from hn (z). Verify that the mean of pm is n times the mean of p. (e) Find those integers j for which pn(j) > 0 from hm(z). 7 Let X be a discrete random variable with values in {0, 1, 2,.. ., n} and moment generating function g(t). Find, in terms of g(t), the generating functions for (a) -X. (b) X + 1. (c) 3X. (d) aX + b. 8 Let X1, X2, ..., Xm be an independent trials process, with values in {0, 1} and mean p = 1/3. Find the ordinary and moment generating functions for the distribution of (a) Si = X1. Hint: First find X1 explicitly. (b) S2 = X1 + X2. (c) S=X1+X2+---+Xn. 9 Let X and Y be random variables with values in {1, 2, 3, 4, 5, 6} with distri- bution functions px and py given by px(j) = a, py(j) = by . (a) Find the ordinary generating functions hx(z) and hy(z) for these distri- butions. (b) Find the ordinary generating function hz (z) for the distribution Z= X +Y.  376 CHAPTER 10. GENERATING FUNCTIONS (c) Show that hz(z) cannot ever have the form z2 + 3 - ---- 12 z2+z3+...+z1 hz(z) 11 Hint: hx and hy must have at least one nonzero root, but hz(z) in the form given has no nonzero real roots. It follows from this observation that there is no way to load two dice so that the probability that a given sum will turn up when they are tossed is the same for all sums (i.e., that all outcomes are equally likely). 10 Show that if 1 - 1 - 4pqz2 h(z) = 2q 2qz then h1= p/q,' pq, ) 1, if p > q, and h'1= 1/(p-q), ifp> q, h'(1V{4 ( oo, if p=q. 11 Show that if X is a random variable with mean p and variance o2, and if X * = (X - p)/o is the standardized version of X, then 9x* (t ) = e-M/" gx t 10.2 Branching Processes Historical Background In this section we apply the theory of generating functions to the study of an important chance process called a branching process. Until recently it was thought that the theory of branching processes originated with the following problem posed by Francis Galton in the Educational Times in 1873.1 Problem 4001: A large nation, of whom we will only concern ourselves with the adult males, N in number, and who each bear separate sur- names, colonise a district. Their law of population is such that, in each generation, ao per cent of the adult males have no male children who reach adult life; a1 have one such male child; a2 have two; and so on up to a5 who have five. Find (1) what proportion of the surnames will have become extinct after r generations; and (2) how many instances there will be of the same surname being held by m persons. 1D. G. Kendall, "Branching Processes Since 1873," Jowrnal of London Mathematics Society, vol. 41 (1966), p. 386.  10.2. BRANCHING PROCESSES 377 The first attempt at a solution was given by Reverend H. W. Watson. Because of a mistake in algebra, he incorrectly concluded that a family name would always die out with probability 1. However, the methods that he employed to solve the problems were, and still are, the basis for obtaining the correct solution. Heyde and Seneta discovered an earlier communication by Bienaym6 (1845) that anticipated Galton and Watson by 28 years. Bienaym6 showed, in fact, that he was aware of the correct solution to Galton's problem. Heyde and Seneta in their book I. J. Bienaymi: Statistical Theory Anticipated,2 give the following translation from Bienaym6's paper: If ... the mean of the number of male children who replace the number of males of the preceding generation were less than unity, it would be easily realized that families are dying out due to the disappearance of the members of which they are composed. However, the analysis shows further that when this mean is equal to unity families tend to disappear, although less rapidly .... The analysis also shows clearly that if the mean ratio is greater than unity, the probability of the extinction of families with the passing of time no longer reduces to certainty. It only approaches a finite limit, which is fairly simple to calculate and which has the singular charac- teristic of being given by one of the roots of the equation (in which the number of generations is made infinite) which is not relevant to the question when the mean ratio is less than unity.3 Although Bienaym6 does not give his reasoning for these results, he did indicate that he intended to publish a special paper on the problem. The paper was never written, or at least has never been found. In his communication Bienaym6 indicated that he was motivated by the same problem that occurred to Galton. The opening paragraph of his paper as translated by Heyde and Seneta says, A great deal of consideration has been given to the possible multipli- cation of the numbers of mankind; and recently various very curious observations have been published on the fate which allegedly hangs over the aristocrary and middle classes; the families of famous men, etc. This fate, it is alleged, will inevitably bring about the disappearance of the so-called families fermies.4 A much more extensive discussion of the history of branching processes may be found in two papers by David G. Kendall.5 2C. C. Heyde and E. Seneta, I. J. Bienaymi: Statistical Theory Anticipated (New York: Springer Verlag, 1977). sibid., pp. 117-118. 4ibid., p. 118. 5D. G. Kendall, "Branching Processes Since 1873," pp. 385-406; and "The Genealogy of Ge- nealogy: Branching Processes Before (and After) 1873," Bulletin London Mathematics Society, vol. 7 (1975), pp. 225-253.  378 CHAPTER 10. GENERATING FUNCTIONS 4 1/64 1/16 3 1/32 1/8 5/16 2 5/64 1/4 1 1/16 2 1/4 0 1/16 1/4 2 1/16 1/4 1/4 1 1/16 11 /2 o /8 1/4 1 /2 0 1/2 Figure 10.1: Tree diagram for Example 10.8. Branching processes have served not only as crude models for population growth but also as models for certain physical processes such as chemical and nuclear chain reactions. Problem of Extinction We turn now to the first problem posed by Galton (i.e., the problem of finding the probability of extinction for a branching process). We start in the 0th generation with 1 male parent. In the first generation we shall have 0, 1, 2, 3, ... male offspring with probabilities po, pi, P2, P3.....If in the first generation there are k offspring, then in the second generation there will be X1 + X2 +.-.-.- + Xk offspring, where X1, X2, ..., Xk are independent random variables, each with the common distribution po, pi, P2,.....This description enables us to construct a tree, and a tree measure, for any number of generations. Examples Example 10.8 Assume that Po = 1/2, p1 = 1/4, and P2 = 1/4. Then the tree measure for the first two generations is shown in Figure 10.1. Note that we use the theory of sums of independent random variables to assign branch probabilities. For example, if there are two offspring in the first generation, the probability that there will be two in the second generation is P(X1 + X2 = 2) P= oP2 +P1P1 +P2Po 11 1 1 1 1 5 2 4 4 4 4 2 16 We now study the probability that our process dies out (i.e., that at some generation there are no offspring).  10.2. BRANCHING PROCESSES 379 Let dm be the probability that the process dies out by the mth generation. Of course, do = 0. In our example, d1= 1/2 and d2 = 1/2 + 1/8 + 1/16 = 11/16 (see Figure 10.1). Note that we must add the probabilities for all paths that lead to 0 by the mth generation. It is clear from the definition that 0=do 0, h'(z) ;> 0 and h"(z) ;> 0. Thus for nonnegative z, h(z) is an increasing function and is concave upward. Therefore the graph of  380 CHAPTER 10. GENERATING FUNCTIONS 1 0,d=1 (a) (b) (c) Figure 10.2: Graphs of y = z and y = h(z). y = h(z) can intersect the line y = z in at most two points. Since we know it must intersect the line y = z at (1, 1), we know that there are just three possibilities, as shown in Figure 10.2. In case (a) the equation d = h(d) has roots {d, 1} with 0 < d < 1. In the second case (b) it has only the one root d = 1. In case (c) it has two roots {1, d} where 1 < d. Since we are looking for a solution 0 < d < 1, we see in cases (b) and (c) that our only solution is 1. In these cases we can conclude that the process will die out with probability 1. However in case (a) we are in doubt. We must study this case more carefully. From Equation 10.4 we see that h'(1) = pi + 2p2 + 3p3-+... = m, where m is the expected number of offspring produced by a single parent. In case (a) we have h'(1) > 1, in (b) h'(1) = 1, and in (c) h'(1) < 1. Thus our three cases correspond to m > 1, m = 1, and m < 1. We assume now that m > 1. Recall that do = 0, d1= h(do) = po, d2 = h(di), ..., and do = h(dn_1). We can construct these values geometrically, as shown in Figure 10.3. We can see geometrically, as indicated for do, di, d2, and d3 in Figure 10.3, that the points (di, h(d2)) will always lie above the line y = z. Hence, they must converge to the first intersection of the curves y = z and y = h(z) (i.e., to the root d < 1). This leads us to the following theorem. D Theorem 10.2 Consider a branching process with generating function h(z) for the number of offspring of a given parent. Let d be the smallest root of the equation z = h(z). If the mean number m of offspring produced by a single parent is < 1, then d = 1 and the process dies out with probability 1. If m > 1 then d < 1 and the process dies out with probability d. D We shall often want to know the probability that a branching process dies out by a particular generation, as well as the limit of these probabilities. Let do be  10.2. BRANCHING PROCESSES 381 1. po Z y = h(z) Figure 10.3: Geometric determination of d. the probability of dying out by the nth generation. Then we know that d1 = po. We know further that do = h(dn_1) where h(z) is the generating function for the number of offspring produced by a single parent. This makes it easy to compute these probabilities. The program Branch calculates the values of do. We have run this program for 12 generations for the case that a parent can produce at most two offspring and the probabilities for the number produced are Po = .2, p1 = .5, and P2 = .3. The results are given in Table 10.1. We see that the probability of dying out by 12 generations is about .6. We shall see in the next example that the probability of eventually dying out is 2/3, so that even 12 generations is not enough to give an accurate estimate for this probability. We now assume that at most two offspring can be produced. Then h(z) = po+ pi z + p2z2 In this simple case the condition z = h(z) yields the equation d = po + pd + p2d2 which is satisfied by d = 1 and d = po/p2. Thus, in addition to the root d = 1 we have the second root d = po/p2. The mean number m of offspring produced by a single parent is m=p1+2p2=1-po-p2+2p2=1-po+p2. Thus, if Po > P2, m < 1 and the second root is > 1. If Po =P2, we have a double root d = 1. If Po < p2, m > 1 and the second root d is less than 1 and represents the probability that the process will die out.  382 CHAPTER 10. GENERATING FUNCTIONS Generation Probability of dying out 1 .2 2 .312 3 .385203 4 .437116 5 .475879 6 .505878 7 .529713 8 .549035 9 .564949 10 .578225 11 .589416 12 .598931 Table 10.1: Probability of dying out. Po = .2092 Pi = .2584 P2 = .2360 P3 = .1593 P4 = .0828 P5 = .0357 P6 =.0133 P7 = .0042 P8 =0011 p9 = .0002 Plo = .0000 Table 10.2: Distribution of number of female children. Example 10.9 Keyfitz6 compiled and analyzed data on the continuation of the female family line among Japanese women. His estimates at the basic probability distribution for the number of female children born to Japanese women of ages 45-49 in 1960 are given in Table 10.2. The expected number of girls in a family is then 1.837 so the probability d of extinction is less than 1. If we run the program Branch, we can estimate that d is in fact only about .324. D Distribution of Offspring So far we have considered only the first of the two problems raised by Galton, namely the probability of extinction. We now consider the second problem, that is, the distribution of the number Zn of offspring in the nth generation. The exact form of the distribution is not known except in very special cases. We shall see, 6N. Keyfitz, Introdutction to the Mathematics of Poputlation, rev. ed. (Reading, PA: Addison Wesley, 1977).  10.2. BRANCHING PROCESSES 383 however, that we can describe the limiting behavior of Zn as n - c. We first show that the generating function hn (z) of the distribution of Zn can be obtained from h(z) for any branching process. We recall that the value of the generating function at the value z for any random variable X can be written as h(z) = E(zX) = po +piz +p2z2 +*--** That is, h(z) is the expected value of an experiment which has outcome z3 with probability pj. Let Sn = X1 + X2 + - - - + X where each X has the same integer-valued distribution (p3) with generating function k(z) P=Po + piz + p2z2 + - - - . Let k (z) be the generating function of Sn. Then using one of the properties of ordinary generating functions discussed in Section 10.1, we have kn(z) = (k(z))",n since the XD's are independent and all have the same distribution. Consider now the branching process Zn. Let hn(z) be the generating function of Zn. Then kn+1 (z) = E(zzn+1) S E(zZn+1|Zn=k)P(Zn=k) k If Zn = k, then Zn+1 = X1+X2+""" -+Xk where X1, X2, ..., Xk are independent random variables with common generating function h(z). Thus E(zZn+1|Zn = k) = E(zX1±X2±±'-Xk) =(h(z))k, and kn+1 (z) = (:(h(z) )k P(Zn = k). k But hn(z) P(Zn = k)zk . k Thus, kn+1(z) = hn(h(z)) . (10.5) If we differentiate Equation 10.5 and use the chain rule we have hn+1 (z) = h' (h(z))h'(z). Putting z = 1 and using the fact that h(1) = 1, h'(1) =m, and hn(1) = mn = the mean number of offspring in the n'th generation, we have mn+i = m=-Mi. Thus, m2 = m - m = m2, m3 = m2 -mm M3, and in general Thus, for a branching process with m > 1, the mean number of offspring grows exponentially at a rate m.  384 CHAPTER 10. GENERATING FUNCTIONS Examples Example 10.10 For the branching process of Example 10.8 we have h(z) = 1/2 + (1/4)z + (1/4)z2, h2(z) = h(h(z)) = 1/2 + (1/4)[1/2 + (1/4)z + (1/4)z2] = +(1/4)[1/2 + (1/4)z + (1/4)z2]2 = 11/16 + (1/8)z + (9/64)z2 + (1/32)z3 + (1/64)z4. The probabilities for the number of offspring in the second generation agree with those obtained directly from the tree measure (see Figure 1). D It is clear that even in the simple case of at most two offspring, we cannot easily carry out the calculation of h (z) by this method. However, there is one special case in which this can be done. Example 10.11 Assume that the probabilities p1, p2, ... form a geometric series: pk=bck-1, k=1, 2, ...,with0 1 we must find a root d < 1 of the equation z =h(z), or b bz z= 1 1-c 1 -cz'  10.2. BRANCHING PROCESSES 385 This leads us to a quadratic equation. We know that z = 1 is one solution. The other is found to be d=1-b-c c(1 - c) It is easy to verify that d < 1 just when m > 1. It is possible in this case to find the distribution of Zn. This is done by first finding the generating function hn(z).7 The result for m / 1 is: 2 n 1-d hz)= 1 - d m mn-d z mnF -d m _mn-1 z 1-mn-dz The coefficients of the powers of z give the distribution for Z : P(Zn )=1-mn 1n l-d d(m"-1) and 1 - d 2. m - 1 j-1 P(Zn =j)= m"hmldd) (211)i for j>1. D Example 10.12 Let us re-examine the Keyfitz data to see if a distribution of the type considered in Example 10.11 could reasonably be used as a model for this population. We would have to estimate from the data the parameters b and c for the formula Pk = bck-1. Recall that b m ( -bc(10.6) and the probability d that the process dies out is 1-b-c d = 1 - c (10.7) c(1 - c) Solving Equation 10.6 and 10.7 for b and c gives rn-i c m - d and 1 - d2 b mm - d We shall use the value 1.837 for m and .324 for d that we found in the Keyfitz example. Using these values, we obtain b = .3666 and c = .5533. Note that (1 - c)2 < b < 1 - c, as required. In Table 10.3 we give for comparison the probabilities Po through P8 as calculated by the geometric distribution versus the empirical values. 7T. E. Harris, The Theory of Branching Processes (Berlin: Springer, 1963), p. 9.  386 CHAPTER 10. GENERATING FUNCTIONS Geometric pj Data Model 0 .2092 .1816 1 .2584 .3666 2 .2360 .2028 3 .1593 .1122 4 .0828 .0621 5 .0357 .0344 6 .0133 .0190 7 .0042 .0105 8 .0011 .0058 9 .0002 .0032 10 .0000 .0018 Table 10.3: Comparison of observed and expected frequencies. The geometric model tends to favor the larger numbers of offspring but is similar enough to show that this modified geometric distribution might be appropriate to use for studies of this kind. Recall that if Sn = X1 + X2 + - - - + Xn is the sum of independent random variables with the same distribution then the Law of Large Numbers states that Sn/n converges to a constant, namely E(X1). It is natural to ask if there is a similar limiting theorem for branching processes. Consider a branching process with Zn representing the number of offspring after n generations. Then we have seen that the expected value of Zn is m" . Thus we can scale the random variable Zn to have expected value 1 by considering the random variable w Zn W. In the theory of branching processes it is proved that this random variable W will tend to a limit as n tends to infinity. However, unlike the case of the Law of Large Numbers where this limit is a constant, for a branching process the limiting value of the random variables W is itself a random variable. Although we cannot prove this theorem here we can illustrate it by simulation. This requires a little care. When a branching process survives, the number of offspring is apt to get very large. If in a given generation there are 1000 offspring, the offspring of the next generation are the result of 1000 chance events, and it will take a while to simulate these 1000 experiments. However, since the final result is the sum of 1000 independent experiments we can use the Central Limit Theorem to replace these 1000 experiments by a single experiment with normal density having the appropriate mean and variance. The program BranchingSimulation carries out this process. We have run this program for the Keyfitz example, carrying out 10 simulations and graphing the results in Figure 10.4. The expected number of female offspring per female is 1.837, so that we are graphing the outcome for the random variables W = Z12/(1.837)". For three of  10.2. BRANCHING PROCESSES 387 2 1.5 1 0.5 5 10 15 20 25 Figure 10.4: Simulation of Z /m" for the Keyfitz example. the simulations the process died out, which is consistent with the value d = .3 that we found for this example. For the other seven simulations the value of Wn tends to a limiting value which is different for each simulation. D Example 10.13 We now examine the random variable Zn more closely for the case m < 1 (see Example 10.11). Fix a value t > 0; let [tm"] be the integer part of tm". Then 1 - d m -1 P(Z = [tinm]) m"( )2(m-_[m ]-1 ma- d 2ma- d 1 1- d 2(1 -1/m" m) ±a m 1 -d/m m 1 - d/m" where a <;2. Thus, as n - 00, m"P(Zn = [tmn]) - (1 - d)2 e = (1 - d)2e-t-d For t =0, P(Zn =0)- d. We can compare this result with the Central Limit Theorem for sums Sn of integer- valued independent random variables (see Theorem 9.3), which states that if t is an integer and u = (t - np)/ o.2n, then as n -0, o.2n P(S uo.2n+ pn)1- > e u2 2wr We see that the form of these statements are quite similar. It is possible to prove a limit theorem for a general class of branching processes that states that under  388 CHAPTER 10. GENERATING FUNCTIONS suitable hypotheses, as n -- 0, m"P(Zn = (tm"]) - k(t), for t > 0, and P(Zn = 0) -> d. However, unlike the Central Limit Theorem for sums of independent random vari- ables, the function k(t) will depend upon the basic distribution that determines the process. Its form is known for only a very few examples similar to the one we have considered here. D Chain Letter Problem Example 10.14 An interesting example of a branching process was suggested by Free Huizinga.8 In 1978, a chain letter called the "Circle of Gold," believed to have started in California, found its way across the country to the theater district of New York. The chain required a participant to buy a letter containing a list of 12 names for 100 dollars. The buyer gives 50 dollars to the person from whom the letter was purchased and then sends 50 dollars to the person whose name is at the top of the list. The buyer then crosses off the name at the top of the list and adds her own name at the bottom in each letter before it is sold again. Let us first assume that the buyer may sell the letter only to a single person. If you buy the letter you will want to compute your expected winnings. (We are ignoring here the fact that the passing on of chain letters through the mail is a federal offense with certain obvious resulting penalties.) Assume that each person involved has a probability p of selling the letter. Then you will receive 50 dollars with probability p and another 50 dollars if the letter is sold to 12 people, since then your name would have risen to the top of the list. This occurs with probability p12, and so your expected winnings are -100 + 50p + 50p12. Thus the chain in this situation is a highly unfavorable game. It would be more reasonable to allow each person involved to make a copy of the list and try to sell the letter to at least 2 other people. Then you would have a chance of recovering your 100 dollars on these sales, and if any of the letters is sold 12 times you will receive a bonus of 50 dollars for each of these cases. We can consider this as a branching process with 12 generations. The members of the first generation are the letters you sell. The second generation consists of the letters sold by members of the first generation, and so forth. Let us assume that the probabilities that each individual sells letters to 0, 1, or 2 others are po, p1, and p2, respectively. Let Zi, Z2, ..., Z12 be the number of letters in the first 12 generations of this branching process. Then your expected winnings are 50(E(Z1) + E(Z12)) = 5m + 50m12, 8Prvt Communication.  10.2. BRANCHING PROCESSES 389 where m = p1+2P2 is the expected number of letters you sold. Thus to be favorable we just have 50m + 50m12 > 100 , or m+ m12 > 2 But this will be true if and only if m > 1. We have seen that this will occur in the quadratic case if and only if p2 > po. Let us assume for example that po = .2, Pi = .5, and p2 = .3. Then m = 1.1 and the chain would be a favorable game. Your expected profit would be 50(1.1 + 1.112) - 100 112 . The probability that you receive at least one payment from the 12th generation is 1 - d12. We find from our program Branch that d12= .599. Thus, 1 -d12 = .401 is the probability that you receive some bonus. The maximum that you could receive from the chain would be 50(2 + 212) = 204,900 if everyone were to successfully sell two letters. Of course you can not always expect to be so lucky. (What is the probability of this happening?) To simulate this game, we need only simulate a branching process for 12 gen- erations. Using a slightly modified version of our program BranchingSimulation we carried out twenty such simulations, giving the results shown in Table 10.4. Note that we were quite lucky on a few runs, but we came out ahead only a little less than half the time. The process died out by the twelfth generation in 12 out of the 20 experiments, in good agreement with the probability d12= .599 that we calculated using the program Branch. Let us modify the assumptions about our chain letter to let the buyer sell the letter to as many people as she can instead of to a maximum of two. We shall assume, in fact, that a person has a large number N of acquaintances and a small probability p of persuading any one of them to buy the letter. Then the distribution for the number of letters that she sells will be a binomial distribution with mean m = Np. Since N is large and p is small, we can assume that the probability pj that an individual sells the letter to j people is given by the Poisson distribution e-mi Pjt  390 CHAPTER 10. GENERATING FUNCTIONS Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 Z10 Z11 Z12 Profit 1 0 0 0 0 0 0 0 0 0 0 0 -50 1 1 2 3 2 3 2 1 2 3 3 6 250 0 0 0 0 0 0 0 0 0 0 0 0 -100 2 4 4 2 3 4 4 3 2 2 1 1 50 1 2 3 5 4 3 3 3 5 8 6 6 250 0 0 0 0 0 0 0 0 0 0 0 0 -100 2 3 2 2 2 1 2 3 3 3 4 6 300 1 2 1 1 1 1 2 1 0 0 0 0 -50 0 0 0 0 0 0 0 0 0 0 0 0 -100 1 0 0 0 0 0 0 0 0 0 0 0 -50 2 3 2 3 3 3 5 9 12 12 13 15 750 1 1 1 0 0 0 0 0 0 0 0 0 -50 1 2 2 3 3 0 0 0 0 0 0 0 -50 1 1 1 1 2 2 3 4 4 6 4 5 200 1 1 0 0 0 0 0 0 0 0 0 0 -50 1 0 0 0 0 0 0 0 0 0 0 0 -50 1 0 0 0 0 0 0 0 0 0 0 0 -50 1 1 2 3 3 4 2 3 3 3 3 2 50 1 2 4 6 6 9 10 13 16 17 15 18 850 1 0 0 0 0 0 0 0 0 0 0 0 -50 Table 10.4: Simulation of chain letter (finite distribution case).  10.2. BRANCHING PROCESSES 391 Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 Z10 Z11 Z12 Profit 1 2 6 7 7 8 11 9 7 6 6 5 200 1 0 0 0 0 0 0 0 0 0 0 0 -50 1 0 0 0 0 0 0 0 0 0 0 0 -50 1 1 1 0 0 0 0 0 0 0 0 0 -50 0 0 0 0 0 0 0 0 0 0 0 0 -100 1 1 1 1 1 1 2 4 9 7 9 7 300 2 3 3 4 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -50 2 1 0 0 0 0 0 0 0 0 0 0 0 3 3 4 7 11 17 14 11 11 10 16 25 1300 0 0 0 0 0 0 0 0 0 0 0 0 -100 1 2 2 1 1 3 1 0 0 0 0 0 -50 0 0 0 0 0 0 0 0 0 0 0 0 -100 2 3 1 0 0 0 0 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0 0 0 0 50 1 0 0 0 0 0 0 0 0 0 0 0 -50 3 4 4 7 10 11 9 11 12 14 13 10 550 1 3 3 4 9 5 7 9 8 8 6 3 100 1 0 4 6 6 9 10 13 0 0 0 0 -50 1 0 0 0 0 0 0 0 0 0 0 0 -50 Table 10.5: Simulation of chain letter (Poisson case). The generating function for the Poisson distribution is e-mms z3 h(z) =mj _ . = e-memz =em(z-1) The expected number of letters that an individual passes on is m, and again to be favorable we must have m > 1. Let us assume again that m =1.1. Then we can find again the probability 1 - d12 of a bonus from Branch. The result is .232. Although the expected winnings are the same, the variance is larger in this case, and the buyer has a better chance for a reasonably large profit. We again carried out 20 simulations using the Poisson distribution with mean 1.1. The results are shown in Table 10.5. We note that, as before, we came out ahead less than half the time, but we also had one large profit. In only 6 of the 20 cases did we receive any profit. This is again in reasonable agreement with our calculation of a probability .232 for this happening.D  392 CHAPTER 10. GENERATING FUNCTIONS Exercises 1 Let Z1, Z2, ..., ZN describe a branching process in which each parent has j offspring with probability pj. Find the probability d that the process even- tually dies out if (a) po 1/2, p1 =1/4, and P2 =1/4. (b) po =1/3, p1 =1/3, and P2 =1/3. (c) po = 1/3, p1 = 0, and P2 = 2/3. (d) pj = 1/23+1, for j = 0, 1, 2, .... (e) pj = (1/3)(2/3)3, for j = 0, 1, 2, .... (f) pj = e-223/j!, for j = 0, 1, 2, ... (estimate d numerically). 2 Let Z1, Z2, ..., ZN describe a branching process in which each parent has j offspring with probability pj. Find the probability d that the process dies out if (a) po = 1/2, pi = P2 = 0, and p3 = 1/2. (b) PO = Pi =P2 = P3 =1/4. (c) po = t, pi1= 1 - 2t, P2 = 0, and p3 = t, where t < 1/2. 3 In the chain letter problem (see Example 10.14) find your expected profit if (a) po = 1/2, pi = 0, and P2 = 1/2. (b) po 1/6, p1 =1/2, and P2 =1/3. Show that if PO > 1/2, you cannot expect to make a profit. 4 Let SN = X1 + X2 + - ---+ XN, where the Xi's are independent random variables with common distribution having generating function f(z). Assume that N is an integer valued random variable independent of all of the X and having generating function g(z). Show that the generating function for SN is h(z) = g(f(z)). Hint: Use the fact that h(z) = E(zSN) =ZE(zsN|N = k)P(N = k) . k 5 We have seen that if the generating function for the offspring of a single parent is f(z), then the generating function for the number of offspring after two generations is given by h(z) = f(f(z)). Explain how this follows from the result of Exercise 4. 6 Consider a queueing process such that in each minute either 1 or 0 customers arrive with probabilities p or q = 1 - p, respectively. (The number p is called the arrival rate.) When a customer starts service she finishes in the next minute with probability r. The number r is called the service rate.) Thus when a customer begins being served she will finish being served in j minutes with probability (1 - r)s-1r, for j =1, 2, 3,..  10.3. CONTINUOUS DENSITIES 393 (a) Find the generating function f(z) for the number of customers who arrive in one minute and the generating function g(z) for the length of time that a person spends in service once she begins service. (b) Consider a customer branching process by considering the offspring of a customer to be the customers who arrive while she is being served. Using Exercise 4, show that the generating function for our customer branching process is h(z) = g(f (z)). (c) If we start the branching process with the arrival of the first customer, then the length of time until the branching process dies out will be the busy period for the server. Find a condition in terms of the arrival rate and service rate that will assure that the server will ultimately have a time when he is not busy. 7 Let N be the expected total number of offspring in a branching process. Let m be the mean number of offspring of a single parent. Show that N=1+(Pk .-k N=1+mN and hence that N is finite if and only if m < 1 and in that case N = 1/(1-m). 8 Consider a branching process such that the number of offspring of a parent is j with probability 1/2j+1 for j = 0, 1, 2, .... (a) Using the results of Example 10.11 show that the probability that there are j offspring in the nth generation is 73 , if j =0. (b) Show that the probability that the process dies out exactly at the nth generation is 1/n(n + 1). (c) Show that the expected lifetime is infinite even though d = 1. 10.3 Generating Functions for Continuous Densi- ties In the previous section, we introduced the concepts of moments and moment gen- erating functions for discrete random variables. These concepts have natural ana- logues for continuous random variables, provided some care is taken in arguments involving convergence. Moments If X is a continuous random variable defined on the probability space Q, with density function fx, then we define the nth moment of X by the formula pA = E(X"h) J xmfx(x) dx ,  394 CHAPTER 10. GENERATING FUNCTIONS provided the integral in = E(X") J |x|"fx(x)dx , is finite. Then, just as in the discrete case, we see that po = 1, p1 = p, and 92 - 2 Moment Generating Functions Now we define the moment generating function g(t) for X by the formula g°p) Z tkt ZOE(Xk)tk g(t) =E k! k! E(etX) J etx fx (x) dx, provided this series converges. Then, as before, we have pn = g(") (0) . Examples Example 10.15 Let X be a continuous random variable with range [0, 1] and density function fx (x) = 1 for 0 < x < 1 (uniform density). Then pn = Andx =n+1 and g(t) Z :(k +1) k=o et- 1 t Here the series converges for all t. Alternatively, we have g(t) J= etx fx(x)dx je1 x dx e = e tdz= Then (by L'H6pital's rule) et- 1 pi=g(0)O=lim te- 111 t-w t2 2' 112 "() limtoet - 2t2et -+ 2tet - 2t 3  10.3. CONTINUOUS DENSITIES In particular, we verify that p = g'(0) = 1/2 and 2 ///\ 2 1 1 1 0- 2 = g"(0) -(g'(0))2 3 4 1 3 4 12 as before (see Example 6.25). 395 Example 10.16 Let X have range [0, o0) and density function fx(x) =Aex (exponential density with parameter A). In this case din =jxnAe-Axdx =A(-1) n CjeAx dx do 1 n! =A(-1)" din[A] and g~) 00 Atktk Z k g(t) _ k=0 ZA ]k A-t k=0 Here the series converges only for t < A. Alternatively, we have g(t) j=e exdx A(/A)x°O A t - A A- t Now we can verify directly that pn = g(")(0) An! (A - t)"lt =0 n! An El Example 10.17 Let X have range (-o0, +o0) and density function fX (x) = 1 e-x2/2 27 (normal density). In this case we have 1 +oo pnt 1 xneex2/2 dx 27-o _ 2m, ifn =2m, 0, ifn=2m +1.  396 CHAPTER 10. GENERATING FUNCTIONS (These moments are calculated by integrating once by parts to show that pn = (n - 1)pn-2, and observing that po = 1 and p1 = 0.) Hence, g(t) E t tr n=0 o t2m 2mm! e m=0 This series converges for all values of t. Again we can verify that g(")(0) = pn. Let X be a normal random variable with parameters p and a. It is easy to show that the moment generating function of X is given by et+±(U2 /2)t2 Now suppose that X and Y are two independent normal random variables with parameters li, Ui, and p2, 0-2, respectively. Then, the product of the moment generating functions of X and Y is This is the moment generating function for a normal random variable with mean 1 + p12 and variance cv + 0. Thus, the sum of two independent normal random variables is again normal. (This was proved for the special case that both summands are standard normal in Example 7.5.) D In general, the series defining g(t) will not converge for all t. But in the important special case where X is bounded (i.e., where the range of X is contained in a finite interval), we can show that the series does converge for all t. Theorem 10.3 Suppose X is a continuous random variable with range contained in the interval [-M, M]. Then the series 00 Atktk g(t) = 1: k=o converges for all t to an infinitely differentiable function g(t), and g(")(0) =Apn. Proof. We have ptk J xkfxx , /+M Ilk =Jxkf(x) dx, -M so - M  10.3. CONTINUOUS DENSITIES 397 Hence, for all N we have k=0k= which shows that the power series converges for all t. We know that the sum of a convergent power series is always differentiable. D Moment Problem Theorem 10.4 If X is a bounded random variable, then the moment generating function gx (t) of x determines the density function fx (x) uniquely. Sketch of the Proof. We know that gx(t) 00 /ktk Ek! k=0 etxf (x) dx . If we replace t by IT, where T is real all T, and we can define the function and i = 1, then the series converges for kx(T) =gx(iT) x00 The function kx (r) is called the characteristic function of X, and is defined by the above equation even when the series for gx does not converge. This equation says that kx is the Fourier transform of fx. It is known that the Fourier transform has an inverse, given by the formula 1 +o fx(x) = e -"xkx(T) dT , 27r suitably interpreted.9 Here we see that the characteristic function kx, and hence the moment generating function gx, determines the density function fx uniquely under our hypotheses. D Sketch of the Proof of the Central Limit Theorem With the above result in mind, we can now sketch a proof of the Central Limit Theorem for bounded continuous random variables (see Theorem 9.6). To this end, let X be a continuous random variable with density function fx, mean p = 0 and variance u2= 1, and moment generating function g(t) defined by its series for all t. 9H. Dym and H. P. McKean, Fourier Series and Integrals (New York: Academic Press, 1972).  398 CHAPTER 10. GENERATING FUNCTIONS Let X1, X2, ..., Xn be an independent trials process with each Xi having density fx, and let Sn = X1 + X2 + - - - + Xn, and S*= (Sn- np)/vnu2 = Sn /v. Then each Xi has moment generating function g(t), and since the Xi are independent, the sum Sn, just as in the discrete case (see Section 10.1), has moment generating function gn(t) = (g(t))" and the standardized sum S* has moment generating function g*(t) = g. We now show that, as n - o, g*(t) - et2/2, where et2/2 is the moment gener- ating function of the normal density n(x) = (1/v2r)e-x2/2 (see Example 10.17). To show this, we set u(t) = log g(t), and u*(t) = log g*(t) =n log g t = nu t and show that u* (t) - t2/2 as n - c. First we note that u(0) = loggn(0) = 0 U, (0) g'(0) pi 0 g(0) 1 U //(0) g"(0)g(0) - (g'(0))2 (g(0))2 2 P2 Pi -2 1 1 Now by using L'H6pital's rule twice, we get lim u ~t) = lim ut n~oo n ( s--oo 8- lim u'(t/N/ jt s-oo 2s-1/2 lim u -o- _ - -. swoo \ s 2 2 2 Hence, g*(t) et2/2 as n - c. Now to complete the proof of the Central Limit Theorem, we must show that if g*(t) -> et2/2, then under our hypotheses the distribution functions Fn*(x) of the S* must converge to the distribution function Fk(x) of the normal variable N; that is, that Fn*(a) = P(S*<; a)- 1 ex2/2 dx, and furthermore, that the density functions f*(x) of the S* must converge to the density function for N; that is, that f() 1 e-2/2  10.3. CONTINUOUS DENSITIES 399 as n - oc. Since the densities, and hence the distributions, of the S* are uniquely deter- mined by their moment generating functions under our hypotheses, these conclu- sions are certainly plausible, but their proofs involve a detailed examination of characteristic functions and Fourier transforms, and we shall not attempt them here. In the same way, we can prove the Central Limit Theorem for bounded discrete random variables with integer values (see Theorem 9.4). Let X be a discrete random variable with density function p(j), mean p = 0, variance o2 = 1, and moment generating function g(t), and let X1, X2, ..., Xn form an independent trials process with common density p. Let Sn = X1 + X2 + - - - + Xm and S* = Sm/\, with densities pn and p*, and moment generating functions gn(t) and g*(t) = (g( [) Then we have g*(t) e/ just as in the continuous case, and this implies in the same way that the distribution functions Fn*(x) converge to the normal distribution; that is, that F*(a) = P(S* < a)- 1 J e-x2/2 dx, as n - oc. The corresponding statement about the distribution functions p*, however, re- quires a little extra care (see Theorem 9.3). The trouble arises because the dis- tribution p(x) is not defined for all x, but only for integer x. It follows that the distribution p* (x) is defined only for x of the form j/ n, and these values change as n changes. We can fix this, however, by introducing the function p(x), defined by the for- mula _ _ p(jif j -1/2 . Proof. The proof of this theorem is left as an exercise (Exercise 18). Q We note that if we want to examine the behavior of the chain under the assump- tion that it starts in a certain state si, we simply choose u to be the probability vector with ith entry equal to 1 and all other entries equal to 0. Example 11.3 In the Land of Oz example (Example 11.1) let the initial probability vector u equal (1/3, 1/3, 1/3). Then we can calculate the distribution of the states after three days using Theorem 11.2 and our previous calculation of P3. We obtain (.406 .203 .391 u(3) - uP3 = (1/3, 1/3, 1/3) .406 .188 .406 .391 .203 .406 = (.401, .198, .401) Examples The following examples of Markov chains will be used throughout the chapter for exercises. Example 11.4 The President of the United States tells person A his or her in- tention to run or not to run in the next election. Then A relays the news to B, who in turn relays the message to C, and so forth, always to some new person. We assume that there is a probability a that a person will change the answer from yes to no when transmitting it to the next person and a probability b that he or she will change it from no to yes. We choose as states the message, either yes or no. The transition matrix is then yes no _yes 1-a a no b 1-b) The initial state represents the President's choice. D Example 11.5 Each time a certain horse runs in a three-horse race, he has proba- bility 1/2 of winning, 1/4 of coming in second, and 1/4 of coming in third, indepen- dent of the outcome of any previous race. We have an independent trials process,  410 CHAPTER 11. MARKOV CHAINS but it can also be considered from the point of view of Markov chain theory. The transition matrix is W P S W .5 .25 .25 P= P .5 .25 .25 . S .5 .25 .25 Example 11.6 In the Dark Ages, Harvard, Dartmouth, and Yale admitted only male students. Assume that, at that time, 80 percent of the sons of Harvard men went to Harvard and the rest went to Yale, 40 percent of the sons of Yale men went to Yale, and the rest split evenly between Harvard and Dartmouth; and of the sons of Dartmouth men, 70 percent went to Dartmouth, 20 percent to Harvard, and 10 percent to Yale. We form a Markov chain with transition matrix H Y D H .8 .2 0 P = Y .3 .4 .3 . D .2 .1 .7 Example 11.7 Modify Example 11.6 by assuming that the son of a Harvard man always went to Harvard. The transition matrix is now H Y D H 1 0 0 P= Y .3 .4 .3). D .2 .1 .7 Example 11.8 (Ehrenfest Model) The following is a special case of a model, called the Ehrenfest model,3 that has been used to explain diffusion of gases. The general model will be discussed in detail in Section 11.5. We have two urns that, between them, contain four balls. At each step, one of the four balls is chosen at random and moved from the urn that it is in into the other urn. We choose, as states, the number of balls in the first urn. The transition matrix is then 0 1 2 3 4 0 0 1 0 0 0 1 1/4 0 3/4 0 0 P = 2 0 1/2 0 1/2 0 . 3 0 0 3/4 0 1/4 40 0 12 0/ 3P. and T. Ehrenfest, "Uber zwei bekannte Einw~nde gegen das Boltzmannsche H-Theorem," Physikalishce Zeitschrift, vol. 8 (1907), pp. 311-314.  11.1. INTRODUCTION 411 Example 11.9 (Gene Model) The simplest type of inheritance of traits in animals occurs when a trait is governed by a pair of genes, each of which may be of two types, say G and g. An individual may have a GG combination or Gg (which is genetically the same as gG) or gg. Very often the GG and Gg types are indistinguishable in appearance, and then we say that the G gene dominates the g gene. An individual is called dominant if he or she has GG genes, recessive if he or she has gg, and hybrid with a Gg mixture. In the mating of two animals, the offspring inherits one gene of the pair from each parent, and the basic assumption of genetics is that these genes are selected at random, independently of each other. This assumption determines the probability of occurrence of each type of offspring. The offspring of two purely dominant parents must be dominant, of two recessive parents must be recessive, and of one dominant and one recessive parent must be hybrid. In the mating of a dominant and a hybrid animal, each offspring must get a G gene from the former and has an equal chance of getting G or g from the latter. Hence there is an equal probability for getting a dominant or a hybrid offspring. Again, in the mating of a recessive and a hybrid, there is an even chance for getting either a recessive or a hybrid. In the mating of two hybrids, the offspring has an equal chance of getting G or g from each parent. Hence the probabilities are 1/4 for GG, 1/2 for Gg, and 1/4 for gg. Consider a process of continued matings. We start with an individual of known genetic character and mate it with a hybrid. We assume that there is at least one offspring. An offspring is chosen at random and is mated with a hybrid and this process repeated through a number of generations. The genetic type of the chosen offspring in successive generations can be represented by a Markov chain. The states are dominant, hybrid, and recessive, and indicated by GG, Gg, and gg respectively. The transition probabilities are GG Gg gg GG .5 .5 0 P = Gg .25 .5 .25 . gg 0 .5 .5 Example 11.10 Modify Example 11.9 as follows: Instead of mating the oldest offspring with a hybrid, we mate it with a dominant individual. The transition matrix is GG Gg gg GG 1 0 0 P =Gg .5 .5 0J. gg 0 1 0/  412 CHAPTER 11. MARKOV CHAINS Example 11.11 We start with two animals of opposite sex, mate them, select two of their offspring of opposite sex, and mate those, and so forth. To simplify the example, we will assume that the trait under consideration is independent of sex. Here a state is determined by a pair of animals. Hence, the states of our process will be: si= (GG, GG), s2 = (GG, Gg), s3 = (GG, gg), s4 = (Gg, Gg), s5 = (Gg, gg), and s6 = (gg, gg). We illustrate the calculation of transition probabilities in terms of the state s2. When the process is in this state, one parent has GG genes, the other Gg. Hence, the probability of a dominant offspring is 1/2. Then the probability of transition to si (selection of two dominants) is 1/4, transition to s2 is 1/2, and to s4 is 1/4. The other states are treated the same way. The transition matrix of this chain is: GG,GG GG,Gg GG,gg Gg,Gg Gg,gg gg,gg GG,GG 1.000 .000 .000 .000 .000 .000 GG,Gg .250 .500 .000 .250 .000 .000 P1_GG,gg .000 .000 .000 1.000 .000 .000 Gg,Gg .062 .250 .125 .250 .250 .062 Gg,gg .000 .000 .000 .250 .500 .250 gg,gg .000 .000 .000 .000 .000 1.000/ Example 11.12 (Stepping Stone Model) Our final example is another example that has been used in the study of genetics. It is called the stepping stone model.4 In this model we have an n-by-n array of squares, and each square is initially any one of k different colors. For each step, a square is chosen at random. This square then chooses one of its eight neighbors at random and assumes the color of that neighbor. To avoid boundary problems, we assume that if a square S is on the left-hand boundary, say, but not at a corner, it is adjacent to the square T on the right-hand boundary in the same row as S, and S is also adjacent to the squares just above and below T. A similar assumption is made about squares on the upper and lower boundaries. The top left-hand corner square is adjacent to three obvious neighbors, namely the squares below it, to its right, and diagonally below and to the right. It has five other neighbors, which are as follows: the other three corner squares, the square below the upper right-hand corner, and the square to the right of the bottom left-hand corner. The other three corners also have, in a similar way, eight neighbors. (These adjacencies are much easier to understand if one imagines making the array into a cylinder by gluing the top and bottom edge together, and then making the cylinder into a doughnut by gluing the two circular boundaries together.) With these adjacencies, each square in the array is adjacent to exactly eight other squares. A state in this Markov chain is a description of the color of each square. For this Markov chain the number of states is kin, which for even a small array of squares 4S. Sawyer, "Results for The Stepping Stone Model for Migration in Population Genetics," Annals of Probability, vol. 4 (1979), pp. 699-728.  11.1. INTRODUCTION41 413 EZ7ELEELZELZELIILIZLIE7LII EEDDEDDEEDDLZLZELZEEEL EELILZLLIIIIELIIIZLILIILIIZEE *DDDDDEEDDELIILIEZILZLZE Fiur 1.1 niia saeDEofDth Le eppigson odl LIILEEEELILIZLZEIILILZELIZEL EDDEEDEEZLZEELZLEELIIZLZ * ZELZLEEEEEELZLZEEZLZLZL QDDEDQQEQEEQQQQQQQQQQI *DELZLELZLIZQQQQLZELZIIZLZ DEDDEDDEEEDEDDEELIIZZQ EELZLLZLZZEEQQQQZLEELIZLZL QQQ LZLZZLZLZLZELELZLIZLZL QQEZLZLZLIIEELZEIILIZLILIIEI *DEELLIIEELZLIZLIIIIZLIIIZE Figure 11.1: IntaState of the stepping stone model.,00ses is eormos. Tis i an*xalZeEof aMarko chai thatis eay to imulaebu diffiult o anayze n tems ofitsEransiion atri. Theprogam StppinLton simulates~~~ W thi chin We hae ate it aL rado iniia configurationof[ILo colos wth = 2 an shw teEEelt afer he poces hs ru fo som tie i Figure 11.2.ZZLLZZLLZZL This~~~~HEZ is an exampleIL of~ anE abobnEMrohin.Ti yeo hi ilb studed n Setio 11..LOeEofEteEthoemsEL provedIin htIetonLplidt thepreentexaple imlie that wiprobabili1 h soe il vnual l be he amecolr. y wtcingte prgrm unyo cn ee ha trrioresar estabishedand abatteeels to see wih clorsrivs t n im h probbiliy tat apariulrEEcolorilliLutiequlEothLpoorioEfEh arrayof tis coor.DouIaesedLtoIproveLis EiExercLiseL1..3. ExercisesEILLIIILLIIILL 1 It i rainig in teLndIofLOz.LDetermine L tre ad atre mesurLfoLth next thre days' Eeer. ind w(1, (2, andw(3) ad comareLwthILh results obtained fromL P, P2, andLIL.  414 CHAPTER 11. MARKOV CHAINS 2 In Example 11.4, let a = 0 and b = 1/2. Find P, P2, and P3. What would P" be? What happens to P' as n tends to infinity? Interpret this result. 3 In Example 11.5, find P, P2, and P3. What is P"? 4 For Example 11.6, find the probability that the grandson of a man from Har- vard went to Harvard. 5 In Example 11.7, find the probability that the grandson of a man from Harvard went to Harvard. 6 In Example 11.9, assume that we start with a hybrid bred to a hybrid. Find u mu(2), and u(3). What would u(n) be? 7 Find the matrices P2, P3, P4, and P" for the Markov chain determined by the transitionmatrixP = 0 1). Do the same for the transition matrix 0h 1nito arx P = ( . Interpret what happens in each of these processes. 8 A certain calculating machine uses only the digits 0 and 1. It is supposed to transmit one of these digits through several stages. However, at every stage, there is a probability p that the digit that enters this stage will be changed when it leaves and a probability q = 1-p that it won't. Form a Markov chain to represent the process of transmission by taking as states the digits 0 and 1. What is the matrix of transition probabilities? 9 For the Markov chain in Exercise 8, draw a tree and assign a tree measure assuming that the process begins in state 0 and moves through two stages of transmission. What is the probability that the machine, after two stages, produces the digit 0 (i.e., the correct digit)? What is the probability that the machine never changed the digit from 0? Now let p =.1. Using the program MatrixPowers, compute the 100th power of the transition matrix. Interpret the entries of this matrix. Repeat this with p = .2. Why do the 100th powers appear to be the same? 10 Modify the program MatrixPowers so that it prints out the average An of the powers Pm, for n = 1 to N. Try your program on the Land of Oz example and compare An and Pm. 11 Assume that a man's profession can be classified as professional, skilled la- borer, or unskilled laborer. Assume that, of the sons of professional men, 80 percent are professional, 10 percent are skilled laborers, and 10 percent are unskilled laborers. In the case of sons of skilled laborers, 60 percent are skilled laborers, 20 percent are professional, and 20 percent are unskilled. Finally, in the case of unskilled laborers, 50 percent of the sons are unskilled laborers, and 25 percent each are in the other two categories. Assume that every man has at least one son, and form a Markov chain by following the profession of a randomly chosen son of a given family through several generations. Set up  11.1. INTRODUCTION 415 the matrix of transition probabilities. Find the probability that a randomly chosen grandson of an unskilled laborer is a professional man. 12 In Exercise 11, we assumed that every man has a son. Assume instead that the probability that a man has at least one son is .8. Form a Markov chain with four states. If a man has a son, the probability that this son is in a particular profession is the same as in Exercise 11. If there is no son, the process moves to state four which represents families whose male line has died out. Find the matrix of transition probabilities and find the probability that a randomly chosen grandson of an unskilled laborer is a professional man. 13 Write a program to compute u(") given u and P. Use this program to compute u(10) for the Land of Oz example, with u = (0, 1,0), and with u = (1/3,1/3,1/3). 14 Using the program MatrixPowers, find P1 through P6 for Examples 11.9 and 11.10. See if you can predict the long-range probability of finding the process in each of the states for these examples. 15 Write a program to simulate the outcomes of a Markov chain after n steps, given the initial starting state and the transition matrix P as data (see Ex- ample 11.12). Keep this program for use in later problems. 16 Modify the program of Exercise 15 so that it keeps track of the proportion of times in each state in n steps. Run the modified program for different starting states for Example 11.1 and Example 11.8. Does the initial state affect the proportion of time spent in each of the states if n is large? 17 Prove Theorem 11.1. 18 Prove Theorem 11.2. 19 Consider the following process. We have two coins, one of which is fair, and the other of which has heads on both sides. We give these two coins to our friend, who chooses one of them at random (each with probability 1/2). During the rest of the process, she uses only the coin that she chose. She now proceeds to toss the coin many times, reporting the results. We consider this process to consist solely of what she reports to us. (a) Given that she reports a head on the nth toss, what is the probability that a head is thrown on the (n + 1)st toss? (b) Consider this process as having two states, heads and tails. By computing the other three transition probabilities analogous to the one in part (a), write down a "transition matrix" for this process. (c) Now assume that the process is in state "heads" on both the (n - 1)st and the nth toss. Find the probability that a head comes up on the (n + 1)st toss. (d) Is this process a Markov chain?  416 CHAPTER 11. MARKOV CHAINS 11.2 Absorbing Markov Chains The subject of Markov chains is best studied by considering special types of Markov chains. The first type that we shall study is called an absorbing Markov chain. Definition 11.1 A state s2 of a Markov chain is called absorbing if it is impossible to leave it (i.e., p22 = 1). A Markov chain is absorbing if it has at least one absorbing state, and if from every state it is possible to go to an absorbing state (not necessarily in one step). D Definition 11.2 In an absorbing Markov chain, a state which is not absorbing is called transient. D Drunkard's Walk Example 11.13 A man walks along a four-block stretch of Park Avenue (see Fig- ure 11.3). If he is at corner 1, 2, or 3, then he walks to the left or right with equal probability. He continues until he reaches corner 4, which is a bar, or corner 0, which is his home. If he reaches either home or the bar, he stays there. We form a Markov chain with states 0, 1, 2, 3, and 4. States 0 and 4 are absorbing states. The transition matrix is then 0 1 2 3 4 0/ 1 0 0 0 0\ 1(1/2 0 1/2 0 0 P= 2 0 1/2 0 1/2 0 . 3 0 0 1/2 0 1/2 4\ 0 0 0 0 1/ The states 1, 2, and 3 are transient states, and from any of these it is possible to reach the absorbing states 0 and 4. Hence the chain is an absorbing chain. When a process reaches an absorbing state, we shall say that it is absorbed. Q The most obvious question that can be asked about such a chain is: What is the probability that the process will eventually reach an absorbing state? Other interesting questions include: (a) What is the probability that the process will end up in a given absorbing state? (b) On the average, how long will it take for the process to be absorbed? (c) On the average, how many times will the process be in each transient state? The answers to all these questions depend, in general, on the state from which the process starts as well as the transition probabilities. Canonical Form Consider an arbitrary absorbing Markov chain. Renumber the states so that the transient states come first. If there are r absorbing states and t transient states, the transition matrix will have the following canonical form  11.2. ABSORBING MARKOV CHAINS 417 1 1/2 1/2 1/2 012 3 4 1/2 11/21/2 Figure 11.3: Drunkard's walk. TR. ABS. TR. Q R P = ABS.\O I Here I is an r-by-r indentity matrix, 0 is an r-by-t zero matrix, R is a nonzero t-by-r matrix, and Q is an t-by-t matrix. The first t states are transient and the last r states are absorbing. In Section 11.1, we saw that the entry p$ of the matrix Pn is the probability of being in the state s3 after n steps, when the chain is started in state s2. A standard matrix algebra argument shows that P" is of the form TR. ABS. TR. (Q * P1 = ABS.k0 I where the asterisk * stands for the t-by-r matrix in the upper right-hand corner of P". (This submatrix can be written in terms of Q and R, but the expression is complicated and is not needed at this time.) The form of P"n shows that the entries of Q" give the probabilities for being in each of the transient states after n steps for each possible transient starting state. For our first theorem we prove that the probability of being in the transient states after n steps approaches zero. Thus every entry of Q" must approach zero as n approaches infinity (i.e, Q" - 0). Probability of Absorption Theorem 11.3 In an absorbing Markov chain, the probability that the process will be absorbed is 1 (i.e., Q" - 0 as n - oc). Proof. From each nonabsorbing state sg it is possible to reach an absorbing state. Let m3 be the minimum number of steps required to reach an absorbing state, starting from sg. Let pj be the probability that, starting from sy, the process will not reach an absorbing state in m3 steps. Then pj < 1. Let m be the largest of the  418 CHAPTER 11. MARKOV CHAINS m3 and let p be the largest of pj. The probability of not being absorbed in m steps is less than or equal to p, in 2m steps less than or equal to p2, etc. Since p < 1 these probabilities tend to 0. Since the probability of not being absorbed in n steps is monotone decreasing, these probabilities also tend to 0, hence limn, Q" = 0. The Fundamental Matrix Theorem 11.4 For an absorbing Markov chain the matrix I - Q has an inverse N and N = I + Q + Q2 + - - - . The ij-entry nig of the matrix N is the expected number of times the chain is in state sj, given that it starts in state s2. The initial state is counted if i = j. Proof. Let (I - Q)x = 0; that is x = Qx. Then, iterating this we see that x = Qnx. Since Q" 0, we have Qnx - 0, so x = 0. Thus (I - Q)-1 = N exists. Note next that (I -Q)(I +Q +Q2 +.- -+ Q)=I - Q+1. Thus multiplying both sides by N gives I+ Q +Q2 +.-..-+ Q"=N(I -Qn+1) . Letting n tend to infinity we have N=I+Q+Q2+---.. Let si and sj be two transient states, and assume throughout the remainder of the proof that i and j are fixed. Let X(k) be a random variable which equals 1 if the chain is in state sj after k steps, and equals 0 otherwise. For each k, this random variable depends upon both i and j; we choose not to explicitly show this dependence in the interest of clarity. We have P(X (k) = 1) = qc, and P(XCE) = 0) = 1 - q ,> where q(k is the ijth entry of Qk. These equations hold for k = 0 since Q0 = I. Therefore, since X(k) is a 0-1 random variable, E(X(k)) - (k) The expected number of times the chain is in state sj in the first n steps, given that it starts in state si, is clearly E(X(0) + X(1) + ...+ X(n)) - qO)+q )+--+q(). Letting n tend to infinity we have  11.2. ABSORBING MARKOV CHAINS 419 Definition 11.3 For an absorbing Markov chain P, the matrix N = (I - Q)-1 is called the fundamental matrix for P. The entry ri 3 of N gives the expected number of times that the process is in the transient state s3 if it is started in the transient state s2. Example 11.14 (Example 11.13 continued) In the Drunkard's Walk example, the transition matrix in canonical form is 1 1 0 2 1/2 P _ 3 0 0 0 4 0 2 1/2 0 1/2 3 0 1/2 0 0 1/2 0 0 4 0 0 1/2 i 0 0 0 0 1 0 0 1 From this we see that the matrix Q is 0 Q = 1/2 0 1/2 0 1/2 0 1/2 0 and I-Qd Q)-1, we find 1 1/2 0 1/2 1 1/2 0 1/2 1 Computing (I N = (I _ Q)-i 1 2 3 1 3/2 1 1/2 2 1 2 1 3 1/2 1 3/2 From the middle row of N, we see that if we start in state 2, then the expected number of times in states 1, 2, and 3 before being absorbed are 1, 2, and 1. Q Time to Absorption We now consider the question: Given that the chain starts in state s2, what is the expected number of steps before the chain is absorbed? The answer is given in the next theorem. Theorem 11.5 Let t2 be the expected number of steps before the chain is absorbed, given that the chain starts in state s2, and let t be the column vector whose ith entry is t2. Then t =Nc, where c is a column vector all of whose entries are 1.  420 CHAPTER 11. MARKOV CHAINS Proof. If we add all the entries in the ith row of N, we will have the expected number of times in any of the transient states for a given starting state si, that is, the expected time required before being absorbed. Thus, t2 is the sum of the entries in the ith row of N. If we write this statement in matrix form, we obtain the theorem. Absorption Probabilities Theorem 11.6 Let bid be the probability that an absorbing chain will be absorbed in the absorbing state sj if it starts in the transient state si. Let B be the matrix with entries bid. Then B is an t-by-r matrix, and B=NR, where N is the fundamental matrix and R is as in the canonical form. Proof. We have n k =q((qkTrki k n =nirkTkj k (NR) 3. This completes the proof. D Another proof of this is given in Exercise 34. Example 11.15 (Example 11.14 continued) In the Drunkard's Walk example, we found that 1 2 3 1 3/2 1 1/2 N= 2 1 2 1 . 3 1/2 1 3/2 Hence, 3/2 1 1/2 1 t=Nc = 1 2 1 1 (/ 13/)  11.2. ABSORBING MARKOV CHAINS 421 Thus, starting in states 1, 2, and 3, the expected times to absorption are 3, 4, and 3, respectively. From the canonical form, 0 4 1(1/2 0 R= 2 0 0 . 3 0 1/2 Hence, 3/2 1 1/2) 12 1/20 B=NR = 1 2 1 - 0 0 \1/2 1 3/2 0 1/2 0 4 1(3/4 1/4 2(1/2 1/2 3 1/4 3/4 Here the first row tells us that, starting from state 1, there is probability 3/4 of absorption in state 0 and 1/4 of absorption in state 4. D Computation The fact that we have been able to obtain these three descriptive quantities in matrix form makes it very easy to write a computer program that determines these quantities for a given absorbing chain matrix. The program AbsorbingChain calculates the basic descriptive quantities of an absorbing Markov chain. We have run the program AbsorbingChain for the example of the drunkard's walk (Example 11.13) with 5 blocks. The results are as follows: 1 2 3 4 1 .00 .50 .00 .00 2 .50 .00 .50 .00 31.00 .50 .00 .50 4 .00 .00 .50 .00 0 5 1 (.50 .00 R 21.00 .001 31.00 .001' 4 .00 .50/  422 CHAPTER 11. MARKOV CHAINS 1 2 3 4 1 1.60 1.20 .80 .40 N = 2 1.20 2.40 1.60 .80 1 3 .80 1.60 2.40 1.20 4 .40 .80 1.20 1.60 1 4.00 t=2 6.00 1 316.00 4 4.00 0 5 1 .80 .20 B 21.60 .40 31.40 .60 4 .20 .80 Note that the probability of reaching the bar before reaching home, starting at x, is /5 (i.e., proportional to the distance of home from the starting point). (See Exercise 24.) Exercises 1 In Example 11.4, for what values of a and b do we obtain an absorbing Markov chain? 2 Show that Example 11.7 is an absorbing Markov chain. 3 Which of the genetics examples (Examples 11.9, 11.10, and 11.11) are ab- sorbing? 4 Find the fundamental matrix N for Example 11.10. 5 For Example 11.11, verify that the following matrix is the inverse of I - Q and hence is the fundamental matrix N. 8/3 1/6 4/3 2/3 4/3 4/3 8/3 4/3 4/3 1/3 8/3 4/3 2/3 1/6 4/3 8/3 Find Nc and NR. Interpret the results. 6 In the Land of Oz example (Example 11.1), change the transition matrix by making R an absorbing state. This gives R N S R (1 0 0 P= N (1/2 0 1/2. S \1/4 1/4 1/2/  11.2. ABSORBING MARKOV CHAINS 423 Find the fundamental matrix N, and also Nc and NR. Interpret the results. 7 In Example 11.8, make states 0 and 4 into absorbing states. Find the fun- damental matrix N, and also Nc and NR, for the resulting absorbing chain. Interpret the results. 8 In Example 11.13 (Drunkard's Walk) of this section, assume that the proba- bility of a step to the right is 2/3, and a step to the left is 1/3. Find N, Nc, and NR. Compare these with the results of Example 11.15. 9 A process moves on the integers 1, 2, 3, 4, and 5. It starts at 1 and, on each successive step, moves to an integer greater than its present position, moving with equal probability to each of the remaining larger integers. State five is an absorbing state. Find the expected number of steps to reach state five. 10 Using the result of Exercise 9, make a conjecture for the form of the funda- mental matrix if the process moves as in that exercise, except that it now moves on the integers from 1 to n. Test your conjecture for several different values of n. Can you conjecture an estimate for the expected number of steps to reach state n, for large n? (See Exercise 11 for a method of determining this expected number of steps.) *11 Let bk denote the expected number of steps to reach n from n - k, in the process described in Exercise 9. (a) Define bo = 0. Show that for k > 0, we have bk = 1 + (bk_1--bk-2+--...+- bo) (b) Let f(x) = bo+b1x +b2x2 +... Using the recursion in part (a), show that f(x) satisfies the differential equation (1 - x)2y' - (1 - x)y - 1 0. (c) Show that the general solution of the differential equation in part (b) is -log(1 -x) c 1-x 1-x' where c is a constant. (d) Use part (c) to show that 1 1 1 b = 1+ -+ -+- - + -. 2 3 k 12 Three tanks fight a three-way duel. Tank A has probability 1/2 of destroying the tank at which it fires, tank B has probability 1/3 of destroying the tank at which it fires, and tank C has probability 1/6 of destroying the tank at which  424 CHAPTER 11. MARKOV CHAINS it fires. The tanks fire together and each tank fires at the strongest opponent not yet destroyed. Form a Markov chain by taking as states the subsets of the set of tanks. Find N, Nc, and NR, and interpret your results. Hint: Take as states ABC, AC, BC, A, B, C, and none, indicating the tanks that could survive starting in state ABC. You can omit AB because this state cannot be reached from ABC. 13 Smith is in jail and has 3 dollars; he can get out on bail if he has 8 dollars. A guard agrees to make a series of bets with him. If Smith bets A dollars, he wins A dollars with probability .4 and loses A dollars with probability .6. Find the probability that he wins 8 dollars before losing all of his money if (a) he bets 1 dollar each time (timid strategy). (b) he bets, each time, as much as possible but not more than necessary to bring his fortune up to 8 dollars (bold strategy). (c) Which strategy gives Smith the better chance of getting out of jail? 14 With the situation in Exercise 13, consider the strategy such that for i < 4, Smith bets min(i, 4 - i), and for i > 4, he bets according to the bold strategy, where i is his current fortune. Find the probability that he gets out of jail using this strategy. How does this probability compare with that obtained for the bold strategy? 15 Consider the game of tennis when deuce is reached. If a player wins the next point, he has advantage. On the following point, he either wins the game or the game returns to deuce. Assume that for any point, player A has probability .6 of winning the point and player B has probability .4 of winning the point. (a) Set this up as a Markov chain with state 1: A wins; 2: B wins; 3: advantage A; 4: deuce; 5: advantage B. (b) Find the absorption probabilities. (c) At deuce, find the expected duration of the game and the probability that B will win. Exercises 16 and 17 concern the inheritance of color-blindness, which is a sex- linked characteristic. There is a pair of genes, g and G, of which the former tends to produce color-blindness, the latter normal vision. The G gene is dominant. But a man has only one gene, and if this is g, he is color-blind. A man inherits one of his mother's two genes, while a woman inherits one gene from each parent. Thus a man may be of type G or g, while a woman may be type GG or Gg or gg. We will study a process of inbreeding similar to that of Example 11.11 by constructing a Markov chain. 16 List the states of the chain. Hint: There are six. Compute the transition probabilities. Find the fundamental matrix N, Nc, and NR.  11.2. ABSORBING MARKOV CHAINS 425 17 Show that in both Example 11.11 and the example just given, the probability of absorption in a state having genes of a particular type is equal to the proportion of genes of that type in the starting state. Show that this can be explained by the fact that a game in which your fortune is the number of genes of a particular type in the state of the Markov chain is a fair game.5 18 Assume that a student going to a certain four-year medical school in northern New England has, each year, a probability q of flunking out, a probability r of having to repeat the year, and a probability p of moving on to the next year (in the fourth year, moving on means graduating). (a) Form a transition matrix for this process taking as states F, 1, 2, 3, 4, and G where F stands for flunking out and G for graduating, and the other states represent the year of study. (b) For the case q = .1, r = .2, and p = .7 find the time a beginning student can expect to be in the second year. How long should this student expect to be in medical school? (c) Find the probability that this beginning student will graduate. 19 (E. Brown6) Mary and John are playing the following game: They have a three-card deck marked with the numbers 1, 2, and 3 and a spinner with the numbers 1, 2, and 3 on it. The game begins by dealing the cards out so that the dealer gets one card and the other person gets two. A move in the game consists of a spin of the spinner. The person having the card with the number that comes up on the spinner hands that card to the other person. The game ends when someone has all the cards. (a) Set up the transition matrix for this absorbing Markov chain, where the states correspond to the number of cards that Mary has. (b) Find the fundamental matrix. (c) On the average, how many moves will the game last? (d) If Mary deals, what is the probability that John will win the game? 20 Assume that an experiment has m equally probable outcomes. Show that the expected number of independent trials before the first occurrence of k consec- utive occurrences of one of these outcomes is (mk - 1)/(m - 1). Hint: Form an absorbing Markov chain with states 1, 2, ..., k with state i representing the length of the current run. The expected time until a run of k is 1 more than the expected time until absorption for the chain started in state 1. It has been found that, in the decimal expansion of pi, starting with the 24,658,601st digit, there is a run of nine 7's. What would your result say about the ex- pected number of digits necessary to find such a run if the digits are produced randomly? 5H. Gonshor, "An Application of Random Walk to a Problem in Population Genetics," Amer- ican Math Monthly, vol. 94 (1987), pp. 668-671 6Prvaecommunication.  426 CHAPTER 11. MARKOV CHAINS 21 (Roberts7) A city is divided into 3 areas 1, 2, and 3. It is estimated that amounts u1, t2, and u3 of pollution are emitted each day from these three areas. A fraction qij of the pollution from region i ends up the next day at region j. A fraction qj = 1- > q2 > 0 goes into the atmosphere and escapes. Let w ") be the amount of pollution in area i after n days. (a) Show that w(n) = u + uQ + - - - + uQ"-1. (b) Show that w(n) -- w, and show how to compute w from u. (c) The government wants to limit pollution levels to a prescribed level by prescribing w. Show how to determine the levels of pollution u which would result in a prescribed limiting value w. 22 In the Leontief economic model,8 there are n industries 1, 2, ..., n. The ith industry requires an amount 0 < qij 1 of goods (in dollar value) from company j to produce 1 dollar's worth of goods. The outside demand on the industries, in dollar value, is given by the vector d = (d1, d2......, dn). Let Q be the matrix with entries qij. (a) Show that if the industries produce total amounts given by the vector x = (x1, x2, . . . , x) then the amounts of goods of each type that the industries will need just to meet their internal demands is given by the vector xQ. (b) Show that in order to meet the outside demand d and the internal de- mands the industries must produce total amounts given by a vector x = (x1, x2,. . . , x) which satisfies the equation x = xQ + d. (c) Show that if Q is the Q-matrix for an absorbing Markov chain, then it is possible to meet any outside demand d. (d) Assume that the row sums of Q are less than or equal to 1. Give an economic interpretation of this condition. Form a Markov chain by taking the states to be the industries and the transition probabilites to be the qij. Add one absorbing state 0. Define io = 1 - qig-j Show that this chain will be absorbing if every company is either making a profit or ultimately depends upon a profit-making company. (e) Define xc to be the gross national product. Find an expression for the gross national product in terms of the demand vector d and the vector t giving the expected time to absorption. 23 A gambler plays a game in which on each play he wins one dollar with prob- ability p and loses one dollar with probability q =1 - p. The Gambler's Ruin 7F. Roberts, Discrete Mathematical Models (Englewood Cliffs, NJ: Prentice Hall, 1976). 8W. W. Leontief, Input-Outputt Economics (Oxford: Oxford University Press, 1966).  11.2. ABSORBING MARKOV CHAINS 427 problem is the problem of finding the probability wx of winning an amount T before losing everything, starting with state x. Show that this problem may be considered to be an absorbing Markov chain with states 0, 1, 2, ... , T with 0 and T absorbing states. Suppose that a gambler has probability p = .48 of winning on each play. Suppose, in addition, that the gambler starts with 50 dollars and that T = 100 dollars. Simulate this game 100 times and see how often the gambler is ruined. This estimates w50. 24 Show that wx of Exercise 23 satisfies the following conditions: (a) w = pwx+1 + qwx-1 for x = 1, 2, .. ., T - 1. (b) wo = 0. (c) WT = 1. Show that these conditions determine wx. Show that, if p = q = 1/2, then x wxT satisfies (a), (b), and (c) and hence is the solution. If p f q, show that _= (q/p)X 1 (q/p)T - 1 satisfies these conditions and hence gives the probability of the gambler win- ning. 25 Write a program to compute the probability wx of Exercise 24 for given values of x, p, and T. Study the probability that the gambler will ruin the bank in a game that is only slightly unfavorable, say p = .49, if the bank has significantly more money than the gambler. *26 We considered the two examples of the Drunkard's Walk corresponding to the cases n = 4 and n = 5 blocks (see Example 11.13). Verify that in these two examples the expected time to absorption, starting at x, is equal to x(n - x). See if you can prove that this is true in general. Hint: Show that if f(x) is the expected time to absorption then f(0) = f(n) = 0 and f(x) = (1/2)f(x - 1) + (1/2)f(x + 1) + 1 for 0 < x < n. Show that if fi(x) and f2(x) are two solutions, then their difference g(x) is a solution of the equation g(x) (1/2)g(x-1)+(1/2)g(x+1). Also, g(0) =g(n) =0. Show that it is not possible for g(x) to have a strict maximum or a strict minimum at the point i, where 1 K i K n - 1. Use this to show that g(i) =0 for all i. This shows that there is at most one solution. Then verify that the function f(x) =x~rn - x) is a solution.  428 CHAPTER 11. MARKOV CHAINS 27 Consider an absorbing Markov chain with state space S. Let f be a function defined on S with the property that f (i) = pijf (J) jCS or in vector form f:= Pf. Then f is called a harmonic function for P. If you imagine a game in which your fortune is f(i) when you are in state i, then the harmonic condition means that the game is fair in the sense that your expected fortune after one step is the same as it was before the step. (a) Show that for f harmonic f =Phf for all n. (b) Show, using (a), that for f harmonic f=Po f, where P°= lim P"= (0 B n--oo 0 I (c) Using (b), prove that when you start in a transient state i your expected final fortune Zbik f(k) k is equal to your starting fortune f(i). In other words, a fair game on a finite state space remains fair to the end. (Fair games in general are called martingales. Fair games on infinite state spaces need not remain fair with an unlimited number of plays allowed. For example, consider the game of Heads or Tails (see Example 1.4). Let Peter start with 1 penny and play until he has 2. Then Peter will be sure to end up 1 penny ahead.) 28 A coin is tossed repeatedly. We are interested in finding the expected number of tosses until a particular pattern, say B = HTH, occurs for the first time. If, for example, the outcomes of the tosses are HHTTHTH we say that the pattern B has occurred for the first time after 7 tosses. Let TB be the time to obtain pattern B for the first time. Li9 gives the following method for determining E(TB). We are in a casino and, before each toss of the coin, a gambler enters, pays 1 dollar to play, and bets that the pattern B =HTH will occur on the next 95_Y. R. Li, "A Martingale Approach to the Study of Occurrence of Sequence Patterns in Repeated Experiments," Annals of Probability, vol. 8 (1980), pp. 1171-1176.  11.2. ABSORBING MARKOV CHAINS 429 three tosses. If H occurs, he wins 2 dollars and bets this amount that the next outcome will be T. If he wins, he wins 4 dollars and bets this amount that H will come up next time. If he wins, he wins 8 dollars and the pattern has occurred. If at any time he loses, he leaves with no winnings. Let A and B be two patterns. Let AB be the amount the gamblers win who arrive while the pattern A occurs and bet that B will occur. For example, if A = HT and B = HTH then AB = 2 + 4 = 6 since the first gambler bet on H and won 2 dollars and then bet on T and won 4 dollars more. The second gambler bet on H and lost. If A = HH and B = HTH, then AB = 2 since the first gambler bet on H and won but then bet on T and lost and the second gambler bet on H and won. If A = B = HTH then AB = BB = 8 + 2 = 10. Now for each gambler coming in, the casino takes in 1 dollar. Thus the casino takes in TB dollars. How much does it pay out? The only gamblers who go off with any money are those who arrive during the time the pattern B occurs and they win the amount BB. But since all the bets made are perfectly fair bets, it seems quite intuitive that the expected amount the casino takes in should equal the expected amount that it pays out. That is, E(TB) = BB. Since we have seen that for B = HTH, BB = 10, the expected time to reach the pattern HTH for the first time is 10. If we had been trying to get the pattern B = HHH, then BB = 8+4+2 = 14 since all the last three gamblers are paid off in this case. Thus the expected time to get the pattern HHH is 14. To justify this argument, Li used a theorem from the theory of martingales (fair games). We can obtain these expectations by considering a Markov chain whose states are the possible initial segments of the sequence HTH; these states are HTH, HT, H, and 0, where 0 is the empty set. Then, for this example, the transition matrix is HTH HT H 0 HTH 1 0 0 0 HT .5 0 0 .5 H 0 .5 .5 0 ' 0 0 0 .5 .5 and if B = HTH, E(TB) is the expected time to absorption for this chain started in state 0. Show, using the associated Markov chain, that the values E(TB) = 10 and E(TB) = 14 are correct for the expected time to reach the patterns HTH and HHH, respectively. 29 We can use the gambling interpretation given in Exercise 28 to find the ex- pected number of tosses required to reach pattern B when we start with pat- tern A. To be a meaningful problem, we assume that pattern A does not have pattern B as a subpattern. Let EA(TB) be the expected time to reach pattern B starting with pattern A. We use our gambling scheme and assume that the first k coin tosses produced the pattern A. During this time, the gamblers  430 CHAPTER 11. MARKOV CHAINS made an amount AB. The total amount the gamblers will have made when the pattern B occurs is BB. Thus, the amount that the gamblers made after the pattern A has occurred is BB - AB. Again by the fair game argument, EA(TB) = BB-AB. For example, suppose that we start with pattern A = HT and are trying to get the pattern B = HTH. Then we saw in Exercise 28 that AB = 4 and BB = 10 so EA(TB) = BB-AB= 6. Verify that this gambling interpretation leads to the correct answer for all starting states in the examples that you worked in Exercise 28. 30 Here is an elegant method due to Guibas and Odlyzko10 to obtain the expected time to reach a pattern, say HTH, for the first time. Let f(n) be the number of sequences of length n which do not have the pattern HTH. Let fp(n) be the number of sequences that have the pattern for the first time after n tosses. To each element of f(n), add the pattern HTH. Then divide the resulting sequences into three subsets: the set where HTH occurs for the first time at time n + 1 (for this, the original sequence must have ended with HT); the set where HTH occurs for the first time at time n + 2 (cannot happen for this pattern); and the set where the sequence HTH occurs for the first time at time n + 3 (the original sequence ended with anything except HT). Doing this, we have f(n) =fp(n+ 1) +fp(n+3) . Thus, f(n)_ 2fp(n+1) 23fP(n+3) 2n_ = 2n+1 + 2n+3 If T is the time that the pattern occurs for the first time, this equality states that P(T > n) =2P(T =n+1)+ 8P(T =n+ 3) . Show that if you sum this equality over all n you obtain P(T > n)= 2+8 =10. n=0 Show that for any integer-valued random variable 00 E(T) =1:P(T >n) , n=0 and conclude that E(T) = 10. Note that this method of proof makes very clear that E(T) is, in general, equal to the expected amount the casino pays out and avoids the martingale system theorem used by Li. ioL. J. Guibas and A. M. Odlyzko, "String Overlaps, Pattern Matching, and Non-transitive Games," Journal of Combinatorial Theory, Series A, vol. 30 (1981), pp. 183-208.  11.2. ABSORBING MARKOV CHAINS 431 31 In Example 11.11, define f(i) to be the proportion of G genes in state i. Show that f is a harmonic function (see Exercise 27). Why does this show that the probability of being absorbed in state (GG, GG) is equal to the proportion of G genes in the starting state? (See Exercise 17.) 32 Show that the stepping stone model (Example 11.12) is an absorbing Markov chain. Assume that you are playing a game with red and green squares, in which your fortune at any time is equal to the proportion of red squares at that time. Give an argument to show that this is a fair game in the sense that your expected winning after each step is just what it was before this step.Hint: Show that for every possible outcome in which your fortune will decrease by one there is another outcome of exactly the same probability where it will increase by one. Use this fact and the results of Exercise 27 to show that the probability that a particular color wins out is equal to the proportion of squares that are initially of this color. 33 Consider a random walker who moves on the integers 0, 1, ..., N, moving one step to the right with probability p and one step to the left with probability q = 1 - p. If the walker ever reaches 0 or N he stays there. (This is the Gambler's Ruin problem of Exercise 23.) If p = q show that the function f(i) =i is a harmonic function (see Exercise 27), and if p f q then f(i)P is a harmonic function. Use this and the result of Exercise 27 to show that the probability biN of being absorbed in state N starting in state i is y, if p =q, biN = (P)-1 . {PN- if p q" For an alternative derivation of these results see Exercise 24. 34 Complete the following alternate proof of Theorem 11.6. Let s2 be a tran- sient state and sj be an absorbing state. If we compute big in terms of the possibilities on the outcome of the first step, then we have the equation big = pij + Zpikbkj k where the summation is carried out over all transient states sk. Write this in matrix form, and derive from this equation the statement B=NR.  432 CHAPTER 11. MARKOV CHAINS 35 In Monte Carlo roulette (see Example 6.6), under option (c), there are six states (S, W, L, E, P1, and P2). The reader is referred to Figure 6.2, which contains a tree for this option. Form a Markov chain for this option, and use the program AbsorbingChain to find the probabilities that you win, lose, or break even for a 1 franc bet on red. Using these probabilities, find the expected winnings for this bet. For a more general discussion of Markov chains applied to roulette, see the article of H. Sagan referred to in Example 6.13. 36 We consider next a game called Penney-ante by its inventor W. Penney." There are two players; the first player picks a pattern A of H's and T's, and then the second player, knowing the choice of the first player, picks a different pattern B. We assume that neither pattern is a subpattern of the other pattern. A coin is tossed a sequence of times, and the player whose pattern comes up first is the winner. To analyze the game, we need to find the probability PA that pattern A will occur before pattern B and the probability PB = 1 - PA that pattern B occurs before pattern A. To determine these probabilities we use the results of Exercises 28 and 29. Here you were asked to show that, the expected time to reach a pattern B for the first time is, E(TB) = BB , and, starting with pattern A, the expected time to reach pattern B is EA(TB) = BB - AB. (a) Show that the odds that the first player will win are given by John Conway's formula12: PA pA BB-BA 1 - PA PB AA-AB Hint: Explain why E(TB) = E(TA or B) +PAEA(T B) and thus BB=E(TA or B)+pA(BB-AB) . Interchange A and B to find a similar equation involving the PB. Finally, note that PA +PB= 1. Use these equations to solve for PA and PB. (b) Assume that both players choose a pattern of the same length k. Show that, if k = 2, this is a fair game, but, if k = 3, the second player has an advantage no matter what choice the first player makes. (It has been shown that, for k > 3, if the first player chooses a1, a2, ..., ak, then the optimal strategy for the second player is of the form b, ai, ..., ak_1 where b is the better of the two choices H orT.3 11W. Penney, "Problem: Penney-Ante," Journal of Recreational Math, vol. 2 (1969), p. 241. 12M. Gardner, "Mathematical Games," Scientific American, vol. 10 (1974), pp. 120-125. 13Guibas and Odlyzko, op. cit.  11.3. ERGODIC MARKOV CHAINS 433 11.3 Ergodic Markov Chains A second important kind of Markov chain we shall study in detail is an ergodic Markov chain, defined as follows. Definition 11.4 A Markov chain is called an ergodic chain if it is possible to go from every state to every state (not necessarily in one move). D In many books, ergodic Markov chains are called irreducible. Definition 11.5 A Markov chain is called a regular chain if some power of the transition matrix has only positive elements. D In other words, for some n, it is possible to go from any state to any state in exactly n steps. It is clear from this definition that every regular chain is ergodic. On the other hand, an ergodic chain is not necessarily regular, as the following examples show. Example 11.16 Let the transition matrix of a Markov chain be defined by 1 2 P = 1 0 1) 2 1 0 Then is clear that it is possible to move from any state to any state, so the chain is ergodic. However, if n is odd, then it is not possible to move from state 0 to state 0 in n steps, and if n is even, then it is not possible to move from state 0 to state 1 in n steps, so the chain is not regular. D A more interesting example of an ergodic, non-regular Markov chain is provided by the Ehrenfest urn model. Example 11.17 Recall the Ehrenfest urn model (Example 11.8). The transition matrix for this example is 0 1 2 3 4 0 0 1 0 0 0 1 1/4 0 3/4 0 0 P= 2 0 1/2 0 1/2 0 3 0 0 3/4 0 1/4 4 0 0 0 1 0 In this example, if we start in state 0 we will, after any even number of steps, be in either state 0, 2 or 4, and after any odd number of steps, be in states 1 or 3. Thus this chain is ergodic but not regular.D  434 CHAPTER 11. MARKOV CHAINS Regular Markov Chains Any transition matrix that has no zeros determines a regular Markov chain. How- ever, it is possible for a regular Markov chain to have a transition matrix that has zeros. The transition matrix of the Land of Oz example of Section 11.1 has PNN = 0 but the second power P2 has no zeros, so this is a regular Markov chain. An example of a nonregular Markov chain is an absorbing chain. For example, let P 1 0 1/2 1/2 be the transition matrix of a Markov chain. Then all powers of P will have a 0 in the upper right-hand corner. We shall now discuss two important theorems relating to regular chains. Theorem 11.7 Let P be the transition matrix for a regular chain. Then, as n - o0, the powers P" approach a limiting matrix W with all rows the same vector w. The vector w is a strictly positive probability vector (i.e., the components are all positive and they sum to one). D In the next section we give two proofs of this fundamental theorem. We give here the basic idea of the first proof. We want to show that the powers P" of a regular transition matrix tend to a matrix with all rows the same. This is the same as showing that P" converges to a matrix with constant columns. Now the jth column of Pm is Pny where y is a column vector with 1 in the jth entry and 0 in the other entries. Thus we need only prove that for any column vector y, Pny approaches a constant vector as n tend to infinity. Since each row of P is a probability vector, Py replaces y by averages of its components. Here is an example: 1/2 1/4 1/4 1 (1/2.1+1/4.2+1/4 (3 7/4 1/3( 1/3 1/3 2 = 1/3.1+1/3.2+1/3-3 = 2 1/2 1/2 0 3 1/2.1+1/2.2+0.3 3/2 The result of the averaging process is to make the components of Py more similar than those of y. In particular, the maximum component decreases (from 3 to 2) and the minimum component increases (from 1 to 3/2). Our proof will show that as we do more and more of this averaging to get Pny, the difference between the maximum and minimum component will tend to 0 as n -- c. This means Pmy tends to a constant vector. The ijth entry of P", p3 , is the probability that the process will be in state s3 after n steps if it starts in state s2. If we denote the common row of W by w, then Theorem 11.7 states that the probability of being in sg in the long run is approximately wg, the jth entry of w, and is independent of the starting state.  11.3. ERGODIC MARKOV CHAINS 435 Example 11.18 Recall that for the Land of Oz example of Section 11.1, the sixth power of the transition matrix P is, to three decimal places, R N S R .4 .2 .4 P6= N .4 .2 .4 . S .4 .2 .4 Thus, to this degree of accuracy, the probability of rain six days after a rainy day is the same as the probability of rain six days after a nice day, or six days after a snowy day. Theorem 11.7 predicts that, for large n, the rows of P approach a common vector. It is interesting that this occurs so soon in our example. Q Theorem 11.8 Let P be a regular transition matrix, let W= lim P" , n-oo let w be the common row of W, and let c be the column vector all of whose components are 1. Then (a) wP = w, and any row vector v such that vP = v is a constant multiple of w. (b) Pc = c, and any column vector x such that Px = x is a multiple of c. Proof. To prove part (a), we note that from Theorem 11.7, P"-±W. Thus, Pn+1 Pn- P - WP . But Pn+1 W, and so W = WP, and w = wP. Let v be any vector with vP = v. Then v = vP", and passing to the limit, v = vW. Let r be the sum of the components of v. Then it is easily checked that vW =rw. So, v =rw. To prove part (b), assume that x = Px. Then x = Phx, and again passing to the limit, x = Wx. Since all rows of W are the same, the components of Wx are all equal, so x is a multiple of c. D Note that an immediate consequence of Theorem 11.8 is the fact that there is only one probability vector v such that vP = v. Fixed Vectors Definition 11.6 A row vector w with the property wP =w is called a fixed row vector for P. Similarly, a column vector x such that Px =x is called a fixed column vector for P.  436 CHAPTER 11. MARKOV CHAINS Thus, the common row of W is the unique vector w which is both a fixed row vector for P and a probability vector. Theorem 11.8 shows that any fixed row vector for P is a multiple of w and any fixed column vector for P is a constant vector. One can also state Definition 11.6 in terms of eigenvalues and eigenvectors. A fixed row vector is a left eigenvector of the matrix P corresponding to the eigenvalue 1. A similar statement can be made about fixed column vectors. We will now give several different methods for calculating the fixed row vector w for a regular Markov chain. Example 11.19 By Theorem 11.7 we can find the limiting vector w for the Land of Oz from the fact that W1+W2+W3=1 and 1/2 1/4 1/4 (wi w2 w3) 1/2 0 1/2 =(wi w2 w3) 1/4 1/4 1/2 These relations lead to the following four equations in three unknowns: wi + w2 + w3 = 1 , (1/2)wi + (1/2)w2 + (1/4)w3 = wi (1/4)wi + (1/4)w3 = w2 (1/4)wi + (1/2)w2 + (1/2)w3 = w3 . Our theorem guarantees that these equations have a unique solution. If the equations are solved, we obtain the solution w =(.4 .2 .4) in agreement with that predicted from P6, given in Example 11.2. D To calculate the fixed vector, we can assume that the value at a particular state, say state one, is 1, and then use all but one of the linear equations from wP = w. This set of equations will have a unique solution and we can obtain w from this solution by dividing each of its entries by their sum to give the probability vector w. We will now illustrate this idea for the above example. Example 11.20 (Example 11.19 continued) We set w1= 1, and then solve the first and second linear equations from wP = w. We have (1/2) + (1/2)w2 + (1/4)w3 = 1 , (1/4) + (1/4)w3 = W2 - If we solve these, we obtain (wi w2 w3)=(1 1/2 1).  11.3. ERGODIC MARKOV CHAINS 437 Now we divide this vector by the sum of the components, to obtain the final answer: w = (.4 .2 .4) . This method can be easily programmed to run on a computer. D As mentioned above, we can also think of the fixed row vector w as a left eigenvector of the transition matrix P. Thus, if we write I to denote the identity matrix, then w satisfies the matrix equation wP=wI, or equivalently, w(P - I) = 0 . Thus, w is in the left nullspace of the matrix P - I. Furthermore, Theorem 11.8 states that this left nullspace has dimension 1. Certain computer programming languages can find nullspaces of matrices. In such languages, one can find the fixed row probability vector for a matrix P by computing the left nullspace and then normalizing a vector in the nullspace so the sum of its components is 1. The program FixedVector uses one of the above methods (depending upon the language in which it is written) to calculate the fixed row probability vector for regular Markov chains. So far we have always assumed that we started in a specific state. The following theorem generalizes Theorem 11.7 to the case where the starting state is itself determined by a probability vector. Theorem 11.9 Let P be the transition matrix for a regular chain and v an arbi- trary probability vector. Then lim vPn = w , n-oo where w is the unique fixed probability vector for P. Proof. By Theorem 11.7, lim P = W. n-oo Hence, lim vP"= vW. n-oo But the entries in v sum to 1, and each row of W equals w. From these statements, it is easy to check that vW= w. If we start a Markov chain with initial probabilities given by v, then the proba- bility vector vP"h gives the probabilities of being in the various states after n steps. Theorem 11.9 then establishes the fact that, even in this more general class of processes, the probability of being in s3 approaches w3.  438 CHAPTER 11. MARKOV CHAINS Equilibrium We also obtain a new interpretation for w. Suppose that our starting vector picks state s2 as a starting state with probability w2, for all i. Then the probability of being in the various states after n steps is given by wP" = w, and is the same on all steps. This method of starting provides us with a process that is called "stationary." The fact that w is the only probability vector for which wP = w shows that we must have a starting probability vector of exactly the kind described to obtain a stationary process. Many interesting results concerning regular Markov chains depend only on the fact that the chain has a unique fixed probability vector which is positive. This property holds for all ergodic Markov chains. Theorem 11.10 For an ergodic Markov chain, there is a unique probability vec- tor w such that wP = w and w is strictly positive. Any row vector such that vP = v is a multiple of w. Any column vector x such that Px = x is a constant vector. Proof. This theorem states that Theorem 11.8 is true for ergodic chains. The result follows easily from the fact that, if P is an ergodic transition matrix, then P = (1/2)1 + (1/2)P is a regular transition matrix with the same fixed vectors (see Exercises 25-28). D For ergodic chains, the fixed probability vector has a slightly different inter- pretation. The following two theorems, which we will not prove here, furnish an interpretation for this fixed vector. Theorem 11.11 Let P be the transition matrix for an ergodic chain. Let An be the matrix defined by I+P + P2 + ... + p" An = n2+1 Then An -> W, where W is a matrix all of whose rows are equal to the unique fixed probability vector w for P. D If P is the transition matrix of an ergodic chain, then Theorem 11.8 states that there is only one fixed row probability vector for P. Thus, we can use the same techniques that were used for regular chains to solve for this fixed vector. In particular, the program FixedVector works for ergodic chains. To interpret Theorem 11.11, let us assume that we have an ergodic chain that starts in state si. Let X(m) = 1 if the mth step is to state sj and 0 otherwise. Then the average number of times in state si in the first n steps is given by HC" ) = X (O) -+X(1) -+ X(2)+-...- - X (n) n2+1 But X(m) takes on the value 1 with probability (in and 0 otherwise. Thus E(X(m)) =( p), and the ijth entry of As gives the expected value of HC"), that  11.3. ERGODIC MARKOV CHAINS 439 is, the expected proportion of times in state s3 in the first n steps if the chain starts in state si. If we call being in state sj success and any other state failure, we could ask if a theorem analogous to the law of large numbers for independent trials holds. The answer is yes and is given by the following theorem. Theorem 11.12 (Law of Large Numbers for Ergodic Markov Chains) Let H(T) be the proportion of times in n steps that an ergodic chain is in state s . Then for any E > 0, P |H ") - wg|> E 0 , independent of the starting state s2. Q We have observed that every regular Markov chain is also an ergodic chain. Hence, Theorems 11.11 and 11.12 apply also for regular chains. For example, this gives us a new interpretation for the fixed vector w = (.4, .2, .4) in the Land of Oz example. Theorem 11.11 predicts that, in the long run, it will rain 40 percent of the time in the Land of Oz, be nice 20 percent of the time, and snow 40 percent of the time. Simulation We illustrate Theorem 11.12 by writing a program to simulate the behavior of a Markov chain. SimulateChain is such a program. Example 11.21 In the Land of Oz, there are 525 days in a year. We have simulated the weather for one year in the Land of Oz, using the program SimulateChain. The results are shown in Table 11.2. SSRNRNSSSSSSNRSNSSRNSRNSSSNSRRRNSSSNRRSSSSNRSSNSRRRRRRNSSS SSRRRSNSNRRRRSRSRNSNSRRNRRNRSSNSRNRNSSRRSRNSSSNRSRRSSNRSNR RNSSSSNSSNSRSRRNSSNSSRNSSRRNRRRSRNRRRNSSSNRNSRNSNRNRSSSRSS NRSSSNSSSSSSNSSSNSNSRRNRNRRRRSRRRSSSSNRRSSSSRSRRRNRRRSSSSR RNRRRSRSSRRRRSSRNRRRRRRNSSRNRSSSNRNSNRRRRNRRRNRSNRRNSRRSNR RRRSSSRNRRRNSNSSSSSRRRRSRNRSSRRRRSSSRRRNRNRRRSRSRNSNSSRRRR RNSNRNSNRRNRRRRRRSSSNRSSRSNRSSSNSNRNSNSSSNRRSRRRNRRRRNRNRS SSNSRSNRNRRSNRRNSRSSSRNSRRSSNSRRRNRRSNRRNSSSSSNRNSSSSSSSNR NSRRRNSSRRRNSSSNRRSRNSSRRNRRNRSNRRRRRRRRRNSNRRRRRNSRRSSSSN SNS State Times Fraction R 217 .413 N 109 .208 S 199 .379 Table 11.2: Weather in the Land of Oz.  440 CHAPTER 11. MARKOV CHAINS We note that the simulation gives a proportion of times in each of the states not too different from the long run predictions of .4, .2, and .4 assured by Theorem 11.7. To get better results we have to simulate our chain for a longer time. We do this for 10,000 days without printing out each day's weather. The results are shown in Table 11.3. We see that the results are now quite close to the theoretical values of .4, .2, and .4. State Times Fraction R 4010 .401 N 1902 .19 S 4088 .409 Table 11.3: Comparison of observed and predicted frequencies for the Land of Oz. Examples of Ergodic Chains The computation of the fixed vector w may be difficult if the transition matrix is very large. It is sometimes useful to guess the fixed vector on purely intuitive grounds. Here is a simple example to illustrate this kind of situation. Example 11.22 A white rat is put into the maze of Figure 11.4. There are nine compartments with connections between the compartments as indicated. The rat moves through the compartments at random. That is, if there are k ways to leave a compartment, it chooses each of these with equal probability. We can represent the travels of the rat by a Markov chain process with transition matrix given by 1 2 3 4 5 6 7 8 9 1 0 1/2 0 0 0 1/2 0 0 0 2 1/3 0 1/3 0 1/3 0 0 0 0 3 0 1/2 0 1/2 0 0 0 0 0 4 0 0 1/3 0 1/3 0 0 0 1/3 P =5 0 1/4 0 1/4 0 1/4 0 1/4 0 6 1/3 0 0 0 1/3 0 1/3 0 0 7 0 0 0 0 0 1/2 0 1/2 0 8 0 0 0 0 1/3 0 1/3 0 1/3 9 0 0 0 1/2 0 0 0 1/2 0 That this chain is not regular can be seen as follows: From an odd-numbered state the process can go only to an even-numbered state, and from an even-numbered state it can go only to an odd number. Hence, starting in state i the process will be alternately in even-numbered and odd-numbered states. Therefore, odd powers of P will have 0's for the odd-numbered entries in row 1. On the other hand, a glance at the maze shows that it is possible to go from every state to every other state, so that the chain is ergodic.  11.3. ERGODIC MARKOV CHAINS 441 1 2 3 6 5 4 7 8 9 Figure 11.4: The maze problem. To find the fixed probability vector for this matrix, we would have to solve ten equations in nine unknowns. However, it would seem reasonable that the times spent in each compartment should, in the long run, be proportional to the number of entries to each compartment. Thus, we try the vector whose jth component is the number of entries to the jth compartment: x=(2 3 2 3 4 3 2 3 2). It is easy to check that this vector is indeed a fixed vector so that the unique probability vector is this vector normalized to have sum 1: 1 1 1 1 1 1 1 1 1 Example 11.23 (Example 11.8 continued) We recall the Ehrenfest urn model of Example 11.8. The transition matrix for this chain is as follows: 0 1 2 3 4 0 (.000 1.000 .000 .000 .000\ 1 .250 .000 .750 .000 .000 P= 2 .000 .500 .000 .500 .0001 3 .000 .000 .750 .000 .250 4 .000 .000 .000 1.000 .000/ If we run the program FixedVector for this chain, we obtain the vector 0 1 2 3 4 w (.0625 .2500 .3750 .2500 .0625) By Theorem 11.12, we can interpret these values for w2 as the proportion of times the process is in each of the states in the long run. For example, the proportion of  442 CHAPTER 11. MARKOV CHAINS times in state 0 is .0625 and the proportion of times in state 1 is .375. The astute reader will note that these numbers are the binomial distribution 1/16, 4/16, 6/16, 4/16, 1/16. We could have guessed this answer as follows: If we consider a particular ball, it simply moves randomly back and forth between the two urns. This suggests that the equilibrium state should be just as if we randomly distributed the four balls in the two urns. If we did this, the probability that there would be exactly j balls in one urn would be given by the binomial distribution b(n, p, j) with n = 4 and p =1/2. Exercises 1 Which of the following matrices are transition matrices for regular Markov chains? .5 .5 (a) P = .5 .5 (b) P = 1 . 1/3 0 2/3 (c) P = 0 1 0 0 1/5 4/5 (d)3P/1. 1 0 1/2 1/2 0 (e) P = 0 1/2 1/2 . 1/3 1/3 1/3 2 Consider the Markov chain with transition matrix 1/2 1/3 1/6 P = 3/4 0 1/4 . 0 1 0 (a) Show that this is a regular Markov chain. (b) The process is started in state 1; find the probability that it is in state 3 after two steps. (c) Find the limiting probability vector w. 3 Consider the Markov chain with general 2 x 2 transition matrix 1 - a a b 1- b  11.3. ERGODIC MARKOV CHAINS 443 4 Find the fixed probability vector w for the matrices in Exercise 3 that are ergodic. 5 Find the fixed probability vector w for each of the following regular matrices. .75 .25 (a) P (.5 ). (b) P =( . 3/4 1/4 0 (c) P = 0 2/3 1/3. 1/4 1/4 1/2 6 Consider the Markov chain with transition matrix in Exercise 3, with a = b 1. Show that this chain is ergodic but not regular. Find the fixed probability vector and interpret it. Show that P' does not tend to a limit, but that I+P + P2 + --+ p An = n+1 does. 7 Consider the Markov chain with transition matrix of Exercise 3, with a = 0 and b = 1/2. Compute directly the unique fixed probability vector, and use your result to prove that the chain is not ergodic. 8 Show that the matrix 1 0 0 P = 1/4 1/2 1/4 0 0 1 has more than one fixed probability vector. Find the matrix that P ap- proaches as n - o, and verify that it is not a matrix all of whose rows are the same. 9 Prove that, if a 3-by-3 transition matrix has the property that its column sums are 1, then (1/3,1/3,1/3) is a fixed probability vector. State a similar result for n-by-n transition matrices. Interpret these results for ergodic chains. 10 Is the Markov chain in Example 11.10 ergodic? 11 Is the Markov chain in Example 11.11 ergodic? 12 Consider Example 11.13 (Drunkard's Walk). Assume that if the walker reaches state 0, he turns around and returns to state 1 on the next step and, simi- larly, if he reaches 4 he returns on the next step to state 3. Is this new chain ergodic? Is it regular? 13 For Example 11.4 when P is ergodic, what is the proportion of people who are told that the President will run? Interpret the fact that this proportion is independent of the starting state.  444 CHAPTER 11. MARKOV CHAINS 14 Consider an independent trials process to be a Markov chain whose states are the possible outcomes of the individual trials. What is its fixed probability vector? Is the chain always regular? Illustrate this for Example 11.5. 15 Show that Example 11.8 is an ergodic chain, but not a regular chain. Show that its fixed probability vector w is a binomial distribution. 16 Show that Example 11.9 is regular and find the limiting vector. 17 Toss a fair die repeatedly. Let Sn denote the total of the outcomes through the nth toss. Show that there is a limiting value for the proportion of the first n values of Sn that are divisible by 7, and compute the value for this limit. Hint: The desired limit is an equilibrium probability vector for an appropriate seven state Markov chain. 18 Let P be the transition matrix of a regular Markov chain. Assume that there are r states and let N(r) be the smallest integer n such that P is regular if and only if pN(r) has no zero entries. Find a finite upper bound for N(r). See if you can determine N(3) exactly. *19 Define f (r) to be the smallest integer n such that for all regular Markov chains with r states, the nth power of the transition matrix has all entries positive. It has been shown,14 that f (r) = r2 - 2r + 2. (a) Define the transition matrix of an r-state Markov chain as follows: For states s2, with i = 1, 2, ... , r-2, P(i, i+1) = 1, P(r-1, r) = P(r-1, 1) 1/2, and P(r, 1) = 1. Show that this is a regular Markov chain. (b) For r = 3, verify that the fifth power is the first power that has no zeros. (c) Show that, for general r, the smallest n such that P" has all entries positive is n = f(r). 20 A discrete time queueing system of capacity n consists of the person being served and those waiting to be served. The queue length x is observed each second. If 0 < x < n, then with probability p, the queue size is increased by one by an arrival and, inependently, with probability r, it is decreased by one because the person being served finishes service. If x = 0, only an arrival (with probability p) is possible. If x = n, an arrival will depart without waiting for service, and so only the departure (with probability r) of the person being served is possible. Form a Markov chain with states given by the number of customers in the queue. Modify the program FixedVector so that you can input n, p, and r, and the program will construct the transition matrix and compute the fixed vector. The quantity s = p/r is called the traffic intensity. Describe the differences in the fixed vectors according as s < 1, s 1, or s >1. 14E. Seneta, Non-Negative Matrices: An Introduction to Theory anid Applications, Wiley, New York, 1973, pp. 52-54.  11.3. ERGODIC MARKOV CHAINS 445 21 Write a computer program to simulate the queue in Exercise 20. Have your program keep track of the proportion of the time that the queue length is j for j = 0, 1, ..., n and the average queue length. Show that the behavior of the queue length is very different depending upon whether the traffic intensity s has the property s < 1, s = 1, or s > 1. 22 In the queueing problem of Exercise 20, let S be the total service time required by a customer and T the time between arrivals of the customers. (a) Show that P(S =j) (1 - r)-'r and P(T j) (1 - p)J-lp, for j > 0. (b) Show that E(S) = 1/r and E(T) = 1/p. (c) Interpret the conditions s < 1, s = 1 and s > 1 in terms of these expected values. 23 In Exercise 20 the service time S has a geometric distribution with E(S) 1/r. Assume that the service time is, instead, a constant time of t seconds. Modify your computer program of Exercise 21 so that it simulates a constant time service distribution. Compare the average queue length for the two types of distributions when they have the same expected service time (i.e., take t = 1/r). Which distribution leads to the longer queues on the average? 24 A certain experiment is believed to be described by a two-state Markov chain with the transition matrix P, where P .5 .5 p 1- p and the parameter p is not known. When the experiment is performed many times, the chain ends in state one approximately 20 percent of the time and in state two approximately 80 percent of the time. Compute a sensible estimate for the unknown parameter p and explain how you found it. 25 Prove that, in an r-state ergodic chain, it is possible to go from any state to any other state in at most r - 1 steps. 26 Let P be the transition matrix of an r-state ergodic chain. Prove that, if the diagonal entries p22 are positive, then the chain is regular. 27 Prove that if P is the transition matrix of an ergodic chain, then (1/2)(I + P) is the transition matrix of a regular chain. Hint: Use Exercise 26. 28 Prove that P and (1/2)(I + P) have the same fixed vectors. 29 In his book, Wahrscheinlichkeitsrechnvung unid Statistik,15 A. Engle proposes an algorithm for finding the fixed vector for an ergodic Markov chain when the transition probabilities are rational numbers. Here is his algorithm: For 15A. Engle, Wahrscheinlichkeitsrechunmg unid Statistik, vol. 2 (Stuttgart: Klett Verlag, 1976).  446 CHAPTER 11. MARKOV CHAINS (4 2 4) (5 2 3) (8 2 4) (7 3 4) (8 4 4) (8 3 5) (8 4 8) (10 4 6) (12 4 8) (12 5 7) (12 6 8) (13 5 8) (16 6 8) (15 6 9) (16 6 12) (17 7 10) (20 8 12) (20 8 12) . Table 11.4: Distribution of chips. each state i, let a2 be the least common multiple of the denominators of the non-zero entries in the ith row. Engle describes his algorithm in terms of mov- ing chips around on the states-indeed, for small examples, he recommends implementing the algorithm this way. Start by putting a2 chips on state i for all i. Then, at each state, redistribute the a2 chips, sending aipij to state j. The number of chips at state i after this redistribution need not be a multiple of a2. For each state i, add just enough chips to bring the number of chips at state i up to a multiple of a2. Then redistribute the chips in the same manner. This process will eventually reach a point where the number of chips at each state, after the redistribution, is the same as before redistribution. At this point, we have found a fixed vector. Here is an example: 1 2 3 1 1/2 1/4 1/4 P=2 (1/2 0 1/2 3 1/2 1/4 1/4 We start with a = (4, 2, 4). The chips after successive redistributions are shown in Table 11.4. We find that a = (20, 8, 12) is a fixed vector. (a) Write a computer program to implement this algorithm. (b) Prove that the algorithm will stop. Hint: Let b be a vector with integer components that is a fixed vector for P and such that each coordinate of  11.4. FUNDAMENTAL LIMIT THEOREM 447 the starting vector a is less than or equal to the corresponding component of b. Show that, in the iteration, the components of the vectors are always increasing, and always less than or equal to the corresponding component of b. 30 (Coffman, Kaduta, and Shepp16) A computing center keeps information on a tape in positions of unit length. During each time unit there is one request to occupy a unit of tape. When this arrives the first free unit is used. Also, during each second, each of the units that are occupied is vacated with probability p. Simulate this process, starting with an empty tape. Estimate the expected number of sites occupied for a given value of p. If p is small, can you choose the tape long enough so that there is a small probability that a new job will have to be turned away (i.e., that all the sites are occupied)? Form a Markov chain with states the number of sites occupied. Modify the program FixedVector to compute the fixed vector. Use this to check your conjecture by simulation. *31 (Alternate proof of Theorem 11.8) Let P be the transition matrix of an ergodic Markov chain. Let x be any column vector such that Px = x. Let M be the maximum value of the components of x. Assume that x2= M. Show that if pig > 0 then x = M. Use this to prove that x must be a constant vector. 32 Let P be the transition matrix of an ergodic Markov chain. Let w be a fixed probability vector (i.e., w is a row vector with wP = w). Show that if w2 = 0 and pji > 0 then w3 = 0. Use this to show that the fixed probability vector for an ergodic chain cannot have any 0 entries. 33 Find a Markov chain that is neither absorbing or ergodic. 11.4 Fundamental Limit Theorem for Regular Chains The fundamental limit theorem for regular Markov chains states that if P is a regular transition matrix then lim P"=W, n-oo where W is a matrix with each row equal to the unique fixed probability row vector w for P. In this section we shall give two very different proofs of this theorem. Our first proof is carried out by showing that, for any column vector y, Phy tends to a constant vector. As indicated in Section 11.3, this will show that P" converges to a matrix with constant columns or, equivalently, to a matrix with all rows the same. The following lemma says that if an r-by-r transition matrix has no zero entries, and y is any column vector with r entries, then the vector Py has entries which are "closer together" than the entries are in y. 16E. G. Coffman, J. T. Kaduta, and L. A. Shepp, "On the Asymptotic Optimality of First- Storage Allocation," IEEE Trains. Software Eingiineering, vol. II (1985), pp. 235-239.  448 CHAPTER 11. MARKOV CHAINS Lemma 11.1 Let P be an r-by-r transition matrix with no zero entries. Let d be the smallest entry of the matrix. Let y be a column vector with r components, the largest of which is Mo and the smallest mo. Let M1 and mi be the largest and smallest component, respectively, of the vector Py. Then M1 - mi (1 - 2d)(Mo - mo) . Proof. In the discussion following Theoremll.7, it was noted that each entry in the vector Py is a weighted average of the entries in y. The largest weighted average that could be obtained in the present case would occur if all but one of the entries of y have value MO and one entry has value mo, and this one small entry is weighted by the smallest possible weight, namely d. In this case, the weighted average would equal dmo + (1 - d)Mo Similarly, the smallest possible weighted average equals dMo + (1 - d)mo Thus, M1- mi (dmo + (1 - d)Mo)- (dMo + (1 - d)mo) (1 - 2d)(Mo - mo) This completes the proof of the lemma. D We turn now to the proof of the fundamental limit theorem for regular Markov chains. Theorem 11.13 (Fundamental Limit Theorem for Regular Chains) If P is the transition matrix for a regular Markov chain, then lim P"=W, n-oo where W is matrix with all rows equal. Furthermore, all entries in W are strictly positive. Proof. We prove this theorem for the special case that P has no 0 entries. The extension to the general case is indicated in Exercise 5. Let y be any r-component column vector, where r is the number of states of the chain. We assume that r > 1, since otherwise the theorem is trivial. Let Mn and ma be, respectively, the maximum and minimum components of the vector P" y. The vector Phy is obtained from the vector P"-ly by multiplying on the left by the matrix P. Hence each component of P~hy is an average of the components of P"-ly. Thus Mo > M1 ; M2 > ...  11.4. FUNDAMENTAL LIMIT THEOREM 449 and m0 < mi m2 ... Each sequence is monotone and bounded: m0 m < M < M0. Hence, each of these sequences will have a limit as n tends to infinity. Let M be the limit of Mn and m the limit of mn. We know that m < M. We shall prove that M - m = 0. This will be the case if Mn - mn tends to 0. Let d be the smallest element of P. Since all entries of P are strictly positive, we have d > 0. By our lemma Mn - mn (1 - 2d)(Mn_1 - mn_1) . From this we see that Mn - mn (1 - 2d)"(Mo - m0) . Since r > 2, we must have d < 1/2, so 0 < 1 - 2d < 1, so the difference Mn - ma tends to 0 as n tends to infinity. Since every component of Pny lies between mn and Mn, each component must approach the same number u = M =m. This shows that lim Pny=u, n-oo where u is a column vector all of whose components equal u. Now let y be the vector with jth component equal to 1 and all other components equal to 0. Then Pny is the jth column of P". Doing this for each j proves that the columns of P" approach constant column vectors. That is, the rows of P" approach a common row vector w, or, lim P" = W . n-oo It remains to show that all entries in W are strictly positive. As before, let y be the vector with jth component equal to 1 and all other components equal to 0. Then Py is the jth column of P, and this column has all entries strictly positive. The minimum component of the vector Py was defined to be mi1, hence mi > 0. Since mi < m, we have m > 0. Note finally that this value of m is just the jth component of w, so all components of w are strictly positive. D Doeblin's Proof We give now a very different proof of the main part of the fundamental limit theorem for regular Markov chains. This proof was first given by Doeblin,17 a brilliant young mathematician who was killed in his twenties in the Second World War. 17W.* Doeblin, "Expos6 de la Th~orie des Chaines Simple Constantes de Markov A un Nombre Fini d'Etats," Rev. Mach. de l'Union Interbalkanique, vol. 2 (1937), pp. 77-105.  450 CHAPTER 11. MARKOV CHAINS Theorem 11.14 Let P be the transition matrix for a regular Markov chain with fixed vector w. Then for any initial probability vector u, uP - w as n - oc. Proof. Let X0, X1, ... be a Markov chain with transition matrix P started in state s2. Let Yo, Yi, ... be a Markov chain with transition probability P started with initial probabilities given by w. The X and Y processes are run independently of each other. We consider also a third Markov chain P* which consists of watching both the X and Y processes. The states for P* are pairs (si, so). The transition probabilities are given by P*[(ijj), (k, 1)] = P(i, k) - P(j, l) Since P is regular there is an N such that PN(i,J) > 0 for all i and j. Thus for the P* chain it is also possible to go from any state (si, sj) to any other state (sk, si) in at most N steps. That is P* is also a regular Markov chain. We know that a regular Markov chain will reach any state in a finite time. Let T be the first time the the chain P* is in a state of the form (sk, sk). In other words, T is the first time that the X and the Y processes are in the same state. Then we have shown that P[T> n] - 0 as n - oc. If we watch the X and Y processes after the first time they are in the same state we would not predict any difference in their long range behavior. Since this will happen no matter how we started these two processes, it seems clear that the long range behaviour should not depend upon the starting state. We now show that this is true. We first note that if n > T, then since X and Y are both in the same state at time T, P(Xn=j n>T)=P(Y=j n>T). If we multiply both sides of this equation by P(n > T), we obtain P(X = n > T) = P(Y= n > T) . (11.1) We know that for all n, P(Yn =j)=w . But P(Ynh =j) =P(Yn =j, n>T)+P(Yn j, n T ) -w as n goes to cx0. From Equation 11.1, we see that P(Xn = j, n ;> T) -w,  11.4. FUNDAMENTAL LIMIT THEOREM 451 as n goes to c. But by similar reasoning to that used above, the difference between this last expression and P(Xn = j) goes to 0 as n goes to c. Therefore, P(Xn = j) w, as n goes to Xc. This completes the proof. D In the above proof, we have said nothing about the rate at which the distributions of the Xn's approach the fixed distribution w. In fact, it can be shown that18 Z P(Xn=j)-wj <2P(T>n). j=1 The left-hand side of this inequality can be viewed as the distance between the distribution of the Markov chain after n steps, starting in state s2, and the limiting distribution w. Exercises 1 Define P and y by .5 .5 1 P .25 .75 ' 0) Compute Py, P2y, and P4y and show that the results are approaching a constant vector. What is this vector? 2 Let P be a regular r x r transition matrix and y any r-component column vector. Show that the value of the limiting constant vector for Py is wy. 3 Let 1 0 0 P = .25 0 .75 0 0 1 be a transition matrix of a Markov chain. Find two fixed vectors of P that are linearly independent. Does this show that the Markov chain is not regular? 4 Describe the set of all fixed column vectors for the chain given in Exercise 3. 5 The theorem that P" - W was proved only for the case that P has no zero entries. Fill in the details of the following extension to the case that P is regular. Since P is regular, for some N, pN has no zeros. Thus, the proof given shows that MnN - mnN approaches 0 as n tends to infinity. However, the difference Mn - ma can never increase. (Why?) Hence, if we know that the differences obtained by looking at every Nth time tend to 0, then the entire sequence must also tend to 0. 6 Let P be a regular transition matrix and let w be the unique non-zero fixed vector of P. Show that no entry of w is 0. 18T. Lindvall, Lectures ont the Coupling Method (New York: Wiley 1992).  452 CHAPTER 11. MARKOV CHAINS 7 Here is a trick to try on your friends. Shuffle a deck of cards and deal them out one at a time. Count the face cards each as ten. Ask your friend to look at one of the first ten cards; if this card is a six, she is to look at the card that turns up six cards later; if this card is a three, she is to look at the card that turns up three cards later, and so forth. Eventually she will reach a point where she is to look at a card that turns up x cards later but there are not x cards left. You then tell her the last card that she looked at even though you did not know her starting point. You tell her you do this by watching her, and she cannot disguise the times that she looks at the cards. In fact you just do the same procedure and, even though you do not start at the same point as she does, you will most likely end at the same point. Why? 8 Write a program to play the game in Exercise 7. 9 (Suggested by Peter Doyle) In the proof of Theorem 11.14, we assumed the existence of a fixed vector w. To avoid this assumption, beef up the coupling argument to show (without assuming the existence of a stationary distribution w) that for appropriate constants C and r < 1, the distance between aP' and 3P' is at most Cr" for any starting distributions a and /3. Apply this in the case where 3= aP to conclude that the sequence aP" is a Cauchy sequence, and that its limit is a matrix W whose rows are all equal to a probability vector w with wP = w. Note that the distance between aP" and w is at most Crm, so in freeing ourselves from the assumption about having a fixed vector we've proved that the convergence to equilibrium takes place exponentially fast. 11.5 Mean First Passage Time for Ergodic Chains In this section we consider two closely related descriptive quantities of interest for ergodic chains: the mean time to return to a state and the mean time to go from one state to another state. Let P be the transition matrix of an ergodic chain with states si, s2, ..., sr. Let w = (wi, w2,... , wr) be the unique probability vector such that wP = w. Then, by the Law of Large Numbers for Markov chains, in the long run the process will spend a fraction w3 of the time in state s3. Thus, if we start in any state, the chain will eventually reach state sj; in fact, it will be in state sj infinitely often. Another way to see this is the following: Form a new Markov chain by making sj an absorbing state, that is, define puj = 1. If we start at any state other than sj, this new process will behave exactly like the original chain up to the first time that state sj is reached. Since the original chain was an ergodic chain, it was possible to reach sj from any other state. Thus the new chain is an absorbing chain with a single absorbing state sg that will eventually be reached. So if we start the original chain at a state s2 with i f j, we will eventually reach the state sg. Let N be the fundamental matrix for the new chain. The entries of N give the expected number of times in each state before absorption. In terms of the original  11.5. MEAN FIRST PASSAGE TIME 453 1 2 3 6 5 4 7 8 9 Figure 11.5: The maze problem. chain, these quantities give the expected number of times in each of the states before reaching state s3 for the first time. The ith component of the vector Nc gives the expected number of steps before absorption in the new chain, starting in state si. In terms of the old chain, this is the expected number of steps required to reach state sj for the first time starting at state s2. Mean First Passage Time Definition 11.7 If an ergodic Markov chain is started in state s2, the expected number of steps to reach state sj for the first time is called the mean first passage time from s2 to sg. It is denoted by mig. By convention m21 = 0. D Example 11.24 Let us return to the maze example (Example 11.22). We shall make this ergodic chain into an absorbing chain by making state 5 an absorbing state. For example, we might assume that food is placed in the center of the maze and once the rat finds the food, he stays to enjoy it (see Figure 11.5). The new transition matrix in canonical form is 1 2 3 4 6 7 8 9 5 1 0 1/2 0 0 1/2 0 0 0 0 2 1/3 0 1/3 0 0 0 0 0 1/3 3 0 1/2 0 1/2 0 0 0 0 0 4 0 0 1/3 0 0 1/3 0 1/3 1/3 p _G 1/3 0 0 0 0 0 0 0 |1/3 . 7 0 0 0 0 1/2 0 1/2 0 0 8 0 0 0 0 0 1/3 0 1/3 1/3 9 0 0 0 1/2 0 0 1/2 0 0 5\\ 0 0 0 0 0 0 0 0 0j|1  454 CHAPTER 11. MARKOV CHAINS If we compute the fundamental matrix N, we obtain 1 8 14 6 4 2 6 4 2 2 9 14 9 4 4 3 2 3 4 6 14 6 2 2 2 4 3 4 9 14 2 3 4 9 9 4 3 2 14 9 4 3 4 2 2 2 6 14 6 4 3 2 3 4 4 9 14 9 2 2 4 6 2 4 6 14 The expected time to absorption for different starting states is given by the vec- tor Nc, where Nc 6 5 6 5 5 6 5 6 We see that, starting from compartment 1, it will take on the average six steps to reach food. It is clear from symmetry that we should get the same answer for starting at state 3, 7, or 9. It is also clear that it should take one more step, starting at one of these states, than it would starting at 2, 4, 6, or 8. Some of the results obtained from N are not so obvious. For instance, we note that the expected number of times in the starting state is 14/8 regardless of the state in which we start. El Mean Recurrence Time A quantity that is closely related to the mean first passage time is the mean recur- rence time, defined as follows. Assume that we start in state s2; consider the length of time before we return to s2 for the first time. It is clear that we must return, since we either stay at s2 the first step or go to some other state si, and from any other state sj, we will eventually reach s2 because the chain is ergodic. Definition 11.8 If an ergodic Markov chain is started in state s2, the expected number of steps to return to s2 for the first time is the mean recurrence time for s2. It is denoted by ri. El We need to develop some basic properties of the mean first passage time. Con- sider the mean first passage time from s2 to s; assume that i / j. This may be computed as follows: take the expected number of steps required given the outcome of the first step, multiply by the probability that this outcome occurs, and add. If the first step is to sj, the expected number of steps required is 1; if it is to some  11.5. MEAN FIRST PASSAGE TIME 455 other state sk, the expected number of steps required is mkj plus 1 for the step already taken. Thus, mi= =Pig + pik(mks + 1), k~j or, since ZkPik = 1, mi= 1 + Epikmkj . (11.2) k~j Similarly, starting in s2, it must take at least one step to return. Considering all possible first steps gives us ri = pik(mki + 1) (11.3) k = 1+ Zpikmki . (11.4) k Mean First Passage Matrix and Mean Recurrence Matrix Let us now define two matrices M and D. The ijth entry mig of M is the mean first passage time to go from s2 to sj if i f j; the diagonal entries are 0. The matrix M is called the mean first passage matrix. The matrix D is the matrix with all entries 0 except the diagonal entries d2i2 r. The matrix D is called the mean recurrence matrix. Let C be an r x r matrix with all entries 1. Using Equation 11.2 for the case i f j and Equation 11.4 for the case i j, we obtain the matrix equation M =PM +C - D, (11.5) or (I -P)M =C -D . (11.6) Equation 11.6 with m22 = 0 implies Equations 11.2 and 11.4. We are now in a position to prove our first basic theorem. Theorem 11.15 For an ergodic Markov chain, the mean recurrence time for state si is r2 = 1/wi, where w2 is the ith component of the fixed probability vector for the transition matrix. Proof. Multiplying both sides of Equation 11.6 by w and using the fact that w(I - P) = 0 gives wC - wD= 0. Here wC is a row vector with all entries 1 and wD is a row vector with ith entry w2r2. Thus (1, 1,. . ., 1) =(wir, w2r2, .. . , w rn) and ri=1/wi, as was to be proved. F-I  456 CHAPTER 11. MARKOV CHAINS Corollary 11.1 For an ergodic Markov chain, the components of the fixed proba- bility vector w are strictly positive. Proof. We know that the values of r2 are finite and so w2= 1/ri cannot be 0. Q Example 11.25 In Example 11.22 we found the fixed probability vector for the maze example to be _ 1 1 1 1 1 1 1 1 1 Hence, the mean recurrence times are given by the reciprocals of these probabilities. That is, r=(12 8 12 8 6 8 12 8 12). Returning to the Land of Oz, we found that the weather in the Land of Oz could be represented by a Markov chain with states rain, nice, and snow. In Section 11.3 we found that the limiting vector was w = (2/5, 1/5, 2/5). From this we see that the mean number of days between rainy days is 5/2, between nice days is 5, and between snowy days is 5/2. Fundamental Matrix We shall now develop a fundamental matrix for ergodic chains that will play a role similar to that of the fundamental matrix N = (I - Q)-1 for absorbing chains. As was the case with absorbing chains, the fundamental matrix can be used to find a number of interesting quantities involving ergodic chains. Using this matrix, we will give a method for calculating the mean first passage times for ergodic chains that is easier to use than the method given above. In addition, we will state (but not prove) the Central Limit Theorem for Markov Chains, the statement of which uses the fundamental matrix. We begin by considering the case that P is the transition matrix of a regular Markov chain. Since there are no absorbing states, we might be tempted to try Z = (I - P)-1 for a fundamental matrix. But I - P does not have an inverse. To see this, recall that a matrix R has an inverse if and only if Rx = 0 implies x = 0. But since Pc = c we have (I - P)c = 0, and so I - P does not have an inverse. We recall that if we have an absorbing Markov chain, and Q is the restriction of the transition matrix to the set of transient states, then the fundamental matrix N could be written as N=I+Q+Q2+---. The reason that this power series converges is that Q" -- 0, so this series acts like a convergent geometric series. This idea might prompt one to try to find a similar series for regular chains. Since we know that P" -> W, we might consider the series I+ (P - W)+ (P2 - W) +- . (1.7 (11.7)  11.5. MEAN FIRST PASSAGE TIME 457 We now use special properties of P and W to rewrite this series. The special properties are: 1) PW = W, and 2) Wk = W for all positive integers k. These facts are easy to verify, and are left as an exercise (see Exercise 22). Using these facts, we see that (P-W)" = Z(-1)iP) ph-iW i=0 = Pm + 1(-1)jWj i=1 = Pm + (-1)2jW i=1 = P" + (1) W . If we expand the expression (1 - 1)", using the Binomial Theorem, we obtain the expression in parenthesis above, except that we have an extra term (which equals 1). Since (1 - 1)"n= 0, we see that the above expression equals -1. So we have (P-W)" =P"-W , for all n > 1. We can now rewrite the series in 11.7 as I+ (P - W) + (P - W)2 +---. Since the nth term in this series is equal to P" - W, the nth term goes to 0 as n goes to infinity. This is sufficient to show that this series converges, and sums to the inverse of the matrix I - P + W. We call this inverse the fundamental matrix associated with the chain, and we denote it by Z. In the case that the chain is ergodic, but not regular, it is not true that P" - W as n - c. Nevertheless, the matrix I - P + W still has an inverse, as we will now show. Proposition 11.1 Let P be the transition matrix of an ergodic chain, and let W be the matrix all of whose rows are the fixed probability row vector for P. Then the matrix I-P+W has an inverse. Proof. Let x be a column vector such that (I- P +W)x =0. To prove the proposition, it is sufficient to show that x must be the zero vector. Multiplying this equation by w and using the fact that w(I- P) =0 and wW = w, we have w(I - P + W)x =wx =0 .  458 CHAPTER 11. MARKOV CHAINS Therefore, (I - P)x= 0. But this means that x = Px is a fixed column vector for P. By Theorem 11.10, this can only happen if x is a constant vector. Since wx = 0, and w has strictly positive entries, we see that x = 0. This completes the proof. D As in the regular case, we will call the inverse of the matrix I - P + W the fundamental matrix for the ergodic chain with transition matrix P, and we will use Z to denote this fundamental matrix. Example 11.26 Then Let P be the transition matrix for the weather in the Land of Oz. I-P+W 1 0 0 1/2 1/4 0 1 0 - 1/2 0 0 0 1/\1/4 1/4 9/10 -1/20 3/20 -1/10 6/5 -1/10 3/20 -1/20 9/10 1/4 2/5 1/2 + 2/5 1/2 2/5 1/5 2/5 1/5 2/5 1/5 2/5 so (I -P+W)-' 86/75 2/25 -14/75 1/25 21/25 1/25 -14/75 2/25 86/75 El Using the Fundamental Matrix to Calculate the Mean First Passage Matrix We shall show how one can obtain the mean first passage matrix M from the fundamental matrix Z for an ergodic Markov chain. Before stating the theorem which gives the first passage times, we need a few facts about Z. Lemma 11.2 Let Z = (I - P + W)-1, and let c be a column vector of all l's. Then Zc~c wZ=w, and Z(I -P) =I -W . Proof. Since Pc = c and We = c, c = (I- P + W)c . If we multiply both sides of this equation on the left by Z, we obtain Ze c.  11.5. MEAN FIRST PASSAGE TIME Similarly, since wP = w and wW = w, w =w(I -PA+W) . If we multiply both sides of this equation on the right by Z, we obtain wZ = w. 459 Finally, we have (I- P +W)(I-W) I-W-P+W+W-W I-P. Multiplying on the left by Z, we obtain I - W =Z(I - P). This completes the proof. El The following theorem shows how one can obtain the mean first passage times from the fundamental matrix. Theorem 11.16 The mean first passage matrix M for an ergodic chain is deter- mined from the fundamental matrix Z and the fixed row probability vector w by zjj - z2g Proof. We showed in Equation 11.6 that (I -P)M =C -D . Thus, Z(I - P)M = ZC - ZD , and from Lemma 11.2, Z(I -P)M =C -ZD . Again using Lemma 11.2, we have M - WM= C - ZD or M=C-ZD+WM. From this equation, we see that mig = 1 - zij rj + (wM). (11.8) But m3 = 0, and so 0 =1 - z3r3 + (wM)J,  460 CHAPTER 11. MARKOV CHAINS or (wM)3 = z33r3 - 1 . (11.9) From Equations 11.8 and 11.9, we have mig = (zig - zig) -r. Since r3 = 1/wj, zjj - zig Example 11.27 (Example 11.26 continued) In the Land of Oz example, we find that Z= (I -P+ We have also seen that w = 86/75 1/25 -14 W)-1 = 2/25 21/25 2/ -14/75 1/25 86/ (2/5, 1/5, 2/5). So, for example, /75 25 . /75 m12 z22 - zl2 W2 21/25 - 1/25 1/5 4, by Theorem 11.16. Carrying out the calculations for the other entries of M, we obtain 0 4 10/3 M=(8/3 0 8/3 . \10/3 4 0 D- Computation The program ErgodicChain calculates the fundamental matrix, the fixed vector, the mean recurrence matrix D, and the mean first passage matrix M. We have run the program for the Ehrenfest urn model (Example 11.8). We obtain: 0 1 P=2 3 4 0 (.0000 .2500 .0000 .0000 \.0000 1 1.0000 .0000 .5000 .0000 .0000 2 .0000 .7500 .0000 .7500 .0000 3 .0000 .0000 .5000 .0000 1.0000 4 .0000 .0000 .0000 .2500 .0000/ 0 1 2 3 4 w = (.0625 .2500 .3750 .2500 .0625)  11.5. MEAN FIRST PASSAGE TIME 461 0 1 2 3 4 r = (16.0000 4.0000 2.6667 4.0000 16.0000) 0 1 2 3 4 0 .0000 1.0000 2.6667 6.3333 21.3333 1 15.0000 .0000 1.6667 5.3333 20.3333 M = 2 18.6667 3.6667 .0000 3.6667 18.6667 3 20.3333 5.3333 1.6667 .0000 15.0000 4\21.3333 6.3333 2.6667 1.0000 .0000 From the mean first passage matrix, we see that the mean time to go from 0 balls in urn 1 to 2 balls in urn 1 is 2.6667 steps while the mean time to go from 2 balls in urn 1 to 0 balls in urn 1 is 18.6667. This reflects the fact that the model exhibits a central tendency. Of course, the physicist is interested in the case of a large number of molecules, or balls, and so we should consider this example for n so large that we cannot compute it even with a computer. Ehrenfest Model Example 11.28 (Example 11.23 continued) Let us consider the Ehrenfest model (see Example 11.8) for gas diffusion for the general case of 2n balls. Every second, one of the 2n balls is chosen at random and moved from the urn it was in to the other urn. If there are i balls in the first urn, then with probability i/2n we take one of them out and put it in the second urn, and with probability (2n - i)/2n we take a ball from the second urn and put it in the first urn. At each second we let the number i of balls in the first urn be the state of the system. Then from state i we can pass only to state i - 1 and i + 1, and the transition probabilities are given by ,n if j = i- 1, P 0 , otherwise. This defines the transition matrix of an ergodic, non-regular Markov chain (see Exercise 15). Here the physicist is interested in long-term predictions about the state occupied. In Example 11.23, we gave an intuitive reason for expecting that the fixed vector w is the binomial distribution with parameters 2n and 1/2. It is easy to check that this is correct. So, Thus the mean recurrence time for state i is (2n2)  462 CHAPTER 11. MARKOV CHAINS Time forward 65 60 55 50 45 40 0 200 400 600 800 1000 Time reversed 65 60 55 50 45 40 0 200 400 600 800 1000 Figure 11.6: Ehrenfest simulation. Consider in particular the central term i = n. We have seen that this term is approximately 1// Fi. Thus we may approximate rn by irn. This model was used to explain the concept of reversibility in physical systems. Assume that we let our system run until it is in equilibrium. At this point, a movie is made, showing the system's progress. The movie is then shown to you, and you are asked to tell if the movie was shown in the forward or the reverse direction. It would seem that there should always be a tendency to move toward an equal proportion of balls so that the correct order of time should be the one with the most transitions from i to i - 1 if i > n and i to i + 1 if i < n. In Figure 11.6 we show the results of simulating the Ehrenfest urn model for the case of n = 50 and 1000 time units, using the program EhrenfestUrn. The top graph shows these results graphed in the order in which they occurred and the bottom graph shows the same results but with time reversed. There is no apparent difference.  11.5. MEAN FIRST PASSAGE TIME 463 We note that if we had not started in equilibrium, the two graphs would typically look quite different. D Reversibility If the Ehrenfest model is started in equilibrium, then the process has no apparent time direction. The reason for this is that this process has a property called re- versibility. Define Xn to be the number of balls in the left urn at step n. We can calculate, for a general ergodic chain, the reverse transition probability: . P(Xn_1 = j, Xn i P(Xn_1 =| P(Xn i) P(Xn_1 = j)P(Xn = i|Xn_1 = jA P(Xn=i) P(Xn_1 = j )pji P(Xn -i)' In general, this will depend upon n, since P(Xn = j) and also P(X_1 = j) change with n. However, if we start with the vector w or wait until equilibrium is reached, this will not be the case. Then we can define wjpji pij - wi as a transition matrix for the process watched with time reversed. Let us calculate a typical transition probability for the reverse chain P* = pI in the Ehrenfest model. For example, wi1Pi_1,i _-1) 2n - i + 1 22n w 22n 2n (2n) (2n)!_(2n - i + 1)i! (2n - i)! x (i - 1)! (2n - i + 1)!X 2n(2n)! i 2n =Pi,i-1 Similar calculations for the other transition probabilities show that P* = P. When this occurs the process is called reversible. Clearly, an ergodic chain is re- versible if, and only if, for every pair of states si and sj, wipij = wjpji. In particular, for the Ehrenfest model this means that wii,i_1 = i_1pi_1,i. Thus, in equilib- rium, the pairs (i, i - 1) and (i - 1, i) should occur with the same frequency. While many of the Markov chains that occur in applications are reversible, this is a very strong condition. In Exercise 12 you are asked to find an example of a Markov chain which is not reversible. The Central Limit Theorem for Markov Chains Suppose that we have an ergodic Markov chain with states si, s2,... , sk. It is natural to consider the distribution of the random variables S "h), which denotes  464 CHAPTER 11. MARKOV CHAINS the number of times that the chain is in state s3 in the first n steps. The jth component w3 of the fixed probability row vector w is the proportion of times that the chain is in state sj in the long run. Hence, it is reasonable to conjecture that the expected value of the random variable S n), as n -- o0, is asymptotic to nwj, and it is easy to show that this is the case (see Exercise 23). It is also natural to ask whether there is a limiting distribution of the random variables S "). The answer is yes, and in fact, this limiting distribution is the normal distribution. As in the case of independent trials, one must normalize these random variables. Thus, we must subtract from S ") its expected value, and then divide by its standard deviation. In both cases, we will use the asymptotic values of these quantities, rather than the values themselves. Thus, in the first case, we will use the value nw3. It is not so clear what we should use in the second case. It turns out that the quantity S= 2mw z3 - wj - w (11.10) represents the asymptotic variance. Armed with these ideas, we can state the following theorem. Theorem 11.17 (Central Limit Theorem for Markov Chains) For an er- godic chain, for any real numbers r < s, we have S j") - nw3 ijSx2/2 P r < e-x/2dx as n -- o, for any choice of starting state, where o2 is the quantity defined in Equation 11.10. D Historical Remarks Markov chains were introduced by Andrei Andreevich Markov (1856-1922) and were named in his honor. He was a talented undergraduate who received a gold medal for his undergraduate thesis at St. Petersburg University. Besides being an active research mathematician and teacher, he was also active in politics and patricipated in the liberal movement in Russia at the beginning of the twentieth century. In 1913, when the government celebrated the 300th anniversary of the House of Romanov family, Markov organized a counter-celebration of the 200th anniversary of Bernoulli's discovery of the Law of Large Numbers. Markov was led to develop Markov chains as a natural extension of sequences of independent random variables. In his first paper, in 1906, he proved that for a Markov chain with positive transition probabilities and numerical states the average of the outcomes converges to the expected value of the limiting distribution (the fixed vector). In a later paper he proved the central limit theorem for such chains. Writing about Markov, A. P. Youschkevitch remarks: Markov arrived at his chains starting from the internal needs of prob- ability theory, and he never wrote about their applications to physical  11.5. MEAN FIRST PASSAGE TIME 465 science. For him the only real examples of the chains were literary texts, where the two states denoted the vowels and consonants.19 In a paper written in 1913,20 Markov chose a sequence of 20,000 letters from Pushkin's Eugene Onegin to see if this sequence can be approximately considered a simple chain. He obtained the Markov chain with transition matrix vowel consonant vowel .128 .872 consonant .663 .337 The fixed vector for this chain is (.432, .568), indicating that we should expect about 43.2 percent vowels and 56.8 percent consonants in the novel, which was borne out by the actual count. Claude Shannon considered an interesting extension of this idea in his book The Mathematical Theory of Communication, 21 in which he developed the information- theoretic concept of entropy. Shannon considers a series of Markov chain approxi- mations to English prose. He does this first by chains in which the states are letters and then by chains in which the states are words. For example, for the case of words he presents first a simulation where the words are chosen independently but with appropriate frequencies. REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MES- SAGE HAD BE THESE. He then notes the increased resemblence to ordinary English text when the words are chosen as a Markov chain, in which case he obtains THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRI- TER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED. A simulation like the last one is carried out by opening a book and choosing the first word, say it is the. Then the book is read until the word the appears again and the word after this is chosen as the second word, which turned out to be head. The book is then read until the word head appears again and the next word, and, is chosen, and so on. Other early examples of the use of Markov chains occurred in Galton's study of the problem of survival of family names in 1889 and in the Markov chain introduced 19See Dictionary of Scientific Biography, ed. C. C. Gillespie (New York: Scribner's Sons, 1970), pp. 124-130. 20A. A. Markov, "An Example of Statistical Analysis of the Text of Eugene Onegin Illustrat- ing the Association of Trials into a Chain," Bulletin de l'Acadamie Imperiale des Sciences de St. Petersburg, ser. 6, vol. 7 (1913), pp. 153-162. 21C. E. Shannon and W. Weaver, The Mathematical Theory of Communication (Urbana: Univ. of Illinois Press, 1964).  466 CHAPTER 11. MARKOV CHAINS by P. and T. Ehrenfest in 1907 for diffusion. Poincare in 1912 dicussed card shuffling in terms of an ergodic Markov chain defined on a permutation group. Brownian motion, a continuous time version of random walk, was introducted in 1900-1901 by L. Bachelier in his study of the stock market, and in 1905-1907 in the works of A. Einstein and M. Smoluchowsky in their study of physical processes. One of the first systematic studies of finite Markov chains was carried out by M. Frechet.22 The treatment of Markov chains in terms of the two fundamental matrices that we have used was developed by Kemeny and Snell 23 to avoid the use of eigenvalues that one of these authors found too complex. The fundamental matrix N occurred also in the work of J. L. Doob and others in studying the connection between Markov processes and classical potential theory. The fundamental matrix Z for ergodic chains appeared first in the work of Frechet, who used it to find the limiting variance for the central limit theorem for Markov chains. Exercises 1 Consider the Markov chain with transition matrix (1/2 1/2 P 1/4 3/4 * Find the fundamental matrix Z for this chain. Compute the mean first passage matrix using Z. 2 A study of the strengths of Ivy League football teams shows that if a school has a strong team one year it is equally likely to have a strong team or average team next year; if it has an average team, half the time it is average next year, and if it changes it is just as likely to become strong as weak; if it is weak it has 2/3 probability of remaining so and 1/3 of becoming average. (a) A school has a strong team. On the average, how long will it be before it has another strong team? (b) A school has a weak team; how long (on the average) must the alumni wait for a strong team? 3 Consider Example 11.4 with a = .5 and b = .75. Assume that the President says that he or she will run. Find the expected length of time before the first time the answer is passed on incorrectly. 4 Find the mean recurrence time for each state of Example 11.4 for a = .5 and b = .75. Do the same for general a and b. 5 A die is rolled repeatedly. Show by the results of this section that the mean time between occurrences of a given number is 6. 22M. Frechet, "Thdorie des 6v~nements en chaine dans le cas d'un nombre fini d'6tats possible," in Recherches thioriques Modernes sur le calcul des probabilitis, vol. 2 (Paris, 1938). 23J. G. Kemeny and J. L. Snell, Finite Markov Chains.  11.5. MEAN FIRST PASSAGE TIME 467 Figure 11.7: Maze for Exercise 7. 6 For the Land of Oz example (Example 11.1), make rain into an absorbing state and find the fundamental matrix N. Interpret the results obtained from this chain in terms of the original chain. 7 A rat runs through the maze shown in Figure 11.7. At each step it leaves the room it is in by choosing at random one of the doors out of the room. (a) Give the transition matrix P for this Markov chain. (b) Show that it is an ergodic chain but not a regular chain. (c) Find the fixed vector. (d) Find the expected number of steps before reaching Room 5 for the first time, starting in Room 1. 8 Modify the program ErgodicChain so that you can compute the basic quan- tities for the queueing example of Exercise 11.3.20. Interpret the mean recur- rence time for state 0. 9 Consider a random walk on a circle of circumference n. The walker takes one unit step clockwise with probability p and one unit counterclockwise with probability q = 1 - p. Modify the program ErgodicChain to allow you to input n and p and compute the basic quantities for this chain. (a) For which values of n is this chain regular? ergodic? (b) What is the limiting vector w? (c) Find the mean first passage matrix for n = 5 and p = .5. Verify that mg = d(n - d), where d is the clockwise distance from i to j . 10 Two players match pennies and have between them a total of 5 pennies. If at any time one player has all of the pennies, to keep the game going, he gives one back to the other player and the game will continue. Show that this game can be formulated as an ergodic chain. Study this chain using the program ErgodicChain.  468 CHAPTER 11. MARKOV CHAINS 11 Calculate the reverse transition matrix for the Land of Oz example (Exam- ple 11.1). Is this chain reversible? 12 Give an example of a three-state ergodic Markov chain that is not reversible. 13 Let P be the transition matrix of an ergodic Markov chain and P* the reverse transition matrix. Show that they have the same fixed probability vector w. 14 If P is a reversible Markov chain, is it necessarily true that the mean time to go from state i to state j is equal to the mean time to go from state j to state i? Hint: Try the Land of Oz example (Example 11.1). 15 Show that any ergodic Markov chain with a symmetric transition matrix (i.e., Pig = phi) is reversible. 16 (Crowell24) Let P be the transition matrix of an ergodic Markov chain. Show that (I+ P+.- -+ P")(I - P+W)=I -P"+ nW , and from this show that ->W , n as n -- c. 17 An ergodic Markov chain is started in equilibrium (i.e., with initial probability vector w). The mean time until the next occurrence of state s2 is 2i = Z k wkmki + w2r2. Show that rin = z2/w2, by using the facts that wZ = w and mkt = (z22 - Zki)/Wi. 18 A perpetual craps game goes on at Charley's. Jones comes into Charley's on an evening when there have already been 100 plays. He plans to play until the next time that snake eyes (a pair of ones) are rolled. Jones wonders how many times he will play. On the one hand he realizes that the average time between snake eyes is 36 so he should play about 18 times as he is equally likely to have come in on either side of the halfway point between occurrences of snake eyes. On the other hand, the dice have no memory, and so it would seem that he would have to play for 36 more times no matter what the previous outcomes have been. Which, if either, of Jones's arguments do you believe? Using the result of Exercise 17, calculate the expected to reach snake eyes, in equilibrium, and see if this resolves the apparent paradox. If you are still in doubt, simulate the experiment to decide which argument is correct. Can you give an intuitive argument which explains this result? 19 Show that, for an ergodic Markov chain (see Theorem 11.16), S sw g -1 =K . 24Private communication.  11.5. MEAN FIRST PASSAGE TIME 469 -5 20 B C - 30 15 A GO Figure 11.8: Simplified Monopoly. The second expression above shows that the number K is independent of i. The number K is called Kemeny's constant. A prize was offered to the first person to give an intuitively plausible reason for the above sum to be independent of i. (See also Exercise 24.) 20 Consider a game played as follows: You are given a regular Markov chain with transition matrix P, fixed probability vector w, and a payoff function f which assigns to each state s2 an amount f2 which may be positive or negative. Assume that wf = 0. You watch this Markov chain as it evolves, and every time you are in state s2 you receive an amount f2. Show that your expected winning after n steps can be represented by a column vector g(n), with g(n) =(I+P+P2+- +P")f. Show that as n - o, g() - g with g = Zf. 21 A highly simplified game of "Monopoly" is played on a board with four squares as shown in Figure 11.8. You start at GO. You roll a die and move clockwise around the board a number of squares equal to the number that turns up on the die. You collect or pay an amount indicated on the square on which you land. You then roll the die again and move around the board in the same manner from your last position. Using the result of Exercise 20, estimate the amount you should expect to win in the long run playing this version of Monopoly. 22 Show that if P is the transition matrix of a regular Markov chain, and W is the matrix each of whose rows is the fixed probability vector corresponding to P, then PW = W, and Wk = W for all positive integers k. 23 Assume that an ergodic Markov chain has states si, s2,... , sk. Let Sin) denote the number of times that the chain is in state si in the first n steps. Let w denote the fixed probability row vector for this chain. Show that, regardless of the starting state, the expected value of Sin), divided by n, tends to w3 as - 00. Hint: If the chain starts in state s, then the expected value of S-, 0) is given by the expression h=0  470 CHAPTER 11. MARKOV CHAINS 24 In the course of a walk with Snell along Minnehaha Avenue in Minneapolis in the fall of 1983, Peter Doyle25 suggested the following explanation for the constancy of Kemeny's constant (see Exercise 19). Choose a target state according to the fixed vector w. Start from state i and wait until the time T that the target state occurs for the first time. Let KZ be the expected value of T. Observe that KZ +wi . 1/w2 =ZPijK +1, and hence 2 PJKJ. By the maximum principle, KZ is a constant. Should Peter have been given the prize? 25 Privatecommunication.  Chapter 12 Random Walks 12.1 Random Walks in Euclidean Space In the last several chapters, we have studied sums of random variables with the goal being to describe the distribution and density functions of the sum. In this chapter, we shall look at sums of discrete random variables from a different perspective. We shall be concerned with properties which can be associated with the sequence of partial sums, such as the number of sign changes of this sequence, the number of terms in the sequence which equal 0, and the expected size of the maximum term in the sequence. We begin with the following definition. Definition 12.1 Let {Xk} i be a sequence of independent, identically distributed discrete random variables. For each positive integer n, we let Sn denote the sum X1 + X2+-" --+ Xn. The sequence {Sn}_1 is called a random walk. If the common range of the Xk's is Rm, then we say that {Sn} is a random walk in Rm . We view the sequence of Xk's as being the outcomes of independent experiments. Since the Xk's are independent, the probability of any particular (finite) sequence of outcomes can be obtained by multiplying the probabilities that each Xk takes on the specified value in the sequence. Of course, these individual probabilities are given by the common distribution of the Xk's. We will typically be interested in finding probabilities for events involving the related sequence of Sn's. Such events can be described in terms of the Xk's, so their probabilities can be calculated using the above idea. There are several ways to visualize a random walk. One can imagine that a particle is placed at the origin in Rm at time n = 0. The sum Sn represents the position of the particle at the end of n seconds. Thus, in the time interval [n -1, n], the particle moves (or jumps) from position Sn_1 to Sn. The vector representing this motion is just Sn - Sn_1, which equals Xn. This means that in a random walk, the jumps are independent and identically distributed. If m = 1, for example, then one can imagine a particle on the real line that starts at the origin, and at the end of each second, jumps one unit to the right or the left, with probabilities given 471  472 CHAPTER 12. RANDOM WALKS by the distribution of the Xk's. If m = 2, one can visualize the process as taking place in a city in which the streets form square city blocks. A person starts at one corner (i.e., at an intersection of two streets) and goes in one of the four possible directions according to the distribution of the Xk's. If m = 3, one might imagine being in a jungle gym, where one is free to move in any one of six directions (left, right, forward, backward, up, and down). Once again, the probabilities of these movements are given by the distribution of the Xk's. Another model of a random walk (used mostly in the case where the range is R') is a game, involving two people, which consists of a sequence of independent, identically distributed moves. The sum Sn represents the score of the first person, say, after n moves, with the assumption that the score of the second person is -Sn. For example, two people might be flipping coins, with a match or non-match representing +1 or -1, respectively, for the first player. Or, perhaps one coin is being flipped, with a head or tail representing +1 or -1, respectively, for the first player. Random Walks on the Real Line We shall first consider the simplest non-trivial case of a random walk in R1, namely the case where the common distribution function of the random variables Xn is given by (1/2, if x=±+1 JX\J( -}0, otherwise. This situation corresponds to a fair coin being flipped, with Sn representing the number of heads minus the number of tails which occur in the first n flips. We note that in this situation, all paths of length n have the same probability, namely 2-n. It is sometimes instructive to represent a random walk as a polygonal line, or path, in the plane, where the horizontal axis represents time and the vertical axis represents the value of Sn. Given a sequence {Sn} of partial sums, we first plot the points (n, Sn), and then for each k < n, we connect (k, Sk) and (k +1, Sk+1) with a straight line segment. The length of a path is just the difference in the time values of the beginning and ending points on the path. The reader is referred to Figure 12.1. This figure, and the process it illustrates, are identical with the example, given in Chapter 1, of two people playing heads or tails. Returns and First Returns We say that an equalization has occurred, or there is a return to the origin at time n, if Sn = 0. We note that this can only occur if n is an even integer. To calculate the probability of an equalization at time 2m, we need only count the number of paths of length 2m which begin and end at the origin. The number of such paths is clearly Since each path has probability 2-m, we have the following theorem.  12.1. RANDOM WALKS IN EUCLIDEAN SPACE 473 10 8 6 4 2 - II I l,,,IlI ,, Il 1 1 1 1, ,1 1 1 ,iI iI,, 5 10 15 2 25 30 35 40 -2- -4- -6- -8- -10- Figure 12.1: A random walk of length 40. Theorem 12.1 The probability of a return to the origin at time 2m is given by _2m(-22m. m The probability of a return to the origin at an odd time is 0. D A random walk is said to have a first return to the origin at time 2m if m > 0, and S2k # 0 for all k 1, the probabilities {u2k} and {f2k} are related by the equation U2n = fou2n + f2u2n-2 + -"" + f2nuo. Proof. There are u2n22" paths of length 2n which have endpoints (0, 0) and (2n, 0). The collection of such paths can be partitioned into n sets, depending upon the time of the first return to the origin. A path in this collection which has a first return to the origin at time 2k consists of an initial segment from (0, 0) to (2k, 0), in which no interior points are on the horizontal axis, and a terminal segment from (2k, 0) to (2n, 0), with no further restrictions on this segment. Thus, the number of paths in the collection which have a first return to the origin at time 2k is given by f2k2k2n-2k22n2 =f2ku2n-2k22n If we sum over k, we obtain the equation Ut2n2n fovi2n2nm + f2t2n-22n + ' ' + f2nio2nm Dividing both sides of this equation by 22Th completes the proof.D  474 CHAPTER 12. RANDOM WALKS The expression in the right-hand side of the above theorem should remind the reader of a sum that appeared in Definition 7.1 of the convolution of two distributions. The convolution of two sequences is defined in a similar manner. The above theorem says that the sequence {u2n} is the convolution of itself and the sequence {f2n}. Thus, if we represent each of these sequences by an ordinary generating function, then we can use the above relationship to determine the value f2n. Theorem 12.3 For m > 1, the probability of a first return to the origin at time 2m is given by tt2m _ (2 f2m = 2m - 1 (2m - 1)22m Proof. We begin by defining the generating functions 00 U(x) 3:tt2mXm and 00 F(x) S f2nXm . M=0 Theorem 12.2 says that U(x) = 1 + U(x)F(x) . (12.1) (The presence of the 1 on the right-hand side is due to the fact that vo is defined to be 1, but Theorem 12.2 only holds for m > 1.) We note that both generating functions certainly converge on the interval (-1, 1), since all of the coefficients are at most 1 in absolute value. Thus, we can solve the above equation for F(x), obtaining U(x)-1 () U(x) Now, if we can find a closed-form expression for the function U(x), we will also have a closed-form expression for F(x). From Theorem 12.1, we have 2m U(x) ()2-2mXm m~ m=0 In Wilf,1 we find that 1 2m M 1 - 4x mm The reader is asked to prove this statement in Exercise 1. If we replace x by x/4 in the last equation, we see that 1 U(x) = - . 1H. S. Wilf, Generatingfrinctionology, (Boston: Academic Press, 1990), p. 50.  12.1. RANDOM WALKS IN EUCLIDEAN SPACE 475 Therefore, we have U(x) - 1 F(x) U(X) (1 -x)-1/2 -_1 (1 - x)-1/2 1-(1 -)1/2 Although it is possible to compute the value of f2m, using the Binomial Theorem, it is easier to note that F'(x) = U(x)/2, so that the coefficients f2m, can be found by integrating the series for U(x). We obtain, for m > 1, u2m-2 fhm 2m (2m-2) \M-1 m22m-1 (2m) (2m - 1)22m U2m 2m - 1 since (2m_-2\ mn 2m m- 1 2(2m- 1) m This completes the proof of the theorem. D Probability of Eventual Return In the symmetric random walk process in Rm, what is the probability that the particle eventually returns to the origin? We first examine this question in the case that m = 1, and then we consider the general case. The results in the next two examples are due to P6lya.2 Example 12.1 (Eventual Return in R1) One has to approach the idea of eventual return with some care, since the sample space seems to be the set of all walks of infinite length, and this set is non-denumerable. To avoid difficulties, we will define wn to be the probability that a first return has occurred no later than time n. Thus, wn concerns the sample space of all walks of length n, which is a finite set. In terms of the wa's, it is reasonable to define the probability that the particle eventually returns to the origin to be w, = lim w. n-oo ?-1-rOO This limit clearly exists and is at most one, since the sequence {wn} _U1 is an increasing sequence, and all of its terms are at most one. 2G. P61ya, "Uber eine Aufgabe der Wahrscheinlichkeitsrechnung betreffend die Irrfahrt im Strassennetz," Math. Ann., vol. 84 (1921), pp. 149-160.  476 CHAPTER 12. RANDOM WALKS In terms of the f, probabilities, we see that W S2n = f2i i=1 Thus, 00 w = f2i i=1 In the proof of Theorem 12.3, the generating function oo F(x) = f2mxm M=0 was introduced. There it was noted that this series converges for x E (-1, 1). In fact, it is possible to show that this series also converges for x =+1 by using Exercise 4, together with the fact that U2m f2m =2m - 1 (This fact was proved in the proof of Theorem 12.3.) Since we also know that F(x)= 1 - (1 -z)12 we see that w, = F(1) = 1 . Thus, with probability one, the particle returns to the origin. An alternative proof of the fact that w, = 1 can be obtained by using the results in Exercise 2. D Example 12.2 (Eventual Return in Rm) We now turn our attention to the case that the random walk takes place in more than one dimension. We define fym) to be the probability that the first return to the origin in Rm occurs at time 2n. The quantity urn) is defined in a similar manner. Thus, f2) and utn) equal f22n and u2n, which were defined earlier. If, in addition, we define uom) = 1 and f (m) = 0, then one can mimic the proof of Theorem 12.2, and show that for all m;> 1, (m)f(m)(n) f(m) (m) . fm) m)(12.2) u2n =f n+f2 2n-2+---+fn o .(2) We continue to generalize previous work by defining 00 U(m)(x)= u x" n2n and F(m)(z) = fmz n2~O  12.1. RANDOM WALKS IN EUCLIDEAN SPACE 477 Then, by using Equation 12.2, we see that U(m)(x) = 1 + U(m)(x)F(m)(x) , as before. These functions will always converge in the interval (-1, 1), since all of their coefficients are at most one in magnitude. In fact, since w m) = 1n)<1 n=0 for all m, the series for F(m) (x) converges at x = 1 as well, and F(m) (x) is left- continuous at x = 1, i.e., lim F(m)(x) = F(m)(1). 4T1 Thus, we have w(m) limF()() U(m)(x) - 1 SliF ( li U(m)(x)(12.3) so to determine w(m), it suffices to determine lim U(m)(x) . xTl We let u(m) denote this limit. We claim that 00 U (m) un n=0 (This claim is reasonable; it says that to find out what happens to the function U(m)(x) at x = 1, just let x = 1 in the power series for U(m)(x).) To prove the claim, we note that the coefficients in) are non-negative, so U(m) (x) increases monotonically on the interval [0, 1). Thus, for each K, we have K 00 Zun) < linmU(m)(x) -=u(m)< u n=0 n=0 By letting K - o, we see that 00 U (m) un 2n This establishes the claim. From Equation 12.3, we see that if u(m) < o0, then the probability of an eventual return is (m) - 1 while if u(m) = o, then the probability of eventual return is 1. To complete the example, we must estimate the sum n0  478 CHAPTER 12. RANDOM WALKS In Exercise 12, the reader is asked to show that (2) 1 2n 2 U2n 42n n) Using Stirling's Formula, it is easy to show that (see Exercise 13) 2n 22n so u(2) 1 7rn From this it follows easily that n=0 diverges, so w2= 1, i.e., in R2, the probability of an eventual return is 1. When m = 3, Exercise 12 shows that (3) 1 (2n U2n 22n n I n 'I I 3 j!k!(n jk 2 n! -j-k)! Let M denote the largest value of 1 n! 3n j!k!(n -j-k)! over all non-negative values of j and k with j + k <;n. Formula, to show that c n for some constant c. Thus, we have It is easy, using Stirling's (3) 1 2nn! U2n - 22 n n3 j Ik!( -j 3 nk- k)!) Using Exercise 14, one can show that the right-hand expression is at most c' n3/2 where c' is a constant. Thus, oo ?2n n=0 converges, so w 3) is strictly less than one. This means that in R3, the probability of an eventual return to the origin is strictly less than one (in fact, it is approximately .34). One may summarize these results by stating that one should not get drunk in more than two dimensions. D  12.1. RANDOM WALKS IN EUCLIDEAN SPACE 479 Expected Number of Equalizations We now give another example of the use of generating functions to find a general formula for terms in a sequence, where the sequence is related by recursion relations to other sequences. Exercise 9 gives still another example. Example 12.3 (Expected Number of Equalizations) In this example, we will de- rive a formula for the expected number of equalizations in a random walk of length 2m. As in the proof of Theorem 12.3, the method has four main parts. First, a recursion is found which relates the mth term in the unknown sequence to earlier terms in the same sequence and to terms in other (known) sequences. An exam- ple of such a recursion is given in Theorem 12.2. Second, the recursion is used to derive a functional equation involving the generating functions of the unknown sequence and one or more known sequences. Equation 12.1 is an example of such a functional equation. Third, the functional equation is solved for the unknown generating function. Last, using a device such as the Binomial Theorem, integra- tion, or differentiation, a formula for the mth coefficient of the unknown generating function is found. We begin by defining g2m to be the number of equalizations among all of the random walks of length 2m. (For each random walk, we disregard the equalization at time 0.) We define go = 0. Since the number of walks of length 2m equals 22m, the expected number of equalizations among all such random walks is g2m/22m Next, we define the generating function G(x): 00 G(x) = g2kx k=0 Now we need to find a recursion which relates the sequence {g2k} to one or both of the known sequences {f2k} and {2k}. We consider m to be a fixed positive integer, and consider the set of all paths of length 2m as the disjoint union E2UE4U---UE2mUH, where E2k is the set of all paths of length 2m with first equalization at time 2k, and H is the set of all paths of length 2m with no equalization. It is easy to show (see Exercise 3) that E2k =f2k22m We claim that the number of equalizations among all paths belonging to the set E2k is equal to E2k + 22kf2kg2m-2k . (12.4) Each path in E2k has one equalization at time 2k, so the total number of such equalizations is just E2kl. This is the first summand in expression Equation 12.4. There are 22kf2k different initial segments of length 2k among the paths in E2k. Each of these initial segments can be augmented to a path of length 2m in 22m-2k ways, by adjoining all possible paths of length 2m -2k. The number of equalizations obtained by adjoining all of these paths to any one initial segment is g2m-2k, by  480 CHAPTER 12. RANDOM WALKS definition. This gives the second summand in Equation 12.4. Since k can range from 1 to m, we obtain the recursion 92m = (E2k + 22kf2kg2m-2k . (12.5) k=1 The second summand in the typical term above should remind the reader of a convolution. In fact, if we multiply the generating function G(x) by the generating function 00 F(4x1) S 22k f2kxk , k=0 the coefficient of zm equals m 1: 22k f 2kg2m-2k- k=0 Thus, the product G(x)F(4x) is part of the functional equation that we are seeking. The first summand in the typical term in Equation 12.5 gives rise to the sum m 22m f2k - From Exercise 2, we see that this sum is just (1- u2m)22m. Thus, we need to create a generating function whose mth coefficient is this term; this generating function is (1 - u2m)22mXm M=0 or 5 22mzm - 5 u2m22mXm m=0 m=0 The first sum is just (1 - 4x)-1, and the second sum is U(4x). So, the functional equation which we have been seeking is 1 G(x) =F(4x)G(x) + - U(4x). If we solve this recursion for G(x), and simplify, we obtain 1 1 G(x) =. 1 43/2-1-4x (12.6) We now need to find a formula for the coefficient of zm. The first summand in Equation 12.6 is (1/2)U'(4x), so the coefficient of xm in this function is tt2m±222m+1(m + 1) . The second summand in Equation 12.6 is the sum of a geometric series with common ratio 4x, so the coefficient of zm is 22m. Thus, we obtain  12.1. RANDOM WALKS IN EUCLIDEAN SPACE 481 92m =u2m+222m+1(m + 1) - 22m 1 2m+2 = - (+1)- 22m 2 m+1 m+1 We recall that the quotient g2m/22m is the expected number of equalizations among all paths of length 2m. Using Exercise 4, it is easy to show that 92m 2 2m - 2m.' 22m 7 In particular, this means that the average number of equalizations among all paths of length 4m is not twice the average number of equalizations among all paths of length 2m. In order for the average number of equalizations to double, one must quadruple the lengths of the random walks. D It is interesting to note that if we define M = max Sk, 0 1, f2m = u2m-2 - U2m (b) Using part (a), find a closed-form expression for the sum f2 + f4+ + f2m (c) Using part (b), show that 00 E f2m= 1. m=1 (One can also obtain this statement from the fact that  482 CHAPTER 12. RANDOM WALKS (d) Using parts (a) and (b), show that the probability of no equalization in the first 2m outcomes equals the probability of an equalization at time 2m. 3 Using the notation of Example 12.3, show that E2k| = f2k22m 4 Using Stirling's Formula, show that 1 u2m~ 5 A lead change in a random walk occurs at time 2k if S2k-1 and S2k+1 are of opposite sign. (a) Give a rigorous argument which proves that among all walks of length 2m that have an equalization at time 2k, exactly half have a lead change at time 2k. (b) Deduce that the total number of lead changes among all walks of length 2m equals 1 2(92m - u2m) . (c) Find an asymptotic expression for the average number of lead changes in a random walk of length 2m. 6 (a) Show that the probability that a random walk of length 2m has a last return to the origin at time 2k, where 0 < k 0, S2 > 0, ..., S2m> 0) = u2m . Hint: First explain why P(Si>0, S2>0, ...,S2m > 0) 1 = -P(Si#O0,S2#0,...,S2m#0). 2 Then use Exercise 7, together with the observation that if no equalization occurs in the first 2m outcomes, then the path goes through the point (1, 1) and remains on or above the horizontal line x= 1. *9 In Feller,3 one finds the following theorem: Let Mn be the random variable which gives the maximum value of Sk, for 1 0, then P(Mn= r)={ pnr, if r=n (mod 2), P(M Pn,r+1, if r n (mod 2). (a) Using this theorem, show that 1m 2m E(M2m) =2m (4k - 1) , 22m k1m+k and if n = 2m + 1, then 1 m f2m+1 E(M2m+1) 22m+1 Z(4k + 1) .m k ) (b) For m;> 1, define m 2m rm =Ik m+k k=1 m+k and m 2m + 1 sm = zk( 221) k=1 m 1 By using the identity n n-1 n-1 ( k-1+ k1) show that sm =2rm(-22m (2m) 3W. Feller, Introduction to Probability Theory anid its Applications, vol. I, 3rd ed. (New York: John Wiley & Sons, 1968).  484 CHAPTER 12. RANDOM WALKS and r'm = 2sm-1 + 22m-1 if m>2. (c) Define the generating functions 00 R(x) =Z1:rkxk k=1 and 00 S(x)= szkxk. k=1 Show that S(x) =2R(x)- 1 -14x+ v/1- 4x and 1 R(x) = 2xS(x) + x(. 1 - 4x (d) Show that R(x)= (1 -x and 1 1 1 1 S~)=2 (1 - 4x)3/2 21 - 4.x (e) Show that 72m - 1) rm=m () and 1 2m+1 1 sm -(m +1) (22m) 2 m 2 (f) Show that E M m 2m + 1 2m - 1 E(M2m) 22m-1 ) + 22m+1 (m 2 ' and m+1 (2m+2 11 E(M2m+1) =22m±1 m + 1 J 2 The reader should compare these formulas with the expression for g~m/2(2m) in Example 12.3.  12.1. RANDOM WALKS IN EUCLIDEAN SPACE 485 *10 (from K. Levasseur4) A parent and his child play the following game. A deck of 2n cards, n red and n black, is shuffled. The cards are turned up one at a time. Before each card is turned up, the parent and the child guess whether it will be red or black. Whoever makes more correct guesses wins the game. The child is assumed to guess each color with the same probability, so she will have a score of n, on average. The parent keeps track of how many cards of each color have already been turned up. If more black cards, say, than red cards remain in the deck, then the parent will guess black, while if an equal number of each color remain, then the parent guesses each color with probability 1/2. What is the expected number of correct guesses that will be made by the parent? Hint: Each of the (2") possible orderings of red and black cards corresponds to a random walk of length 2n that returns to the origin at time 2n. Show that between each pair of successive equalizations, the parent will be right exactly once more than he will be wrong. Explain why this means that the average number of correct guesses by the parent is greater than n by exactly one-half the average number of equalizations. Now define the random variable Xi to be 1 if there is an equalization at time 2i, and 0 otherwise. Then, among all relevant paths, we have E(XZ) = P(X= =1) (=2-) . Thus, the expected number of equalizations equals n ~1 n 2n -2i(2i) E ( Xi = 2n - -- =1 (n ) i=1 n z z i= 1 One can now use generating functions to find the value of the sum. It should be noted that in a game such as this, a more interesting question than the one asked above is what is the probability that the parent wins the game? For this game, this question was answered by D. Zagier.5 He showed that the probability of winning is asymptotic (for large n) to the quantity 1 1 2 2V2 *11 Prove that (2) 1 n (2n)! 42 = k!k!(n - k)!(n -k)!' and (3) 1(2n) 2n 62n I .zjIjkk!(n -j-!- j - k j,k) 4K. Levasseur, "How to Beat Your Kids at Their Own Game," Mathematics Magazine vol. 61, no. 5 (December, 1988), pp. 301-305. 5D. Zagier, "How Often Should You Beat Your Kids?" Mathematics Magazine vol. 63, no. 2 (April 1990), pp. 89-92.  486 CHAPTER 12. RANDOM WALKS where the last sum extends over all non-negative j and k with j + k <;n. Also show that this last expression may be rewritten as 1 2n 1 n! 2 22 n k .:3 .Jk!(n-j-k)! *12 Prove that if n;> 0, then n n 2 2n k n k=0 Hint: Write the sum as i: (s) (12 iik) k=0 and explain why this is a coefficient in the product (1+x)"h(1+x)" . Use this, together with Exercise 11, to show that (2) 1 2n ) n 2 1 2n 2 U2n- 2n n k 2n - k=0 *13 Using Stirling's Formula, prove that 2n 22n nr) *14 Prove that 1n! . 3" j!k!(n - j k)! 3,k where the sum extends over all non-negative j and k such that j + k < n. Hint: Count how many ways one can place n labelled balls in 3 labelled urns. *15 Using the result proved for the random walk in R3 in Example 12.2, explain why the probability of an eventual return in Rh is strictly less than one, for all n> 3. Hint: Consider a random walk in Rh and disregard all but the first three coordinates of the particle's position. 12.2 Gambler's Ruin In the last section, the simplest kind of symmetric random walk in R1 was studied. In this section, we remove the assumption that the random walk is symmetric. Instead, we assume that p and q are non-negative real numbers with p + q =1, and that the common distribution function of the jumps of the random walk is  12.2. GAMBLER'S RUIN 487 One can imagine the random walk as representing a sequence of tosses of a weighted coin, with a head appearing with probability p and a tail appearing with probability q. An alternative formulation of this situation is that of a gambler playing a sequence of games against an adversary (sometimes thought of as another person, sometimes called "the house") where, in each game, the gambler has probability p of winning. The Gambler's Ruin Problem The above formulation of this type of random walk leads to a problem known as the Gambler's Ruin problem. This problem was introduced in Exercise 23, but we will give the description of the problem again. A gambler starts with a "stake" of size s. She plays until her capital reaches the value M or the value 0. In the language of Markov chains, these two values correspond to absorbing states. We are interested in studying the probability of occurrence of each of these two outcomes. One can also assume that the gambler is playing against an "infinitely rich" adversary. In this case, we would say that there is only one absorbing state, namely when the gambler's stake is 0. Under this assumption, one can ask for the proba- bility that the gambler is eventually ruined. We begin by defining qk to be the probability that the gambler's stake reaches 0, i.e., she is ruined, before it reaches M, given that the initial stake is k. We note that qo = 1 and qu = 0. The fundamental relationship among the qk's is the following: qk = pqk+1 + qqk-1 where 1 < k K p, the expression approaches 1. In the case p = q = 1/2, we recall that qz = 1 - z/M. Thus, if M -- o, we see that the probability of eventual ruin tends to 1. Historical Remarks In 1711, De Moivre, in his book De Mesura Sortis, gave an ingenious derivation of the probability of ruin. The following description of his argument is taken from David.6 The notation used is as follows: We imagine that there are two players, A and B, and the probabilities that they win a game are p and q, respectively. The players start with a and b counters, respectively. Imagine that each player starts with his counters before him in a pile, and that nominal values are assigned to the counters in the following manner. A's bottom counter is given the nominal value q/p; the next is given the nominal value (q/p)2, and so on until his top counter which has the nominal value (q/p)a. B's top counter is valued (q/p)a+l, and so on downwards until his bottom counter which is valued (q/p)a+b. After each game the loser's top counter is transferred to the top of the winner's pile, and it is always the top counter which is staked for the next game. Then in terms of the nominal values B's stake is always q/p times A's, so that at every game each player's nominal expectation is nil. This remains true throughout the play; therefore A's chance of winning all B's counters, multiplied by his nominal gain if he does so, must equal B's chance multiplied by B's nominal gain. Thus, Pa(( a±1+- - +(9)ab =Pb(( +- - +( a) . (12.8) 6F. N. David, Games, Gods and Gambling (London: Griffin, 1962).  490 CHAPTER 12. RANDOM WALKS Using this equation, together with the fact that Pa + Pb = 1, it can easily be shown that P-_(q/p)a -1 a (/)pga+b - 1 if p + q, and a Pa a+b if p =q =1/2. In terms of modern probability theory, de Moivre is changing the values of the counters to make an unfair game into a fair game, which is called a martingale. With the new values, the expected fortune of player A (that is, the sum of the nominal values of his counters) after each play equals his fortune before the play (and similarly for player B). (For a simpler martingale argument, see Exercise 9.) De Moivre then uses the fact that when the game ends, it is still fair, thus Equation 12.8 must be true. This fact requires proof, and is one of the central theorems in the area of martingale theory. Exercises 1 In the gambler's ruin problem, assume that the gambler initial stake is 1 dollar, and assume that her probability of success on any one game is p. Let T be the number of games until 0 is reached (the gambler is ruined). Show that the generating function for T is 1 - 1 - 4pqz2 h(z) = 2pz and that h(1) = qp i '

p and h'1= 1/(q -p), if q >p, h'(1V- oo, if q p. Interpret your results in terms of the time T to reach 0. (See also Exam- ple 10.7.) 2 Show that the Taylor series expansion for 1 - x is /1- O = 1/2 n=0 where the binomial coefficient (1/) is (1/2) (1/2)(1/2 -1). -(1/2 - n+1)  12.2. GAMBLER'S RUIN 491 Using this and the result of Exercise 1, show that the probability that the gambler is ruined on the nth step is r ___k-1ifn=2k-i, PT(T)()0, ifn=2k. 3 For the gambler's ruin problem, assume that the gambler starts with k dollars. Let Tk be the time to reach 0 for the first time. (a) Show that the generating function hk(t) for Tk is the kth power of the generating function for the time T to ruin starting at 1. Hint: Let Tk = U1 + U2 + - - - + Uk, where U3 is the time for the walk starting at j to reach j - 1 for the first time. (b) Find hk(1) and h' (1) and interpret your results. 4 (The next three problems come from Feller.7) As in the text, assume that M is a fixed positive integer. (a) Show that if a gambler starts with an stake of 0 (and is allowed to have a negative amount of money), then the probability that her stake reaches the value of M before it returns to 0 equals p(1 - q1). (b) Show that if the gambler starts with a stake of M then the probability that her stake reaches 0 before it returns to M equals qqM_1. 5 Suppose that a gambler starts with a stake of 0 dollars. (a) Show that the probability that her stake never reaches M before return- ing to 0 equals 1 - p(1 - q1). (b) Show that the probability that her stake reaches the value M exactly k times before returning to 0 equals p(1 - q1)(1 - qqM_1)k-1(qq_1). Hint: Use Exercise 4. 6 In the text, it was shown that if q < p, there is a positive probability that a gambler, starting with a stake of 0 dollars, will never return to the origin. Thus, we will now assume that q > p. Using Exercise 5, show that if a gambler starts with a stake of 0 dollars, then the expected number of times her stake equals M before returning to 0 equals (p/q)M, if q > p and 1, if q = p. (We quote from Feller: "The truly amazing implications of this result appear best in the language of fair games. A perfect coin is tossed until the first equalization of the accumulated numbers of heads and tails. The gambler receives one penny for every time that the accumulated number of heads exceeds the accumulated number of tails by m. The 'fair entrance fee' equals 1 independent of in.") 7W. Feller, op. cit., pg. 367.  492 CHAPTER 12. RANDOM WALKS 7 In the game in Exercise 6, let p = q = 1/2 and M = 10. What is the probability that the gambler's stake equals M at least 20 times before it returns to 0? 8 Write a computer program which simulates the game in Exercise 6 for the case p= q =1/2, and M 10. 9 In de Moivre's description of the game, we can modify the definition of player A's fortune in such a way that the game is still a martingale (and the calcula- tions are simpler). We do this by assigning nominal values to the counters in the same way as de Moivre, but each player's current fortune is defined to be just the value of the counter which is being wagered on the next game. So, if player A has a counters, then his current fortune is (q/p)a (we stipulate this to be true even if a = 0). Show that under this definition, player A's expected fortune after one play equals his fortune before the play, if p f q. Then, as de Moivre does, write an equation which expresses the fact that player A's expected final fortune equals his initial fortune. Use this equation to find the probability of ruin of player A. 10 Assume in the gambler's ruin problem that p = q = 1/2. (a) Using Equation 12.7, together with the facts that qo = 1 and qM = 0, show that for 0 K z M, M-z M-z qz = M' (b) In Equation 12.8, let p -- 1/2 (and since q = 1 - p, q -- 1/2 as well). Show that in the limit, M-z qz = M Hint: Replace q by 1 - p, and use L'Hopital's rule. 11 In American casinos, the roulette wheels have the integers between 1 and 36, together with 0 and 00. Half of the non-zero numbers are red, the other half are black, and 0 and 00 are green. A common bet in this game is to bet a dollar on red. If a red number comes up, the bettor gets her dollar back, and also gets another dollar. If a black or green number comes up, she loses her dollar. (a) Suppose that someone starts with 40 dollars, and continues to bet on red until either her fortune reaches 50 or 0. Find the probability that her fortune reaches 50 dollars. (b) How much money would she have to start with, in order for her to have a 95% chance of winning 10 dollars before going broke? (c) A casino owner was once heard to remark that "If we took 0 and 00 off of the roulette wheel, we would still make lots of money, because people would continue to come in and play until they lost all of their money." Do you think that such a casino would stay in business?  12.3. ARC SINE LAWS 493 12.3 Arc Sine Laws In Exercise 12.1.6, the distribution of the time of the last equalization in the sym- metric random walk was determined. If we let a2k,2m denote the probability that a random walk of length 2m has its last equalization at time 2k, then we have a2k,2m = U2ku2m-2k We shall now show how one can approximate the distribution of the a's with a simple function. We recall that 1 U2k ~ - Therefore, as both k and m go to o, we have 1 a2k,2m '7km-k This last expression can be written as 1 wm (k/m)(1 - k/m) Thus, if we define 1 7 rx(1 -x) for 0 -x).) We imagine that we have a random walk of length n in which each summand has the distribution F(x), where F is continuous and symmetric. The subscript of the first maximum of such a walk is the unique subscript k such that Sk > So, ..., Sk > Sk _1, Sk;> Sk+1,..., Sk > Sn . We define the random variable Kn to be the subscript of the first maximum. We can now state the following theorem concerning the random variable Ks. Theorem 12.6 Let F be a symmetric continuous distribution function, and let a be a fixed real number strictly between 0 and 1. Then as n - o, we have 2 P(Kn < na) --arcsin a. A version of this theorem that holds for a symmetric random walk can also be found in Feller. Exercises 1 For a random walk of length 2m, define Ek to equal 1 if Sk > 0, or if Sk-1 = 1 and Sk = 0. Define Ek to equal -1 in all other cases. Thus, Ek gives the side of the t-axis that the random walk is on during the time interval [k - 1, k]. A "law of large numbers" for the sequence {Ek} would say that for any 6 > 0, we would have P -(