Copue Scec Nor MatS Unvrst of Caiona Davi lirr (MSS .f~ =c-o~~-)E1t x o- mrom us~ See Cratv Comn liesea Stp/hahrcsudvseu - - -g f S SOstatookSS   Contents 1 Discrete Probability Models 1 1.1 ALOHA Network Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Basic Ideas of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 The Crucial Notion of a Repeatable Experiment . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Our Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Basic Probability Computations: ALOHA Network Example . . . . . . . . . . . . . 6 1.2.4 Bayes' Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.5 ALOHA in the Notebook Context . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.6.1 Simulation of the ALOHA Example . . . . . . . . . . . . . . . . . . . . 10 1.2.6.2 Rolling Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.7 Combinatorics-Based Probability Computation . . . . . . . . . . . . . . . . . . . . 12 1.2.7.1 Which Is More Likely in Five Cards, One King or Two Hearts? . . . . . . 12 1.2.7.2 "Association Rules" in Data Mining . . . . . . . . . . . . . . . . . . . . 13 1.3 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Independence, Expected Value and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4.1 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4.2 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4.2.1 Intuitive Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4.2.2 Computation and Properties of Expected Value . . . . . . . . . . . . . . . 15 1.4.2.3 Casinos, Insurance Companies and "Sum Users," Compared to Others . . 18 1.4.3 Variance.. . .......... . ..... . .. . . . ... .. .. .. .. .. .. .. .. ........ 19 1  ii1 CONTENTS 1.4.4 Is a Variance of X Large or Small? . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.4.5 Chebychev's Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.4.6 The Coefficient of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.4.7 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.4.8 A Combinatorial Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.4.9 Expected Value, Etc. in the ALOHA Example . . . . . . . . . . . . . . . . . . . . . 23 1.4.10 Reconciliation of Math and Intuition (optional section) . . . . . . . . . . . . . . . . 24 1.5 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.5.1 Basic Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.5.2 Parameteric Families of pmfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.5.2.1 The Geometric Family of Distributions . . . . . . . . . . . . . . . . . . . 25 1.5.2.2 The Binomial Family of Distributions . . . . . . . . . . . . . . . . . . . . 26 1.5.2.3 The Poisson Family of Distributions . . . . . . . . . . . . . . . . . . . . 27 1.5.2.4 The Negative Binomial Family of Distributions . . . . . . . . . . . . . . 27 1.5.2.5 The Power Law Family of Distributions . . . . . . . . . . . . . . . . . . 29 1.6 Recognizing Distributions When You See Them . . . . . . . . . . . . . . . . . . . . . . . . 29 1.6.1 A Coin Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.6.2 Tossing a Set of Four Coins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.6.3 The ALOHA Example Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.7 A Cautionary Tale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.7.1 Trick Coins, Tricky Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.7.2 Intuition in Retrospect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.7.3 Implications for Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 1.8 Why Not Just Do All Analysis by Simulation? . . . . . . . . . . . . . . . . . . . . . . . . . 33 1.9 Tips on Finding Probabilities, Expected Values and So On . . . . . . . . . . . . . . . . . . 33 2 Continuous Probability Models 37 2.1 A Random Dart.. . . ......... . ....... . .. . .. . . . . .. .. .. .. .. .. .. ...... 37 2.2 Density Functions.. . .......... . .. . .. . .. . . .. .. .. .. .. .. .. .. ........ 40 2.2.1 Motivation, Definition and Interpretation.. ...... .. .. .. .. .. .. .. .. .. .. 40  CONTENTS iii 2.2.2 Use of Densities to Find Probabilities and Expected Values . . . . . . . . . . . . . . 42 2.3 Famous Parametric Families of Continuous Distributions . . . . . . . . . . . . . . . . . . . 43 2.3.1 The Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.3.1.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.3.1.2 Example: Modeling of Disk Performance . . . . . . . . . . . . . . . . . . 43 2.3.1.3 Example: Modeling of Denial-of-Service Attack . . . . . . . . . . . . . . 43 2.3.2 The Normal (Gaussian) Family of Continuous Distributions . . . . . . . . . . . . . 44 2.3.2.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.3.2.2 Example: Network Intrusion . . . . . . . . . . . . . . . . . . . . . . . . 45 2.3.2.3 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.3.2.4 Example: Coin Tosses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.3.2.5 Museum Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.3.2.6 Optional topic: Formal Statement of the CLT . . . . . . . . . . . . . . . . 47 2.3.2.7 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.3 The Chi-Square Family of Distributions . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.3.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.3.2 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.4 The Exponential Family of Distributions . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.4.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.4.2 Connection to the Poisson Distribution Family . . . . . . . . . . . . . . . 49 2.3.4.3 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.3.5 The Gamma Family of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.3.5.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.3.5.2 Example: Network Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.3.5.3 Importance in Modeling....... .. .. .. .. .. .. .. .. .. .. .. ..51 2.4 Describing "Failure"............. . ... .. ... .. .. .. .. .. .. .. .. .. ......51 2.4.1 Memoryless Property......... .. .. . .. .. .. .. .. .. .. .. .. ......51 2.4.2 Hazard Functions............ . ... .. .. .. .. .. .. .. .. .. .. .. ....54 2.4.2.1 Basic Concepts....... .. .. .. .. .. .. .. .. .. .. .. ......54 2.4.3 Example: Software Reliability Models..... . .. .. .. .. .. .. .. .. .. .. ....55  iv CONTENTS 2.5 A Cautionary Tale: the Bus Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.6 Choosing a M odel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.7 A General Method for Simulating a Random Variable . . . . . . . . . . . . . . . . . . . . . 57 3 Multivariate Probability Models 61 3.1 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.1.1 W hy Are They Needed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.1.2 Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.1.3 M ultivariate Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.1.3.1 Motivation and Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.1.3.2 Use of Multivariate Densities in Finding Probabilities and Expected Values 63 3.1.3.3 Example: a Triangular Distribution . . . . . . . . . . . . . . . . . . . . . 64 3.2 More on Co-variation of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.2.3 Example: Continuation of Section 3.1.3.3 . . . . . . . . . . . . . . . . . . . . . . . 67 3.2.4 Example: a Catchup Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.3 Sets of Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3.1.1 Probability Mass Functions and Densities Factor . . . . . . . . . . . . . . 69 3.3.1.2 Expected Values Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.3.1.3 Covariance Is 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.3.1.4 Variances Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3.1.5 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3.2 Examples............. . ..... .... .. .. .. .. .. .. .. .. .. ......72 3.3.2.1 Example: Dice....... ..... .. .. .. .. .. .. .. .. .. ......72 3.3.2.2 Example: Ethernet......... . .. .. .. .. .. .. .. .. .. ....72 3.3.2.3 Example: Analysis of Seek Time... .. .. .. .. .. .. .. .. .. .. ..73 3.3.2.4 Example: Backup Battery....... . .. .. .. .. .. .. .. .. .. ....73 3.4 Matrix Formulations............. .... ..... .. .. .. .. .. .. .. .. .. ......74  CONTENTS V 3.4.1 3.4.2 Properties of Mean Vectors . . . . . . . . . . . . . . . . . . . . . Properties of Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Conditional Pmfs and Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 The Law of Total Expectation (advanced topic) . . . . . . . . . . . . . . . . . . . . 3.5.3.1 Expected Value As a Random Variable . . . . . . . . . . . . . . . . . . . 3.5.3.2 The Famous Formula (Theorem of Total Expectation) . . . . . . . . . . . 3.5.4 What About the Variance? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.5 Example: Trapped Miner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.6 Example: Analysis of Hash Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Parametric Families of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 The Multinomial Family of Distributions . . . . . . . . . . . . . . . . . . . . . . . 3.6.1.1 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1.2 Means and Covariances . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1.3 Application: Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 The Multivariate Normal Family of Distributions . . . . . . . . . . . . . . . . . . . 3.6.2.1 Densities and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2.2 The Multivariate Central Limit Theorem . . . . . . . . . . . . . . . . . . 3.6.2.3 Example: Dice Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2.4 Application: Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Simulation of Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Transform Methods (advanced topic) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.0.5 Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.0.6 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Example: Network Packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1.1 Poisson Generating Function . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1.2 Sums of Independent Poisson Random Variables Are Poisson Distributed . 3.8.1.3 Random Number of Bits in Packets on One Link (advanced topic) .. .. . 3.8.2 Other Uses of Transforms........ ..... . . .. .. .. .. .. .. .. .. ....... 74 74 75 75 75 76 76 76 77 77 78 80 80 80 81 82 83 83 86 86 88 88 89 89 90 91 91 91 92 93  vi CONTENTS 3.9 Vector Space Interpretations (for the mathematically adventurous only) . . . . . . . . . . . 3.9.1 Properties of Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.2 Conditional Expectation As a Projection . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Proof of the Law of Total Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 93 94 95 4 Introduction to Statistical Inference 4.1 What Statistics Is All About . . . . . . . . . . . . . . 4.2 Introduction to Confidence Intervals . . . . . . . . . . 4.2.1 How Long Should We Run a Simulation? . . . 4.2.2 Confidence Intervals for Means . . . . . . . . 4.2.2.1 Sampling Distributions . . . . . . . 4.2.2.2 Our First Confidence Interval . . . . 4.2.3 Meaning of Confidence Intervals . . . . . . . . 4.2.3.1 A Weight Survey in Davis . . . . . . 4.2.3.2 Back to Our Bus Simulation . . . . . 4.2.3.3 One More Point About Interpretation 4.2.4 Sampling With and Without Replacement . . . 4.2.5 Other Confidence Levels . . . . . . . . . . . . 4.2.6 "The Standard Error of the Estimate" . . . . . 4.2.7 Why Not Divide by n-1? The Notion of Bias 4.2.8 And What About the Student-t Distribution? 4.2.9 Confidence Intervals for Proportions . . . . . . 4.2.9.1 Derivation . . . . . . . . . . . . . . 4.2.9.2 Examples . . . . . . . . . . . . . . 4.2.9.3 Interpretation . . . . . . . . . . . . 4.2.9.4 (Non-)Effect of the Population Size . 4.2.9.5 Planning Ahead . . . . . . . . . . . 4.2.10 One-Sided Confidence Intervals . . . . . . . . 4.2.11 Confidence Intervals for Differences of Means o 4.2.11.1 Independent Samples . . . . . . . . 99 . .... . . . . . . . . . . . . . . . . 99 . .... . . . . . . . . . . . . . . . . 99 . .... . . . . . . . . . . . . . . . . 99 . . . . . . . . . . . . . . . . . . . . 100 . . . . . . . . . . . . . . . . . . . . 100 . . . . . . . . . . . . . . . . . . . . 102 . . . . . . . . . . . . . . . . . . . . 104 . . . . . . . . . . . . . . . . . . . . 104 . . . . . . . . . . . . . . . . . . . . 105 . . . . . . . . . . . . . . . . . . . . 106 . . . . . . . . . . . . . . . . . . . . 107 . . . . . . . . . . . . . . . . . . . . 107 . . . . . . . . . . . . . . . . . . . . 107 . . . . . . . . . . . . . . . . . . . . 108 . . . . . . . . . . . . . . . . . . . . 109 . . . . . . . . . . . . . . . . . . . . 110 . . . . . . . . . . . . . . . . . . . . 110 . . . . . . . . . . . . . . . . . . . . 11 1 . . . . . . . . . . . . . . . . . . . . 112 . . . . . . . . . . . . . . . . . . . . 112 . . . . . . . . . . . . . . . . . . . . 112 . . . . . . . . . . . . . . . . . . . . 113 r Proportions . . . . . . . . . . . . 113 . . . . . . . . . . . . . . . . . . . . 113  CONTENTSvi vii 4.2.11.2 Random Sample Size. .. .. .. ... ... . .. 4.2.11.3 Dependent Samples.... .. .. .. .. .. .. .. . 4.2.12 Example: Machine Classification of Forest Covers. .. .. . 4.2.13 Exact Confidence Intervals.... .. .. .. .. .. .. .. .. . 4.2.14 Slutsky's Theorem (advanced topic)... .. .. .. .. .. .. 4.2. 14.1 The Theorem.... .. .. .. .. .. .. .. .. ... 4.2.14.2 Why It's Valid to Substitute s for au.. .. .. .. . 4.2.14.3 Example: Confidence Interval for a Ratio Estimator 4.2.15 The Delta Method: Confidence Intervals for General Functions tions (advanced topic).... .. .. .. .. .. .. .. .. .... ........................ ... ... .....115 ........................ ... ... .....115 ...................................116 ...................................117 ...................................117 ...................................118 ...................................118 ...................................119 of Means or Propor- ( 4.2.15.1 The Theorem.. .. .. ... .. ... ... ... ... .. .. 4.2.15.2 Example: Square Root Transformation... .. .. .. .. .. .. 4.2.15.3 Example: Confidence Interval for a2.. .. .. .. .. .. .. .. 4.2.16 Simultaneous Confidence Intervals.... .. .. .. .. .. .. .. .. .. .. 4.2.16.1 The Bonferonni Method... .. .. .. .. .. .. .. .. .. ... 4.2.16.2 Scheffe's Method (advanced topic)... .. .. .. .. .. .. .. 4.2.16.3 Example........ .. .. .. .. .. .. .. .. .. .. .. .. .. 4.2.16.4 Other Methods for Simultaneous Inference.. .. .. .. .. .. . 4.2.17 The Bootstrap Method for Forming Confidence Intervals (advanced topic) 4.3 Hypothesis Testing..... ..... . .. .. .. .. .. .. .. .. .. .. .. .. .. 4.3.1 The Basics..... ..... . .. .. .. .. .. .. .. .. .. .. .. .. .. 4.3.2 General Testing Based on Normally Distributed Estimators .. .. .. .. . 4.3.3 Example: Network Security... .. .. .. .. .. .. .. .. .. .. .. .. 4.3.4 The Notion of "p-Values".... .. .. .. .. .. .. .. .. .. .. .. ... 4.3.5 What's Random and What Is Not... .. .. .. .. .. .. .. .. .. .. . 4.3.6 One-Sided HA......... .. .. .. .. .. .. .. .. .. .. .. .. .. 4.3.7 Exact Tests..... ........ .. .. .. .. .. .. .. .. .. .. .. . 4.3.8 What's Wrong with Hypothesis Testing... .. .. .. .. .. .. .. .. .. 4.3.9 What to Do Instead......... .. .. .. .. .. .. .. .. .. .. .. .. 4.3.10 Decide on the Basis of "the Preponderance of Evidence" .. .. .. .. .. .................................119 .................................119 .................................120 .................................121 .................................123 .................................124 .................................125 .................................126 .................................126 .................................126 .................................126 .................................126 .................................127 .................................128 .................................128 .................................128 .................................129 .................................129 .................................131 .................................131 .................................132  viii CONTENTS 4.4 General Methods of Estimation . . . . . . . . 4.4.1 Example: Guessing the Number of Raf 4.4.2 Method of Moments . . . . . . . . . 4.4.3 Method of Maximum Likelihood . . . 4.4.4 Example: Estimation the Parameters of 4.4.4.1 Method of Moments . . . . 4.4.4.2 MLEs . . . . . . . . . . . 4.4.5 More Examples . . . . . . . . . . . . 4.4.6 What About Confidence Intervals? . . 4.4.7 Bayesian Methods (advanced topic . 4.4.8 The Empirical cdf . . . . . . . . . . 4.5 Real Populations and Conceptual Populations 4.6 Nonparametric Density Estimation . . . . . . 4.6.1 Basic Ideas . . . . . . . . . . . . . . 4.6.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 fle Tickets Sold . . . . . . . . . . . . . . . . 133 ... .i.et Sold......................133 . . . . . . . . . . . . . . . . . . . . . . . . . 134 a Gamma Distribution . . . . . . . . . . . . 135 . . . . . . . . . . . . . . . . . . . . . . . . . 135 . . . . . . . . . . . . . . . . . . . . . . . . . 136 . . . . . . . . . . . . . . . . . . . . . . . . . 136 . . . . . . . . . . . . . . . . . . . . . . . . . 138 . . . . . . . . . . . . . . . . . . . . . . . . . 138 . . . . . . . . . . . . . . . . . . . . . . . . . 139 . . . . . . . . . . . . . . . . . . . . . . . . . 140 . . . . . . . . . . . . . . . . . . . . . . . . . 14 1 . . . . . . . . . . . . . . . . . . . . . . . . . 14 1 . . . . . . . . . . . . . . . . . . . . . . . . . 142 anced topic) . . . . . . . . . . . . . . . . . . 144 .nc topic)........................145 149 . . . . . . . . . . . . . . . . . . . . . . . . . 149 ..............................150 em . . . . . . . . . . . . . . . . . . . . . . . 150 .s . . . . . . . . . . . . . . . . . . . . . . . . 15 1 . . . . . . . . . . . . . . . . . . . . . . . . . 15 1 . . . . . . . . . . . . . . . . . . . . . . . . . 153 . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 .s. . . . . . . . . . . . . . . . . . . . . . . 155 .. . . . . . . . . . . . . 155 4.6.3 4.6.4 Kernel-Based Density Estimation (advc Proper Use of Density Estimates . . . 5 Introduction to Model Building 5.1 Bias Vs. Variance . . . . . . . . . . . . . . . 5.2 "Desperate for Data" . . . . . . . . . . . . . 5.2.1 Mathematical Formulation of the Probl 5.2.2 Bias and Variance of the Two Predictor 5.2.3 Implications . . . . . . . . . . . . . 5.3 Assessing "Goodness of Fit" of a Model . . 5.3.1 The Chi-Square Goodness of Fit Test 5.3.2 Kolmogorov-Smirnov Confidence Ban 5.4 Bias Vs. Variance-Again . . . . . . . . . . 5.5 Robustness . . . . . . . . . . . . . . . . . . 6 Statistical Relations Between Variables 157  CONTENTS ix 6.1 6.2 6.3 The Goals: Prediction and Understanding . . . . . . . . . . . . . . . . . . Example Applications: Software Engineering, Networks, Text Mining . . . Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 What Does "Relationship" Really Mean? . . . . . . . . . . . . . . 6.3.2 Multiple Regression: More Than One Predictor Variable . . . . . . 6.3.3 Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 Nonrandom Predictor Variables . . . . . . . . . . . . . . . . . . . 6.3.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.6 Optimality of the Regression Function . . . . . . . . . . . . . . . . 6.3.7 Parametric Estimation of Linear Regression Functions . . . . . . . 6.3.7.1 Meaning of "Linear" . . . . . . . . . . . . . . . . . . . . 6.3.7.2 Point Estimates and Matrix Formulation . . . . . . . . . 6.3.7.3 Back to Our ALOHA Example . . . . . . . . . . . . . . 6.3.7.4 Approximate Confidence Intervals . . . . . . . . . . . . 6.3.7.5 Once Again, Our ALOHA Example . . . . . . . . . . . 6.3.7.6 Estimation Vs. Prediction . . . . . . . . . . . . . . . . . 6.3.7.7 Exact Confidence Intervals . . . . . . . . . . . . . . . . 6.3.8 The Famous "Error Term" (advanced topic) . . . . . . . . . . . . . 6.3.9 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.9.1 The Overfitting Problem in Regression . . . . . . . . . . 6.3.9.2 Methods for Predictor Variable Selection . . . . . . . . . 6.3.10 Nonlinear Parametric Regression Models . . . . . . . . . . . . . . 6.3.11 Nonparametric Estimation of Regression Functions . . . . . . . . . 6.3.12 Regression Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . 6.3.13 Nominal Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.14 The Case in Which All Predictors Are Nominal Variables: "Analysis 6.3.14.1 It's a Regression!... .. .. .. .. .. .. .. .. .. ... 6.3.14.2 Interaction Terms... .. .. .. .. .. .. .. .. .. ... 6.3.14.3 Now Consider Parsimony... .. .. .. .. .. .. .. ... 6.3.14.4 Reparameterization..... .. .. .. .. .. .. .. .. ... of Variance'' 157 157 158 158 159 160 160 163 164 165 165 166 167 169 171 172 172 172 173 173 174 175 176 177 177 177 178 178 179 180  X CONTENTS 6.4 The Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Meaning of the Regression Function . . . . . . . . . . . . . . . . . . . . . . . 6.4.1.1 The Mean Here Is a Probability . . . . . . . . . . . . . . . . . . . . 6.4.1.2 Optimality of the Regression Function . . . . . . . . . . . . . . . . 6.4.2 Parametric Models for the Regression Function in Classification Problems . . . 6.4.2.1 The Logistic Model: Form . . . . . . . . . . . . . . . . . . . . . . 6.4.2.2 The Logistic Model: Intuitive Motivation . . . . . . . . . . . . . . . 6.4.2.3 The Logistic Model: Theoretical Foundation . . . . . . . . . . . . . 6.4.3 Nonparametric Estimation of Regression Functions for Classification (advanced ...181 ...181 ...181 ...181 ...182 ...182 ...183 ...183 topic) 184 6.4.3.1 Use the Kernel Method, CART, Etc. . . . . . . . . . . . . . . . . . . . . 184 6.4.3.2 SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 6.4.4 Variable Selection in Classification Problems . . . . . . . . . . . . . . . . . . . . . 185 6.4.4.1 Problems Inherited from the Regression Context . . . . . . . . . . . . . . 185 6.4.4.2 Example: Forest Cover Data . . . . . . . . . . . . . . . . . . . . . . . . 185 6.4.5 Y Must Have a Marginal Distribution! . . . . . . . . . . . . . . . . . . . . . . . . . 186 6.5 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.5.1 Dimension Reduction and the Principle of Parsimony . . . . . . . . . . . . . . . . . 187 6.5.2 How to Calculate Them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 6.5.3 Example: Forest Cover Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.6 Log-Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.6.1 The Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.6.2 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.6.3 The Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 6.6.4 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 6.6.5 The Goal: Parsimony Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 6.7 Simpson's (Non-)Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 7 Markov Chains 7.1 Discrete-Time Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Example: Finite Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 197 197  CONTENTS xi 7.1.2 Long-Run Distribution . . . 7.1.2.1 Periodic Chains . 7.1.2.2 The Meaning of the 7.1.3 Example: Stuck-At 0 Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Term "Stationary Distributioi . . . . . . . . . . . . . . . . 7.1.3.1 Description . . . . . . . . . . . . . . . . . . . 7.1.3.2 Initial Analysis . . . . . . . . . . . . . . . . . . 7.1.3.3 Going Beyond Finding r . . . . . . . . . . . . 7.1.4 Example: Shared-Memory Multiprocessor . . . . . . . . 7.1.4.1 The Model . . . . . . . . . . . . . . . . . . . . 7.1.4.2 Going Beyond Finding r . . . . . . . . . . . . 7.1.5 Example: Slotted ALOHA . . . . . . . . . . . . . . . . . 7.1.5.1 Going Beyond Finding r . . . . . . . . . . . . 7.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . 7.3 Continuous-Time Markov Chains . . . . . . . . . . . . . . . . . . 7.3.1 Holding-Time Distribution . . . . . . . . . . . . . . . . . 7.3.2 The Notion of "Rates" . . . . . . . . . . . . . . . . . . . 7.3.3 Stationary Distribution . . . . . . . . . . . . . . . . . . . 7.3.4 Minima of Independent Exponentially Distributed Random 7.3.5 Example: Machine Repair . . . . . . . . . . . . . . . . . 7.3.6 Continuous-Time Birth/Death Processes . . . . . . . . . . 7.3.7 Example: Computer Worm . . . . . . . . . . . . . . . . . 7.4 Hitting Times Etc.. . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Some Mathematical Conditions . . . . . . . . . . . . . . 7.4.2 Example: Random Walks . . . . . . . . . . . . . . . . . 7.4.3 Finding Hitting and Recurrence Times . . . . . . . . . . . 7.4.4 Example: Finite Random Walk . . . . . . . . . . . . . . . 7.4.5 Example: Tree-Searching . . . . . . . . . . . . . . . . . n" . . . Variables 198 200 200 200 200 201 202 204 204 205 206 207 209 210 210 211 211 213 213 215 216 217 217 217 218 219 220 223 223 8 Introduction to Queuing Models 8.1 Introduction.... . . . . ..........................................  xii CONTENTS 8.2 M/M/1 ...... ................................. 8.2.1 Steady-State Probabilities . . . . . . . . . . . . . . . . . . 8.2.2 Mean Queue Length . . . . . . . . . . . . . . . . . . . . . 8.2.3 Distribution of Residence Time/Little's Rule . . . . . . . . 8.3 Multi-Server Models . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Loss Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Nonexponential Service Times . . . . . . . . . . . . . . . . . . . . 8.6 Reversed Markov Chains . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 Markov Property . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Long-Run State Proportions . . . . . . . . . . . . . . . . . 8.6.3 Form of the Transition Rates of the Reversed Chain . . . . . 8.6.4 Reversible Markov Chains . . . . . . . . . . . . . . . . . . 8.6.4.1 Conditions for Checking Reversibility . . . . . . 8.6.4.2 Making New Reversible Chains from Old Ones . 8.6.4.3 Example: Queues with a Common Waiting Area . 8.6.4.4 Closed-Form Expression for r for Any Reversible 8.7 Networks of Queues . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.1 Tandem Queues . . . . . . . . . . . . . . . . . . . . . . . . 8.7.2 Jackson Networks . . . . . . . . . . . . . . . . . . . . . . 8.7.2.1 Open Networks . . . . . . . . . . . . . . . . . . 8.7.3 Closed Networks . . . . . . . . . . . . . . . . . . . . . . . 9 Renewal Theory and Some Applications 9.1 Introduction.. . . ............................... 9.1.1 The Light Bulb Example, Generalized . . . . . . . . . . . . 9.1.2 Duality Between "Lifetime Domain" and "Counts Domain" 9.2 Where We Are Going... .. .. .. .. .. .. .. .. .. .. .. ... 9.3 Properties of Poisson Processes... .. .. .. .. .. .. .. .. ... 9.3.1 Definition....... . .. .. .. .. .. .. .. .. .. ..... 9.3.2 Alternate Characterizations of Poisson Processes .. .. ... . . . . . . . . . . . . . 223 . . . . . . . . . . . . .224 . . . . . . . . . . . . .224 . . . . . . . . . . . . .225 . . . . . . . . . . . . .227 . . . . . . . . . . . . .227 . . . . . . . . . . . . .229 . . . . . . . . . . . . .230 . . . . . . . . . . . . . 23 1 . . . . . . . . . . . . . 231 . . . . . . . . . . . . . 231 . . . . . . . . . . . . .232 . . . . . . . . . . . . .232 . . . . . . . . . . . . . 233 . . . . . . . . . . . . . 233 Markov Chain . . . . . 234 . . . . . . . . . . . . .235 . . . . . . . . . . . . . 235 . . . . . . . . . . . . .236 . . . . . . . . . . . . .236 . . . . . . . . . . . . .237 239 .............239 . . . . . . . . . . . . .239 . . . . . . . . . . . . .239 .. ......240 .. ......240 .. ......240 .. ......240  CONTENTS xiii 9.3.2.1 Exponential Interrenewal Times . . . . . . . . 9.3.2.2 Stationary, Independent Increments . . . . . . 9.3.3 Conditional Distribution of Renewal Times . . . . . . . 9.3.4 Decomposition and Superposition of Poisson Processes . 9.3.5 Nonhomogeneous Poisson Processes . . . . . . . . . . 9.3.5.1 Example: Software Reliability . . . . . . . . 9.4 Properties of General Renewal Processes . . . . . . . . . . . . . 9.4.1 The Regenerative Nature of Renewal Processes . . . . . 9.4.2 Some of the Main Theorems . . . . . . . . . . . . . . . 9.4.2.1 The Functions Fn Sum to m . . . . . . . . . . 9.4.2.2 The Renewal Equation . . . . . . . . . . . . . 9.4.2.3 The Function m(t) Uniquely Determines F(t) . 9.4.2.4 Asymptotic Behavior of m(t) . . . . . . . . . 9.5 Alternating Renewal Processes . . . . . . . . . . . . . . . . . . 9.5.1 Definition and Main Result . . . . . . . . . . . . . . . . 9.5.2 Example: Inventory Problem (difficult) . . . . . . . . . 9.6 Residual-Life Distribution . . . . . . . . . . . . . . . . . . . . 9.6.1 Residual-Life Distribution . . . . . . . . . . . . . . . . 9.6.2 Age Distribution . . . . . . . . . . . . . . . . . . . . . 9.6.3 Mean of the Residual and Age Distributions . . . . . . . 9.6.4 Example: Estimating Web Page Modification Rates . . . 9.6.5 Example: The (Ss) Inventory Model Again . . . . . . . 9.6.6 Example: Disk File Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 . . . . . . . . . . . . . . . 24 1 . . . . . . . . . . . . . . . 242 . . . . . . . . . . . . . . . 243 . . . . . . . . . . . . . . . 243 . . . . . . . . . . . . . . . 244 . . . . . . . . . . . . . . . 244 . . . . . . . . . . . . . . . 244 . . . . . . . . . . . . . . . 244 . . . . . . . . . . . . . . . 244 . . . . . . . . . . . . . . . 246 . . . . . . . . . . . . . . . 246 . . . . . . . . . . . . . . . 248 . . . . . . . . . . . . . . . 248 . . . . . . . . . . . . . . . 248 . . . . . . . . . . . . . . . 249 . . . . . . . . . . . . . . . 250 . . . . . . . . . . . . . . . 250 . . . . . . . . . . . . . . . 25 1 . . . . . . . . . . . . . . . 253 . . . . . . . . . . . . . . . 253 . . . . . . . . . . . . . . . 253 . . . . . . . . . . . . . . . 253 ilt) . . . . . . . . . . . . . 254 . . . . . . . . . . . . . . . 256 9.6.7 9.6.8 Example: Event Sets in Discrete Event Simulation (difficu Example: Memory Paging Model . . . . . . . . . . . .  xiv CONTENTS  Preface Why is this book different from all other books on probability and statistics? First, the book stresses computer science applications. Though other books of this nature have been pub- lished, notably the outstanding text by K.S. Trivedi, this book has much more coverage of statistics, in- cluding a full chapter titled Statistical Relations Between Variables. This should prove especially helpful as maching learning and data mining play a greater role in computer science. Second, there is a strong emphasis on modeling: What do probabilistic models really mean, in real-life terms? How does one choose a model? How do we assess the practical usefulness of models? This aspect is so important that there is a separate chapter for this as well, titled Introduction to Model Building. Through- out the text, there is considerable discussion of the intuition involving probabilistic concepts. For instance, when probability density functions are introduced, there is an extended discussion regarding the intuitive meaning of densities and their relation to the inherently-discrete nature of real data due to the finite preci- sion of measurement. However, all models and so on are described precisely in terms of random variables and distributions. Finally, the R statistical/data manipulation language is used throughout. Again, several excellent texts on probability and statistics have been written that feature R, but this book, by virtue of having a computer science audience, uses R in a more sophisticated manner. It is recommended that my online tutorial on R programming, R for Programmers (http : //heather. cs .ucdavis .edu/~matlof f /R/RProg. p df), be used as a supplement. As prerequisites, the student must know calculus, basic matrix algebra, and have skill in programming. As with any text in probability and statistics, it is also extremely helpful if the student has a good sense of math intuition, and does not treat mathematics as simply memorization of formulas. A note regarding the chapters on statistics: It is crucial that students apply the concepts in thought-provoking exercises on real data. Nowadays there are many good sources for real data sets available. Here are a few to get you started: " UC Irvine Machine Learning Repository, http://archive. ics.uci. edu/ml/datasets. html " UCLA Statistics Dept. data sets, ht t p : //www.sst at . ucla . edu/data/ " Dr. B's Wide World of Web Data, http: //research.ed.asu.edu/multimedia/DrB/ Default .htm * StatSci.org,at http :/ /www .st at sci .org/dataset s .html xv  xvi CONTENTS " University of Edinburgh School of Informatics, ht t p : / /wwww. in f . ed. ac. uk/teaching/ courses/ dme/html/datasets04O5.html Note that R has the capability of reading files on the Web, e.g. > z <- read.table("http://heather.cs.ucdavis.edu/~matloff/z") This work is licensed under a Creative Commons Attribution-No Derivative Works 3.0 United States Li- cense. The details may be viewed at http: //creativecommons .org/licenses/by-nd/3. 0/ us /, but in essence it states that you are free to use, copy and distribute the work, but you must attribute the work to me and not "alter, transform, or build upon" it. If you are using the book, either in teaching a class or for your own learning, I would appreciate your informing me. I retain copyright in all non-U.S. jurisdic- tions, but permission to use these materials in teaching is still granted, provided the licensing information here is displayed.  Chapter 1 Discrete Probability Models 1.1 ALOHA Network Example Throughout this book, we will be discussing both "classical" probability examples involving coins, cards and dice, and also examples involving applications to computer science. The latter will involve diverse fields such as data mining, machine learning, computer networks, software engineering and bioinformatics. In this section, an example from computer networks is presented which will be used at a number of points in this chapter. Probability analysis is used extensively in the development of new, faster types of networks. Today's Ethernet evolved from an experimental network developed at the University of Hawaii, called ALOHA. A number of network nodes would occasionally try to use the same radio channel to commu- nicate with a central computer. The nodes couldn't hear each other, due to the obstruction of mountains between them. If only one of them made an attempt to send, it would be successful, and it would receive an acknowledgement message in response from the central computer. But if more than one node were to trans- mit, a collision would occur, garbling all the messages. The sending nodes would timeout after waiting for an acknowledgement which never came, and try sending again later. To avoid having too many collisions, nodes would engage in random backoff, meaning that they would refrain from sending for a while even though they had something to send. One variation is slotted ALOHA, which divides time into intervals which I will call "epochs." Each epoch will have duration 1.0, so epoch 1 extends from time 0.0 to 1.0, epoch 2 extends from 1.0 to 2.0 and so on. In the version we will consider here, in each epoch, if a node is "active," i.e. has a message to send, it will either send or refrain from sending, with probability p and 1-p. The value of p is set by the designer of the network. (Real Ethernet hardware does something like this, using a random number generator inside the chip.) The other parameter q in our model is the probability that a node which had been "inactive" generates a message during an epoch, and thus becomes "active." Think of what happens when you are at a computer. You are not typing constantly, and when you are not typing, the time until you hit a key again will be random. Our parameter q models that randomness. Let n be the number of nodes, which we'll assume for simplicity is two. Assume also for simplicity that the timing is as follows. Arrival of a new message happens in the middle of an epoch, and the decision as to 1  2 CHAPTER 1. DISCRETE PROBABILITY MODELS whether to send versus back off is made near the end of an epoch, say 90% into the epoch. For example, say that at the beginning of the epoch which extends from time 15.0 to 16.0, node A has something to send but node B does not. At time 15.5, node B will either generate a message to send or not, with probability q and 1-q, respectively. Suppose B does generate a new message. At time 15.9, node A will either try to send or refrain, with probability p and 1-p, and node B will do the same. Suppose A refrains but B sends. Then B's transmission will be successful, and at the start of epoch 16 B will be inactive, while node A will still be active. On the other hand, suppose both A and B try to send at time 15.9; both will fail, and thus both will be active at time 16.0, and so on. Be sure to keep in mind that in our simple model here, during the time a node is active, it won't generate any additional new messages. Let's observe the network for two epochs, epoch 1 and epoch 2. Assume that the network consists of just two nodes, called node 1 and node 2, both of which start out active. Let X1 and X2 denote the numbers of active nodes at the very end of epochs 1 and 2, after possible transmissions. We'll take p to be 0.4 and q to be 0.8 in this example. Let's find P(X1 = 2), the probability that X1 = 2, and then get to the main point, which is to ask what we really mean by this probability. How could X1 = 2 occur? There are two possibilities: " both nodes try to send; this has probability p2 " neither node tries to send; this has probability (1 - p)2 Thus P(X1 =2) =p2 + (1-p)2 =0.52 (1.1) 1.2 Basic Ideas of Probability 1.2.1 The Crucial Notion of a Repeatable Experiment It's crucial to understand what that 0.52 figure really means in a practical sense. To this end, let's put the ALOHA example aside for a moment, and consider the "experiment" consisting of rolling two dice, say a blue one and a yellow one. Let X and Y denote the number of dots we get on the blue and yellow dice, 5 respectively, and consider the meaning of P(X + Y = 6) = In the mathematical theory of probability, we talk of a sample space, which consists of the possible out- comes (X, Y), seen in Table 1.1. In a theoretical treatment, we place weights of 1/36 on each of the points in the space, reflecting the fact that each of the 36 points is equally likely, and then say, "What we mean by P(X + Y =6) = is that the outcomes (1,5), (2,4), (3,3), (4,2), (5,1) have total weight 5/36." Though the notion of sample space is presented in every probability textbook, and is central to the advanced theory of probability, most probability computations do not rely on explicitly writing down a sample space.  1.2. BASIC IDEAS OF PROBABI ZITY 3 1,1 1,2 1,3 1,4 1,5 1,6 2,1 2,2 2,3 2,4 2,5 2,6 3,1 3,2 3,3 3,4 3,5 3,6 4,1 4,2 4,3 4,4 4,5 4,6 5,1 5,2 5,3 5,4 5,5 5,6 6,1 6,2 6,3 6,4 6,5 6,6 Table 1.1: Sample Space for the Dice Example notebook line outcome blue+yellow = 6? 1 blue 2, yellow 6 No 2 blue 3, yellow 1 No 3 blue 1, yellow 1 No 4 blue 4, yellow 2 Yes 5 blue 1, yellow 1 No 6 blue 3, yellow 4 No 7 blue 5, yellow 1 Yes 8 blue 3, yellow 6 No 9 blue 2, yellow 5 No Table 1.2: Notebook for the Dice Problem In this particular example it is useful for us as a vehicle for explaining the concepts, but we will NOT use it much. But the intuitive notion-which is FAR more important-of what P(X + Y = 6) = 3 means is the following. Imagine doing the experiment many, many times, recording the results in a large notebook: " Roll the dice the first time, and write the outcome on the first line of the notebook. " Roll the dice the second time, and write the outcome on the second line of the notebook. " Roll the dice the third time, and write the outcome on the third line of the notebook. " Roll the dice the fourth time, and write the outcome on the fourth line of the notebook. " Imagine you keep doing this, thousands of times, filling thousands of lines in the notebook. The first 9 lines of the notebook might look like Table 1.2. Here 2/9 of these lines say Yes. But after many, many repetitions, approximately 5/36 of the lines will say Yes. For example, after doing the experiment 720 times, approximately 3 x 720 = 100 lines will say Yes. This is what probability really is: In what fraction of the lines does the event of interest happen? It sounds simple, but if you always think about this "lines in the notebook" idea, probability problems are a lot easier to solve. And it is the fundamental basis of computer simulation.  4 CHAPTER 1. DISCRETE PROBABILITY MODELS 1.2.2 Our Definitions These definitions are intuitive, rather than rigorous math, but intuition is what we need. Keep in mind that we are making definitions below, not listing properties. " We assume an "experiment" which is (at least in concept) repeatable. The experiment of rolling two dice is repeatable, and even the ALOHA experiment is so. (We simply watch the network for a long time, collecting data on pairs of consecutive epochs in which there are two active stations at the beginning.) On the other hand, the econometricians, in forecasting 2009, cannot "repeat" 2008. Yet all of the econometricians' tools assume that events in 2008 were affected by various sorts of randomness, and we think of repeating the experiment in a conceptual sense. " We imagine performing the experiment a large number of times, recording the result of each repetition on a separate line in a notebook. " We say A is an event for this experiment if it is a possible boolean (i.e. yes-or-no) outcome of the experiment. In the above example, here are some events: X+Y = 6 * X=1 * Y=3 * X-Y = 4 " A random variable is a numerical outcome of the experiment, such as X and Y here, as well as X+Y, 2XY and even sin(XY). " For any event of interest A, imagine a column on A in the notebook. The kth line in the notebook, k = 1,2,3,..., will say Yes or No, depending on whether A occurred or not during the kth repetition of the experiment. For instance, we have such a column in our table above, for the event {A = blue+yellow = 6}. " For any event of interest A, we define P(A) to be the long-run proportion of lines with Yes entries. " For any events A, B, imagine a new column in our notebook, labeled "A and B." In each line, this column will say Yes if and only if there are Yes entries for both A and B. P(A and B) is then the long-run proportion of lines with Yes entries in the new column labeled "A and B."1 " For any events A, B, imagine a new column in our notebook, labeled "A or B." In each line, this column will say Yes if and only if at least one of the entries for A and B says Yes.2 * For any events A, B, imagine a new column in our notebook, labeled "A |B" and pronounced "A given B." In each line: * This new column will say "NA" ("not applicable") if the B entry is No. 'In most textbooks, what we call "A and B" here is written AOB, indicating the intersection of two sets in the sample space. But again, we do not take a sample space point of view here. 2In the sample space approach, this is written A U B.  1.2. BASIC IDEAS OF PROBABILITY 5 * If it is a line in which the B column says Yes, then this new column will say Yes or No, depending on whether the A column says Yes or No. Think of probabilities in this "notebook" context: " P(A) means the long-run proportion of lines in the notebook in which the A column says Yes. " P(A or B) means the long-run proportion of lines in the notebook in which the A-or-B column says Yes. " P(A and B) means the long-run proportion of lines in the notebook in which the A-and-B column says Yes. " P(A | B) means the long-run proportion of lines in the notebook in which the A | B column says Yes-among the lines which do NOT say NA. A hugely common mistake is to confuse P(A and B) and P(A | B). This is where the notebook view becomes so important. Compare the quantities P(X = 1 and S = 6) = e and P(X = 1|S = 6) -5, where S = X+Y:3 " After a large number of repetitions of the experiment, approximately 1/36 of the lines of the notebook will have the property that both X = 1 and S = 6 (since X = 1 and S = 6 is equivalent to X = 1 and Y = 5). " After a large number of repetitions of the experiment, if we look only at the lines in which S = 6, then among those lines, approximately 1/5 of those lines will show X = 1. The quantity P(A IB) is called the conditional probability of A, given B. Note that and has higher logical precedence than or. For example, P(A and B or C) means P[(A and B) or C]. Also, not has higher precedence than and. Here are some more very important definitions and properties: * Suppose A and B are events such that it is impossible for them to occur in the same line of the notebook. They are said to be disjoint events. Then P(A or B) = P(A) + P(B) (1.2) Again, this terminology disjoint stems from the set-theoretic sample space approach, where it means that A 0 B = #5. That mathematical terminology works fine for our dice example, but in my experience people have major difficulty applying it correctly in more complicated problems. This is another illustration of why I put so much emphasis on the "notebook" framework. 3Think of adding an S column to the notebook too  6 CHAPTER 1. DISCRETE PROBABILITY MODELS If A and B are not disjoint, then P(A or B) = P(A) + P(B) - P(A and B) (1.3) In the disjoint case, that subtracted term is 0, so (1.3) reduces to (1.2). " Events A and B are said to be stochastically independent, usually just stated as independent,4 if P(A and B) = P(A) - P(B) (1.4) In calculating an "and" probability, how does one know whether the events are independent? The answer is that this will typically be clear from the problem. If we toss the blue and yellow dice, for instance, it is clear that one die has no impact on the other, so events involving the blue die are independent of events involving the yellow die. On the other hand, in the ALOHA example, it's clear that events involving X1 are NOT independent of those involving X2. If A and B are not independent, the equation (1.4) generalizes to P(A and B) = P(A)P(B|A) (1.5) Note that if A and B actually are independent, then P(BIA) = P(B), and (1.5) reduces to (1.4). 1.2.3 Basic Probability Computations: ALOHA Network Example Please keep in mind that the notebook idea is simply a vehicle to help you understand what the concepts really mean. This is crucial for your intuition and your ability to apply this material in the real world. But the notebook idea is NOT for the purpose of calculating probabilities. Instead, we use the properties of probability, as seen in the following. Let's look at all of this in the ALOHA context. In Equation (1.1) we found that P(X1 = 2) = p2 + (1-p)2 = 0.52 (1.6) How did we get this? Let CZ denote the event that node i tries to send, i = 1,2. Then using the definitions above, our steps would be P(X1 = 2) = P(C1 and C2 or not C1 and not C2) (1.7) = P(C1 and C2) + P( not C1 and not C2) (from (1.2) (1.8) =P(C1)P(C2) + P( not C1)P( not C2) (from (1.4) (1.9) =p2 + (1 -p)2 (1.10) Here are the reasons for these steps: 4The term stochastic is just a fancy synonym for random.  1.2. BASIC IDEAS OF PROBABILITY 7 (1.7): We listed the ways in which the event {X1 = 2} could occur. (1.8): Write G = C1 and C2, H = D1 and D2, where Di =not C2, i = 1,2. Then the events G and H are clearly disjoint; if in a given line of our notebook there is a Yes for G, then definitely there will be a No for H, and vice versa. (1.9): The two nodes act physically independently of each other. Thus the events C1 and C2 are stochasti- cally independent, so we applied (1.4). Then we did the same for D1 and D2. Note carefully that in Equation (1.7), our first step was to "break big events down into small events," in this case breaking the event {X1 = 2} down into the events C1 and C2 and D1 and D2. This is a central part of most probability computations. In calculating a probability, ask your- self, "How can it happen?" Good tip: When you solve problems like this, write out the and and or conjunctions like I've done above. This helps! Now, what about P(X2 = 2)? Again, we break big events down into small events, in this case according to the value of X1: P(X2 = 2) = P(X1 = 0 and X2 = 2 or X1 = 1 and X2 = 2 or X1 = 2 and X2 = 2) = P(X1 = 0 and X2 = 2) (1.11) + P(X1 = 1 and X2 = 2) + P(X1 = 2 and X2 = 2) Since X1 cannot be 0, that first term, P(X1 = 0 and X2 = 2) is 0. To deal with the second term, P(X1 1 and X2 = 2), we'll use (1.5). Due to the time-sequential nature of our experiment here, it is natural (but certainly not "mandated," as we'll often see situations to the contrary) to take A and B to be {X1 = 1} and {X2 = 2}, respectively. So, we write P(X1 = 1 and X2 = 2) = P(X1 = 1)P(X2 = 2X1 = 1) (1.12) To calculate P(X1 = 1), we use the same kind of reasoning as in Equation (1.1). For the event in question to occur, either node A would send and B wouldn't, or A would refrain from sending and B would send. Thus P(X1 =1) = 2p(1 -p) = 0.48 (1.13) Now we need to find P(X2 2 2X1 1). This again involves breaking big events down into small ones. If X1=1, then X2 =2 can occur only if both of the following occur: * Event A: Whichever node was the one to successfully transmit during epoch 1-and we are given that there indeed was one, since X1 1 1now generates a new message.  8 CHAPTER 1. DISCRETE PROBABILITY MODELS " Event B: During epoch 2, no successful transmission occurs, i.e. either they both try to send or neither tries to send. Recalling the definitions of p and q in Section 1.1, we have that P(X2 = 2X1 = 1) = q[p2 + (1 - p)2] = 0.41 (1.14) Thus P(X1 = 1 and X2 = 2) = 0.48 x 0.41 = 0.20. We go through a similar analysis for P(X1 = 2 and X2 = 2): We recall that P(X1 = 2) = 0.52 from before, and find that P(X2 = 2X1 = 2) = 0.52 as well. So we find P(X1 = 2 and X2 = 2) to be 0.522 = 0.27. Putting all this together, we find that P(X2 = 2) = 0.47. Let's do one more; let's find P(X1 = 1X2 = 2). [Pause a minute here to make sure you understand that this is quite different from P(X2 = 2X1 = 1).] From (1.5), we know that P(X = X2= 2 _P(X1 = 1 and X2 = 2) P(X1=1X2=2) P(=1ad 2(1.15) P(X2 = 2) We computed both numerator and denominator here before, in Equations (1.12) and (1.11), so we see that P(X1 = 1|X2 = 2) = 0.20/0.47 = 0.43. 1.2.4 Bayes' Theorem Following (1.15) above, we noted that the ingredients had already been computed, in (1.12) and (1.11). If we go back to the derivations in those two equations and substitute in (1.15), we have _P(X?1 =1 and X2 =2) P(X1 = 1|X2 = 2) PX P = 2=2 (1.16) P(2 =2 P(X1 = 1 and X2 = 2) P(X1 = 1 and X2 = 2) + P(X1 = 2 andAX2=2) (1.17) P(X1 = 1)P(X2 2|X1 1) (118) P(X1 = 1 )P(X2 = 2|X1 = 1) + P(X1 = 2)P(X2 = 2|X1 = 2)' Looking at this in more generality, for events A and B we would find that P(A ) ~ P( A)P( B|A) P(A)P(B|A) + P(not A)P(B not A) (.9 This is known as Bayes' Theorem or Bayes' Rule.  1.2. BASIC IDEAS OF PROBABLITTY9 9 notebook line X1 2 X2 = 2 X1=2andX2=2 X2=2X1=2 1 Yes No No No 2 No No No NA 3 Yes Yes Yes Yes 4 Yes No No No 5 Yes Yes Yes Yes 6 No No No NA 7 No Yes No NA Table 1.3: Top of Notebook for Two-Epoch ALOHA Experiment 1.2.5 ALOHA in the Notebook Context Think of doing the ALOHA "experiment" many, many times. " Run the network for two epochs, starting with both nodes active, the first time, and write the outcome on the first line of the notebook. " Run the network for two epochs, starting with both nodes active, the second time, and write the outcome on the second line of the notebook. " Run the network for two epochs, starting with both nodes active, the third time, and write the outcome on the third line of the notebook. " Run the network for two epochs, starting with both nodes active, the fourth time, and write the out- come on the fourth line of the notebook. " Imagine you keep doing this, thousands of times, filling thousands of lines in the notebook. The first seven lines of the notebook might look like Table 1.3. We see that: " Among those first seven lines in the notebook, 4/7 of them have X1 = 2. After many, many lines, this proportion will be approximately 0.52. " Among those first seven lines in the notebook, 3/7 of them have X2 = 2. After many, many lines, this proportion will be approximately 0.47.5 " Among those first seven lines in the notebook, 3/7 of them have X1 = 2 and X2 = 2. After many, many lines, this proportion will be approximately 0.27. " Among the first seven lines in the notebook, four of them do not say NA in the X2 = 2|X1 = 2 column. Among these four lines, two say Yes, a proportion of 2/4. After many, many lines, this proportion will be approximately 0.52. 5Don't make anything of the fact that these probabilities nearly add up to 1.  10 CHAPTER 1. DISCRETE PROBABILITY MODELS 1.2.6 Simulation To simulate whether a simple event occurs or not, we typically use R function runifO. This function gener- ates random numbers from the interval (0,1), with all the points inside being equally likely. So for instance the probability that the function returns a value in (0,0.5) is 0.5. Thus here is code to simulate tossing a coin: if (runif(1) < 0.5) heads <- TRUE else heads <- FALSE The argument 1 means we wish to generate just one random number from the interval (0,1). 1.2.6.1 Simulation of the ALOHA Example Following is a computation via simulation of the approximate value of P(X1 = 2), P(X2 = 2) and P(X2 2|X1 = 1), using the R statistical language, the language of choice of professional statisticans. It is open source, it's statistically correct (not all statistical packages are so), has dazzling graphics capabilities, etc. To learn about the syntax (e.g. < - as the assignment operator), see my introduction to R for programmers at http: //heather. cs .ucdavis .edu/-matloff/R/RProg .pdf. 1 # finds P(X1 = 2), P(X2 = 2) and P(X2 = 2|X1 = 1) in ALOHA example 2 sim <- function(p,q,nreps) { 3 countx2eq2 <- 0 4 countxleql <- 0 5 countxleq2 <- 0 6 countx2eq2givxleql <- 0 7 # simulate nreps repetitions of the experiment 8 for (i in 1:nreps) { 9 numsend <- 0 # no messages sent so far 10 # simulate A and B's decision on whether to send in epoch 1 11 for (i in 1:2) 12 if (runif(1) < p) numsend <- numsend + 1 13 if (numsend == 1) Xl <- 1 14 else Xl <- 2 15 if (X1 == 2) countxleq2 <- countxleq2 + 1 16 # now simulate epoch 2 17 # if Xl= 1 then one node may generate a new message 18 numactive <- Xl 19 if (X1 == 1 && runif(1) < q) numactive <- numactive + 1 20 # send? 21 if (numactive == 1) 22 if (runif(l) < p) X2 <- 0 23 else X2 <- 1 24 else { # numactive = 2 25 numsend <- 0 26 for (i in 1:2) 27 if (runif(1) < p) numsend <- numsend + 1 28 if (numsend ==1) X2 <- 1 29 else X2 <- 2 30} 31 if (X2 ==2) countx2eq2 <- countx2eq2 + 1 32 if (X1l= 1) { # do tally for the cond. prob. 33 countxleql K- countxleql + 1 34 if (X2 ==2) countx2eq2givxleql <- countx2eq2givxleql + 1 35 } 36} 37 # print results 38 cat ("P (X1 =2) :", countxleq2/nreps, "\n")  1.2. BASIC IDEAS OF PROBABILITY 11 39 cat("P(X2 = 2):",countx2eq2/nreps, "\n") 40 cat("P(X2 = 2 I Xl= 1) :",countx2eq2givxleql/countxleql,"\n") 41 } Note that each of the nreps iterations of the main for loop is analogous to one line in our hypothetical notebook. So, the find (the approximate value of) P(X1 = 2), divide the count of the number of times X1 = 2 occurred by the number of iterations. Note especially that the way we calculated P(X2 = 2|X1 = 1) was to count the number of times X2 = 2, among those times that X1 =1, just like in the notebook case. Remember, simulation results are only approximate. The larger the value we use for nreps, the more accurate our simulation results are likely to be. The question of how large we need to make nreps will be addressed in a later chapter. 1.2.6.2 Rolling Dice If we roll three dice, what is the probability that their total is 8? We count all the possibilities, or we could get an approximate answer via simulation: 1 # roll d dice; find P(total = k) 2 3 # simulate roll of one die; the possible return values are 1,2,3,4,5,6, 4 # all equally likely 5 roll <- function() return(sample(1:6,1)) 6 7 probtotk <- function(d,k,nreps) { 8 count <- 0 9 # do the experiment nreps times 10 for (rep in 1:nreps) { 11 sum <- 0 12 # roll d dice and find their sum 13 for (j in 1:d) sum <- sum + roll() 14 if (sum == k) count <- count + 1 15 } 16 return (count/nreps) 17 } The call to the built-in R function sample() here says to take a sample of size 1 from the sequence of numbers 1,2,3,4,5,6. That's just what we want to simulate the rolling of a die. The code for (j in 1:d) sum <- sum + roll() then simulates the tossing of a die d times, and computing the sum. Since applications of R often use large amounts of computer time, good R programmers are always looking for ways to speed things up. Here is an alternate version of the above program: 1 # roll d dice; find P(total =k) 2 3 probtotk <- function(d,k,nreps){ 4 count K- 0  12 CHAPTER 1. DISCRETE PROBABILITY MODELS 5 # do the experiment nreps times 6 for (rep in 1:nreps) 7 total <- sum(sample(1:6,d,replace=TRUE)) 8 if (total == k) count <- count + 1 9 } 10 return (count/nreps) 11 } Here the code sample (1: 6, d, replace=TRUE ) simulates tossing the die d times (the argument replace says this is sampling with replacement, so for instance we could get two 6s). That returns a d-element array, and we then call R's built-in function sum() to find the total of the d dice. The second version of the code here is more compact and easier to read. It also eliminates one explicit loop, which is the key to writing fast code in R. 1.2.7 Combinatorics-Based Probability Computation In some probability problems all the outcomes are equally likely. The probability computation is then simply a matter of counting all the outcomes of interest and dividing by the total number of possible outcomes. Of course, sometimes even such counting can be challenging, but it is simple in principle. We'll discuss two examples here. 1.2.7.1 Which Is More Likely in Five Cards, One King or Two Hearts? Suppose we deal a 5-card hand from a regular 52-card deck. Which is larger, P(1 king) or P(2 hearts)? Before continuing, take a moment to guess which one is more likely. Now, here is how we can compute the probabilities. There are (52) possible hands, so this is our denominator. For P(1 king), our numerator will be the number of hands consisting of one king and four non-kings. Since there are four kings in the deck, the number of ways to choose one king is (4) = 4. There are 48 non-kings in the deck, so there are (4) ways to choose them. Every choice of one king can be combined with every choice of four non-kings, so the number of hands consisting of one king and four non-kings is 4 . (v). Thus 4 -(48) P(1 king) -= 524 =-0.299 (1.20) (52) The same reasoning gives us P(2 hearts) - ') -0.274 (1.21) So, the 1-king hand is just slightly more likely.  1.2. BASIC IDEAS OF PROBABILITY 13 By the way, I used the R function choose() to evaluate these quantities, running R in interactive mode, e.g.: > choose (13,2) * choose (39,3) / choose (52,5) [1] 0.2742797 R also has a very nice function combn() which will generate all the (k) combinations of k things chosen from n, and also at your option call a user-specified function on each combination. This allows you to save a lot of computational work. See the examples in R's online documentation. Here's how we could do the 1-king problem via simulation: 1 # use simulation to find P(1 king) when deal a 5-card hand from a 2 # standard deck 3 4 # think of the 52 cards as being labeled 1-52, with the 4 kings having 5 # numbers 1-4 6 7 sim <- function(nreps) { 8 countlking <- 0 # count of number of hands with 1 king 9 for (rep in 1:nreps) { 10 hand <- sample(1:52,5,replace=FALSE) # deal hand 11 kings <- intersect(1:4,hand) # find which kings, if any, are in hand 12 if (length(kings) == 1) countlking <- countlking + 1 13 } 14 print(countlking/nreps) 15 } 1.2.7.2 "Association Rules" in Data Mining The field of data mining is a branch of computer science, but it is largely an application of various statistical methods to really huge databases. One of the applications of data mining is called the market basket problem. Here the data consists of records of sales transactions, say of books at Amazon.com. The business' goal is exemplified by Amazon's suggestion to customers that "Patrons who bought this book also tended to buy the following books."6 The goal of the market basket problem is to sift through sales transaction records to produce association rules, patterns in which sales of some combinations of books imply likely sales of other related books. The notation for association rules is A, B - C, D, E, meaning in the book sales example that customers who bought books A and B also tended to buy books C, D and E. Here A and B are called the antecedents of the rule, and C, D and E are called the consequents. Let's suppose here that we are only interested in rules with a single consequent. We will present some methods for finding good rules in another chapter, but for now, let's look at how many possible rules there are. Obviously, it would be impractical to use rules with a large number of antecedents.7. Suppose the business has a total of 20 products available for sale. What percentage of potential rules have three or fewer antecedents?8 6Some Customers appreciate such tips, while others view it as insulting or an invasion of privacy, but we'll not address such issues here. 7In addition, there are serious statistical problems that would arise, to be discussed in another chapter. 8Be sure to note that this is also a probability, namely the probability that a randomly chosen rule will have three or fewer antecedents.  14 CHAPTER 1. DISCRETE PROBABILITY MODELS For each k = 1,...,19, there are (k) possible sets of antecedents, thus this many possible rules. The fraction of potential rules using three or fewer antecedents is then E3k=l (20 k) . (201 k) 20 Ek19 1 ( k) . (201 k) 23180 10485740 (1.22) So, this is just scratching the surface. And note that with only 20 products, there are already over ten million possible rules. With 50 products, this number is 2.81 x 1016! Imagine what happens in a case like Amazon, with millions of products. These staggering numbers show what a tremendous challenge data miners face. 1.3 Discrete Random Variables In our dice example, the random variable X could take on six values in the set { 1,2,3,4,5,6}. This is a finite set. In the ALOHA example, X1 and X2 each take on values in the set {0,1,2}, again a finite set.9 Now think of another experiment, in which we toss a coin until we get heads. Let N be the number of tosses needed. Then N can take on values in the set { 1,2,3,... } This is a countably infinite set. Now think of one more experiment, in which we throw a dart at the interval (0,1), and assume that the place that is hit, R, can take on any of the values between 0 and 1. This is an uncountably infinite set. We say that X, X1, X2 and N are discrete random variables, while R is continuous. We'll discuss continu- ous random variables in a later chapter. 1.4 Independence, Expected Value and Variance The concepts and properties introduced in this section form the very core of probability and statistics. Except for some specific calculations, these apply to both discrete and continuous random variablescalculations, these apply to both discrete and continuous random variables 1.4.1 Independent Random Variables We already have a definition for the independence of events; what about independence of random variables? Random variables U and V are said to be independent if for any sets I and J, the events {X is in I} and {Y is in J} are independent, i.e. P(X is in I and Y is in J) = P(X is in I) P(Y is in J). 9We could even say that X1 takes on only values in the set { 1,2}, but if we were to look at many epochs rather than just two, it would be easier not to make an exceptional case.  1.4. INDEPENDENCE, EXPECTED VALUE AND VARIANCE 15 1.4.2 Expected Value 1.4.2.1 Intuitive Definition Consider a repeatable experiment with random variable X. We say that the expected value of X is the long-run average value of X, as we repeat the experiment indefinitely. In our notebook, there will be a column for X. Let Xi denote the value of X in the ith row of the notebook. Then the long-run average of X is . X1 + ... + Xn hmn (1.23) m-oo n Suppose for instance our experiment is to toss 10 coins. Let X denote the number of heads we get out of 10. We might get four heads in the first repetition of the experiment, i.e. X1 = 4, seven heads in the second repetition, so X2 = 7, and so on. Intuitively, the long-run average value of X will be 5. (This will be proven below.) Thus we say that the expected value of X is 5, and write E(X) = 5. 1.4.2.2 Computation and Properties of Expected Value Continuing the coin toss example above, let Kin be the number of times the value i occurs among X1, ..., Xn, i= 0,...,10, n = 1,2,3,... For instance, K4,20 is the number of times we get four heads, in the first 20 repetitions of our experiment. Then E(X) = limX1+"+X (1.24) n-oo n li 0-Kon+1-Kin+2-K2n...+10-K1o,n (1.25) 10 Zi -limK (1.26) i=0 But limno i is the long-run proportion of the time that X = i. In other words, it's P(X = i)! So, 10 E(X) = ji - P(X = i) (1.27) i=0 So in general, the expected value of a discrete random variable X which takes value in the set A is E(X) =ZcP(X =c) (1.28) ce A Note that (1.28) is the formula we'll use. The preceding equations were derivation, to motivate the formula. Note too that 1.28 is not the definition of expected value; that was in 1.23. It is quite important to distinguish between all of these, in terms of goals.  16 CHAPTER 1. DISCRETE PROBABILITY MODELS It will be shown in Section 1.5.2.2 that in our example above in which X is the number of heads we get in 10 tosses of a coin, P(X i) (10)0.5(1 - 0.5)10- (1.29) So 10 E(X) = (1010.5(1 - 0.5)10-i (1.30) i=0 It turns out that E(X) = 5. For X in our dice example, 6 E(X) Sc. - =3.5 (1.31) It is customary to use capital letters for random variables, e.g. X here, and lower-case letters for values taken on by a random variable, e.g. c here. Please adhere to this convention. By the way, it is also customary to write EX instead of E(X), whenever removal of the parentheses does not cause any ambiguity. An example in which it would produce ambiguity is E(U2). The expression EU2 might be taken to mean either E(U2), which is what we want, or (EU)2, which is not what we want. For S = X+Y in the dice example, 1 2 3 1 E(S) = 2 - - + 36 + 4.- + ...12 -3 = 7 (1.32) 36 36 36 36 In the case of N, tossing a coin until we get a head: E(N) = c - = 2 (1.33) (We will not go into the details here concerning how the sum of this particular infinite series is computed.) Some people like to think of E(X) using a center of gravity analogy. Forget that analogy! Think notebook! Intuitively, E(X) is the long-run average value of X among all the lines of the notebook. So for instance in our dice example, E(X) = 3.5, where X was the number of dots on the blue die, means that if we do the experiment thousands of times, with thousands of lines in our notebook, the average value of X in those lines will be about 3.5. With S = X+Y, B(S) = 7. This means that in the long-run average in column S in Table 1.4 is 7. Of course, by symmetry, E(Y) will be 3.5 too, where Y is the number of dots showing on the yellow die. That means we wasted our time calculating in Equation (1.32); we should have realized beforehand that B(S) is 2 x 3.5 =7.  1.4. INDEPENDENCEEXPECTED VALUE AND VARIANCE 17 notebook line outcome blue+yellow = 6? S 1 blue 2, yellow 6 No 8 2 blue 3, yellow 1 No 4 3 blue 1, yellow 1 No 2 4 blue 4, yellow 2 Yes 6 5 blue 1, yellow 1 No 2 6 blue 3, yellow 4 No 7 7 blue 5, yellow1 Yes 6 8 blue 3, yellow 6 No 9 9 blue 2, yellow 5 No 7 Table 1.4: Expanded Notebook for the Dice Problem In other words, for any random variables U and V, the expected value of a new random variable D = U+V is the sum of the expected values of U and V: E(U + V) =E(U) + E(V) (1.34) Note carefully that U and V do NOT need to be independent random variables for this relation to hold. You should convince yourself of this fact intuitively by thinking about the notebook notion. Say we look at 10000 lines of the notebook, which has columns for the values of U, V and U+V. It makes no difference whether we average U+V in that column, or average U and V in their columns and then add-either way, we'll get the same result. While you are at it, convince yourself that E(aU + b) =aE(U) + b (1.35) for any constants a and b. For instance, say U is temperature in Celsius. Then the temperature in Fahrenheit is W = 9U + 32. So, W is a new random variable, and we can get is expected from that of U by using (1.35) with a =9 andb= 32. But if U and V are independent, then E(UV) = EU -EV (1.36) In the dice example, for instance, let D denote the product of the numbers of blue dots and yellow dots, i.e. D = XY. Then E(D) = 3.52 = 12.25 (1.37) Consider a function go of one variable, and let W = g(X). W is then a random variable too. Say X takes on  18 CHAPTER 1. DISCRETE PROBABILITY MODELS values in A, as in (1.28). Then W takes on values in B = {g(c) : ceA}. Define Ad= {c: cE A,g(c) d} (1.38) Then P(W = d) = P(X EAd) (1.39) so E(W) Z= dP(W = d) (1.40) deB = dZ P(X = c) (1.41) dEB cEAd = g(c)P(X = c) (1.42) cEA The properties of expected value discussed above are key to the entire remainder of this book. You should notice immediately when you are in a setting in which they are applicable. For instance, if you see the expected value of the sum of two random variables, you should instinctively think of (1.34 right away. 1.4.2.3 Casinos, Insurance Companies and "Sum Users," Compared to Others The expected value is intended as a measure of central tendency, i.e. as some sort of definition of the probablistic "middle" in the range of a random variable. It plays an absolutely central role in probability and statistics. Yet one should understand its limitations. First, note that the term expected value itself is a misnomer. We do not expect W to be 91/6 in this last example; in fact, it is impossible for W to take on that value. Second, the expected value is what we call the mean in everyday life. And the mean is terribly overused. Consider, for example, an attempt to describe how wealthy (or not) people are in the city of Davis. If suddenly Bill Gates were to move into town, that would skew the value of the mean beyond recognition. Even without Gates, there is a question as to whether the mean has that much meaning. More subtly than that, there is the basic question of what the mean means. What, for example, does Equation (1.23) mean in the context of people's incomes in Davis? We would sample a person at random and record his/her income as X1. Then we'd sample another person, to get X2, and so on. Fine, but in that context, what would (1.23) mean? The answer is, not much. For a casino, though, (1.23) means plenty. Say X is the amount a gambler wins on a play of a roulette wheel, and suppose (1.23) is equal to $1.88. Then after, say, 1000 plays of the wheel (not necessarily by the same gambler), the casino knows it will have paid out a total about about $1,880. So if the casino charges, say  1.4. INDEPENDENCE, EXPECTED VALUE AND VARIANCE 19 $1.95 per play, it will have made a profit of about $70 over those 1000 plays. It might be a bit more or less than that amount, but the casino can be pretty sure that it will be around $70, and they can plan their business accordingly. The same principle holds for insurance companies, concerning how much they pay out in claims. With a large number of customers, they know ("expect"!) approximately how much they will pay out, and thus can set their premiums accordingly. The key point in the casino and insurance companies examples is that they are interested in totals, e.g. total payouts on a blackjack table over a month's time, or total insurance claims paid in a year. Another example might be the number of defectives in a batch of computer chips; the manufacturer is interested in the total number of defectives chips produced, say in a month. By contrast, in describing how wealthy people of a town are, the total income of all the residents is not relevant. Similarly, in describing how well students did on an exam, the sum of the scores of all the students doesn't tell us much. A better description might involve percentiles, including the 50th percentile, the median. Nevertheless, the mean has certain mathematical properties, such as (1.34), that have allowed the rich devel- opment of the fields of probability and statistics over the years. The median, by contrast, does not have nice mathematical properties. So, the mean has become entrenched as a descriptive measure, and we will use it often. 1.4.3 Variance While the expected value tells us the average value a random variable takes on, we also need a measure of the random variable's variability-how much does it wander from one line of the notebook to another? In other words, we want a measure of dispersion. The classical measure is variance, defined to be the mean squared difference between a random variable and its mean: Var(U) = E[(U - EU)2] (1.43) For X in the die example, this would be Var(X) = E[(X - 3.5)2] (1.44) To evaluate this, apply (1.42) with g(c) = (c - 3.5)2: 6 Var(X) =Z(c - 3.5)2 -= 2.92 (1.45) You can see that variance does indeed give us a measure of dispersion. If the values of U are mostly clustered near its mean, the variance will be small; if there is wide variation in U, the variance will be large.  20 CHAPTER 1. DISCRETE PROBABILITY MODELS The properties of E in (1.34) and (1.35) can be used to show that Var(U) = E(U2) - (EU)2 (1.46) The term E(U2) is again evaluated using (1.42). Thus for example, if X is the number of dots which come up when we roll a die, and W = X2, then E(W)= i2j _ _(1.47) i=1 An important property of variance is that Var(cU) = c2Var(U) (1.48) for any random variable U and constant c. It should make sense to you: If we multiply a random variable by 5, say, then its average squared distance to its mean should increase by a factor of 25. And shifting data over by a constant does not change the amount of variation in them, so Var(cU + d) = c2Var(U) (1.49) for any constant d. The square root of the variance is called the standard deviation. The squaring in the definition of variance produces some distortion, by exaggerating the importance of the larger differences. It would be more natural to use the mean absolute deviation (MAD), E(|U - EUI). However, this is less tractable mathematically, so the statistical pioneers chose to use the mean squared difference, which lends itself to lots of powerful and beautiful math, in which the Pythagorean Theorem pops up in abstract vector spaces. (See Section 3.9.2 for details.) As with expected values, the properties of variance discussed above, and also in Section 3.2.1 below, are key to the entire remainder of this book. You should notice immediately when you are in a setting in which they are applicable. For instance, if you see the variance of the sum of two random variables, you should instinctively think of (1.61 right away. 1.4.4 Is a Variance of X Large or Small? Recall that the variance of a random variable X is suppose to be a measure of the dispersion of X, meaning the amount that X varies from one instance (one line in our notebook) to the next. But if Var(X) is, say, 2.5, is that a lot of variability or not? We will pursue this question here.  1.4. INDEPENDENCE, EXPECTED VALUE AND VARIANCE 1.4.5 Chebychev's Inequality This inequality states that for a random variable X with mean y and variance a2, 21 1 P(IX - pt >c) -c2 C2 (1.50) In other words, X does not often stray more than, say, 3 standard deviations from its mean. This gives some concrete meaning to the concept of variance/standard deviation. To prove (1.50), let's first state and prove Markov's Inequality: For any nonnegative random variable Y, BY P(Y > d) d - d (1.51) To prove (1.51), let Z equal 1 if Y > d, 0 otherwise. Then Y>dZ (1.52) (think of the two cases), so EY > dEZ The right-hand side of (1.53) is dP(Y > d), so (1.51) follows. Now to prove (1.50), define Y = (X - p)2 and set d = c2U2. Then (1.51) says P[(X - p)2 > c 2 2J cU (1.53) (1.54) (1.55) (1.56) the left-hand side of (1.55) is the same as the left-hand side of (1.50). The numerator of the right-hand size of (1.55) is simply Var(X), i.e. a2, so we are done. 1.4.6 The Coefficient of Variation Continuing our discussion of the magnitude of a variance, look at our remark following (1.50):  22 CHAPTER 1. DISCRETE PROBABILITY MODELS In other words, X does not often stray more than, say, 3 standard deviations from its mean. This gives some concrete meaning to the concept of variance/standard deviation. This suggests that any discussion of the size of Var(X) should relate to the size of E(X). Accordingly, one often looks at the coefficient of variation, defined to be the ratio of the standard deviation to the mean: coef. of var. Var(X) EX (1.57) This is a scale-free measure (e.g. inches divided by inches), and serves as a good way to judge whether a variance is large or not. 1.4.7 Covariance This is a topic we'll cover fully in Chapter 3, but at least introduce here. A measure of the degree to which U and V vary together is their covariance, Cov(U, V) = E[(U - EU)(V - EV)] (1.58) Except for a divisor, this is essentially correlation. If U is usually large at the same time Y is small, for instance, then you can see that the covariance between them witll be negative. On the other hand, if they are usually large together or small together, the covariance will be positive. Again, one can use the properties of E() to show that Cov(U, V) = E(UV) -EU -EV (1.59) Also Var(U + V) = Var(U) + Var(V) + 2Cov(U, V) (1.60) If U and V are independent, then Cov(U,V) = 0 and Var(U + V) = Var(U) + Var(V) (1.61) 1.4.8 A Combinatorial Example A committee of four people is drawn at random from a set of six men and three women. Suppose we are concerned that there may be quite a gender imbalance in the membership of the committee. Toward that end, let M and W denote the numbers of men and women in our committee, and let D = M-W. Let's find E(D).  1.4. INDEPENDENCE, EXPECTED VALUE AND VARIANCE D can take on the values 4-0, 3-1, 2-2 and 1-3, i.e. 4, 2, 0 and -2. So, 23 ED=-2"P(D=-2)+0"P(D=0)+2"P(D=2)+4"P(D=4) (1.62) Now, using reasoning along the lines in Section 1.2.7, we have P(D=-2)=P(M=1andW=3) ( 6)(3) (94) (1.63) After similar calculations for the other probabilities in (1.62), we find the ED = 1.33. If we were to perform this experiment many times, i.e. choose committees again and again, on average we would have a bit more than one more man than women on the committee. 1.4.9 Expected Value, Etc. in the ALOHA Example Finding expected values etc. in the ALOHA example is straightforward. For instance, EX1= 0 P(X1 =0) + 1 - P(X1 =1) + 2 - P(X1 =2)= 1 . 0.48 + 2 . 0.52 =1.52 (1.64) Here is R code to find various values approximately by simulation: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 # finds E(X1), E(X2), Var(X2), Cov(X1,X2) sim <- function(p,q,nreps) { sumxl <- 0 sumx2 <- 0 sumx2sq <- 0 sumxlx2 <- 0 for (i in 1:nreps) { numsend <- 0 for (i in 1:2) if (runif(1) < p) numsend <- num if (numsend == 1) Xl <- 1 else Xl <- 2 numactive <- Xl if (X1 == 1 && runif (1) < q) numact if (numactive == 1) if (runif(1) < p) X2 <- 0 else X2 <- 1 else { # numactive = 2 numsend <- 0 for (i in 1:2) if (runif(1) < p) numsend <-7 if (numsend == 1) X2 <- 1 else X2 <- 2 send + 1 ive numactive + 1 numsend + 1 } sumxl <- sumx2 <- sumx2sq sumxlx2 sumxl + X1 sumx2 + X2 :- sumx2sq + X2^2 :- sumxlx2 + X1*X2 } # print results meanxl <- sumxl /nreps cat ("E (X1) : ", meanxl, "\n")  24 CHAPTER 1. DISCRETE PROBABILITY MODELS 33 meanx2 <- sumx2 /nreps 34 cat ("E (X2) : ", meanx2, "\n") 35 cat("Var(X2) :",sumx2sq/nreps - meanx2^2,"\n") 36 cat ("Cov (X1, X2) : ",sumx2/nreps, "\n") 37 } As a check on your understanding so far, you should find at least one of these values by hand, and see if it jibes with the simulation output. 1.4.10 Reconciliation of Math and Intuition (optional section) Here I have been promoting the notebook idea over the sterile, confusing mathematical definitions in the theory of probability. It is worth noting, though, that the theory actually does imply the notebook notion, through a theorem known as the Strong Law of Large Numbers: Consider a random variable U, and a sequence of independent random variables U1, U2, ... which all have the same distribution as U, i.e. they are "repetitions" of the experiment which generates U. Then U1 +...+ Un limU . = E(U) with probability 1 (1.65) n-oo n In other words, the average value of U in all the lines of the notebook will indeed converge to EU. 1.5 Distributions 1.5.1 Basic Notions For the type of random variables we've discussed so far, the distribution of a random variable U is simply a list of all the values it takes on, and their associated probabilities: Example: For X in the dice example, the distribution of X is 1 1 1 1 1 1 {(1, ), (2, ), (3, ), (4, ), (5, ), (6, )} (1.66) Example: In the ALOHA example, distribution of X1 is {(O, 0.00), (1, 0.48), (2, 0.52)} (1.67) Example: In our example in which N is the number of tosses of a coin needed to get the first head, the distribution is {(1 ),(,), (3' -), ...} (1.68)  1.5. DISTRIBUTIONS 25 It is common to express this in functional notation. We define the probability mass function (pmf) of a discrete random variable V, denoted pv, as PV(k) = P(V = k) (1.69) for any value k which V can take on. (Please keep in mind the notation. It is customary to use the lower-case p, with a subscript consisting of the name of the random variable.) Example: In (1.68), P k k=1 2 ... (1.70) Example: In the dice example, which S = X+Y, (I k=2 k 36' 3, k=3 36' 1.5.2 Parameteric Families of pmfs 1.5.2.1 The Geometric Family of Distributions Recall our example of tossing a coin until we get the first head, with N denoting the number of tosses needed. In order for this to take k tosses, we need k-1 tails and then a head. Thus PNk1 We might call getting a head a "success," and refer to a tail as a "failure." Of course, these words don't mean anything; we simply refer to the outcome of interest as ''success.'' Define M to be the number of rolls of a die needed until the number 5 shows up. Then PN(k) (-) -,k = 1, 2 ... (1.73) 6 6  26 CHAPTER 1. DISCRETE PROBABILITY MODELS reflecting the fact that the event {M = k} occurs if we get k-1 non-5s and then a 5. Here "success" is getting a5. The tosses of the coin and the rolls of the die are known as Bernoulli trials, which is a sequence of indepen- dent 1-0-valued random variables B2, i = 1,2,3,... BZ is 1 for success, 0 for failure, with success probability p. For instance, p is 1/2 in the coin case, and 1/6 in the die example. In general, suppose the random variable U is defined to be the number of trials needed to get a success in a sequence of Bernoulli trials. Then pU(k) = (1 - p)kip, k = 1, 2, ... (1.74) Note that there is a different distribution for each value of p, so we call this a parametric family of distri- butions, indexed by the parameter p. We say that U is geometrically distributed with parameter p. It can be shown that 1 E((U) = 1.75) p (which should make good intuitive sense to you) and VarU = 2 (1.76) p By the way, if we were to think of an experiment involving a geometric distribution in terms of our notebook idea, the notebook would have an infinite number of columns, one for each Bi. Within each row of the notebook, the BZ entries would be 0 until the first 1, then NA ("not applicable" after that). 1.5.2.2 The Binomial Family of Distributions A geometric distribution arises when we have Bernoulli trials with parameter p, with a variable number of trials (N) but a fixed number of successes (1). A binomial distribution arises when we have the opposite-a fixed number of Bernoulli trials (n) but a variable number of successes (say X).10 For example, say we toss a coin five times, and let X be the number of heads we get. We say that X is binomially distributed with parameters n = 5 and p = 1/2. Let's find P(X = 2). There are many orders in which that could occur, such as HHTTT, TTHHT, HTTHT and so on. Each order has probability 0.52(1 - 0.5)3, and there are (Q orders. Thus PX=2)- 0)0.52(1 - 0.5)3 0)732 =5/16 (1.77) ioNote again the custom of using capital letters for random variables, and lower-case letters for constants.  1.5. DISTRIBUTIONS 27 For general n and p, P(X = k) = (1)pk(i - p)"-k (1.78) So again we have a parametric family of distributions, in this case a family having two parameters, n and p. Let's write X as a sum of those 0-1 Bernoulli variables we used in the discussion of the geometric distribution above: X = Bi (1.79) i=1 where Bi is 1 or 0, depending on whether there is success on the ith trial or not. Then the reader should use our earlier properties of E() and Var() in Section 1.4 to fill in the details in the following derivations of the expected value and variance of a binomial random variable: EX = E(B1 + ..., +Bn) = EB1 + ... + EBn = np (1.80) and Var (X) = Var (B1 + ..., +Bn) = Var (B1) + ... + Var (Bn) = np(1 - p) (1.81) Again, (1.80) should make good intuitive sense to you. 1.5.2.3 The Poisson Family of Distributions Another famous parametric family of distributions is the set of Poisson Distributions, which is used to model unbounded counts. The pmf is P( =k)= , k =0, 1,2, ... (1.82) k! The parameter for the family, A, turns out to be the value of E(X) and also Var(X). The Poisson family is very often used to model count data. For example, if you go to a certain bank every day and count the number of customers who arrive between 11:00 and 11:15 a.m., you will probably find that that distribution is well approximated by a Poisson distribution for some A. 1.5.2.4 The Negative Binomial Family of Distributions Recall that a typical example of the geometric distribution family (Section 1.5.2.1) arises as N, the number of tosses of a coin needed to get our first head. Now generalize that, with N now being the number of tosses  28 CHAPTER 1. DISCRETE PROBABILITY MODELS needed to get our rth head, where r is a fixed value. Let's find P(N = k), k = r, r+1, ... For concreteness, look at the case r = 3, k = 5. In other words, we are finding the probability that it will take us 5 tosses to accumulate 3 heads. First note the equivalence of two events: {N = 5} = {2 heads in the first 4 tosses and head on the 5th toss} (1.83) That event described before the "and" corresponds to a binomial probability: P(2 heads in the first 4 tosses) ( ) (1) (1.84) Since the probability of a head on the knth toss is 1/2 and the tosses are independent, we find that P(N = 5) ( 2) (2 -16(1.85) The negative binomial distribution family, indexed by parameters r and p, corresponds to random variables which count the number of independent trials with success probability p needed until we get r successes. The pmf is P(N =k) = (1 -p)k-rpr,k= , r+ 1, .. (1.86) We can write N=G1+...+Gr (1.87) where GZ is the number of tosses between the successes numbers i-1 and i. But each GZ has a geometric distribution! Since the mean of that distribution is 1/p, we have that 1 E(N) = r - - (1.88) p In fact, those r geometric variables are also independent, so we know the variance of N is the sum of their variances: Var(N) =r - 2 (1.89)  1.6. RECOGNIZING DISTRIBUTIONS WHEN YOU SEE THEM 29 1.5.2.5 The Power Law Family of Distributions Here px (k) = ck--7, k = 1, 2, 3, ... (1.90) It is required that -y> 1, as otherwise the sum of probabilities will be infinite. For - satisfying that condition, the value c is chosen so that that sum is 1.0: 00 00 1.0 =Zck- c k- dk =c (1.91) k=1J1 Here again we have a parametric family of distributions, indexed by the parameter -y. The power law family is an old-fashioned model (an old-fashioned term for distribution is law), but there has been a resurgence of interest in it in recent years. It turns out that many types of networks in the real world exhibit approximately power law behavior. For instance, in a famous study of the Web (A. Barabasi and R. Albert, Emergence of Scaling in Random Networks, Science, 1999, 509-512), it was found that the number of links leading to a Web page has an approximate power law distribution with -y= 2.1. The number of links leading out of a Web page was found to be approximately power-law distributed, with -y= 2.7. 1.6 Recognizing Distributions When You See Them Many random variables one encounters do not have a distribution in some famous parametric family. But many do, and it's important to be alert to this point, and recognize one when you see one. 1.6.1 A Coin Game Consider a game played by Jack and Jill. Each of them tosses a coin many times, but Jack gets a head start of two tosses. So by the time Jack has had, for instance, 8 tosses, Jill has had only 6; when Jack tosses for the 15th time, Jill has her 13th toss; etc. Let Xk denote the number of heads Jack has gotten through his kth toss, and let Yk be the head count for Jill at that same time, i.e. among only k-2 tosses for her. (So, Yi1 = Y2 = 0.) Let's find the probability that Jill is winning after the kth toss, i.e. P(Y > X6). Your first reaction might be, "Aha, binomial distribution!" You would be on the right track, but the problem is that you would not be thinking precisely enough. Just WHAT has a binomial distribution? The answer is that both X6 and Y6 have binomial distributions, both with p = 0.5, but n = 6 for X6 while n = 4 for Y6. Now, as usual, ask the famous question, "How can it happen?" How can it happen that Y6 > X6? Well, we could have, for example, Y6 3 and X6 1, as well as many other possibilities. Let's write it  30 CHAPTER 1. DISCRETE PROBABILITY MODELS mathematically: 4 i-1 P(Ye > X6) = E P(Ye = i and X = j) (1.92) i=1 j=O Make SURE your understand this equation. Now, to evaluate P(Y = i and X6= j), we see the "and" so we ask whether Y6 and X6 are independent. They in fact are; Jill's coin tosses certainly don't affect Jack's. So, P(Y= iandX6 =j) =P(Y =i) .P(X6 =j) (1.93) It is at this point that we finally use the fact that X6 and Y have binomial distributions. We have P(Y6 = i) = 0.5'(1 - 0.5)4-i (1.94) and P(X6 = j) = 0.5(1- 0.5)6-3 (1.95) We would then substitute (1.94) and (1.95) in (1.92). We could then evaluate it by hand, but it would be more convenient to use R's dbinom() function: prob <-O0 2 for (i in 1:4) 3 for (j in 0:(i-1)) 4 prob <- prob + dbinom(i,4,0.5) * dbinom(j,6,0.5) 5 print (prob) We get an answer of about 0.17. If Jack and Jill were to play this game repeatedly, stopping each time after the 6th toss, then Jill would win about 17% of the time. 1.6.2 Tossing a Set of Four Coins Consider a game in which we have a set of four coins. We keep tossing the set of four until we have a situation in which exactly two of them come up heads. Let N denote the numbr of times we must toss the set of four coins. For instance, on the first toss of the set of four, the outcome might be HTHH. The second might be TTTH, and the third could be THHT. In the situation, N = 3. Let's find P(N = 5). Here we recognize that N has a geometric distribution, with "success" defined as getting two heads in our set of four coins. What value does the parameter p have here?  1.7. A CAUTIONARY TALE 31 Well, p is P(X = 2), where X is the number of heads we get from a toss of the set of four coins. We recognize that X is binomial! Thus p .4-=3(1.96) Thus using the fact that N has a geometric distribution, P(N = 5) = (1 - p)4p = 0.057 (1.97) 1.6.3 The ALOHA Example Again As an illustration of how commonly these parametric families arise, let's again look at the ALOHA example. Consider the general case, with transmission probability p, message creation probability q, and m network nodes. We will not restrict our observation to just two epochs. Suppose Xi = m, i.e. at the end of epoch i all nodes have a message to send. Then the number which attempt to send during epoch i+1 will be binomially distributed, with parameters m and p.11 For instance, the probability that there is a successful transmission is equal to the probability that exactly one of the m nodes attempts to send, p(1 - p)m-1 = mp(1 - p)m-1 (1.98) Now in that same setting, Xi = m, let K be the number of epochs it will take before some message actually gets through. In other words, we will have Xi = m, Xi+1 = m, Xi+2 = m,... but finally Xi+K-1 = m -1. Then K will be geometrically distributed, with success probability equal to (1.98). There is no Poisson distribution in this example, but it is central to the analysis of Ethernet, and almost any other network. We will discuss this at various points in later chapters. 1.7 A Cautionary Tale 1.7.1 Trick Coins, Tricky Example Suppose we have two trick coins in a box. They look identical, but one of them, denoted coin 1, is heavily weighted toward heads, with a 0.9 probability of heads, while the other, denoted coin 2, is biased in the opposite direction, with a 0.9 probability of tails. Let C1 and C2 denote the events that we get coin 1 or coin 2, respectively. Our experiment consists of choosing a coin at random from the box, and then tossing it n times. Let By denote the outcome of the ith toss, i = 1,2,3,..., where By 1 means heads and By 0 means tails. Let Xi=B1 + ... + Bi, so Xi is a count of the number of heads obtained through the ith toss. "Note that this is a conditional distribution, given X2 m.  32 CHAPTER 1. DISCRETE PROBABILITY MODELS The question is: "Does the random variable Xi have a binomial distribution?" Or, more simply, the ques- tion is, "Are the random variables BZ independent?" To most people's surprise, the answer is No (to both questions). Why not? The variables BZ are indeed 0-1 variables, and they have a common success probability. But they are not independent! Let's see why they aren't. Consider the events A = {Bi = 1}, i = 1,2,3,... In fact, just look at the first two. By definition, they are independent if and only if P(A1 and A2) = P(A1)P(A2) (1.99) First, what is P(A1)? Now, wait a minute! Don't answer, "Well, it depends on which coin we get," because this is NOT a conditional probability. Yes, the conditional probabilities P(A1|C1) and P(A1|C2) are 0.9 and 0.1, respectively, but the unconditional probability is P(A1) = 0.5. You can deduce that either by the symmetry of the situation, or by P(A1) = P(C1)P(A1|C1) + P(C2)P(A1|C2) = (0.5)(0.9) + (0.5)(0.1) = 0.5 (1.100) You should think of all this in the notebook context. Each line of the notebook would consist of a report of three things: which coin we get; the outcome of the first toss; and the outcome of the second toss. (Note by the way that in our experiment we don't know which coin we get, but conceptually it should have a column in the notebook.) If we do this experiment for many, many lines in the notebook, about 90% of the lines in which the coin column says "1" will show Heads in the second column. But 50% of the lines overall will show Heads in that column. So, the right hand side of Equation (1.99) is equal to 0.25. What about the left hand side? P(A1 and A2) = P(A1 and A2 and C1) + P(A1 and A2 and C2) (1.101) = P(A1 and A2|C1)P(C1) + P(A1 andA2|C2)P(C2) (1.102) = (0.9)2(0.5) + (0.1)2(0.5) (1.103) = 0.41 (1.104) Well, 0.41 is not equal to 0.25, so you can see that the events are not independent, contrary to our first intuition. And that also means that Xi is not binomial. 1.7.2 Intuition in Retrospect To get some intuition here, think about what would happen if we tossed the chosen coin 10000 times instead of just twice. If the tosses were independent, then for example knowledge of the first 9999 tosses should not tell us anything about the 10000th toss. But that is not the case at all. After 9999 tosses, we are going to have a very good idea as to which coin we had chosen, because by that time we will have gotten about 9000 heads (in the case of coin C1) or about 1000 heads (in the case of C2). In the former case, we know that the  1.8. WHY NOT JUST DO ALL ANALYSIS BY SIMULATION? 33 10000th toss is likely to be a head, while in the latter case it is likely to be tails. In other words, earlier tosses do indeed give us information about later tosses, so the tosses aren't independent. 1.7.3 Implications for Modeling The lesson to be learned is that independence can definitely be a tricky thing, not to be assumed cavalierly. And in creating probability models of real systems, we must give very, very careful thought to the conditional and unconditional aspects of our models--it can make a huge difference, as we saw above. Also, the conditional aspects often play a key role in formulating models of nonindependence. This trick coin example is just that-tricky-but similar situations occur often in real life. If in some medical study, say, we sample people at random from the population, the people are independent of each other. But if we sample families from the population, and then look at children within the families, the children within a family are not independent of each other. 1.8 Why Not Just Do All Analysis by Simulation? Now that computer speeds are so fast, one might ask why we need to do mathematical probability analysis; why not just do everything by simulation? There are a number of reasons: " Even with a fast computer, simulations of complex systems can take days, weeks or even months. " Mathematical analysis can provide us with insights that may not be clear in simulation. " Like all software, simulation programs are prone to bugs. The chance of having an uncaught bug in a simulation program is reduced by doing mathematical analysis for a special case of the system being simulated. This serves as a partial check. " Statistical analysis is used in many professions, including engineering and computer science, and in order to conduct meaningful, useful statistical analysis, one needs a firm understanding of probability principles. An example of that second point arose in the computer security research of a graduate student at UCD, C. Senthilkumar, who was working on a way to more quickly detect the spread of a malicious computer worm. He was evaluating his proposed method by simulation, and found that things "hit a wall" at a certain point. He wasn't sure if this was a real limitation; maybe, for example, he just wasn't running his simulation on the right set of parameters to go beyond this limit. But a mathematical analysis showed that the limit was indeed real. 1.9 Tips on Finding Probabilities, Expected Values and So On First, do not write/think nonsense. For example, the expression "P(A) or P(B)" is nonsense-do you see why?  34 CHAPTER 1. DISCRETE PROBABILITY MODELS Similarly, don't use "formulas" that you didn't learn and are in fact false. For example, in an expression involving a random variable X, one can NOT replace X by EX! (How would you like it if your professor were to lose your exam, and then tell you, "Well, I'll just assign you a score that is equal to the class mean"?) As noted before, in calculating a probability, ask yourself, "How can it happen?" Then you will typically have a set of and/or terms, which you compute individually and add together. And until you get used to it, write down every step, including reasons, as you see in (1.7)-(1.9). Another point is that you should define variables, e.g. "Let X denote the number of heads." Write it down! This makes it much easier to translate from words to math expressions and equations. Exercises 1. This problem concerns the ALOHA network model of Section 1.1. Feel free to use (but cite) computations already in the example. (a) P(X1 = 2 and X2 =1), for the same values of p and q in the examples. (b) Find P(X2 = 0). (c) Find (P(X1 =1X2 =1). 2. Consider a game in which one rolls a single die until one accumulates a total of at least four dots. Let X denote the number of rolls needed. Find P(X < 2) and E(X). 3. Recall the committee example in Section 1.4.8. Suppose now, though, that the selection protocol is that there must be at least one man and at least one woman on the committee. Find E(D) and Var (D). 4. Consider the game in Section 1.6.1. Find E(Z) and Var(Z), where Z = Y - X6. 5. Say we choose six cards from a standard deck, one at a time WITHOUT replacement. Let N be the number of kings we get. Does N have a binomial distribution? Choose one: (i) Yes. (ii) No, since trials are not independent. (iii) No, since the probability of success is not constant from trial to trial. (iv) No, since the number of trials is not fixed. (v) (ii) and (iii). (iv) (ii) and (iv). (vii) (iii) and (iv). 6. Suppose we have n independent trials, with the probability of success on the ith trial being p2. Let X = the number of successes. Use the fact that "the variance of the sum is the sum of the variance" for independent random variables to derive Var(X). 7. You bought three tickets in a lottery, for which 60 tickets were sold in all. There will be five prizes given. Find the probability that you win at least one prize, and the probability that you win exactly one prize. 8. Two five-person committees are to be formed from your group of 20 people. In order to foster commu- nication, we set a requirement that the two committees have the same chair but no other overlap. Find the probability that you and your friend are both chosen for some committee. 9. Consider a device that lasts either one, two or three months, with probabilities 0.1, 0.7 and 0.2, respec- tively. We carry one spare. Find the probability that we have some device still working just before four months have elapsed.  1.9. TIPS ON FINDING PROBABILITIES, EXPECTED VALUES AND SO ON 35 10. A building has six floors, and is served by two freight elevators, named Mike and Ike. The destination floor of any order of freight is equally likely to be any of floors 2 through 6. Once an elevator reaches any of these floors, it stays there until summoned. When an order arrives to the building, whichever elevator is currently closer to floor 1 will be summoned, with elevator Ike being the one summoned in the case in which they are both on the same floor. Find the probability that after the summons, elevator Mike is on floor 3. Assume that only one order of freight can fit in an elevator at a time. Also, suppose the average time between arrivals of freight to the building is much larger than the time for an elevator to travel between the bottom and top floors; this assumption allows us to neglect travel time. 11. Without resorting to using the fact that () = n!/[k!(n - k!)], find c and d such that () ) n k + (d) (1.105) 12. Prove Equation (1.46), and also show that b = EU minimizes the quantity E] (U - b)2 13. Show that if X is a nonnegative-integer valued random variable, then oo EX = P(X > i) (1.106) i=1 Hint: Write i =>- 1, and when you see an iterated sum, reverse the order of summation. 14. A civil engineer is collecting data on a certain road. She needs to have data on 25 trucks, and 10 percent of the vehicles on that road are trucks. Find the probability that she will need to wait for more than 200 vehicles to pass before she gets the needed data. 15. Suppose we toss a fair time n times, resulting in X heads. Show that the term expected value is a misnomer, by showing that lim P(X = n/2) = 0 (1.107) n->oo Use Stirling's approximation, k! 2 kIk) (1.108)  36 CHAPTER 1. DISCRETE PROBABILITY MODELS  Chapter 2 Continuous Probability Models 2.1 A Random Dart Imagine that we throw a dart at random at the interval (0,1). Let D denote the spot we hit. By "at random" we mean that all subintervals of equal length are equally likely to get hit. For instance, the probability of the dart landing in (0.7,0.8) is the same as for (0.2,0.3), (0.537,0.637) and so on. The first crucial point to note is that P(D=c) 0 =(2.1) for any individual point c. That can be seen by the fact that c is in as tiny a subinterval as you wish, or by the fact that the interval (c,c), or even [c,c], has length 0. Or, reason that there are infinitely many points, and if they all had some nonzero probability w, say, then the probabilities would sum to infinity instead of to 1; thus they must have probability 0. That may sound odd to you, but remember, this is an idealization. D actually cannot be just any old point in (0,1). Our dart has nonzero thickness, our measuring instrument has only finite precision, and so on. So it really is an idealization, though an extremely useful one. It's like the assumption of "massless string" in physics analyses; there is no such thing, but it's a good approximation to reality. But Equation (2.1) presents a problem for us in defining the term distribution for variables like this. We defined it for a discrete random variable Y as a list of the values Y takes on, together with their probabilities. But that would be impossible here-all the probabilities of individual values here are 0. Instead, we define the distribution of a random variable W which puts 0 probability on individual points in another way. To set this up, we first must define, for any random variable W (including discrete ones), its cumulative distribution function (cdf): Fw(t) = P(W <;t), -oo < t < oo (2.2) (Please keep in mind the notation. It is customary to use capital F to denote a cdf, with a subscript consisting 37  38 CHAPTER 2. CONTINUOUS PROBABILITY MODELS of the name of the random variable.) What is t here? It's simply an argument to a function. The function here has domain (-oc, oc), and we must thus define that function for every value of t. For instance, consider our "random dart" example above. We know that, for example FD(0.23) = P(D 0.23) = 0.23 (2.3) In general for our dart, 0, if t <0 FD (t) = , if 0 < t1 1i, if t > 1 (2.4) Here is the graph of FD: (0 U- N' I I I II ' -0.5 0.0 0.5 1.0 1.5 t The cdf of a discrete random variable is defined as in Equation (2.2) too. For example, say Z is the number  2.1. A RANDOM DART of heads we get from two tosses of a coin. Then 39 0, Fz (t) =.2', 0.75, 1, if t < 0 if 0 2 (2.5) For instance, Fz(1.2) = P(Z < 1.2) = P(z = 0 or Z = 1) = 0.25 + 0.50 = 0.75. (Make sure you confirm this!) Fz is graphed below: N J_ ' I I I I I I I' -0.5 0.0 0.5 1 .0 1 .5 2.0 2.5 t The fact that one cannot get a noninteger number of heads is what makes the cdf of Z flat between consecu- tive integers. In the graphs you see that FD in (2.4) is continuous while Fz in (2.5) has jumps. For this reason, we call random variables like D-ones which have 0 probability for individual points-continuous random variables. At this level of study of probability, most random variables are either discrete or continuous, but some are not.  40 CHAPTER 2. CONTINUOUS PROBABILITY MODELS 2.2 Density Functions Intuition is key here. Make SURE you develop a good intuitive understanding of density functions, as it is vital in being able to apply probability well. We will use it a lot in our course. 2.2.1 Motivation, Definition and Interpretation OK, now we have a name for random variables that have probability 0 for individual points-"continuous" and we have solved the problem of how to describe their distribution. Now we need something which will be continuous random variables' analog of a probability mass function. Think as follows. From (2.2) we can see that for a discrete random variable, its cdf can be calculated by summing is pmf. Recall that in the continuous world, we integrate instead of sum. So, our continuous-case analog of the pmf should be something that integrates to the cdf. That of course is the derivative of the cdf, which is called the density. It is defined as fw(t) =+Fw(t), -oo < t < oo (2.6) (Please keep in mind the notation. It is customary to use lower-case f to denote a density, with a subscript consisting of the name of the random variable.) Recall from calculus that an integral is the area under the curve, derived as the limit of the sums of areas of rectangles drawn at the curve, as the rectangles become narrower and narrower. Since the integral is a limit of sums, its symbol f is shaped like an S. Now look at Figure 2.1, depicting a density function fx. (It so happens that in this example, the density is an increasing function, but most are not.) A rectangle is drawn, positioned horizontally at 1.3 i 0.1, and with height equal fx (1.3). The area of the rectangle approximates the area under the curve in that region, which in turn is a probability: 2(0.1)fx(1.3) j fx(t) dt = P(1.2 < X < 1.4) (2.7) 1.2 In other words, for any density fx at any point t, and for small values of c, 2cfx(t) P(t - c < X < t +c) (2.8) Thus we have: Intrepetation of Density Functions For any density fx and any two points r and s, P(r- c 2.5) = 2t/15dt =0.65 42.5 2- 1 for s in (1,4) (cdf is 0 for t < 1, and1 for t > 4) (2.17) (2.18) Fx(s) 2t/15 dt  2.3. FAMOUS PARAMETRIC FAMILIES OF CONTINUOUS DISTRIBUTIONS 43 2.3 Famous Parametric Families of Continuous Distributions 2.3.1 The Uniform Distributions 2.3.1.1 Density and Properties In our dart example, we can imagine throwing the dart at the interval (q,r) (so this will be a two-parameter family). Then to be a uniform distribution, i.e. with all the points being "equally likely," the density must be constant in that interval. But it also must integrate to 1 [see (2.11). So, that constant must be 1 divided by the length of the interval: 1 fD (t) =(2-19) r-q for tin (q,r), 0 elsewhere. It easily shown that E(D) =tr and Var(D) = 12(r - q)2. The notation for this family is U(q,r). 2.3.1.2 Example: Modeling of Disk Performance Uniform distributions are often used to model computer disk requests. Recall that a disk consists of a large number of concentric rings, called tracks. When a program issues a request to read or write a file, the read/write head must be positioned above the track of the first part of the file. This move, which is called a seek, can be a significant factor in disk performance in large systems, e.g. a database for a bank. If the number of tracks is large, the position of the read/write head, which I'll denote at X, is like a continuous random variable, and often this position is modeled by a uniform distribution. This situation may hold just before a defragmentation operation. After that operation, the files tend to be bunched together in the central tracks of the disk, so as to reduce seek time, and X will not have a uniform distribution anymore. Each track consists of a certain number of sectors of a given size, say 512 bytes each. Once the read/write head reaches the proper track, we must wait for the desired sector to rotate around and pass under the read/write head. It should be clear that a uniform distribution is a good model for this rotational delay. 2.3.1.3 Example: Modeling of Denial-of-Service Attack In one facet of computer security, it has been found that a uniform distribution is actually a warning of trouble, a possible indication of a denial-of-service attack. Here the attacker tries to monopolize, say, a Web server, by inundating it with service requests. According to the research of David Marchette,2 attackers choose uniformly distributed false IP addresses, a pattern not normally seen at servers. 2Statistical Methods for Network and Computer Security, David J. Marchette, Naval Surface Warfare Center, ri on . mat h. iastate.edu/IA/2003/foils/marchette.pdf.  44 CHAPTER 2. CONTINUOUS PROBABILITY MODELS 2.3.2 The Normal (Gaussian) Family of Continuous Distributions These are the famous "bell-shaped curves," so called because their densities have that shape.3 2.3.2.1 Density and Properties Density and Parameters: The density for a normal distribution is 1 _ -__ 2 fw(t) = e-o.5(,)2 , -oo < t < oo (2.20) Again, this is a two-parameter family, indexed by the parameters y and o, which turn out to be the mean4 and standard deviation y and o, The notation for it is N(p, o.2) (it is customary to state the variance o2 rather than the standard deviation). Closure Under Affine Transformation: The family is closed under affine transformations, meaning that if X has the distribution N(p, o.2), then Y = cX + d has the distribution N(cp + d, c2.2), i.e. Y too has a normal distribution. Consider this statement carefully. It is saying much more than simply that Y has mean xp + d and variance c2.2, which would follow from (1.49) even if X did not have a normal distribution. The key point is that this new variable Y is also a member of the normal family, i.e. its density is still given by (2.20), now with the new mean and variance. Let's derive this. For convenience, suppose c > 0. Then Fy (t) = P(Y < t) (definition of Fy) (2.21) = P(cX + d < t) (definition of Y) (2.22) = P X 535): Let Z = (X - 500)/15. From our discussion above, we know that Z has a N(0,1) distribution, so P(X 535) = (P > ~so 1 = i(35/15) =0.01 (2.31)  46 CHAPTER 2. CONTINUOUS PROBABILITY MODELS Again, traditionally we would obtain that 0.01 value from a N(0,1) cdf table in a book. With R, we would just use the function pnormO: > 1 - pnorm(535,500,15) [1] 0.009815329 Anyway, that 0.01 probability makes us suspicious. While it could really be Jill, this would be unusual behavior for Jill, so we start to suspect that it isn't her. Of course, this is a very crude analysis, and real intrusion detection systems are much more complex, but you can see the main ideas here. 2.3.2.3 The Central Limit Theorem The Central Limit Theorem (CLT) says, roughly speaking, that a random variable which is a sum of many components will have an approximate normal distribution.5 So, for instance, human weights are approximately normally distributed, since a person is made of many components. The same is true for SAT test scores,6 as the total score is the sum of scores on the individual problems. Binomially distributed random variables, though discrete, also are approximately normally distributed. This comes from the fact that if say T has a binomial distribution with n trials, then we can write T = T1 +....+Tn, where Ti is 1 for a success and 0 for a failure. Since we have a sum, the CLT applies. Thus we use the CLT if we have binomial distributions with large n. 2.3.2.4 Example: Coin Tosses For example, let's find the approximate probability of getting more than 12 heads in 20 tosses of a coin. X, the number of heads, has a binomial distribution with n = 20 and p = 0.5 Its mean and variance are then np = 10 and np(1-p) = 5. So, let Z = (X - 10)/5, and write 12 - 10 P(X > 12) = P(Z > ) 1 - 4(0.894) = 0.186 (2.32) Or: > 1 - pnorm(12,10,sqrt(5)) [1] 0.1855467 The exact answer is 0.132. Remember, the reason we could do this was that X is approximately normal, from the CLT. This is an approximation of the distribution of a discrete random variable by a continuous one, which introduces additional error. 5There are many versions of the CLT. The basic one requires that the summands be independent and identically distributed, but more advanced versions are broader in scope. 6This refers to the raw scores, before scaling by the testing company.  2.3. FAMOUS PARAMETRIC FAMILIES OF CONTINUOUS DISTRIBUTIONS 47 We can get better accuracy by accounting for the fact that X is discrete, replacing 12 by 12.5 above. (Think of the number 13 "owning" the region between 12.5 and 13.5.) This is customary, and in this case gives us 0.1317762, while the exact answer to seven decimal places is 0.131588. This is called the correction of continuity. Of course, for larger n this adjustment is not necessary. 2.3.2.5 Museum Demonstration Many science museums have the following visual demonstration of the CLT. There are many balls in a chute, with a triangular array of r rows of pins beneath the chute. Each ball falls through the rows of pins, bouncing left and right with probability 0.5 each, eventually being collected into one of r bins, numbered 0 to r. A ball will end up in bin i if it bounces rightward in i of the r rows of pins, i = 0,1,...,r. Key point: Let X denote the bin number at which a ball ends up. X is the number of rightward bounces ("successes") in r rows ("trials"). Therefore X has a binomial distribution with n = r and p = 0.5 Each bin is wide enough for only one ball, so the balls in a bin will stack up. And since there are many balls, the height of the stack in bin i will be approximately proportional to P(X = i). And since the latter will be approximately given by the CLT, the stacks of balls will roughly look like the famous bell-shaped curve! There are many online simulations of this museum demonstration, such as ht t p : / /www. rand. org/ statistics/applets/clt.htmlandhttp://www. jcu.edu/math/isep/Quincunx/Quincunx. html. By collecting the balls in bins, the apparatus basically simulates a histogram for X, which will then be approximately bell-shaped. 2.3.2.6 Optional topic: Formal Statement of the CLT Definition 1 A sequence of random variables L1, L2, L3, ... converges in distribution to a random variable M if lim P(Lh < t) = P(M < t), for all t (2.33) n-oo Note by the way, that these random variables need not be defined on the same probability space. The formal statement of the CLT is: Theorem 2 Suppose Xi, X2, ... are independent random variables, all having the same distribution which has mean m and variance v2. Then X1 + r.X,-n converges in distribution to a N(0,1) random variable.  48 CHAPTER 2. CONTINUOUS PROBABILITY MODELS 2.3.2.7 Importance in Modeling Normal distributions play a key role in statistics. Most of the classical statistical procedures assume that one has sampled from a population having an approximate distributions. This should come as no surprise, knowing the CLT. The latter implies that many things in nature do have approximate normal distributions. 2.3.3 The Chi-Square Family of Distributions 2.3.3.1 Density and Properties Let Zi, Z2, ..., Zk be independent N(0,1) random variables. The the distribution of Y = Z+ ... + Z(2.35) is called chi-square with k degrees of freedom. We write such a distribution as X. Chi-square is a one-parameter family of distributions. It turns out that chi-square is a special case of the gamma family in Section 2.3.5 below, with r = k/2 and A = 0.5. 2.3.3.2 Importance in Modeling This distribution is used widely in statistical applications. As will be seen in our chapters on statistics, many statistical methods involve a sum of squared normal random variables.7 2.3.4 The Exponential Family of Distributions 2.3.4.1 Density and Properties The densities in this family have the form fw(t) = Ae-, 0 < t < oc (2.36) This is a one-parameter family of distributions.8 After integration, one finds that E(W) = and Var(W) = . You might wonder why it is customary to index the family via A rather than 1/A (see (2.36)), since the latter is the mean. But this is actually quite natural, for the reason cited in thefollowing subsection. 7The motivation for the term degrees offreedom will be explained in those chapters too. 8In the mathematical theory of statistics, the term exponential family has a broader meaning than this.  2.3. FAMOUS PARAMETRIC FAMILIES OF CONTINUOUS DISTRIBUTIONS 49 2.3.4.2 Connection to the Poisson Distribution Family Suppose the lifetimes of a set of light bulbs are independent and identically distributed (i.i.d.), and consider the following process. At time 0, we install a light bulb, which burns an amount of time X1. Then we install a second light bulb, with lifetime X2. Then a third, with lifetime X3, and so on. Let Tr = X1 + ... + Xr (2.37) denote the time of the ith replacement. Also, let N(t) denote the number of replacements up to and including time t.9 Then it can be shown that if the common distribution of the Xi is exponentially distributed, the N(t) has a Poisson distribution with mean At. And the converse is true too: If the Xi are independent and identically distributed and N(t) is Poisson, then the Xi must have exponential distributions. In other words, N(t) will have a Poisson distribution if and only if the lifetimes are exponentially distributed. You can see the "only if" part quickly, by the following argument. First, note that P(X1 > t) = P[N(t) =0] = e- t (2.38) Then fx1(t) = d 1 - e-At)= Ae-At (2.39) dt The collection of random variables N(t) t > 0, is called a Poisson process. The relation E[N(t)] = At says is that replacements are occurring at an average rate of A per unit time. Thus A is called the intensity parameter of the process. It is because of this "rate" interpretation that makes A a natural indexing parameter in (2.36). 2.3.4.3 Importance in Modeling Many distributions in real life have been found to be approximately exponentially distributed. A famous example is the lifetimes of air conditioners on airplanes. Another famous example is interarrival times, such as customers coming into a bank or messages going out onto a computer network. It is used in software reliability studies too. 2.3.5 The Gamma Family of Distributions 2.3.5.1 Density and Properties Recall Equation (2.37), in which the random variable Tr was defined to be the time of the rth light bulb replacement. Tr is the sum of r independent exponentially distributed random variables with parameter A. 9Again, since N(t) is a continuous random variable, the phrase "and including" is unnecssary here.  50 CHAPTER 2. CONTINUOUS PROBABILITY MODELS The distribution of Tr is called an Erlang distribution, with density 1 A fTr(t) = Artr-le-a t> 0 (2.40) (r-1). This is a two-parameter family. We can generalize this by allowing r to take noninteger values, by defining a generalization of the factorial function: F(r) = or-ie-X dx (2.41) 0 This is called the gamma function, and it gives us the gamma family of distributions, more general than the Erlang: fw(t) = Artr-ie-A, t> 0 (2.42) F(r) (Note that F (r) is merely serving as the constant that makes the density integrate to 1.0. It doesn't have meaning of its own.) This is again a two-parameter family, with r and A as parameters. A gamma distribution has mean rA and variance rA2. In the case of integer r, this follows from (2.37) and the fact that an exponentially distributed random variable has mean and variance 1/A and variance 1/A2, and it can be derived in general. Note again that the gamma reduces to the exponential when r = 1. Recall from above that the gamma distribution, or at least the Erlang, arises as a sum of independent random variables. Thus the Central Limit Theorem implies that the gamma distribution should be approximately normal for large (integer) values of r. We see in Figure 2.2 that even with r = 10 it is rather close to normal. It also turns out that the chi-square distribution with d degrees of freedom is a gamma distribution, with r = d/2andA=0.5. 2.3.5.2 Example: Network Buffer Suppose in a network context (not our ALOHA example), a node does not transmit until it has accumulated five messages in its buffer. Suppose the times between message arrivals are independent and exponentially distributed with mean 100 milliseconds. Let's find the probability that more than 552 ms will pass before a transmission is made, starting with an empty buffer. Let X1 be the time until the first message arrives, X2 the time from then to the arrival of the second message, and so on. Then the time until we accumulate five messages is Y = X1 + ... + X5. Then from the definition of the gamma family, we see that Y has a gamma distribution with r = 5 and A =0.01. Then p01 P(Y > 552) = -0.015tde-0 01* dt (2.43) J552 4!  2.4. DESCRIBING "FAILURE" 51 This integral could be evaluated via repeated integration by parts, but let's use R instead: > 1 - pgamma(552,5,0.01) [1] 0.3544101 2.3.5.3 Importance in Modeling As seen in (2.37), sums of exponentially distributed random variables often arise in applications. Such sums have gamma distributions. You may ask what the meaning is of a gamma distribution in the case of noninteger r. There is no particular meaning, but when we have a real data set, we often wish to summarize it by fitting a parametric family to it, meaning that we try to find a member of the family that approximates our data well. In this regard, the gamma family provides us with densities which rise near t = 0, then gradually decrease to 0 as t becomes large, so the family is useful if our data seem to look like this. Graphs of some gamma densities are shown in Figure 2.2. 2.4 Describing "Failure" In addition to density functions, another useful description of a distribution is its hazard function. Again think of the lifetimes of light bulbs, not necessarily assuming an exponential distribution. Intuitively, the hazard function states the likelihood of a bulb failing in the next short interval of time, given that it has lasted up to now. To understand this, let's first talk about a certain property of the exponential distribution family. 2.4.1 Memoryless Property One of the reasons the exponential family of distributions is so famous is that it has a property that makes many practical stochastic models mathematically tractable: The exponential distributions are memoryless. What this means is that for positive t and u P(W>t+ulW>t)=P(W>u) (2.44) Let's derive this:  52 CHAPTER 2. CONTINUOUS PROBABILITY MODELS C7 O r= 1.0 r = 5.0 lambda = 1.0 O 0 cl c r= 10.0 0 I I I I I I 0 I I I 5 10 15 20 25 30 x Figure 2.2: Various Gamma Densities  2.4. DESCRIBING "FAILURE" 53 P(W > t+u and W > t) P(W > t+u|W > t) = (2.45) P(W > t) P(W > t + u) = (2.46) P(W > t) tuAe-As ds f (2.47) = e-a"(2.48) P(W > u) (2.49) We say that this means that "time starts over" at time t, or that W "doesn't remember" what happened before time t. It is difficult for the beginning modeler to fully appreciate the memoryless property. Let's make it concrete. Consider the problem of waiting to cross the railroad tracks on Eighth Street in Davis, just west of J Street. One cannot see down the tracks, so we don't know whether the end of the train will come soon or not. If we are driving, the issue at hand is whether to turn off the car's engine. If we leave it on, and the end of the train does not come for a long time, we will be wasting gasoline; if we turn it off, and the end does come soon, we will have to start the engine again, which also wastes gasoline. (Or, we may be deciding whether to stay there, or go way over to the Covell Rd. railroad overpass.) Suppose our policy is to turn off the engine if the end of the train won't come for at least s seconds. Suppose also that we arrived at the railroad crossing just when the train first arrived, and we have already waited for r seconds. Will the end of the train come within s more seconds, so that we will keep the engine on? If the length of the train were exponentially distributed (if there are typically many cars, we can model it as continous even though it is discrete), Equation (2.45) would say that the fact that we have waited r seconds so far is of no value at all in predicting whether the train will end within the next s seconds. The chance of it lasting at least s more seconds right now is no more and no less than the chance it had of lasting at least s seconds when it first arrived. The memorylessness of exponential distributions implies that a Poisson process N(t) also has a "time starts over" property (called the Markov property). Recall our example in Section 2.3.4.2 in which N(t) was the number of light bulb burnouts up to time t. The memorylessness property means that if we start counting afresh from time, say z, then the numbers of burnouts after time z, i.e. Q(u) = N(z+u) - N(z), also is a Poisson process. In other words, Q(u) has a Poisson distribution with parameter A. Moreover, Q(u) is independent of N(t) for any t < z. By the way, the exponential distributions are the only continuous distributions which are memoryless. This too has implications for the theory.  54 CHAPTER 2. CONTINUOUS PROBABILITY MODELS 2.4.2 Hazard Functions 2.4.2.1 Basic Concepts Suppose the lifetimes of light bulbs L were discrete. Suppose a particular bulb has already lasted 80 hours. The probability of it failing in the next hour would be P(L = 81) PL(81) P(L = 81>L > 80) =_=(2.50) P(L > 80) 1 - FL(80) By analogy, for continuous L we define fL(t) hL-L(t) (2.51) 1 - FLit) Again, the interpretation is that hL (t) is the likelihood of the item failing very soon after t, given that it has lasted t amount of time. Note carefully that the word "failure" here should not be taken literally. In our Davis railroad crossing example above, "failure" means that the train ends-a "failure" which those of us who are waiting will welcome! Since we know that exponentially distributed random variables are memoryless, we would expect intuitively that their hazard functions are constant. We can verify this by evaluating (2.51) for an exponential density with parameter A; sure enough, the hazard function is constant, with value A. The reader should verify that in contrast to an exponential distribution's constant failure rate, a uniform distribution has an increasing failure rate (IFR). Some distributions have decreasing failure rates, while most have non-monotone rates. Hazard function models have been used extensively in software testing. Here "failure" is the discovery of a bug, and with quantities of interest include the mean time until the next bug is discovered, and the total number of bugs. People have what is called a "bathtub-shaped" hazard function. It is high near 0 (reflecting infant mortality) and after, say, 70, but is low and rather flat in between. You may have noticed that the right-hand side of (2.51) is the derivative of -In[1 - FL (t)]. Therefore j sds =-n[1 -FL(t)] (2.52) so that 1 - FL(t) - ef ht (s) ds (2.53)  2.5. A CAUTIONARY TALE: THE BUS PARADOX 55 and thus10 fL(t) =hL (t) efohL(s) ds (2.54) In other words, just as we can find the hazard function knowing the density, we can also go in the reverse direction. This establishes that there is a one-to-one correspondence between densities and hazard functions. This may guide our choice of parametric family for modeling some random variable. We may not only have a good idea of what general shape the density takes on, but may also have an idea of what the hazard function looks like. These two pieces of information can help guide us in our choice of model. 2.4.3 Example: Software Reliability Models Hazard function models have been used successfully to model the "arrivals" (i.e. discoveries) of bugs in software. Questions that arise are, for instance, "When are we ready to ship?", meaning when can we believe with some confidence that most bugs have been found? Typically one collects data on bug discoveries from a number of projects of similar complexity, and estimates the hazard function from that data. See for example Accurate Software Reliability Estimation, by Jason Allen Denton, Dept. of Computer Science, Colorado State University, 1999, and the many references therein. 2.5 A Cautionary Tale: the Bus Paradox Suppose you arrive at a bus stop, at which buses arrive according to a Poisson process with intensity pa- rameter 0.1, i.e. 0.1 arrival per minute. Recall that the means that the interarrival times have an exponential distribution with mean 10 minutes. What is the expected value of your waiting time until the next bus? Well, our first thought might be that since the exponential distribution is memoryless, "time starts over" when we reach the bus stop. Therefore our mean wait should be 10. On the other hand, we might think that on average we will arrive halfway between two consecutive buses. Since the mean time between buses is 10 minutes, the halfway point is at 5 minutes. Thus it would seem that our mean wait should be 5 minutes. Which analysis is correct? Actually, the correct answer is 10 minutes. So, what is wrong with the second analysis, which concluded that the mean wait is 5 minutes? The problem is that the second analysis did not take into account the fact that although inter-bus intervals have an exponential distribution with mean 10, the particular inter-bus interval that we encounter is special. Imagine a bag full of sticks, of different lengths. We reach into the bag and choose a stick at random. The key point is that not all pieces are equally likely to be chosen; the longer pieces will have a greater chance of being selected.11i (The formal name for this is length-biased sampling.) ioRecall that the derivative of the integral of a function is the original function! " Another example was suggested to me by UCD grad student Shubhabrata Sengupta: Think of a large parking lot on which  56 CHAPTER 2. CONTINUOUS PROBABILITY MODELS Similarly, the particular inter-bus interval that we hit is likely to be a longer interval. To see this, suppose we observe the comings and goings of buses for a very long time, and plot their arrivals on a time line on a wall. In some cases two successive marks on the time line are close together, sometimes far apart. If we were to stand far from the wall and throw a dart at it, we would hit the interval between some pair of consecutive marks. Intuitively we are more apt to hit a wider interval than a narrower one. Once one recognizes this and carefully finds the density of that interval, we discover that that interval does indeed tend to be longer-so much so that the expected value of this interval is 20 minutes! In other words, if we throw a dart at the wall, say, 1000 times, the mean of the 1000 intervals we would hit would be about 20. This in contrast to the mean of all of the intervals on the wall, which would be 10. Thus the halfway point comes at 10 minutes, consistent with the analysis which appealed to the memoryless property. Actually, we can intuitively reason out what the density is of the length of the particular inter-bus interval that we hit, as follows. First consider the bag-of-sticks example, and suppose (somewhat artificially) that stick length X is a discrete random variable. Let Y denote the length of the stick that we pick. Suppose that, say, stick lengths 2 and 6 each comprise 10% of the sticks in the bag, i.e. px (2) = px (6) = 0.1 (2.55) Intuitively, one would then reason that py (6) = 3py (2) (2.56) In other words, the sticks of length 2 are just as numerous as those of length 6, but since the latter are three times as long, they should have triple the chance of being chosen. Note that this is not some absolute physical law. Different people might draw sticks from the bag in different ways. But it is a reasonable model. Now let X denote interarrival times between buses, and Y denote the interarrival time that we hit. The analog of (2.56) would be that fy(t) is proportional to tfx (t), i.e. fy(t) = ct fx (t) (2.57) for some constant c. Recalling that fy must integrate to 1, we see that c (i tfx(t) dt>1 (2.58) But that integral is just E(X)! The latter quantity is 10, and hundreds of buckets are placed of various diameters. We throw a ball high into the sky, and see what size bucket it lands in. Here the density would be proportional to the square of the diameter.  2.6. CHOOSING A MODEL 57 So, fy(t) =O.O1te-01 (2.60) You may recognize this as an Erlang density. 2.6 Choosing a Model The parametric families presented here are often used in the real world. As indicated previously, this may be done on an empirical basis. We would collect data on a random variable X, and plot the frequencies of its values in a histogram. If for example the plot looks roughly like the curves in Figure 2.2, we could choose this as the family for our model. Or, our choice may arise from theory. If for instance our knowledge of the setting in which we are working says that our distribution is memoryless, that forces us to use the exponential density family. In either case, the question as to which member of the family we choose to will be settled by using some kind of procedure which finds the member of the family which best fits our data. We will discuss this in detail in our chapters on statistics. Note that we may choose not to use a parametric family at all. We may simply find that our data does not fit any of the common parametric families (there are many others than those presented here) very well. Procedures that do not assume any parametric family are termed nonparametric. 2.7 A General Method for Simulating a Random Variable Suppose we wish to simulate a random variable X with cdf Fx for which there is no R function. This can be done via F-1(U), where U has a U(0,1) distribution. In other words, we call runifO and then plug the result into the inverse of cdf of X. Here "inverse" is in the sense that, for instance, squaring and "square-rooting," expo and lnO, etc. are inverse operations of each other. For example, say X has the density 2t on (0,1). Then Fx(t) = t2, so F-1(s) o.5. We can then generate X in R as sqrt(runif(1)). Here's why: For brevity, denote FX1 as G and Fx as H. Our generated random variable is G(U). Then P[G(U) t] =P[U &G-1(t)] =P[U &H(t)] =H(t) (2.61) In other words, the cdf of G(U) is Fx ! So, G(U) has the same distribution as X.  58 CHAPTER 2. CONTINUOUS PROBABILITY MODELS Note that this method, though valid, is not necessarily practical, since computing F--1 may not be easy. Exercises 1. Suppose X has a uniform distribution on (-1,1), and let Y = X2. Find fy. Hint: First find Fy (t). 2. "All that glitters is not gold," and not every bell-shaped density is normal. The family of Cauchy distri- butions, having density fx(t) b21 1b2o 0.7). Hint: P(W > 0.7)= P(W2 > 0.49).  2.7. A GENERAL METHOD FOR SIMULATING A RANDOM VARIABLE 59 6. Suppose a manufacturer of some electronic component finds that its lifetime is exponentially distributed with mean 10000 hours. They give a refund if the item fails before 500 hours. Let N be the number of items they have sold, up to and including the one on which they make the first refund. Find EN and Var(N). 7. For the density a exp -bt, t > 0, show that we must have a = b. Then show that the mean and variance for this distribution are 1/b and 1/b2, respectively. 8. Consider the "random bucket" example in Footnote 11. Suppose bucket diameter D, measured in meters, has a uniform distribution on (1,2). Let W denote the diameter of the bucket in which the tossed ball lands. (a) Find the density, mean and variance of W, and also P(W > 1.5) (b) Write an R function that will generate random variates having the distribution of W. 9. Suppose that computer roundoff error in computing the square roots of numbers in a certain range is distributed uniformly on (-0.5,0.5), and that we will be computing the sum of n such square roots. Find a number c such that the probability is approximately 95% that the sum is in error by no more than c. A certain public parking garage charges parking fees of $1.50 for the first hour or fraction thereof, and $1 per hour after that. So, someone who stays 57 minutes pays $1.50, someone who parks for one hour and 12 minutes pays $1.70, and so on. Suppose parking times T are exponentially distributed with mean 1.5 hours. Let W denote the total fee paid. Find E(W) and Var(W). 10. In Section 2.45, we showed that the exponential distribution is memoryless. In fact, it is the only continuous distribution with that property. Show that the U(0,1) distribution does NOT have that property. To do this, evaluate both sides of (2.44).  60 CHAPTER 2. CONTINUOUS PROBABILITY MODELS  Chapter 3 Multivariate Probability Models 3.1 Multivariate Distributions 3.1.1 Why Are They Needed? Most applications of probability and statistics involve the interaction between variables. For instance, when you buy a book at Amazon.com, the software will likely inform you of other books that people bought in conjunction with the one you selected. Amazon is relying on the fact that sales of certain pairs or groups of books are correlated. Individual pmfs px and densities fx don't describe these correlations. We need something more. We need ways to describe multivariate distributions. 3.1.2 Discrete Case Say we roll a blue die and a yellow one. Let X and Y denote the number of dots which appear on the blue and yellow dice, respectively, and let S denote the total number of dots appearing on the two dice. We will not discuss Y much here, focusing on X and S. Recall that the distribution of X is defined to be a list of all the values X takes on, and their associated probabilities: 1 1 1 1 1 1 {(1 ), (2 ), (3 ), (4, ), (5, ), (6 )} (3.1) We can write this more compactly (but equivalently) by defining X's probability mass function (pmf): 1 px () =P(X i)= -,i =1, 2 .., 6(3.2) 6 61  62 CHAPTER 3. MULTIVARIATE PROBABILITY MODELS The distribution of S is defined similarly, either as a list, 1 2 3 4 5 6 5 4 3 2 1 {(2, -), (3, -), (4, -), (5, -), (6, -)(7, -)(8, -)(9, -)(10, -)(11, -)(12, -)} (3.3) 36' '36' '36' '36' '36 '36 '36 '36 '36 '36 36 or via its pmf ps.1 But it may also be important to describe how X and S vary jointly. For example, intuitively we would feel that X and S are positively correlated. How do we describe their joint variation? To do this, we define the bivariate probability mass function of X and S. Just as the univariate pmf of X is defined to be px(i) = P(X = i), we define the bivariate pmf as 1' Expected values are calculated in the analogous manner. Recall that for a function g( of X E[g(X] = g(i)px (i) (3.5) So, for any function go of two discrete random variables U and V, define E [g(U, V)]= g (i, j)pu, V(i, j) (3.6) i,j For instance: 6 12 6 i+6 E(XS) Z Zij px,s(i, j) >3>1iU36 (3.7) i=1 j=2 i=1 j=i+1 The univariate pmfs, called marginal pmfs, can of course be recovered from the bivariate pmf. To get px() from px,s(), we sum over the values of S. For example, let's find px(3), which is the probability that X = 3. How could the event X = 3 happen? Well, S could be anywhere from 4 to 9, each with probability 1/6. So, 1 1 px (3) = pxs(3, j) =6 - - (3.8) That is consistent with our univariate calculation of Px (3), as of course it should be. 'Recall that the convention for denoting pmfs is to use the letter 'p' with a subscript indicating the random variable.  3.1. MULTIVARIATE DISTRIBUTIONS 63 We get consistent results for expected values too. Treating X as a function of X and S, we have 6 i+6 E (X) = ipx,S (i, j) (3.9) i=1 j=i+1 but the right-hand side (RHS) of (3.9) reduces to 6 i+6 6 E(X) = i px,s(i, j) = ipx (i) (3.10) i=1 j=i+1 i=1 from (3.8). The last expression in (3.10) is E(X) as defined in the univariate setting, so everything is indeed consistent. 3.1.3 Multivariate Densities 3.1.3.1 Motivation and Definition Extending our previous definition of cdf for a single variable, we define the two-dimensional cdf for a pair of random variables X and Y as Fx,y(u, v) = P(X < u and Y 1). This calculation will involve a double integral. The region A in (3.13) is {(s, t) : s + t > 1, 0 < t < s < 1}. We have a choice of integrating in the order ds dt or dt ds. The latter will turn out to be more convenient. The limits in the double integral are obtained through the following reasoning, as shown in this figure:  3.1. MULTIVARIATE DISTRIBUTIONS 65 A I I I I I I 0.0 0.2 0.4 0.6 0.8 1.0 S Here s represents X and t represents Y. The gray area is the region in which (X,Y) ranges. The subregion A in (3.13), corresponding to the event X+Y > 1, is shown in the striped area in the figure. The dark vertical line shows all the points (s,t) in the striped region for a typical value of s in the integration process. Since s is the variable in the outer integral, considered it fixed for the time being and ask where t will range for that s. We see that for X = s, Y will range from 1-s to s; thus we set the inner integral's limits to 1-s and s. Finally, we then ask where s can range, and see from the picture or from (3.16) that it ranges from 0 to 1. Thus those are the limits for the outer integral. /1 s 12 P(X +Y > 1) = 8st dt ds j 8s - (s -0.5) ds = - (3.17) 0.5J 1-s 03 Following (3.14), E [/ X+Yj] = vs+t8st dt ds (3.18) /1 s Let's find the marginal density fy (t). So we must "integrate out" the s in (3.16): S1 fy(t) = 8st ds = 4t - 4t3 (3.19) 1t  66 CHAPTER 3. MULTIVARIATE PROBABILITY MODELS 3.2 More on Co-variation of Random Variables 3.2.1 Covariance The covariance between random variables X and Y is defined a Cov(X,Y) = E[(X - EX)(Y - EY)] (3.20) Suppose that typically when X is larger than its mean, Y is also larger than its mean, and vice versa for below-mean values. Then (3.20) will likely be positive. In other words, if X and Y are positively correlated (a term we will define formally later but keep intuitive for now), then their covariance is positive. Similarly, if X is often smaller than its mean whenever Y is larger than its mean, the covariance and correlation between them will be negative. All of this is roughly speaking, of course, since it depends on how much X is larger or smaller than its mean, etc. Covariance is linear in both arguments: Cov(aX + bY, cU + dV) = acCov(X, U) + adCov(X, V) + bcCov(Y, U) + bdCov(Y, V) (3.21) for any constants a, b, c and d. Also Cov(X, Y + q) = Cov(X, Y) (3.22) for any constant q and so on. Note that Cov(X, X) = Var(X) (3.23) for any X with finite variance. Also, here is a shortcut way to find the covariance: Cov(X, Y) = E(XY) - EX -EY (3.24) The proof will help you review some important issues, namely (a) E(U+V) = EU + EV, (b) E(cU) = c EU and Ec = c for any constant c, and (c) EX and EY are constants in (3.24). Cov(X, Y) =B[(X - EX)(Y - BY)] (definition) (3.25) =BE[XY-BEX-Y -BEY.-X +BEXBY] (algebra) (3.26) B (XY) + EB[-BX -Y] + EB[-BY -X] + EB[BX B Y ] (E[U+V]=EU+EVX3.27) =B(XY) - EX B Y (BEcU] = cBU, Bc = c) (3.28)  3.2. MORE ON CO-VARIATION OF RANDOM VARIABLES 67 Another important property: Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y) (3.29) This comes from (3.24), the relation Var(X) = E(X2) - EX2 and the corresponding one for Y. Just substitute and do the algebra. 3.2.2 Correlation Covariance does measure how much or little X and Y vary together, but it is hard to decide whether a given value of covariance is "large" or not. For instance, if we are measuring lengths in feet and change to inches, then (3.21) shows that the covariance will increase by 122 = 144. Thus it makes sense to scale covariance according to the variables' standard deviations. Accordingly, the correlation between two random variables X and Y is defined by Coyv(X,Y ) p(XY) =Y(3.30) Var(X) Var(Y) So, correlation is unitless, i.e. does not involve units like feet, pounds, etc. It is shown later in this chapter that S-1= b) return(win) else return(win+0.2) } cdf2 <- function(xy,tl,t2) { # 2-dim. cdf tmp <- xy[xy[,1] <= tl & xy[,2] <= t2,] return (nrow (tmp) /nrow (xy)) } nreps <- 10000 nturns <- 10 xyvals <- matrix(nrow=nreps,ncol=2) for (rep in 1:nreps) { x <- 0 y <- 0 for (turn in 1:nturns) { # x's turn x <- x + taketurn(x,y) # y's turn y <- y + taketurn(y,x) } xyvals[rep,] c (x, y) } print (cor (xyvals[, 1],xyvals [,2])) print (cdf2 (xyvals,5.8,5.2)) The output is 0.65 and 0.03. So, X and Y are indeed positively correlated as we had surmised. Note the use of R's built-in function cor() to compute correlation. Note too that the bonus makes the two players' winnings "leapfrog" over each other. Without it, we would have EX = EY = 5.0, and Fx,y (5.8, 5.2) somewhat greater than 0.25. (The latter would be the value of Fx,y (5.0, 5.0).) But the bonus moves the distributions of X and Y more toward 10.0. 3.3 Sets of Independent Random Variables Great mathematical tractability can be achieved by assuming that the Xi in a random vector X = (X1, ..., Xk) are independent. In many applications, this is a reasonable assumption. 3.3.1 Properties In the next few sections, we will look at some commonly-used properties of sets of independent random variables. For simplicity, consider the case k = 2, with X and Y being independent (scalar) random variables. 3.3.1.1 Probability Mass Functions and Densities Factor If X and Y are independent, then PX,Y = PxPY (3-48)  70 CHAPTER 3. MULTIVARIATE PROBABILITY MODELS in the discrete case, and fx,y = fxfy (3.49) in the continuous case. In other words, the joint pmf/density is the product of the marginal ones. This is easily seen in the discrete case: px,y (i, j) = P(X = i and Y = j) (definition) (3.50) P(X = i)P(Y = j) (independence) (3.51) = px(i)py (j) (definition)) (3.52) Here is the proof for the continuous case; fx,y(u,v) = Fxy (u, v) (3.53) O Ov = P(X < u and Y < v) (3.54) Ou Dv = P(X < u)-PY < v) (3.55) ou ov = Fx(u) - Fy (v) (3.56) ou ov = fx(v)fy(v) (3.57) 3.3.1.2 Expected Values Factor If X and Y are independent, then E(XY) = E(X)E(Y) (3.58) To prove this, use (3.48) and (3.49) for the discrete and continuous cases. 3.3.1.3 Covariance Is 0 If X and Y are independent, then from (3.58) and (3.24), we have Cov(X, Y) =0 (3.59) and thus p(X, Y) =0 as well.  3.3. SETS OF INDEPENDENT RANDOM VARIABLES 71 However, the converse is false. A counterexample is the random pair (V, W) that is uniformly distributed on the unit disk, {(s, t) : s2 + t2< . 3.3.1.4 Variances Add If X and Y are independent, then from (3.29) and (3.58), we have Var(X + Y) =Var(X) +Var(Y). (3.60) 3.3.1.5 Convolution If X and Y are nonnegative, continuous random variables, and we set Z = X+Y, then the density of Z is the convolution of the densities of X and Y: fz(t) j fx(s)fy(t - s) ds (3.61) 0 You can get intuition on this by considering the discrete case. Say U and V are nonnegative integer-valued random variables, and set W = U+V. Let's find pw; pw (k) = P(W = k) (by definition) (3.62) = P(U + V = k) (substitution) (3.63) k = S P(U = i and V = k - i) ("In what ways can it happen?") (3.64) i=o k = 5PU,v(i, k - i) (by definition) (3.65) i=o k = Pu(i)Pv(k - i) (from Section 3.3.1.1) (3.66) i=0 Review the analogy between densities and pmfs in our unit on continuous random variables, Section 2.2.1, and then see how (3.61) is analogous to (3.62) through (3.66): * k in (3.62) is analogous to t in (3.61) * the limits 0 to k in (3.66) are analogous to the limits 0 to t in (3.61) * the expression k-i in (3.66) is analogous to t-s in (3.61) 9 and so on  72 CHAPTER 3. MULTIVARIATE PROBABILITY MODELS 3.3.2 Examples 3.3.2.1 Example: Dice In Section 3.2.1, we speculated that the correlation between X, the number on the blue die, and S, the total of the two dice, was positive. Let's compute it. Write S = X + Y, where Y is the number on the yellow die. Then using the properties of covariance presented above, we have that Cov(X, S) = Cov(X, X + Y) (by definition) (3.67) = Cov(X, X) + Cov(X, Y) (from (3.21)) (3.68) = Var(X) + 0 (from (3.23), (3.59) (3.69) Also, from (3.60), Var(S)=Var(X + Y)=Var(X) + Var(Y) (3.70) But Var(Y) = Var(X). So the correlation between X and S is V ar (X ) p(X, S) Va()0.707 (3.71) V ar (X) 2Var (X ) Since correlation is at most 1 in absolute value, 0.707 is considered a fairly high correlation. Of course, we did expect X and S to be highly correlated. 3.3.2.2 Example: Ethernet Consider this network, essentially Ethernet. Here nodes can send at any time. Transmission time is 0.1 seconds. Nodes can also "hear" each other; one node will not start transmitting if it hears that another has a transmission in progress, and even when that transmission ends, the node that had been waiting will wait an additional random time, to reduce the possibility of colliding with some other node that had been waiting. Suppose two nodes hear a third transmitting, and thus refrain from sending. Let X and Y be their random backoff times, i.e. the random times they wait before trying to send. Let's find the probability that they clash, which is P(IX - Y| <0.1). Assume that X and Y are independent and exponentially distributed with mean 0.2, i.e. they each have density 5e-5" on (0, oo). Then from (3.49), we know that their joint density is the product of their marginal densities, fxy(s, t) = 25e-5(s+t),s, t > 0 (3.72)  3.3. SETS OF INDEPENDENT RANDOM VARIABLES 73 Now P(IX - Y| X 0.1) = 1-P(IX -Y| > 0.1)= 1 -P(X > Y + 0.1) - P(Y > X + 0.1) (3.73) Look at that first probability. Applying (3.13) with A = {(s, t) : s > t + 0.1, 0 < s, t}, we have 00 p00 P(X > Y + 0.1) = ] 25e-5(s+t) ds dt = 0.303 (3.74) 0 J Jt+0.1 By symmetry, P(Y > X + 0.1) is the same. So, the probability of a clash is 0.394, rather high. We may wish to increase our mean backoff time, though a more detailed analysis is needed. 3.3.2.3 Example: Analysis of Seek Time This will be an analysis of seek time on a disk. Suppose we have mapped the innermost track to 0 and the outermost one to 1, and assume that (a) the number of tracks is large enough to treat the position H of the read/write head the interval [0,1] to be a continous random variable, and (b) the track number requested has a uniform distribution on that interval. Consider two consecutive service requests for the disk, denoting their track numbers by X and Y. In the simplest model, we assume that X and Y are independent, so that the joint distribution of X and Y is the product of their marginals, and is thus is equal to 1 on the square 0 < X, Y < 1. The seek distance will be IX - Y|. Its mean value is found by taking g(s,t) in (3.14) to be |s - t|. Li 1 1 1 s - t .1 ds dt = (3.75) Jo 3 By the way, what about the assumptions here? The independence would be a good assumption, for instance, for a heavily-used file server accessed by many different machines. Two successive requests are likely to be from different machines, thus independent. In fact, even within the same machine, if we have a lot of users at this time, successive requests can be assumed independent. On the other hand, successive requests from a particular user probably can't be modeled this way. As mentioned in our unit on continuous random variables, page 43, if it's been a while since we've done a defragmenting operation, the assumption of a uniform distribution for requests is probably good. Once again, this is just scratching the surface. Much more sophisticated models are used for more detailed work. 3.3.2.4 Example: Backup Battery Suppose we have a portable machine that has compartments for two batteries. The main battery has lifetime X with mean 2.0 hours, and the backup's lifetime Y has mean life 1 hours. One replaces the first by the  74 CHAPTER 3. MULTIVARIATE PROBABILITY MODELS second as soon as the first fails. The lifetimes of the batteries are exponentially distributed and independent. Let's find the density of W, the time that the system is operational (i.e. the sum of the lifetimes of the two batteries). Recall that if the two batteries had the same mean lifetimes, W would have a gamma distribution. But that's not the case here. But we notice that the distribution of W is a convolution of two exponential densities, as it is the sum of two nonnegative independent random variables. Using (3.3.1.5), we have fw(t) = fx(s)fy(t - s) ds = 0.5e-0.58e-(t-8) ds = e-" 5t - e-t, 0 < t < oc (3.76) 0 0 3.4 Matrix Formulations When dealing with multivariate distributions, some very messy equations can be greatly compactified through the use of matrix algebra. We will introduce this here. Throughout this section, consider a random vector W = (W1, ..., Wk)' where ' denotes matrix transpose, and a vector written horizontally like this without a ' means a row vector. 3.4.1 Properties of Mean Vectors The expected value of W is defined to be the vector EW = (EW1, ..., EWk)' (3.77) The linearity of the components implies that of the vectors. For any scalar constants c and d, and any random vectors V and W, we have E(cV + dW) = cEV + dEW (3.78) where the multiplication and equality is now in the vector sense. 3.4.2 Properties of Covariance Matrices The covariance matrix E of W is the k x k matrix whose (i, j)th element is Cov(W, IW). Note that that means that the diagonal elements of the matrix are the variances of the W, and that the matrix is symmetric. We write the covariance matrix of W as Cov(W). Here are some important properties: * Say c is a constant scalar, and define Q = c W. Then Q is a k-component random vector like W, and Cov(Q) = c2Cov(W) (3.79)  3.5. CONDITIONAL DISTRIBUTIONS 75 " If A is an r x k but nonrandom matrix, define Q = A W. Then Q is an r-component random vector, and Cov(Q) = A Cov(W) A' (3.80) " Suppose V and W are independent random vectors, meaning that each component in V is independent of each component of W. (But this does NOT mean that the components within V are independent of each other, and similarly for W.) Then Cov(V + W) = Cov(V) + Cov(W) (3.81) 3.5 Conditional Distributions The key to good probability modeling and statistical analysis is to understand conditional probability. The issue arises constantly. 3.5.1 Conditional Pmfs and Densities First, let's review: In many repetitions of our "experiment," P(A) is the long-run proportion of the time that A occurs. By contrast, P(A B) is the long-run proportion of the time that A occurs, among those repetitions in which B occurs. Keep this in your mind at all times. Now we apply this to pmfs, densities, etc. We define the conditional pmf as follows for discrete random variables X and Y: py~xUjiZ) = P(Y =Sj|X = i j = x(2) (3.82) px (i) By analogy, we define the conditional density for continuous X and Y: fyx(tis) =fxy(s,t) (3.83) .fx (s) 3.5.2 Conditional Expectation Conditional expectations are defined as straightforward extensions of (3.82) and (3.83): E(Y|X =i) = (jpy x(j i) (3.84) E (Y|X =s) =jtfy x(t s) dt (3.85)  76 CHAPTER 3. MULTIVARIATE PROBABILITY MODELS 3.5.3 The Law of Total Expectation (advanced topic) 3.5.3.1 Expected Value As a Random Variable For a random variable Y and an event A, the quantity E(Y|A) is the long-run average of Y, among the times when A occurs. Note several things about the expression E(Y|A): " The expression evaluates to a constant. " The item to the left of the | symbol is a random variable (Y). " The item on the right of the | symbol is an event (A). By contrast, for the quantity E(YlW) defined below, for a random variable W, it is the case that: " The expression itself is a random variable, not a constant. " The item to the left of the | symbol is again a random variable (Y). " But the item to the right of the | symbol is also a random variable (W). It will be very important to keep these differences in mind. Consider the function g(t) defined as2 g(t) = E(Y|W = t) (3.86) In this case, the item to the right of the | is an event, and thus g(t) is a constant (for each value of t), not a random variable. Now, define the random variable Q to be g(W). Since W is a random variable, then Q is too. The quantity E(Y W) is then defined to be Q. (Before reading any further, re-read the two sets of bulleted items above, and make sure you understand the difference between E(Y|W=t) and E(YlW).) One can view E(YlW) as a projection in an abstract vector space. This is very elegant, and actually aids the intuition. If (and only if) you are mathematically adventurous, read the details in Section 3.9.2. 3.5.3.2 The Famous Formula (Theorem of Total Expectation) An extremely useful formula, given only scant or no mention in most undergraduate probability courses, is E(Y) =E[E(Y|W)] (3.87) for any random variables Y and W. 20f course, the t is just a placeholder, and any other letter could be used.  3.5. CONDITIONAL DISTRIBUTIONS 77 The RHS of (3.87) looks odd at first, but it's merely E[g(W)]; since Q = E(YlW) is a random variable, we can certainly ask what its expected value is. Equation (3.87) is a bit abstract. It's a very useful abstraction, enabling streamlined writing and thinking about the process. Still, you may find it helpful to consider the case of discrete W, in which (3.87) has the more concrete form EY = ( P(W = i) - E(Y|W = i) (3.88) To see this intuitively, think of measuring the heights and weights of all the adults in Davis. Say we measure height to the nearest inch, so that height is discrete. We look at all the adults in Davis who are 72 inches tall, and write down their mean weight. Then we write down the mean weight of all adults of height 68. Then we write down the mean weight of all adults of height 75, and so on. Then (3.87) says that if we take the average of all the numbers we write down-the average of the averages-then we get the mean weight among all adults in Davis. Note carefully, though, that this is a weighted average. If for instance people of height 69 inches are more numerous in the population, then their mean weight will receive greater emphasis in over average of all the means we've written down. This is seen in (3.88), with the weights being the quantities P(W=i). The relation (3.87) is proved in the discrete case in Section 3.10. 3.5.4 What About the Variance? By the way, one might guess that the analog of the Theorem of Total Expectation for variance is Var(Y) = E[Var(Y W)] (3.89) But this is false. Think for example of the extreme case in which Y = W. Then Var(YlW) would be 0, but Var(Y) would be nonzero. The correct formula, called the Law of Total Variance, is Var(Y) = E[Var(Y|W)] + Var[E(Y|W)] (3.90) Deriving this formula is easy, by simply evaluating both sides, and using the relation Var(X) = E(X2) - (EX)2. This exercise is left to the reader. 3.5.5 Example: Trapped Miner (Adapted from Stochastic Processes, by Sheldon Ross, Wiley, 1996.) A miner is trapped in a mine, and has a choice of three doors. Though he doesn't realize it, if he chooses to exit the first door, it will take him to safety after 2 hours of travel. If he chooses the second one, it will  78 CHAPTER 3. MULTIVARIATE PROBABILITY MODELS lead back to the mine after 3 hours of travel. The third one leads back to the mine after 5 hours of travel. Suppose the doors look identical, and if he returns to the mine he does not remember which door(s) he tried earlier. What is the expected time until he reaches safety? Let Y be the time it takes to reach safety, and let W denote the number of the door chosen (1, 2 or 3) on the first try. Then let us consider what values E(YlW) can have. If W = 1, then Y = 2, so E(Y lW = 1) = 2 (3.91) If W = 2, things are a bit more complicated. The miner will go on a 3-hour excursion, and then be back in its original situation, and thus have a further expected wait of EY, since "time starts over." In other words, E(Y lW = 2) = 3 + EY (3.92) Similarly, E(YlW = 3) = 5 + EY (3.93) In summary, now considering the random variable E(YlW), we have 2, w.p.{ Q = E(Y|W) = 3 + EY, w.p. j(3.94) 5 + EY, w.p. where "w.p." means "with probability." So, using (3.87) or (3.88), we have 1 1 1 10 2 EY = EQ = 2 x -+(3+EY) x-+(5+EY) x - -+ -EY (3.95) 3 3 3 3 3 Equating the extreme left and extreme right ends of this series of equations, we can solve for EY, which we find to be 10. It is left to the reader to see how this would change if we assume that the miner remembers which doors he has already hit. 3.5.6 Example: Analysis of Hash Tables (Famous example, adapted from various sources.) Consider a database table consisting of m cells, only some of which are currently occupied. Each time a new key must be inserted, it is used in a hash function to find an unoccupied cell. Since multiple keys map to the same table cell, we may have to probe multiple times before finding an unoccupied cell. We wish to find E(Y), where Y is the number of probes needed to insert a new key. One approach to doing so would be to condition on W, the number of currently occupied cells at the time we do a search. After  3.5. CONDITIONAL DISTRIBUTIONS 79 finding E(YlW), we can use the Theorem of Total Expectation to find EY. We will make two assumptions (to be discussed later): (a) Given that W = k, each probe will collide with an existing cell with probability k/m, with successive probes being independent. (b) W is uniformly distributed on the set 1,2,...,m, i.e. P(W = k) = 1/m for each k. To calculate E(Y|W=k), we note that given W = k, then Y is the number of independent trials until a "success" is reached, where "success" means that our probe turns out to be to an unoccupied cell. This is a geometric distribution, i.e. k -1 k P(_ |W=k ( ) (1 - ) (3.96) m m The mean of this geometric distribution is, from (1.75), k (3.97) 1- m Then EY = E[E(Y|W)] (3.98) m-11 Z- E(Y lW = k) (3.99) k=1 m-1 m Zm k (3.100) k=1 1 1 1 1+ - + -+...+(3.101) 2 3 m - 1 ~ - du (3.102) 1 U ln(m) (3.103) where the approximation is something you might remember from calculus (you can picture it by drawing rectangles to approximate the area under the curve.). Now, what about our assumptions, (a) and (b)? The assumption in (a) of each cell having probability k/m should be reasonably accurate if k is much smaller than m, because hash functions tend to distribute probes uniformly, and the assumption of independence of successive probes is all right too, since it is very unlikely that we would hit the same cell twice. However, if k is not much smaller than m, the accuracy will suffer. Assumption (b) is more subtle, with differing interpretations. For example, the model may concern one specific database, in which case the assumption may be questionable. Presumably W grows over time, in  80 CHAPTER 3. MULTIVARIATE PROBABILITY MODELS which case the assumption would make no sense-it doesn't even have a distribution. We could instead think of a database which grows and shrinks as time progresses. However, even here, it would seem that W would probably oscillate around some value like m/2, rather than being uniformly distributed as assumed here. Thus, this model is probably not very realistic. However, even idealized models can sometimes provide important insights. 3.6 Parametric Families of Distributions Since there are so many ways in which random variables can correlate with each other, there are rather few parametric families commonly used to model multivariate distributions (other than those arising from sets of independent random variables have a distribution in a common parametric univariate family). We will discuss two here. 3.6.1 The Multinomial Family of Distributions 3.6.1.1 Probability Mass Function This is a generalization of the binomial family. Suppose one tosses a die 8 times. What is the probability that the results consist of two Is, one 2, one 4, three 5s and one 6? Well, if the tosses occur in that order, i.e. the two 1s come first, then the 2, etc., then the probability is 2 1 0 ° 1 3 11 (6) (6) (6) (6) (AX(6)1(3.104) But there are many different orderings, in fact 8! 8! (3.105) 2!1!0!1!3!1! of them. From this, we can see the following. Suppose: " we have n trials, each of which has r possible outcomes or categories " the trials are independent + teth~ outcome has probability p Let Xi denote the number of trials with outcome i, i = 1,...,r. Then we say that X1, ..., Xr have a multinomial distribution, and the joint pmf of the X1, ..., Xr is  3.6. PARAMETRIC FAMILIES OF DISTRIBUTIONS 81 Note that this family of distributions has r+1 parameters. 3.6.1.2 Means and Covariances Now look at the vector X = (X1, ..., Xr)'. Let's find its mean vector and covariance matrix. First, note that the marginal distributions of the Xi are binomial! So, EXZ = npj and Var (Xi) = np2(1 - p2) (3.107) So we know EX now: EX = npi npr (3.108) What about Cov(X)? To this end, let Tk equal 1 or 0, depending on whether the kth trial results in outcome i, k = 1,...,n and i = 1,...,r. We say that Tk is the indicator variable for the event that kth trial results in outcome i. This is a simple concept, but it has powerful uses, as you'll see. Make sure you understand that n 2 k Em=k= (3.109) From (3.109), you can see that X=U1+ ...+ U (3.110) where Tk 1 Tkr (3.111) Now, here's where the power of the matrix operations in Section 3.4 will be seen: Cov(X) Cov(Ui + ... + Un) (from (3.110)) Cov(Ui) + ... + Cov(Un) (from (3.81)) nCov(U1) (all have the same distribution) (3.112) (3.113) (3.114)  82 CHAPTER 3. MULTIVARIATE PROBABILITY MODELS Now, for i $ j, we have from (3.24) Cov(Tii, Tig) = E(TiT) - ETi -ETi3 (3.115) But T1i, T1 = 0! And ET1i = p2 and the same for the j case. So, Cov (T1Z, Tlj) -Pij (3.116) Of course, for i = j, Cov(Tii, T1) = Var(Ti = p (1 - pi), since T1 has a binomial distribution with number of trials equal to 1. Putting all this together, and recalling (3.114), we see that Cov(X) = n Pi(1 - Pi) pip2 -P1P2 ..- -PiPr P2(1 -P2) ... -p2pr ... ... pr(1 -Pr) ii (3.117) Note too that if we define R = X/n, so that R is the vector of proportions in the various categories (e.g. X1 /n is the fraction of trials that resulted in category 1), then (3.117) and (3.79), we have Cov(R) = 1- P1 (1 - pi) P1P2 -P1P2 -.. PiPr P2(1 -P2) ... -p2pr ......Pr(1Pr) (3.118) Whew! That was a workout, but these formulas will become very useful later on, both in this unit and subsequent ones. 3.6.1.3 Application: Text Mining One of the branches of computer science in which the multinomial family plays a prominent role is in text mining. One goal is automatic document classification. We want to write software that will make reasonably accurate guesses as to whether a document is about sports, the stock market, elections etc., based on the frequencies of various key words the program finds in the document. Many of the simpler methods for this use the bag of words model. We have r key words we've decided are useful for the classification process, and the model assumes that statistically the frequencies of those words in a given document category, say sports, follow a multinomial distribution. Each category has its own set of probabilities p1, ..., pr For instance, if "Barry Bonds" is considered one word, its probability will be much higher in the sports category than in the elections category, say. So, the observed frequencies of the words in a particular document will hopefully enable our software to make a fairly good guess as to the category the document belongs to.  3.6. PARAMETRIC FAMILIES OF DISTRIBUTIONS 83 Once again, this is a very simple model here, designed to just introduce the topic to you. Clearly the multinomial assumption of independence between trials is grossly incorrect here, most models are much more complex than this. 3.6.2 The Multivariate Normal Family of Distributions Note to the reader: This is a more difficult section, but worth putting extra effort into, as so many statistical applications in computer science make use of it. It will seem hard at times, but in the end won't be too bad. 3.6.2.1 Densities and Properties Intuitively, this family has densities which are shaped like multidimensional bells, just like the univariate normal has the famous one-dimensional bell shape. Let's look at the bivariate case first. The joint distribution of X1 and X2 is said to be bivariate normal if their density is 1)-2 +_tp2)2 _ 2p(s-pt1)(t-pL2) fxy(s, t) = e - o2(1P2) 2 Pa1a2-i-i ,-oc 90) =1 - 1 the series in (3.137) may not converge. for 0 < s < 1, the series does converge. To see this, note that if s = 1, we just get the sum of all probabilities, which is 1.0. If a nonnegative s is less than 1, then s2 will also be less than 1, so we still have convergence. One use of the generating function is, as its name implies, to generate the probabilities of values for the random variable in question. In other words, if you have the generating function but not the probabilities, you can obtain the probabilities from the function. Here's why: For clarify, write (3.137) as gv(s) = P(V = 0) + sP(V = 1) + s2P(V = 2) + ... (3.139) From this we see that gv(0) = P(V = 0) (3.140) So, we can obtain P(V = 0) from the generating function. Now differentiating (3.137) with respect to s, we have d 9'v~s) = (P(V = 0) + SpyV = 1) + s2P(V = 2) + ...] = P(V = 1) + 2sP(V = 2) + ... (3.141) So, we can obtain P(V = 2) from g'y (0), and in a similar manner can calculate the other probabilities from the higher derivatives. 3.8.0.6 Moment Generating Functions The generating function is handy, but it is limited to discrete random variables. More generally, we can use the moment generating function, defined for any random variable X as mx (t) = E[etX] (3.142) for any t for which the expected value exists. That last restriction is anathema to mathematicians, so they use the characteristic function, #bx(t) =E[etX] (3.143) which exists for any t. However, it makes use of pesky complex numbers, so we'll stay clear of it here. Differentiating (3.142) with respect to t, we have m' (t) = E[XeX] (3.144)  3.8. TRANSFORM METHODS (ADVANCED TOPIC) 91 We see then that m' (0) = EX (3.145) So, if we just know the moment-generating function of X, we can obtain EX from it. Also, m'x (t) = E(X2etX) (3.146) so m'/ (0) = E(X2) (3.147) In this manner, we can for various k obtain E(Xk'), the kth moment of X, hence the name. 3.8.1 Example: Network Packets As an example, suppose say the number of packets N received on a network link in a given time period has a Poisson distribution with mean p, i.e. e-pk P(N = k) = k!, k = 0, 1, 2, 3, ... (3.148) 3.8.1.1 Poisson Generating Function Let's first find its generating function. 9N k- -p -p=p°O49 gN(t) Ztk! (3.149) k=0 k=0 where we made use of the Taylor series from calculus, 00 e" = uk /k! (3.150) k=o 3.8.1.2 Sums of Independent Poisson Random Variables Are Poisson Distributed Supposed packets come in to a network node from two independent links, with counts N1 and N2, Poisson distributed with means pii and p~2. Let's find the distribution of N =Ni + AT2, using a transform approach. 9N(t) = E[tN1+N2 ] = E[tNl]E[tN2] gN1(t)gN2(t) v+vt (3.15 1)  92 CHAPTER 3. MULTIVARIATE PROBABILITY MODELS where v =pli + u2. But the last expression in (3.151) is the generating function for a Poisson distribution too! And since there is a one-to-one correspondence between distributions and transforms, we can conclude that N has a Poisson distribution with parameter v. We of course knew that N would have mean v but did not know that N would have a Poisson distribution. So: A sum of two independent Poisson variables itself has a Poisson distribution. By induction, this is also true for sums of k independent Poisson variables. 3.8.1.3 Random Number of Bits in Packets on One Link (advanced topic) Consider just one of the two links now, and for convenience denote the number of packets on the link by N, and its mean as p. Continue to assume that N has a Poisson distribution. Let B denote the number of bits in a packet, with B1, ..., BN denoting the bit counts in the N packets. We assume the BZ are independent and identically distributed. The total number of bits received during that time period is T = B1 + ... + BN (3.152) Suppose the generating function of B is known to be h(s). Then what is the generating function of T? 9T(s) = E(sT) (3.153) = E[E(sT IN)] (3.154) = E[E(sB1+...+BN N)] (3.155) = E[E(s 1|N)...E(sBN IN)] (3.156) = E[h(s)N] (3.157) =gN[h(s)] (3.158) = e- + h(s) (3.159) Here is how these steps were made: " From the first line to the second, we used the Theorem of Total Expectation. " From the second to the third, we just used the definition of T. * From the third to the fourth lines, we have used algebra plus the fact that the expected value of a product of independent random variables is the product of their individual expected values. * From the fourth to the fifth, we used the definition of h(s). 9 From the fifth to the sixth, we used the definition of gN-  3.9. VECTOR SPACE INTERPRETATIONS (FOR THE MATHEMATICALLY ADVENTUROUS ONLY)93 " From the sixth to the last we used the formula for the generating function for a Poisson distribution with mean p. We can then get all the information about T we need from this formula, such as its mean, variance, proba- bilities and so on, as seen previously. 3.8.2 Other Uses of Transforms Transform techniques are used heavily in queuing analysis, including for models of computer networks. The techniques are also used extensively in modeling of hardware and software reliability. 3.9 Vector Space Interpretations (for the mathematically adventurous only) 3.9.1 Properties of Correlation Let V be the set of all random variables with finite variance and mean 0. Treat this as a vector space, with the sum of two vectors X and Y taken to be the random variable X+Y, for a constant c, the vector cX being the random variable cX. Note that V is closed under these operations, as it must be. Define an inner product on this space: (X, Y) = E(XY) = Cov(X, Y) (3.160) (Recall that Cov(X,Y) = E(XY) - EX EY, and that we are working with random variables that have mean 0.) Thus the norm of a vector X is IX| -_(X, X)0 = E(X2) = Var(X) (3.161) again since E(X) = 0. The famous Cauchy-Schwarz Inequality for inner products says, (X,Y)| |X| Y (3.162) i.e. p(X, Y)| < 1 (3.163) Also, the Cauchy-Schwarz Inequality yields equality if and only if one vector is a scalar multiple of the other, i.e. Y = cX for some c. When we then translate this to random variables of nonzero means, we get Y = cX + d. In other words, the correlation between two random variables is between -1 and 1, with equality if and only if one is an exact linear function of the other.  94 CHAPTER 3. MULTIVARIATE PROBABILITY MODELS 3.9.2 Conditional Expectation As a Projection Continue to consider the vector space in Section 3.9.1. For a random variable X, let W denote the subspace of V consisting of all functions h(X) with mean 0 and finite variance. (Again, note that this subspace is indeed closed under vector addition and scalar multiplica- tion.) Now consider any Y in V. Recall that the projection of Y onto W is the closest vector T in W to Y, i.e. T minimizes Y - T| = (E[(Y-T)2]) (3.164) To find the minimizing T, consider first the minimization of E[(S - c)2] (3.165) with respect to constants c for some random variable S. Expanding the square, we have E[(S - c)2] = E(S2) - 2cES + (ES)2 (3.166) Taking + and setting the result to0, we find that the minimizing c is c = ES. Getting back to (3.164), use the Law of Total Expectation to write E[(Y - T)2] = E (E[(Y - T)2 X]) (3.167) From what we learned with (3.165), applied to the conditional (i.e. inner) expectation in (3.167), we see that the T which minimizes (3.167) is T = E(YlX). In other words, the conditional mean is a projection! Nice, but is this useful in any way? The answer is yes, in the sense that it guides the intuition. All this is related to issues of statistical prediction-here we would be predicting Y from X-and the geometry here can really guide our insight. This is not very evident without getting deeply into the prediction issue, but let's explore some of the implications of the geometry. For example, a projection is perpendicular to the line connecting the projection to the original vector. So 0 =( E(Y|X ), Y - E(Y|X )) =Cov[ E(Y|X ), Y - E(Y|X )] (3.168) This says that the prediction E(YlX) is uncorrelated with the prediction error, Y-E(YlX). This in turn has statistical importance. Of course, (3.168) could have been derived directly, but the geometry of the vector space intepretation is what suggested we look at the quantity in the first place. Again, the point is that the vector space view can guide our intuition.  3.10. PROOF OF THE LAW OF TOTAL EXPECTATION 95 Simlarly, the Pythagorean Theorem holds, so Y12 E(YX) 2 +||Y - E(YlX) 2 (3.169) which means that Var(Y) = Var[E(Y|X)] + Var[Y - E(Y|X)] (3.170) Equation (3.170) is a common theme in linear models in statistics, the decomposition of variance. 3.10 Proof of the Law of Total Expectation Let's prove (3.87) for the case in which W and Y take values only in the set {1,2,3,...}. Recall that if T is an integer-value random variable and we have some function hO, then L = h(T) is another random variable, and its expected value can be calculated ass E(L) = h(k)P(T = k) (3.171) k In our case here, Q is a function of W, so we find its expectation from the distribution of W: 00 E(Q) = ( g(i)P(W = i) i=1 = >ZE(Y|W = i)P(W =i) i=1 o o E EjP(Y = j|W = i) P(W = i) i=1 _j=1 o o = j(E P(Y = j|W=i)P(W =i) j=1 i=1 00 - jP(Y =j) j=1 = E(Y) In other words, E(Y) =E[E(Y|W)] (3.172) 5This is sometimes called The Law of the Unconscious Statistician, by nasty probability theorists who look down on statisticians. Their point is that technically EL =>Ek kP(L =k), and that (3.171) must be proven, whereas the statisticians supposedly think it's a definition.  96 CHAPTER 3. MULTIVARIATE PROBABILITY MODELS Exercises 1. Suppose the random pair (X, Y) has the density fx,y (s, t) = 8st on the triangle {(s, t) : 0 < t < s < 1}. Find p(X, Y) and fx(s). 2. In the catchup game in Section 3.2.4, let V and W denote the winnings of the two playes after only one turn. Find P(V > 0.4). 3. Suppose Type 1 batteries have exponentially distributed lifetimes with mean 2.0 hours, while Type 2 battery lifetimes are exponentially distributed with mean 1.5. (a) Suppose we have a portable machine that has compartments for two batteries, a main, of Type 1, and a backup, of Type 2. One replaces the first by the second as soon as the first fails. Find the density of W, the time that the system is operational, i.e. the sum of the lifetimes of the two batteries. (b) Suppose we have a large box containing a mixture of the two types of batteries, in proportions q and 1-q. We reach into the box, choose a battery at random, then use it. Let Y be the lifetime of the battery we choose. Use the Law of Total Variance, (3.90), to find Var(Y). 4. Newspapers at a certain vending machine cost 25 cents. Suppose 60% of the customers pay with quarters, 20% use two dimes and a nickel, 15% insert a dime and three nickels, and 5% deposit five nickels. When the vendor collects the money, five coins fall to the ground. Let X, Y amd Z denote the numbers of quarters, dimes and nickels among these five coins. (a) Is the joint distribution of (X, Y, Z) a member of a parametric family presented in this chapter? If so, which one? (b) Find P(X = 2, Y = 2, Z = 1). (c) Find p(X, Y). 5. Bus lines A and B intersect at a certain transfer point, with the schedule stating that buses from both lines will arrive there at 3:00 p.m. However, they are often late, by amounts X and Y for the two buses. The bivariate density is fx,y(s, t) =2 -s - t, 0 6). (b) Find EW. 6. Show that p(aX + b, cY + d) = p(X, Y) (3.174)  3.10. PROOF OF THE LAW OF TOTAL EXPECTATION 97 for any constants a, b, c and d. 7. Suppose we wish to predict a random variable Y by using another random variable, X. We may consider predictors of the form cX + d for constants c and d. Show that the values of c and d that minimize the mean squared prediction error, E[(Y - cX - d)2 are E(XY) -EXEY cV)(3.175) Var (X ) E(X2) -EY - EX - E(XY) d = (3.176) Var(X) 8. Programs A and B consist of r and s modules, respectively, of which c modules are common to both. As a simple model, assume that each module has probability p of being correct, with the modules acting independently. Let X and Y denote the numbers of correct modules in A and B, respectively. Find the correlation (X, Y) as a function of r, s, c and p. Hint: Write X = X1 + ...Xr-c, where Xi is 1 or 0, depending on whether module i of A is correct, for the nonoverlapping modules of A. Do the same for B, and for the set of common modules. 9. Use transform methods to derive some properties of the Poisson family: (a) Show that for any Poisson random variable, its mean and variance are equal. (b) Suppose X and Y are independent random variables, each having a Poisson distribution. Show that Z = X + Y again has a Poisson distribution. 10. Suppose one keeps rolling a die. Let Sn denote the total number of dots after n rolls, mod 8, and let T be the number of rolls needed for the event Sn = 0 to occur. Find E(T), using an approach like that in the "trapped miner" example in Section 3.5.5. 11. In our ordinary coins which we use every day, each one has a slightly different probability of heads, which we'll call H. Say H has the distribution N(0.5, 0.032). We choose a coin from a batch at random, then toss it 10 times. Let N be the number of heads we get. Find Var(N). 12. Jack and Jill play a dice game, in which one wins $1 per dot. There are three dice, die A, die B and die C. Jill always rolls dice A and B. Jack always rolls just die C, but he also gets credit for 90% of die B. For instance, say in a particular roll A, B and C are 3, 1 and 6, respectively. Then Jill would win $4 and Jack would get $6.90. Let X and Y be Jill's and Jack's total winnings after 100 rolls. Use the Central Limit Theorem to find the approximate values of P(X > 650, Y < 660) and P(Y > 1.06X). Hints: This will follow a similar pattern to the dice game in Section 3.6.2.3, which we win $5 for one dot, and $2 for two or three dots. Remember, in that example, the key was that we noticed that the pair (X, Y) was a sum of random pairs. That meant that (X, Y) had an approximate bivariate normal distribution, so we could find probabilities if we had the mean vector and covariance matrix of (X, Y). Thus we needed to find LX, BY, Var(X), Var(Y) and Cov(X, Y). We used the various properties of E(), Var() and Cov() to get those quantities.  98 CHAPTER 3. MULTIVARIATE PROBABILITY MODELS You will do the same thing here. Write X = U1 + ... + U100, where Ui is Jill's winnings on the ith roll. Write Y as a similar sum of V. You probably will find it helpful to define A2, BZ and CZ as the numbers of dots appearing on dice A, B and C on the ith roll. Then find EX etc. Again, make sure to utilize the various properties for E, Var() and Cov(). 13. Show that if random variables U and V are independent, Var(UV) = E(U2) - Var(V) + Var(U) - (EV)2 (3.177)  Chapter 4 Introduction to Statistical Inference 4.1 What Statistics Is All About If you follow the events involving space travel,1, you may hear statements like, "There is a 40% chance that weather conditions on Friday will be good enough to launch the space shuttle." Your response might be curiosity as to the following questions: * What does that 40% figure really mean? * How accurate is that figure? * What data was used to obtain that figure, and what mathematical model was used? Well, these are typical statistical issues. If you thought that statistics is nothing more than adding up columns of numbers and plugging into formulas, you are badly mistaken. Actually, statistics is an application of probability theory. We employ probabilistic models for the behavior of our sample data, and infer from the data accordingly-hence the name, statistical inference. 4.2 Introduction to Confidence Intervals 4.2.1 How Long Should We Run a Simulation? In our simulations in previous units, it was never quite clear how long the simulation should be run, i.e. what value to set for nreps. Now we will finally address this issue. As our example, recall from the Bus Paradox in Section 2.5: Buses arrive at a certain bus stop at random times, with interarrival times being independent exponentially distributed random variables with mean 10 'Personally, I don't. 99  100 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE minutes. You arrive at the bus stop every day at a certain time, say four hours (240 minutes) after the buses start their morning run. What is your mean wait for the next bus? We found mathematically that, due to the memoryless property of the exponential distribution, our wait is again exponentially distributed with mean 10. But suppose we didn't know that, and we wished to find the answer via simulation. We could write a program: 1 doexpt <- function(opt) { 2 lastarrival <- 0.0 3 while (lastarrival < opt) 4 lastarrival <- lastarrival + rexp(1,0.1) 5 return(lastarrival-opt) 6 } 7 8 observationpt <- 240 9 nreps <- 10000 10 waits <- vector(length=nreps) 11 for (rep in 1:nreps) waits[rep] <- doexpt(observationpt) 12 cat ("approx . mean wait = ",mean(waits) , "\n") Running the program yields approx. mean wait = 9.653743 Was 1000 iterations enough? How close is this value 9.653743 to the true expected value of waiting time?2 What we would like to do is something like what the pollsters do during presidential elections, when they say "Ms. X is supported by 62% of the voters, with a margin of error of 4%." In fact, we will do exactly this, in the next section. 4.2.2 Confidence Intervals for Means 4.2.2.1 Sampling Distributions In our example in Section 4.2.1, let W denote the random wait time one experiences in general in this situation. We are using the program to estimate E(W), which we will denote by p. While we're at it, let's denote Var(W) by a.23 Before we go on, it's important that you first recall what EW and Var(W) really mean, say via our "notebook" view. We come to the bus stop every day at the same time, and each day we would record our waiting time on a separate line of the notebook. EW and Var(W) would mean the mean and the variance of all the values of we record (after an infinite number of days). Similarly, Fw (12.2), for instance, would mean the long-run proportion of notebook lines in which W <; 12.2. So, what if estimate EW and Var(W) by sampling only for n days, either by actually waiting at the bus stop, or by simulating n days as we did in the above program (where our n was nreps)? How accurate are our estimates? 20f course, continue to ignore the fact that we know that this value is 10.0. What we're trying to do here is figure out how to answer "how close is it" questions in general, when we don't know the true mean. 3Remember, this does NOT mean that W has a normal distribution. The use of pi and o2 to denote mean and variance is standard, regardless of distribution.  4.2. INTRODUCTION TO CONFIDENCE INTERVALS 101 Let WZ denote the ith waiting time, i = 1,2,...,n and let W denote the sample mean, n W is what the program prints out. The key points are that " The random variables WZ each have the distribution Fw, and thus each have mean y and variance o2 " The random variables WZ are independent. " The mean of W is also p: n £(W) = E ( i) (for const. c, E(cU) =cEU) (4.2) i=1 1n - ±ZEWi (E[U + V] = EU + EV) (4.3) i=1 1 = -ny(EW = p) (4.4) n =A (4.5) " The variance of W is 1/n of the population variance: Var(W) = !Var W (for const. c, , Var[cU] = c2Var[U]) (4.6) i=1 n - Var(WZ) (for U,V indep., Var[U + V] = Var[U] + Var[V]) (4.7) i=1 1= 2 = (4.8) 12 (4.9) n Let's think of the notebook example in a different context. Here our experiment is to sample 20 wait times, again either by personally going to the bus stop 20 times or running the above program with nreps equal to 20. Each line of the notebook would consist of data from 20 visits to the bus stop, with wait times W1, ..., W20 and the sample mean W. (So our notebook would have a column for W1, one for W42, ..., one for W20 and especially one for W. Here is what we would find: * When we say that each Wi has the same distribution as the population, we mean the following, say for i = 2: If we were to gather together all the values of W1, one from each of the infinitely many lines  102 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE of the notebook, then their average would be 10.0. Also, the long-run proportion of lines for which W1 < 4, say, would be equal to P(W < 4). It can be shown that actually W has an exponential distribution. (This follows from the memoryless property.) So, P(W < 4) = 0.1e-0.l dt= 0.33 (4.10) 0 Therefore the long-run proportion of lines for which W1 < 4 would be 0.33. And if we were to calculate the standard deviation of all those values of W1, we'd get 0 (which we also know to be 10, since the mean and standard deviation are equal in the case of exponential distributions). " Equation (4.5) says that if we were to average all the values of W over all the lines of the notebook, we'd get 10.0 there too. If we were to calculate the standard deviation of those values of W, we'd get a/ /(which we know to be 0.5). These points are absolutely key, forming the very basis of statistics. You should spend extra time pondering them. 4.2.2.2 Our First Confidence Interval The Central Limit Theorem then tells us that Z = W(4.11) has an approximately N(0,1) distribution. We will be interested in the central 95% of that distribution, which due to symmetry have 2.5% of the area in the left tail and 2.5% in the right one. Through the R call qnorm(0.025), or by consulting a N(0,1) cdf table in a book, we find that there cuttoff points are at -1.96 and 1.96. Thus 0.95 z P(-1.96W< "<1.96 (4.12) Doing a bit of algebra on the inequalities yields Now remember, not only do we not knowny, we also don't know u. But we can estimate it, as follows:  4.2. INTRODUCTION TO CONFIDENCE INTERVALS 103 Recall that by definition 62 = E [(W - p) 2] 4.14) Let's estimate a2 by taking sample analogs. The sample analog of p is W. What about the sample analog of the "E()"? Well, since E() averaging over the whole population of Ws, the sample analog is to average over the sample. So, we get (W - W)2(4.15) n i=1 In other words, just as it is natural to estimate the population mean of W by its sample mean, the same holds for Var(W): The population variance of W is the mean squared distance from W to its population mean. Therefore it is natural to estimate Var(W) by the average squared distance of W from its sample mean, among our sample values Wi. We use s2 as our symbol for this estimate of population variance.4 We thus take our estimate of a to be s, the square root of that quantity. By the way, (4.15) is equal to 2 Wv - W2 (4.16) i=1 (Caution: This way of computing s2 is subject to more roundoff error.) One can show (the details will be given at the end of this section) that (4.13) is still valid if we substitute s for a, i.e. 0.95~P (W-1.96 <1p 6.4]) / nreps 19 cat ("approx. P(W > 6.4) =",prop, "\n") The value printed out for the probability is 0.516. We again ask the question, how can we gauge the accuracy of this number as an estimator of the true probability P(W > 6.4)? 4.2.9.1 Derivation It turns out that we already have our answer, because a probability is just a special case of a mean. To see this, let 1fi if W >6.2 (.9 -0, otherwise  4.2. INTRODUCTION TO CONFIDENCE INTERVALS 111 Then E(Y) = 1- P(Y =1) + 0-P(Y =0) =P(W >6.2) (4.30) Let p denote this probability, and let p denote our estimate of it; p is our prop in the program. In (4.16), take W2 to be our Y here, and note that Y2 = Y. That means that s2 =p-2=(1-5) (4.31) Equation (4.18) becomes (P'-1.96 jy(1-j3)/n,jS+ 1.96 i(1 -5)/n) (4.32) 4.2.9.2 Examples We incorporate that into our program: 1 doexpt <- function(opt) { 2 lastarrival <- 0.0 3 while (lastarrival < opt) 4 lastarrival <- lastarrival + rexp(1,0.1) 5 return(lastarrival-opt) 6 } 7 8 observationpt <- 240 9 nreps <- 1000 10 waits <- vector(length=nreps) 11 for (rep in 1:nreps) waits[rep] <- doexpt(observationpt) 12 wbar <- mean (waits) 13 cat("approx. mean wait =",wbar,"\n") 14 s2 <- (mean(waits^2) - mean(wbar)^2) 15 s <- sqrt(s2) 16 radius <- 1.96*s/sqrt(nreps) 17 cat("approx. CI for EW =",wbar-radius,"to",wbar+radius,"\n") 18 prop <- length(waits[waits > 6.4]) / nreps 19 s2 <- prop*(1-prop) 20 s <- sqrt (s2) 21 radius <- 1.96*s/sqrt(nreps) 22 cat("approx. P(W > 6.4) =",prop,", with a margin of error of",radius,"\n") In this case, we get margin of error of 0.03, thus an interval of (0.51,0.57). We would say, "We don't know the exact value of P(W > 6.4), so we ran a simulation. The latter estimates this probability to be 0.54, with a 95% margin of error of 0.03." Note again that this uses the same principles as our Davis weights example. Suppose we were interested in estimating the proportion of adults in Davis who weigh more than 150 pounds. Suppose that proportion is 0.45 in our sample of 1000 people. This would be our estimate y5 for the population proportion p, and an approximate 95% confidence interval (4.32) for the population proportion would be (0.42,0.48). We would then say, "We are 95% confident that the true population proportion p of people who weigh over 150 pounds is between 0.42 and 0.48."  112 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE Note also that although we've used the word proportion in the Davis weights example instead of probability, they are the same. If I choose an adult at random from the population, the probability that his/her weight is more than 150 is equal to the proportion of adults in the population who have weights of more than 150. And the same principles are used in opinion polls during presidential elections. Here p is the population proportion of people who plan to vote for the given candidate. This is an unknown quantity, which is exactly the point of polling a sample of people-to estimate that unknown quantity p. Our estimate is p, the proportion of people in our sample who plan to vote for the given candidate, and n is the number of people that we poll. We again use (4.32). 4.2.9.3 Interpretation The same interpretation holds as before. Consider the examples in the last section: " If each of you and 99 friends were to run the R program at the beginning of Section 4.2.9.2, you 100 people would get 100 confidence intervals for P(W > 6.4). About 95 of you would have intervals that do contain that number. " If each of you and 99 friends were to sample 1000 people in Davis and come up with confidence intervals for the true population proportion of people who weight more than 150 pounds, about 95 of you would have intervals that do contain that true population proportion. " If each of you and 99 friends were to sample 1200 people in an election campaign, to estimate the true population proportion of people who will vote for candidate X, about 95 of you will have intervals that do contain this population proportion. 4.2.9.4 (Non-)Effect of the Population Size Note that in both the Davis and election examples, it doesn't matter what the size of the population is. The approximate distribution of p, N(p,p(1-p)/n), and thus the accuracy of p, depends only on p and n. So when people ask, "How a presidential election poll can get by with sampling only 1200 people, when there are more than 100,000,000 voters in the U.S.?" now you know the answer. (We'll discuss the question "Why 1200?" below.) Another way to see this is to think of a situation in which we wish to estimate the probability p of heads for a certain coin. We toss the coin n times, and use p as our estimate of p. Here our "population"-the population of all coin tosses-is infinite, yet it is still the case that 1200 tosses would be enough to get a good estimate of p. 4.2.9.5 Planning Ahead Now, why do the pollsters sample 1200 people? First, note that the maximum possible value of j5(1 - j5) is 0.25.10 Then the pollsters know that their margin 'mUse calculus to find the maximum value of f(x) = x(1-x).  4.2. INTRODUCTION TO CONFIDENCE INTERVALS 113 of error with n = 1200 will be at most 1.96 x 0.5//1200, or about 3%, even before they poll anyone. They consider 3% to be sufficiently accurate for their purposes, so 1200 is the n they choose. 4.2.10 One-Sided Confidence Intervals Confidence intervals as discussed so far give one both an upper and lower bound for the parameter of interest. (From here on, the word parameter is used in a broader context than just parametric families of distributions. The term will refer to any population quantity.) In some applications, we are interested in having only an upper bound, or only a lower bound. One can go through the same kind of reasoning as in Section 4.2 above to obtain approximate 95% one-sided confidence intervals: (W - 1.65 , oo) (4.33) (-oo, W + 1.65 3) (4.34) Note the constant 1.65, which is the 0.95 quantile of the N(0,1) distr, compared to 1.96, the 0.975 quantile. 4.2.11 Confidence Intervals for Differences of Means or Proportions 4.2.11.1 Independent Samples Suppose in our sampling of people in Davis we are mainly interested in the difference in weights between men and women. Let X and rnl denote the sample mean and sample size for men, and let Y and ml for the women. Denote the population means and variances by u and ol, i = 1,2. We wish to find a confidence interval for pi - p2. The natural estimator for that quantity is X - Y. In order to form a confidence interval for pi - p2 using X - Y, we need to know the distribution of that latter quantity. To see this, recall that this is how we eventually got (4.18); we started by noting the distribution of W, or more precisely the distribution of (W - p)/(a/ /r) in (4.11), and then used that to derive our confidence interval. So, here we need to know the distribution of X - Y. Note first that X and Y are independent. They come from separate people. Also, as noted before, they are approximately normally distributed. So, they jointly have an approximately bivariate normal distribution. Then from our earlier unit on multivariate distributions, page 85, we know that the linear combination X -Y7 1 - X+ (-1) - Y (4.35) will also have an approximately normal distribution, with mean p1i+(-1)pu2 and variance of /n1+(-1)2o?/nt2.  114 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE If we then let s2, i = 1,2 denote the two sample variances, we have that X i l 2) (4.36) has an approximate N(0,1) distribution, and working as before, we have that an approximate 95% confidence interval for pl - p2 is s2 2 __ Y-1.96 1+2,X rl 2 _ 2 2 Y+1.96sl+2 Tl1n2 (4.37) A similar derivation gives us an approximate 95% confidence interval for the difference in two population proportions pl - P2: (flu 2 2 #2 - 1.96 "I + s2),PI rl 12 2 2 P2 + 1.96 sl + 2 n2 n2 (4.38) where 2 -~t-- (4.39) Example: In a network security application, C. Mano et allt compare round-trip travel time for packets involved in the same application in certain wired and wireless networks. The data was as follows: sample wired wireless sample mean 2.000 11.520 sample s.d. 6.299 9.939 sample size 436 344 We had observed quite a difference, 11.52 versus 2.00, but could it be due to sampling variation? Maybe we have unusual samples? This calls for a confidence interval! Then a 95% confidence interval for the difference between wireless and wired networks is 9.9392 6.2992_ 11.520 - 2.000+i 1.96 + =9.52+i 1.22 344 436 (4.40) So you can see that there is a big difference between the two networks, even after allowing for sampling variation. "RIPPS: Rogue Identifying Packet Payload Slicer Detecting Unauthorized Wireless Hosts Through Network Traffic Condition- ing, C. Mano and a ton of other authors, ACM TRANSACTIONS ON INFORMATION SYSTEMS AND SECURITY, to appear.  4.2. INTRODUCTION TO CONFIDENCE INTERVALS 115 4.2.11.2 Random Sample Size In our Davis weights example in Section 4.2.3.1, we were implicitly assuming that the samples sizes of the two groups, ni and n2, were nonrandom. For instance, we might sample 500 men and 500 women. On the other hand, we might simply sample 1000 people without regard to gender. Then the number of men and women in the sample would be random. Think once again of our notebook view. In our first sample of 1000 people, we might have 492 men and 508 women. In our second sample, the gender breakdown might be 505 and 495, and so on. In keeping with the convention to denote random quantities by capital letters, we might write the numbers of men and women in our sample as N1 and N2. However, in most cases it should not matter. As long as there is not some odd property of our sampling method, e.g. in which there would be tendency for larger samples to have shorter men, we can simply do our inference conditionally on Ni and N2, thus treating them as constants. 4.2.11.3 Dependent Samples Note carefully, though, that a key point above was the independence of the two samples. By contrast, suppose we wish, for instance, to find a confidence interval for vi - v2, the difference in mean weights in Davis of 15-year-old and 10-year-old children, and suppose our data consist of pairs of weight measurements at the two ages on the same children. In other words, we have a sample of n children, and for the ith child we have his/her weight UZ at age 15 and V at age 10. Let V and U denote the sample means. The problem is that the two sample means are not independent. If a child is taller than his/her peers at age 15, he/she was probably taller than them when they were all age 10. In other words, for each i, V and UZ are positively correlated, and thus the same is true for V and U. Thus we cannot use (4.37). However, the random variables T = V - U2, i = 1,2,...,n are still independent. Thus we can use (4.18), so that our approximate 95% confidence interval is (T-1.96 ,T+1.96 ) (4.41) where s2 is the sample variance of the Ti. A common situation in which we have dependent samples is that in which we are comparing two dependent proportions. Suppose for example that there are three candidates running for a political office, A, B and C. We poll 1,000 voters and ask whom they plan to vote for. Let PA, PB and pz be the three population proportions of people planning to vote for the various candidates, and let 1A, PB and ic be the corresponding sample proportions. Suppose we wish to form a confidence interval for PA - PB Clearly, the two sample proportions are not independent random variables, since for instance if 1 1 then we know for sure that 15B is 0. To deal with this, we could set up variables Ui and V as above, with for example Ui being 1 or 0, according to whether the ith person in our sample plans to vote for A or not. But we can do better. Let NA, NB and No denote the actual numbers of people in our sample who state they will vote for the various candidates, so that for instance PA= NA /1000. Well, the point is that the  116 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE vector (NA, NB, NC)T has a multinomial distribution. Thus we know that PA - PB = 0.001NA - 0.001NB (4.42) has variance 0.001PA(1 - PA) + 0.001pB (1 - PB) - 0.002PAPB (4-43) So, the standard error of PA - PB is IO.0015A(1 - PA) + 0.001pB(1 - pB) - 0.002PAPB (4-44) 4.2.12 Example: Machine Classification of Forest Covers Remote sensing is machine classification of type from variables observed aerially, typically by satellite. In the application we'll consider here, researchers want to predict forest cover type for a given location (there are seven different types), from known geographic data, as direct observation is too expensive and may suffer from land access permission issues. (See Blackard, Jock A. and Denis J. Dean, 2000, "Comparative Accuracies of Artificial Neural Networks and Discriminant Analysis in Predicting Forest Cover Types from Cartographic Variables," Computers and Electronics in Agriculture, 24(3):131-151.) There were over 50,000 observations, but for simplicity we'll just use the first 1,000 here. One of the variables was the amount of hillside shade at noon, which we'll call HS12. Let's find an approx- imate 95% confidence interval for the difference in population mean HS 12 values in cover type 1 and type 2 locations. The two sample means were 223.8 and 226.3, with s values of 15.3 and 14.3, and the sample sizes were 226 and 585. So our confidence interval is 15.32 14.32 223.8 - 226.3 i 1.96 + = -2.5 i 2.3 = (-4.8, -0.3) (4.45) 226 585 Now let's find a confidence interval for the difference in population proportions of sites that have cover types 1 and 2. Our sample estimate is P1 -P2= 0.226 - 0.585 = -0.359 (4.46) The standard error of this quantity, from (4.44), is y/0.001 -0.226 -0.7740.00 1 -0.585 -0.415 - 002 -0.226 -0.585 =0.019 (4.47) That gives us a confidence interval of -0.359 1 1.96 - 0.019 = (-0.397, -0.321) (4.48)  4.2. INTRODUCTION TO CONFIDENCE INTERVALS 117 4.2.13 Exact Confidence Intervals Recall how we derived our previous confidence intervals. We began with a probability statement involving our estimator, and then did some algebra to turn it around into a formula for a confidence interval. Those operations had nothing to do with the approximate nature of the distributions involved. We can do the same thing if we have exact distributions. For example, suppose we have a random sample X1, ..., X10 from an exponential distribution with parameter A. Let's find an exact 95% confidence interval for A. Let T = X1 +...+Xi0(4.49) Recall that T has a gamma distribution with parameters 10 (the "shape," in R's terminology) and A. Let q(A) denote the 0.95 quantile of this distribution, i.e. the point to the right of which there is only 5% of the area under the density. Note carefully that this is indeed a function of A; it has different values for different A. Then: 0.95 = P[T < q(A)] = P[q-1(T) > A] (4.50) (Here we have used the fact that q() is a decreasing function.) So, an EXACT 95% one-sided confidence interval for A is (0, q-1(T)) (4.51) Now, what IS q-1? Recall what q() is, the 0.95 quantile of the gamma distribution with shape 10. It always helps intuition to look at some specific numbers: > qgamma (0. 95, 10, 2. 5) [1] 6.282087 > qgamma (0. 95, 10, 4) [1] 3.926304 So, q(2.5) = 6.28 and q(4) = 3.92. That means q-1(6.28) = 2.5 and q-1(3.92) = 4. You can now see how we can form the interval. Say T = 16.4. Then we do some trial-and-error until we find a number w such that qgamma(0.95,10,w) = 16. Our confidence interval is then (0,w). 4.2.14 Slutsky's Theorem (advanced topic) (The reader should review Section 2.3.2.6 before continuing.) Since one generally does not know the value of o- in (4.13), we replace it by s, yielding (4.17). Why was that legitimate? The answer depends on the theorem below. First, we need a definition.  118 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE Definition 3 We say that a sequence of random variables Ln converges in probability to the random variable L iffor every Ec> 0, lim P(|Ln - L| > E) 0 n- oo (4.52) This is a little weaker than convergence with probability 1, as in the Strong Law of Large Numbers (SLLN, Section 1.4.10). Convergence with probability 1 implies convergence in probability but not vice versa. So for example, if Qi, Q2, Q3, ... are i.i.d. with mean w, then the SLLN implies that L Q=(4.53) n converges with probability 1 to w, and thus Ln converges in probability to w too. 4.2.14.1 The Theorem Theorem 4 Slutsky's Theorem (abridged version): Consider random variables Xn, Yn, and X, such that Xn converges in distribution to X and Yn converges in probability to a constant c with probability1, Then: (a) Xn + Yn converges in distribution to X + c. (b) Xn/Yn converges in distribution to X/c. 4.2.14.2 Why It's Valid to Substitute s for o We now return to the question raised above. In our context here, that we take Xn = s Y w-/ o (4.54) (4.55) We know that (4.54) converges in distribution to N(0,1) while (4.55) converges in to 1. Thus for large n, we have that W/r (4.56) has an approximate N(0,1) distribution, so that (4.17) is valid.  4.2. INTRODUCTION TO CONFIDENCE INTERVALS 119 4.2.14.3 Example: Confidence Interval for a Ratio Estimator Again consider the example in Section 4.2.3.1 of weights of men and women in Davis, but this time suppose we wish to form a confidence interval for the ratio of the means, -y = l (4.57) p2 Again, the natural estimator is =X Sx (4.58) Y How can we construct a confidence interval from this estimator? If it were a linear combination of X and Y, we'd have no problem, since a linear combination of multivariate normal random variables is again normal. That is not exactly the case here, but it's close. Since Y converges in probability to p2, Slutsky's Theorem (Section 4.2.14) tells us that the problem here really is one of such a linear combination. We can form a confidence interval for pi, then divide both endpoints of the interval by Y, yielding a confidence interval for 4.2.15 The Delta Method: Confidence Intervals for General Functions of Means or Propor- tions (advanced topic) The delta method is a great way to derive asymptotic distributions of quantities that are functions of random variables whose asymptotic distributions are already known. 4.2.15.1 The Theorem Theorem 5 Suppose R1, ..., Rk are estimators of r11, ...,]k based on a random sample of size n, such that the random vector R1 - 91 )R2-7]2(4.59) has an asymptotically multivariate normal distribution with mean 0 and nonsingular covariance matrix E -= (oR -). Let h be a smooth scalar function of k variables, with hi denoting its thi partial derivative. Consider the random variable h(Rj ... Rk) (4.60)  120 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE Then [Y - h(ri1, ..., r1)] converges in distribution to a normal distribution with mean 0 and variance [vii...,vk ]~vi ...,vk](4.61) provided not all of v 2 = Ili(i1) ...,7k), 2 = 1,...,k (4.62) are 0. Informally, the theorem says that Y will be approximately normal with mean h(rji, ...,r) and covariance matrix 1/n times (4.61). This can be used to form confidence intervals for h(ri, ..., r). Of course, the quantities in (4.61) are typically estimated from the sample. In other words, our approximate 95% confidence interval for h(Ti, ...,) is h(R1, ..., Rk) ±-1.96[ 1, ..., ] T [v1, ...,Dk ] (4.63) Proof We'll cover the case k = 1 (dropping the subscript 1 for convenience). Recall the Mean Value Theorem from calculus:12 h(R) = h(T1) + h'(W)(R -TI) (4.64) for some W between rj and R. Rewriting this, we have Ah(R) - h(1)] r= h'(W)(R -TI) (4.65) It can be shown-and should be intuitively plausible to you-that if a sequence of random variables con- verges in distribution to a constant, the convergence is in probability too. So, R - rj converges in probability to 0, forcing W to converge in probability to h(rI). Then from Slutsky's Theorem, the asymptotic distribu- tion of (4.65) is the same as that of fn h'(rI) (R - I). The result follows. 4.2.15.2 Example: Square Root Transformation It is used to be common, and to some degree still common today, for statistical analysts to apply a square-root transformation to Poisson data. The delta method sheds light on the motivation for this. '2This is where the "delta" in the name of the method comes from, an allusion to the fact that derivatives are limits of difference quotients.  4.2. INTRODUCTION TO CONFIDENCE INTERVALS 121 Consider a random variable X that is Poisson-distributed with mean A. Recall from Section 3.8.1.2 that sums of independent Poisson random variables are themselves Poisson distributed. For that reason, X has the same distribution as Y1 + ... + Yk (4.66) where the Y are i.i.d. Poisson random variables each having mean A/k. By the Central Limit Theorem, Y1 +... + Yk has an approximate normal distribution, with mean and variance A. (This is not quite a rigorous argument, as the mean of Y depends on k, so our treatment here is informal.) Now consider W= X =vY1 + ... + Yk. Let h(t) = T, so that h'(t) = 1/(2v t). The delta method then says that W also has an approximate normal distribution, with asymptotic variance 1 1 A= 4 - 4(4.67) 4A 4 So, the (asymptotic) variance of X is a constant, independent of A. This becomes relevant in regression analysis, where, as we will discuss in Chapter 6, a classical assumption is that a certain collection of random variables all have the same variance. 4.2.15.3 Example: Confidence Interval for a2 Recall that in Section 4.2.7 we noted that (4.18) is only an approximately confidence interval for the mean. An exact interval is available using the Student t-distribution, if the population is normally distributed. We pointed out that (4.18) is very close to the exact interval for even moderately large n anyway, and since no population is exactly normal, (4.18) is good enough. Note that one of the implications of this and the fact that (4.18) did not assume any particular population distribution is that a Student-t based confidence interval works well even for non-normal populations. We say that the Student-t interval is robust to the normality assumption. But what about a confidence interval for a variance? Here one can form an exact interval based on the chi-square distribution, if the population is normal. In this case, though, the interval does NOT work well for non-normal populations; it is NOT robust to the normality assumption. So, let's derive an interval that doesn't assume normality; we'll use the delta method. Warning: This will get a little messy. Write ,2-= E(W2) - (EW)2 (4.68) and from (4.16) write our estimator of u2 as s2 W 2=T 1 (4.69) i=1  122 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE (We are using our old notation, with W1, ..., W being a random sample from our population, and with W representing a random variable having the population distribution.) Since ET2 = E(W2) and ET1 L= EW, we take our function h to be h(u, v) = u - v2 (4.70) In other words, in the notation of the theorem, R1 is our T2 and R2 is our T1. We'll need the variances of T1 and T2, and their covariance. We already have their means, as noted above. We also have the variance of T1, from (4.9): 1 Var(T1) = -Var(W) (4.71) n Now for the variance of T2: Using (4.9) but on W2 instead of W, we have 11 Var(T2) =-Var(W2) = [E(W4) - E(W2)2] (4.72) Now for the covariance: Cov(T1,T2) -= Z(Cov(Wi,W2) -= nnCov(W,W2) (4.73) But from the famous formula for covariance, Cov(W, W2) = E(W3) - EW - E(W2) (4.74) To summarize: 1 Var(T1) =-[E(W2) - (EW)2] (4.76) 1 Cov(T1, T2) =-[E(W3) - LW - E(W2) (4.77) h'(Rl, R2) _ (1) -2ET1)T = (1) -2EW)T (4.78)  4.2. INTRODUCTION TO CONFIDENCE INTERVALS 123 The asymptotic variance to use in our confidence interval for o2 is seen in (4.61) to be 1 - [E(w4) - E(W2)2 + 4(EW)2{E(W2) - (EW)2} - 4EW{E(W3) - EW - E(W2)}] (4.79) n Now, we do not know the value of E(Wm) here, m = 1,2,3,4. So, we estimate E(Wm) as n (4.80) i=1 Our confidence interval is then s2 plus and minus 1.96 times the square root of this quantity. It should be noted, though, that estimating means of higher powers of a random variable requires larger samples in order to achieve comparable accuracy. Our confidence interval here may need a rather large sample to be accurate, as opposed to the situation with (4.18), in which even n = 20 should work well. 4.2.16 Simultaneous Confidence Intervals Suppose in our study of heights, weights and so on of people in Davis, we are interested in estimating a number of different quantities, with our forming a confidence interval for each one. Though our confidence level for each one of them will be 95%, our overall confidence level will be less than that. In other words, we cannot say we are 95% confident that all the intervals contain their respective population values. In some cases we may wish to construct confidence intervals in such a way that we can say we are 95% confident that all the intervals are correct. This branch of statistics is known as simultaneous inference or multiple inference. Usually this kind of methodology is used in the comparison of several treatments. This term originated in the life sciences, e.g. comparing the effectiveness of several different medications for controlling hy- pertension, it can be applied in any context. For instance, we might be interested in comparing how well programmers do in several different programming languages, say Python, Ruby and Perl. We'd form three groups of programmers, one for each language, with say 20 programmers per group. Then we would have them write code for a given application. Our measurement could be the length of time T that it takes for them to develop the program to the point at which it runs correctly on a suite of test cases. Let T23 be the value of T for the jth programmer in the ith group, i = 1,2,3, j = 1,2,...,20. We would then wish to compare the three "treatments," i.e. programming languages, by estimating u= ETii, i = 1,2,3. Our estimators would be U = Z ~Ti /20, i = 1,2,3. Since we are comparing the three population means, we may not be satisfied with simply forming ordinary 95% confidence intervals for each mean. We may wish to form confidence intervals which jointly have confidence level 95%.13 Note very, very carefully what this means. As usual, think of our notebook idea. Each line of the notebook would contain the 60 observations; different lines would involve different sets of 60 people. So, there would be 60 columns for the raw data, three columns for the Us. We would also have six more columns for the 13The word may is important here. It really is a matter of philosophy as to whether one uses simultaneous inference procedures.  124 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE confidence intervals (lower and upper bounds) for the pi. Finally, imagine three more columns, one for each confidence interval, with the entry for each being either Right or Wrong. A confidence interval is labeled Right if it really does contain its target population value, and otherwise is labeled Wrong. Now, if we construct individual 95% confidence intervals, that means that in a given Right/Wrong column, in the long run 95% of the entries will say Right. But for simultaneous intervals, we hope that within a line we see three Rights, and 95% of all lines will have that property. In our context here, if we set up our three intervals to have individual confidence levels of 95%, their simultaneous level will be 0.953 = 0.86, since the three confidence intervals are independent. Conversely, if we want a simultaneous level of 0.95, we could take each one at a 98.3% level, since 0.953 0.983. However, in general the intervals we wish to form will not be independent, so the above "cube root method" would not work. Here we will give a short introduction to more general procedures. Note that "nothing in life is free." If we want simultaneous confidence intervals, they will be wider. 4.2.16.1 The Bonferonni Method One simple approach is Bonferonni's Inequality: Lemma 6 Suppose A1, ..., Ag are events. Then 9 P(A1ior ... orAg) < P(AZ) (4.81) i=1 You can easily see this for g = 2: P(A1 or A2) = P(A1) + P(A2) - P(A1 and A2) < P(A1) + P(A2) (4.82) One can then prove the general case by mathematical induction. Now to apply this to forming simultaneous confidence intervals, take AZ to be the event that the ith confi- dence interval is incorrect, i.e. fails to include the population quantity being estimated. Then (4.81) says that if, say, we form two confidence intervals, each having individual confidence level (100-5/2)%, i.e. 97.5%, then the overall collective confidence level for those two intervals is at least 95%. Here's why: Let A1 be the event that the first interval is wrong, and A2 is the corresponding event for the second interval. Then overall conf. level =P(not A1 and not A2) (4.83) = 1 -P(A1 or A2) (4.84) > 1 - P(A1) - P(A2) (4.85) = 1 - 0.025 - 0.025 (4.86) 0.95 (4.87)  4.2. INTRODUCTION TO CONFIDENCE INTERVALS 125 4.2.16.2 Scheffe's Method (advanced topic) The Bonferonni method is unsuitable for more than a few intervals; each one would have to have such a high individual confidence level that the intervals would be very wide. Many alternatives exist, a famous one being Scheffe's method. 14 Theorem 7 Suppose R1, ..., Rk have an approximately multivariate normal distribution, with mean vector t (pi) and covariance matrix E = (a-3). Let E be a consistent estimator of E. For any constants c1, ..., ck, consider linear combinations of the Ri, k Z ciRi i=1 (4.88) which estimate k z cipli i=1 (4.89) Form the confidence intervals k S ciRi kx k((ci, ..., ck) i=1 [s(ci, ..., ck)]2 = (ci, ..., Ck)T(ci, ..., ck) where (4.90) (4.91) and where X k is the upper-o percentile of a chi-square distribution with k degrees of freedom.15 Then all of these intervals (for infinitely many values of the ci!) have simultaneous confidence level 1 - a. By the way, if we are interested in only constructing confidence intervals for contrasts, i.e. c2 having the property that Ec2= 0, we the number of degrees of freedom reduces to k-1, thus producing narrower intervals. Just as in Section 4.2.7 we avoided the t-distribution, here we have avoided the F distribution, which is used instead of ch-square in the "exact" form of Scheffe's method. '4The name is pronounced "sheh-FAY." 15Recall that the distribution of the sum of squares of g independent N(O,1) random variables is called chi-square with g degrees of freedom. It is tabulated in the R statistical package's function qchisqO.  126 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE 4.2.16.3 Example For example, again consider the Davis heights example in Section 4.2.11. Suppose we want to find approx- imate 95% confidence intervals for two population quantities, p1 and p2. These correspond to values of c1, c2 of (1,0) and (0,1). Since the two samples are independent, o12 = 0. The chi-square value is 5.99,16 so the square root in (4.90) is 3.46. So, we would compute (4.18) for X and then for Y, but would use 3.46 instead of 1.96. This actually is not as good as Bonferonni in this case. For Bonferonni, we would find two 97.5% confidence intervals, which would use 2.24 instead of 1.96. Scheffe's method is too conservative if we just are forming a small number of intervals, but it is great if we form a lot of them. Moreover, it is very general, usable whenever we have a set of approximately normal estimators. 4.2.16.4 Other Methods for Simultaneous Inference There are many other methods for simultaneous inference. It should be noted, though, that many of them are limited in scope, in contrast to Scheffe's method, which is usable whenever one has multivariate normal estimators, and Bonferonni's method, which is universally usable. 4.2.17 The Bootstrap Method for Forming Confidence Intervals (advanced topic) Many statistical applications can be quite complex, which makes them very difficult to analyze mathemat- ically. Fortunately, there is a fairly general method for finding confidence intervals called the bootstrap. Here is a very brief overview. Say we are estimating some population value 0 based on i.i.d. random variables Qi, i = 1,...,n. Say our estimator is 0. Then we draw k new "samples" of size n, by drawing values with replacement from the Qi. For each sample, we recompute 0, giving us values O2, i = 1,...,k. We sort these latter values and find the 0.025 and 0.975 quantiles, i.e. the 2.5% and 97.5% points of the values O2, i = 1,...,k. These two points form our confidence interval for 0. R includes the boot() function to do the mechanics of this for us. 4.3 Hypothesis Testing 4.3.1 The Basics Suppose you have a coin which you want to assess for "fairness." Let p be the probability of heads for the coin. You could toss the coin, say, 100 times, and then form a confidence interval for p using (4.32). The width of the interval would tell you whether 100 tosses was enough for the accuracy you want, and the location of the interval would tell you whether the coin is "fair" enough. 16Obtained from R via qchisq(0.95,2).  4.3. HYPOTHESIS TESTING 127 For instance, if your interval were (0.49,0.54), you might feel satisfied that this coin is reasonably fair. In fact, note carefully that even if the interval were, say, (0.502,0.506), you would still consider the coin to be reasonably fair. Unfortunately, this entire process would be counter to the traditional usage of statistics. Most users of statistics would use the toss data to test the null hypothesis Ho: p= 0.5 (4.92) against the alternate hypothesis HA: p 0.5 (4.93) The approach is to consider Ho "innocent until proven guilty." We form the test statistic Z = (4.94) y#(1 -j5) Under Ho the random variable Z would have an approximate N(0,1) distribution. The basic idea is that if Z turns out to have a value which is rare for that distribution, we say, "Rather than believe we've observed a rare event, we choose instead to abandon our assumption that Ho is true." So, what do we take for our cutoff value for "rareness"? This probability is called the significance level, denoted by a. The classical value for a is 0.05. If Ho were true, Z would have an approximate N(0,1) distribution, and thus would be less than -1.96 or greater than 1.96 only 5% of the time, a "rare event." So, if Z does stray that far (i.e. 1.96 or more in either direction) from 0, we reject Ho, and decide that p $ 0.5. We say, "The value of p is significantly different from 0.5"; more on this below, as it is NOT what it sounds like. Let X be the number of heads we get from our 100 tosses. Note that our rule for decision making formulated above is equivalent (do the algebra to see this for yourself) to saying that we will accept Ho if 40 < X < 60, and reject it otherwise. 4.3.2 General Testing Based on Normally Distributed Estimators Suppose 0 is an approximately normally distributed estimator of some population value 0. Then to test Ho : 0 = c, form the test statistic Z 0=- (4.95) s .e. (0) where s.e. (0) is the standard error of 0, and proceed as before: Reject Ho : 0= c at the significance level of a = 0.05 if |Z| I> 1.96.  128 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE 4.3.3 Example: Network Security Let's look at the network security example in Section 4.2.11.1 again. Here 0= X - Y, and c is presumably 0 (depending on the goals of Mano et al). If you review the material leading up to (4.36), you'll see that s.e.(X -Y) = 2i 2(4.96) El n2 In that example, we found that the standard error was 0.61. So, our test statistic (4.95) is X - Y -0 11.52-2.00 Z 0.1 = .1 15.61 (4.97) 0.61 0.61 This is definitely larger in absolute value than 1.96, so we reject Ho, and conclude that the population mean round-trip times are different in the wired and wireless cases. 4.3.4 The Notion of "p-Values" In that example above, the Z value, 15.61, was far larger than the cutoff for rejection of Ho, 1.96. You might say that we "resoundingly" rejected Ho. When data analysts encounter such a situation, they want to indicate it in their reports. This is done through something called the observed significance level, more often called the p-value. To illustrate this, let's look at a somewhat milder case, in which Z = 2.14. By checking the a table of the N(0,1) distribution, or say by calling pnorm(2.14) in R, we would find that the N(0,1) distribution has area 0.016 to the right of 2.14, and of course an equal area to the left of -2.14. In other words, in the general formulation in Section 4.3.2, we would be able to reject Ho even at the much more stringent significance level of 0.032 instead of 0.05. This would be a stronger statement, and in the research community it is customary to say, "The p-value was 0.032." In our example above in which Z was 15.61, the value is literally "off the chart"; pnorm(15.61) returns a value of 1. Of course, it's a tiny bit less than 1, but it is so far out in the right tail of the N(0,1) distribution that the area to the right is essentially 0. So, this would be treated as very, very highly significant. If many tests are performed and are summarized in a table, it is customary to denote the ones with small p-values by asterisks. This is generally one asterisk for p under 0.05, two for p less than 0.01, three for 0.001, etc. The more asterisks, the more "significant" the data is supposed to be. Well, that's a common interpretation, but careful analysts know it to be misleading, as we will now discuss. 4.3.5 What's Random and What Is Not It is crucial to keep in mind that H0 is not an event or any other kind of random entity. This coin either has p = 0.5 or it doesn't. If we repeat the experiment, we will get a different value of X, but p doesn't change. So for example, it would be wrong and meaningless to speak of the "probability that H0 is true."  4.3. HYPOTHESIS TESTING 129 Similarly, it would be wrong and meaningless to write 0.05 = P(lZ| > 1.96|Ho), again because Ho is not an event and this kind of conditional probability would not make sense. What is customarily written is something like 0.05 = PHo( Z > 1.96) (4.98) This is read aloud as "the probability that |Z| is larger than 1.96 under Ho," with the phrase under Ho referring to the probability measure in the case in which Ho is true. 4.3.6 One-Sided HA Suppose that-somehow-we are sure that our coin in the example above is either fair or it is more heavily weighted towards heads. Then we would take our alternate hypothesis to be HA: p> 0.5 (4.99) A "rare event" which could make us abandon our belief in Ho would now be if Z in (4.94) is very large in the positive direction. So, with a = 0.05, our rule would now be to reject Ho if Z > 1.65. The same would be the case if our null hypothesis were HA: p <0.5 (4.100) instead of HA: p =0.5 (4.101) Then (4.98) would change to 0.05> PHo( Z > 1.65) (4.102) 4.3.7 Exact Tests Remember, the tests we've seen so far are all approximate. In (4.94), for instance, p had an approximate normal distribution, so that the distribution of Z was approximately N(0,1). Thus the significance level a was approximate, as were the p-values and so on.17 But the only reason our tests were approximate is that we only had the approximate distribution of our test statistic Z, or equivalently, we only had the approximate distribution of our estimator, e.g. y3. If we have an exact distribution to work with, then we can perform an exact test. '7Another class of probabilities which would be approximate would be the power values. These are the probabilities of rejecting Ho if the latter is not true. We would speak, for instance, of the power of our test at p = 0.55, meaning the chances that we would reject the null hypothesis if the true population value of p were 0.55.  130 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE Let's consider the coin example again. To keep things simple, let's suppose we toss the coin 10 times. We will make our decision based on X, the number of heads out of 10 tosses. Suppose we set our threshhold for "strong evidence" again Ho to be 8 heads, i.e. we will reject Ho if X > 8. What will a be? 10 ae = EP(X z=8 10 10 i) =E(10) m i=8 0.055 (4.103) That's not 0.05. Clearly we cannot get an exact significance level of 0.05,18 but our a is exactly 0.055. Of course, if you are willing to assume that you are sampling from a normally-distributed population, then the Student-t test is nominally exact. The R function t.test() performs this operation. As another example, suppose lifetimes of lightbulbs are exponentially distributed with mean p. In the past, p = 1000, but there is a claim that the new light bulbs are improved and yu > 1000. To test that claim, we will sample 10 lightbulbs, getting lifetimes X1, ..., X10, and compute the sample mean X. We will then perform a hypothesis test of Ho : =1000 (4.104) vs. HA :A> 1000 (4.105) It is natural to have our test take the form in which we reject Ho if X>w (4.106) for some constant w chosen so that P(X > w) = 0.05 (4.107) under Ho. Suppose we want an exact test, not one based on a normal approximation. Recall that 100X, the sum of the X2, has a gamma distribution, with r = 10 and A = 0.001. So, we can find the w for which P(X > w) = 0.05 by using R's qgamma() > qgamma (0.95,10,0.001) [1] 15705.22 So, we reject Ho if our sample mean is larger than 1570.5. 18Actually, it could be done by introducing some randomization to our test.  4.3. HYPOTHESIS TESTING 131 4.3.8 What's Wrong with Hypothesis Testing Hypothesis testing is a time-honored approach, used by tens of thousands of people every day. But it is "wrong." I use the quotation marks here because, although hypothesis testing is mathematically correct, it is at best noninformative and at worst seriously misleading. To begin with, it's absurd to test Ho in the first place. No coin is absolutely perfectly balanced, with p = 0.5000000000000000000000000000... We know that before even collecting any data. But much worse is this word "significant." Say our coin actually has p = 0.502. From anyone's point of view, that's a fair coin! But look what happens in (4.94) as the sample size n grows. if we have a large enough sample, eventually the denominator in (4.94) will be small enough, and p will be close enough to 0.502, that Z will be larger than 1.96 and we will declare that p is "significantly" different from 0.5. But it isn't! Yes, p is different from 0.5, but NOT in any significant sense. This is especially a problem in computer science applications of statistics, because they often use very large data sets. A data mining application, for instance, may consist of hundreds of thousands of retail purchases. The same is true for data on visits to a Web site, network traffic data and so on. In all of these, the standard use of hypothesis testing can result in our pouncing on very small differences that are quite insignificant to us, yet will be declared "significant" by the test. Conversely, if our sample is too small, we can miss a difference that actually is significant-i.e. important to us and we would declare that p is NOT significantly different from 0.5. In summary, the two basic problems with hypothesis testing are " Ho is improperly specified. What we are really interested in here is whether p is near 0.5, not whether it is exactly 0.5 (which we know is not the case anyway). " Use of the word significant is grossly improper (or, if you wish, grossly misinterpreted). Hypothesis testing forms the very core usage of statistics, yet you can now see that it is, as I said above, "at best noninformative and at worst seriously misleading." This is widely recognized by thinking statisticians. For instance, see http: / /www. indiana. edu/~-st igt st s /quot sagn. html for a nice collection of quotes from famous statisticians on this point. There is an entire chapter devoted to this issue in one of the best-selling elementary statistics textbooks in the nation.19 But the practice of hypothesis testing is too deeply entrenched for things to have any prospect of changing. 4.3.9 What to Do Instead In the coin example, we could set limits of fairness, say require that p be no more than 0.01 from 0.5 in order to consider it fair. We could then test the hypothesis Ho0 0.49 p 0.51 (4.108) 19Statistics, third edition, by David Freedman, Robert Pisani, Roger Purves, pub. by W.W. Norton, 1997.  132 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE Such an approach is almost never used in practice, as it is somewhat difficult to use and explain. But even more importantly, what if the true value of p were, say, 0.51001? Would we still really want to reject the coin in such a scenario? Note carefully that I am not saying that we should not make a decision. We do have to decide, e.g. decide whether a new hypertension drug is safe or in this case decide whether this coin is "fair" enough for practical purposes, say for determining which team gets the kickoff in the Super Bowl. But it should be an informed decision, and even testing the modified Ho above would be much less informative than a confidence interval. Forming a confidence interval is the far superior approach. The width of the interval shows us whether n is large enough for p to be reasonably accurate, and the location of the interval tells us whether the coin is fair enough for our purposes. Note that in making such a decision, we do NOT simply check whether 0.5 is in the interval. That would make the confidence interval reduce to a hypothesis test, which is what we are trying to avoid. If for example the interval is (0.502,0.505), we would probably be quite satisfied that the coin is fair enough for our purposes, even though 0.5 is not in the interval. Hypothesis testing is also used for model building, such as for predictor variable selection in regression analysis (a method to be covered in a later unit). The problem is even worse there, because there is no reason to use a = 0.05 as the cutoff point for selecting a variable. In fact, even if one uses hypothesis testing for this purpose-again, very questionable-some studies have found that the best values of a for this kind of application are in the range 0.25 to 0.40. In model building, we still can and should use confidence intervals. However, it does take more work to do so. We will return to this point in our unit on modeling, Chapter 5. 4.3.10 Decide on the Basis of "the Preponderance of Evidence" In the movies, you see stories of murder trials in which the accused must be "proven guilty beyond the shadow of a doubt." But in most noncriminal trials, the standard of proof is considerably lighter, prepon- derance of evidence. This is the standard you must use when making decisions based on statistical data. Such data cannot "prove" anything in a mathematical sense. Instead, it should be taken merely as evidence. The width of the confidence interval tells us the likely accuracy of that evidence. We must then weigh that evidence against other information we have about the subject being studied, and then ultimately make a decision on the basis of the preponderance of all the evidence. Yes, juries must make a decision. But they don't base their verdict on some formula. Similarly, you the data analyst should not base your decision on the blind application of a method that is usually of little relevance to the problem at hand-hypothesis testing. 4.4 General Methods of Estimation In the preceding sections, we often referred to certain estimators as being "natural." For example, if we are estimating a population mean, an obvious choice of estimator would be the sample mean. But in many  4.4. GENERAL METHODS OF ESTIMATION 133 applications, it is less clear what a "natural" estimate for a parameter of interest would be.20 We will present general methods for estimation in this section. 4.4.1 Example: Guessing the Number of Raffle Tickets Sold You've just bought a raffle ticket, and find that you have ticket number 68. You check with a couple of friends, and find that their numbers are 46 and 79. Let c be the total number of tickets. How should we estimate c, using our data 68, 46 and 79? It is reasonable to assume that each of the three of you is equally likely to get assigned any of the numbers 1,2,...,c. In other words, the numbers we get, X2, i = 1,2,3 are uniformly distributed on the set {1,2,...,c}. We can also assume that they are independent; that's not exactly true, since we are sampling without re- placement, but for large c-or better stated, for n/c small-it's close enough. So, we are assuming that the Xi are independent and identically distributed-famously written as i.i.d. in the statistics world-on the set {1,2,...,c}. How do we use the Xi to estimate c? 4.4.2 Method of Moments One approach, an intuitive one, would be to reason as follows. Note first that E(X) = (4.109) 2 Let's solve for c: c = 2EX - 1 (4.110) We know that we can use n'I XX= i(4.111) i=1 to estimate EX, so by (4.110), 2X - 1 is an estimate of c. Thus we take our estimator for c to be =2X -1 (4.112) This estimator is called the Method of Moments estimator of c. Let's step back and review what we did: 20Recall from Section 4.2.10 that we are now using the term parameter to mean any population quantity, rather an an index into a parametric family of distributions.  134 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE " We wrote our parameter as a function of the population mean EX of our data item X. Here, that resulted in (4.110). " In that function, we substituted our sample mean X for EX, and substituted our estimator c for the parameter c, yielding (4.112). We then solved for our estimator. We say that an estimator 0 of some parameter 0 is consistent if lim = 0 (4.113) n-oo where n is the sample size. In other words, as the sample size grows, the estimator eventually converges to the true population value. Of course here X is a consistent estimator of EX . Thus you can see from (4.110) and (4.112) that c is a consistent estimator of c. In other words, the Method of Moments generally gives us consistent estimators. What if we have more than one parameter to estimate, say 01, ..., Or? We generalize what we did above. To see how, recall that E(XI) is called the ith moment of X;21 let's denote it by rh. Also, note that although we derived (4.110) by solving (4.109) for c, we did start with (4.109). So we do the following: " For i = 1,...,r we write rj as a function gi of all the 0k. " For i = 1,...,r set ri/=n-ZX3 (4.114) j=1 " Substitute the 8k in the gi and then solve for them. In the above example with the raffle, we had r = 1, 01 = c, gi1(c) = (c + 1)/2 and so on. A two-parameter example will be given below. 4.4.3 Method of Maximum Likelihood Another method, much more commonly used, is called the Method of Maximum Likelihood. In our example above, it means asking the question, "What value of c would have made our data-68, 46, 79 most likely to happen?" Well, let's find what is called the likelihood, i.e. the probably of our particular data values occurring: r(})3, ifc>7 L P(X1 68,2X2 =46,2X3 =79) = (.15 10, otherwise(415 21 Hencethe name, Method of Moments.  4.4. GENERAL METHODS OF ESTIMATION 135 Now keep in mind that c is a fixed, though unknown constant. It is not a random variable. What we are doing here is just asking "What if" questions, e.g. "If c were 85, how likely would our data be? What about c = 91?" Well then, what value of c maximizes (4.115)? Clearly, it is c = 79. Any smaller value of c gives us a likelihood of 0. And for c larger than 79, the larger c is, the smaller (4.115) is. So, our maximum likelihood estimator (MLE) is 79. In general, if our sample size in this problem were n, our MLE for c would be C = max 2j (4.116) 4.4.4 Example: Estimation the Parameters of a Gamma Distribution As another example, suppose we have a random sample X1, ..., X, from a gamma distribution. fx(t) =IAct'- e-, t)> 0 I'(c) (4.117) for some unknown c and A. How do we estimate c and A from the XZ? 4.4.4.1 Method of Moments Let's try the Method of Moments, as follows. We have two population parameters to estimate, c and A, so we need to involve two moments of X. That could be EX and E(X2), but here it would more conveniently be EX and Var(X). We know from our previous unit on continuous random variables, Chapter 2, that C EX = - A (4.118) (4.119) Var(X) = In our earlier notation, this would be r = 2, 01 = c , 02 = A and gi(c, A) = c/A and g2(c, A) = c/A2 Switching to sample analogs and estimates, we have cX- A _ 2 A2 (4.120) (4.121) Dividing the two quantities yields x A= 2 (4.122)  136 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE which then gives X2 s=2 x (4.123) 4.4.4.2 MLEs What about the MLEs of c and A? Remember, the Xi are continuous random variables, so the likelihood function, i.e. the analog of (4.115), is the product of the density values: L = U 1 AcXcle-'4 (4.124) F(c) In general, it is usually easier to maximize the log likelihood (and maximizing this is the same as maximizing the original likelihood): n (c- 1)Zln(Xi) i=1 n XZ + ncln(A) - nrln(F(c)) k=1 (4.125) One then takes the partial derivatives of (4.125) with respect to c and A. The solution values, c and A, are then the MLEs of c and A. Unfortunately, these equations do not have closed-form solutions, so they must be solved numerically. 4.4.5 More Examples Suppose fw(t) = ctc-1 for tin (0,1), with the density being 0 elsewhere, for some unknown c > 0. We have a random sample W1, ..., W from this density. Let's find the Method of Moments estimator. EW = tctC-1 dt Jo c c+1 (4.126) So, set W c ca+l (4.127) yielding W S1-W (4.128)  4.4. GENERAL METHODS OF ESTIMATION 137 What about the MLE? so L =HlcWi-- n =nlnc+ (c- 1)ZInWi i=1 (4.129) (4.130) Then set 0 = +Z n Wi i=1 and thus (4.131) 1 c 1 (4.1i) c =n i=1 ln Wi As in Section 4.4.3, not every MLE can be determined by taking derivatives. Consider a continuous analog of the example in that section, with fw (t) = on (0,c), 0 elsewhere, for some c > 0. The likelihood is (l\n~1 (4.133) as long as c > max W 2 (4.134) and is 0 otherwise. So, c = max W 2 (4.135) as before. Let's find the bias of this estimator. The bias is EC - c. To get E9 we need the density of that estimator, which we get as follows: P(-< t) P(all Wi < t) (definition) (t (density of WZ) (4.136) (4.137)  138 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE So, fp(t) = =Qt"-1 (4.138) Cn Integrating against t, we find that EC = c (4.139) n +1 So the bias is c/(n+ 1), not bad at all. 4.4.6 What About Confidence Intervals? Usually we are not satisfied with simply forming estimates (called point estimates). We also want some indication of how accurate these estimates are, in the form of confidence intervals (interval estimates). In many special cases, finding confidence intervals can be done easily on an ad hoc basis. Look, for instance, at the Method of Moments Estimator in Section 4.4.2. Our estimator (4.112) is a linear function of (X, so we easily obtain a confidence interval for c from one for EX. Another example is (4.132). Taking the limit as n - oc the equation shows us (and we could verify) that 1 C =El ](4.140) Defining Xi = ln Wi and X = (X1 + ... + Xn)/, we can obtain a confidence interval for EX in the usual way. We then see from (4.140) that we can form a confidence interval for c by simply taking the reciprocal of each endpoint of the interval, and swapping the left and right endpoints. What about in general? For the Method of Moments case, our estimators are functions of the sample moments, and since the latter are formed from sums and thus are asymptotically normal, the delta method can be used to show that our estimators are asymptotically normal and to obtain asymptotic variances for them. There is a well-developed asymptotic theory for MLEs, which under certain conditions not only shows asymptotic normality with a determined asymptotic variance, but also establishes that MLEs are in a certain sense optimal among all estimators. We will not pursue this here. 4.4.7 Bayesian Methods (advanced topic Consider again the example of estimating p, the probability of heads for a certain coin. Suppose we were to say-before tossing the coin even once-"I think p could be any number, but more likely near 0.5, something like a normal distribution with mean 0.5 and standard deviation, oh, let's say 0.1." Note carefully the word "think." We are just using our "gut feeling" here, our "hunch." The number p is NOT a random variable! We are dealing with this ONE coin, and it has just ONE value of p. Yet we are treating p as random anyway.  4.4. GENERAL METHODS OF ESTIMATION 139 Under this "random p" assumption, the MLE would change. Our data here is X, the number of heads we get from n tosses of the coin. Instead of the likelihood being L = p)PX(lp)Th-X (4.141) it now becomes exp -0.5[(p 2v0.1 0.5)/0.1]2 ( n)Px( p)n--X (4.142) We would then find the value of p which maximizes the L, and take that as our estimate. A "gut feeling" or "hunch" used in this manner is called a subjective prior. "Prior" to collecting any data, we have a certain belief about p. This is very controversial, and many people-including me-consider it to be highly inappropriate. They/I feel that there is nothing wrong using one's gut feelings to make a decision, but it should NOT be part of the mathematical analysis of the data. One's hunches can play a role in deciding the "preponderance of evidence," as discussed in Section 4.3.10. On the other hand, maybe we have actual data on coins, presumed to be a random sample from the population of all coins of that type, and we assume that our coin now is chosenly randomly from that population. Say we have formed a normal or other model for p based on that data. It would be fine to use this in estimating p for a new coin, and the second L above would be appropriate. In this case, we would be using an empirical prior. 4.4.8 The Empirical cdf Recall that Fx, the cdf of X, is defined as Fx (t) = P(X <;t), - 00 < t < oo (4.143) Define its sample analog, called the empirical distribution function, by - # of Xi in (-oot) Fx (t) = n (4.144) In other words, Fx (t) is the proportion of X that are below t in the population, and Fx (t) is the value of that proportion in our sample. Fx (t) estimates Fx (t) for each t. Graphically, Fx is a step function, with jumps at the values of the Xi. Specifically, let Y, j = 1,...,n denote  140 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE the sorted version of the X1.22 Then 0, for t < Yi FX(t)= n, forY Yn Here is a simple example. Say n = 4 and our data are 4.8, 1.2, 2.2 and 6.1. We can plot the empirical cdf by calling R's ecdf() function: > plot (ecdf (x) ) Here is the graph: ecdf(x) C) _ T 00 X LL _ 0 N _ O O _ O I 0 2 4 6 x 4.5 Real Populations and Conceptual Populations In our example in Section 4.2.3.1, we were sampling from a real population. However, in many, probably most applications of statistics, either the population or the sampling is more conceptual. Consider the experiment comparing three scripting languages in Section 3.169. We think of our program- mers as being a random sample from the population of all programmers, but that is probably an idealization. It may be, for example, that they all work at the same company, in which case we must think of them as a "random sample" from the rather conceptual "population" of all programmers who might work at this company.23 22A common notation for this is Y = X(j), meaning that Y is the jth smallest of the Xi. These are called the order statistics of our sample. 23You're probably wondering why we haven't discussed other factors, such as differing levels of experience among the program- mers. This will be dealt with in our unit on regression analysis, Chapter 6.  4.6. NONPARAMETRIC DENSITY ESTIMATION 141 And what about our raffle example in Section 4.4.1? Certainly we can imagine various kinds of randomness that contribute to the numbers people get on their raffle tickets. Maybe, for instance, you were in a traffic jam on the way to the the place where you bought the ticket, so you bought it a little later than you might have and thus got a higher number. But I've always emphasized the notion of a repeatable experiment in these notes. How can that happen here? You could imagine, for instance, the raffle chair suddenly losing all the tickets, and asking everyone to draw again, resulting in different ticket numbers. Or you can imagine the "population" of all raffles that you might submit to which have the same value of c. You can see from this that if one chooses to apply statistics carefully-which you absolutely should do there sometimes are some knotty problems of interpretation to think about. 4.6 Nonparametric Density Estimation Consider the Bus Paradox example again. Recall that W denoted the time until the next bus arrives. This is called the forward recurrence time. The backward recurrence time is the time since the last bus was here, which we will denote by R. Suppose we are interested in estimating the density of R, fR O, based on the sample data R1, ..., Rn that we gather in our simulation in Section 4.2.1, where n = 1000. How can we do this?24 We could, of course, assume that fR is a member of some parametric family of distributions, say the two- parameter gamma family. We would then estimate those two parameters as in Section 4.4, and possibly check our assumption using goodness-of-fit procedures, discussed in our unit on modeling, Chapter 5. On the other hand, we may wish to estimate fR without making any parametric assumptions. In fact, one reason we may wish to do so is to visualize the data in order to search for a suitable parametric model. If we do not assume any parametric model, we have in essence change our problem from estimating a finite number of parameters to an infinite-parameter problem; the "parameters" are the values of fx (t) for all the different values of t. Of course, we probably are willing to assume some structure on fR, such as continuity, but then we still would have an infinite-parameter problem. We call such estimation nonparametric, meaning that we don't use a parametric model. However, you can see that it is really infinite-parametric estimation. Again discussed in our unit on modeling, Chapter 5, the more complex the model, the higher the variance of its estimator. So, nonparametric estimators will have higher variance than parametric ones. The nonparametric estimators will also generally have smaller bias, of course. 4.6.1 Basic Ideas Recall that fR(tV+=FR(tV+= P(R <;t) (4.146) 24Actually, our unit on renewal theory, Chapter 9, proves that R has an exponential distribution. However, here we'll pretend we don't know that.  142 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE From calculus, that means that P(R opt) return(opt-lastarrival) else lastarrival <- newlastarrival } 143 } observationpt <- 240 nreps <- 10000 waits <- vector(length=nreps) for (rep in 1:nreps) waits[rep] hist (waits) doexpt(observationpt) Note that I used the default number of intervals, 20. Here is the result: Histogram of waits Co) 07 L, 0' I I I I I I 0 20 40 60 80 100 waits The density seems to have a shape like that of the exponential parametric family. (This is not surprising, because it is exponential, but remember we're pretending we don't know that.) Here is the plot with 100 intervals:  144 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE Histogram of waits (0 07 L, 0' 0 20 40 60 80 waits Again, a similar shape, though more raggedy. 4.6.3 Kernel-Based Density Estimation (advanced topic) No matter what the interval width is, the histogram will consist of a bunch of rectanges, rather than a curve. That is basically because, for any particular value of t, fx (t), depends only on the Xi that fall into that interval. We could get a smoother result if we used all our data to estimate fx (t) but put more weight on the data that is closer to t. One way to do this is called kernel-based density estimation, which in R is handled by the function density(). We need a set of weights, more precisely a weight function k, called the kernel. Any nonnegative function which integrates to 1-i.e. a density function in its own right-will work. Our estimator is then 1 n* t - Ri ffR(t ) = hE (kR nh h zi (4.152) To make this idea concrete, take k to be the uniform density on (-1,1), which has the value 0.5 on (-1,1) and 0 elsewhere. Then (4.152) reduces to (4.150). Note how the parameter h, called the bandwidth, continues to control how far away from to t we wish to go for data points. But as mentioned, what we really want is to include all data points, so we typically use a kernel with support on all of (-oc, oc). In R, the default kernel is that of the N(0,1) density. The bandwidth h controls how much smoothing we do; smaller values of h place heavier weights on data points near t and much lighter weights on the distant points. The default bandwidth in R is taken to the the standard deviation of k. For our data here, I took the defaults:  4.6. NONPARAMETRIC DENSITY ESTIMATION 145 density.default(x = r) N-_ LO) 6 C7 C,) Ni N' O 0 20 40 60 80 N = 1000 Bandwidth = 1.942 Figure 4.1: Kernel estimate, default bandwidth plot (density (r) ) The result is seen in Figure 4.1. I then tried it with a bandwidth of 0.5. See Figure 4.2. This curve oscillates a lot, so an analyst might think 0.5 is too small. (We are prejudiced here, because we know the true population density is exponential.) 4.6.4 Proper Use of Density Estimates There is no good, practical way to choose a good bin width or bandwdith. Moreover, there is also no good way to form a reasonable confidence band for a density estimate. So, density estimates should be used as exploratory tools, not as firm bases for decision making. You will probably find it quite unsettling to learn that there is no exact answer to the problem. But that's real life! Exercises Note to instructor: See the Preface for a list of sources of real data on which exercises can be assigned to complement the theoretical exercises below. 1. Suppose we draw a sample of size 2 from a population in which X has the values 10, 15 and 12. Find pX, first assuming sampling with replacement, then assuming sampling without replacement.  146 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE density.default(x = r, bw = 0.5) 0 6 6 C,) Ni c'- I I I I I 0 20 40 60 80 N = 1000 Bandwidth = 0.5 Figure 4.2: Kernel estimate, bandwidth 0.5 2. Suppose lifetimes of lightbulbs are exponentially distributed with mean p. In the past, a was 1000, but there is a claim that the new light bulbs are improved and pt 1000. To test that claim, we will sample 100 lightbulbs, getting lifetimes X1, ..., X20, and compute X = (X1 + ... + X20)/20. We will then perform a hypothesis test of Ho : pA= 1000 vs. HA : p > 1000. It is natural to have our test take the form in which we reject Ho if X > r for some constant r chosen so that P(X > r) = 0.05 under Ho. Suppose we want an exact test, not one based on a normal approximation. Find r. 3. Consider the Method of Moments Estimator c in the raffle example, Section 4.4.1. Find the exact value of Var(c). Use the facts that 1 + 2 + ... + r = r (r + 1)/2 and 12 + 2+..., r2 = r (r + 1)(2r + 1)/6. 4. Suppose W has a uniform distribution on (-cc), and we draw a random sample of size n, W1, ..., W". Find the Method of Moments and Maximum Likelihood estimators. 5. An urn contains w marbles, one of which is black and the rest being white. We draw marbles from the urn one at a time, without replacement, until we draw the black one; let N denote the number of draws needed. Find the Method of Moments estimator of w based on X. 6. In the raffle example, Section 4.4.1, find a (1 - a) % confidence interval for c based on c, the Maximum Likelihood Estimate of c. Hint: Use the example in Section 4.2.13 as a guide. 7. In many applications, observations come in correlated clusters. For instance, we may sample r trees at  4.6. NONPARAMETRIC DENSITY ESTIMATION 147 random, then s leaves within each tree. Clearly, leaves from the same tree will be more similar to each other than leaves on different trees. In this context, suppose we have a random sample X1, ..., Xn, n even, such that there is correlation within pairs. Specifically, suppose the pair (X2i+1, X2i+2) has a bivariate normal distribution with mean (p, p) and covariance matrix 1 p (4.153) i = 0,...,n/2-1, with the n/2 pairs being independent. Find the Method of Moments estimators of p and p. 8. Candidates A, B and C are vying for election. Let pi, P2 and p3 denote the fractions of people planning to vote for them. We poll n people at random, yielding estimates fiu, 2 and 3. Y claims that she has more supporters than the other two candidates combined. Give a formula for an approximate 95% confidence interval for P2 - (p1 + p3).  14 8 CHAPTER 4. INTRODUCTION TO STATISTICAL INFERENCE  Chapter 5 Introduction to Model Building All models are wrong, but some are useful.-George Box1 [Mathematical models] should be made as simple as possible, but not simplerAlbert Einstein2 The above quote by Box says it all. Consider for example the family of normal distributions. In real life, random variables are bounded-no person's height is negative or greater than 500 inches-and are inherently discrete, due to the finite precision of our measuring instruments. Thus, technically, no random variable in practice can have an exact normal distribution. Yet the assumption of normality pervades statistics, and has been enormously successful, provided one understands its approximate nature. The situation is similar to that of physics. We know that in many analyses of bodies in motion, we can neglect the effect of air resistance, but that in some situations one must include that factor in our model. So, the field of probability and statistics is fundamentally about modeling. The field is extremely useful, provided the user understands the modeling issues well. For this reason, this book contains this separate chapter on modeling issues. 5.1 Bias Vs. Variance Consider a general estimator Q of some population value b. Then a common measure of the quality of the estimator Q is the mean squared error (MSE), E[(Q - b)2] (5.1) Of course, the smaller the MSE, the better. 'George Box (1919-) is a famous statistician, with several statistical procedures named after him. 2The reader is undoubtedly aware of Einstein's (1879-1955) famous theories of relativity, but may not know his connections to probability theory. His work on Brownian motion, which describes the path of a molecule as it is bombarded by others, is probabilistic in nature, and later developed into a major branch of probability theory. Einstein was also a pioneer in quantum mechanics, which is probabilistic as well. At one point, he doubted the validity of quantum theory, and made his famous remark, "God does not play dice with the universe." 149  150 CHAPTER 5. INTRODUCTION TO MODEL BUILDING One can break (5.1) down into variance and (squared) bias components, as follows:3 MSE(Q) = E[(Q - b)2] (5.2) = E[{(Q - EQ) + (EQ - b)}2] (5.3) = E[(Q - EQ)2] + 2E [(Q - EQ)(EQ - b)] + E[(EQ - b)2] (5.4) = Var(Q) + (EQ - b)2 (5.5) = variance + squared bias (5.6) In other words, in discussing the accuracy of an estimator-especially in comparing two or more candidates to use for our estimator-the average squared error has two main components, one for variance and one for bias. In building a model, these two components are often at odds with each other; we may be able to find an estimator with smaller bias but more variance, or vice versa. 5.2 "Desperate for Data" Suppose we have the samples of men's and women's heights described in Section 4.2.11, say we wish to predict the height H of a new person who we know to be a man but for whom we know nothing else. The question is, should we take gender into account in our prediction? If so, we would predict the man to be of height4 Ti = X,(5.7) our estimate for the mean height of all men. If not, then we predict his height to be X y+Y T2 = X ,(5.8) 2 our estimate of the mean height of all people (assuming that half the population is male). Recalling our notation from Section 4.2.11, assume that ni = n2, and call the common value n. Also, for simplicity, let's assume that -1 = U2 = (. 5.2.1 Mathematical Formulation of the Problem Let's formalize this a bit. Let G denote gender, 1 for male, 2 for female. Then our random quantity here is (X,G). (Our "experiment" here is to choose a person from the population at random. Thus the height and gender will be random variables.) 3In reading the following derivation, keep in mind that EQ and b are constants. 4Assuming that predicting too high and too low are of equal concern to us, etc.  5.2. "DESPERATE FOR DATA" Then the correct population model is 151 E(HIG = z) = 2 (5.9) and our predictor Ti reflects this.5 However, T2 makes the simplifying assumption that pi = p2, so that E(HIG = i) = (5.10) where p is the common value of pi and p2. We'll refer to (5.9) as the complex model (two parameters, not counting variances), and to (5.10) as the simple model (one parameter, not counting variances). 5.2.2 Bias and Variance of the Two Predictors Since the true model is (5.9), T1 is unbiased, from (4.5). But the predictor T2 from the simple model is biased: E(T2|G = 1) = E(0.5X+ 0.5Y) (definition) (5.11) = 0.5EX + 0.5EY (linearity of E) (5.12) = 0.5pi + 0.5p2 [from (4.5)] -/Pi (5.13) (5.14) On the other hand, T2 has a smaller variance: Recalling (4.9), we have 0.2 Var(T1|G = 1) = (5.15) And Var(T2|G = 1) Var(0.5X + 0.5Y) 0.52Var(X) + 0.52Var(Y) (properties of Var) 2 2n[from 4.9] (5.16) (5.17) (5.18) 5.2.3 Implications These findings are highly instructive. You might at first think that "of course" T1 would be the better predictor than T2. But for a small sample size, the smaller (actually 0) bias of Ti is not enough to counteract its larger variance. T2 is biased, yes, but it is based on double the sample size and thus has half the variance. 5We are calling it a predictor rather than an estimator as in other examples. This follows custom, which is to use the latter term when the target is a constant, e.g. a population mean, and to use the former term when the target is a random quantity. It is not a major distinction.  152 CHAPTER 5. INTRODUCTION TO MODEL BUILDING In light of (5.6), we see that T1, the "true" predictor, may not necessarily be the better of the two predictors. Granted, it has no bias whereas T2 does have a bias, but the latter has a smaller variance. Let's consider this in more detail, using (5.5): 2 62 MSE(T1)_= + 02 (5.19) n n 12 + 2 2 62 (2 - l1 MSE(T2) = + - pi = + (5.20) 2n 2 2n 2 T1 is a better predictor than T2 if (5.19) is smaller than (5.20), which is true if (5.21) 2 2n So you can see that T1 is better only if either " n is large enough, or " the difference in population mean heights between men and women is large enough, or " there is not much variation within each population, e.g. most men have very similar heights Since that third item, small within-population variance, is rarely seen, let's concentrate on the first two items. The big revelation here is that: A more complex model is more accurate than a simpler one only if either " we have enough data to support it, or " the complex model is sufficiently different from the simpler one In height/gender example above, if n is too small, we are "desperate for data," and thus make use of the female data to augment our male data. Though women tend to be shorter than men, the bias that results for the augmentation is offset by the reduction in estimator variance that we get. But if n is large enough, the variance will be small in either model, so when we go to the more complex model, the advantage gained by reducing the bias will more than compensate for the increase in variance. THIS IS AN ABSOLUTELY FUNDAMENTAL NOTION IN STATISTICS. This was a very simple example, but you can see that in complex settings, fitting too rich a model can result in very high MSEs for the estimates. In essence, everything becomes noise. (Some people have cleverly coined the term noise mining, a play on the term data mining.) This is the famous overfitting problem. In our unit on statistical relations, Chapter 6, we will show the results of a scary experiment done at the Wharton School, the University of Pennsylvania's business school. The researchers deliberately added fake data to a prediction equation, and standard statistical software identified it as "significant"! This is partly a problem with the word itself, as we saw in Section 4.3.8, but also a problem of using far too complex a model, as will be seen in that future unit.  5.3. ASSESSING "GOODNESS OF FIT" OF A MODEL 153 5.3 Assessing "Goodness of Fit" of a Model Our example in Section 4.4.4 concerned how to estimate the parameters of a gamma distribution, given a sample from the distribution. But that assumed that we had already decided that the gamma model was reasonable in our application. Here we will be concerned with how we might come to such decisions. Assume we have a random sample X1, ..., X, from a distribution having density fx. 5.3.1 The Chi-Square Goodness of Fit Test The classic way to do this would be the Chi-Square Goodness of Fit Test. We would set Ho : fx is a member of the exponential parametric family (5.22) This would involve partitioning (0, oc) into k intervals (sii, s2) of our choice, and setting Ni = number of Xi in (sii, s2) (5.23) We would then find the Maximum Likelihood Estimate (MLE) of A, on the assumption that the distribution of X really is exponential. The MLE turns out to be the reciprocal of the sample mean, i.e. A = 1/X (5.24) This would be considered the parameter of the "best-fitting" exponential density for our data. We would then estimate the probabilities by Pi = P[XE(s2_l, s2)] = e-as2- _ e-as2 i = 1, ..., k. p2 = e-as2- _ e-as2 Z = 1, ..., k. (5.25) (5.26) Note that Ni has a binomial distribution, with n trials and success probability p2. Using this, the expected value of ENZ is estimated to be c- Asi-I C- Asi (5.27) Our test statistic would then be S (N -vi i=1 (5.28)  154 CHAPTER 5. INTRODUCTION TO MODEL BUILDING where v2 is the expected value of Ni under the assumption of "exponentialness." It can be shown that Q is approximately chi-square distributed with k-2 degrees of freedom.6 Note that only large values of Q should be suspicious, i.e. should lead us to reject Ho; if Q is small, it indicates a good fit. If Q were large enough to be a "rare event," say larger than Xo.95,k-2, we would decide NOT to use the exponential model; otherwise, we would use it. Hopefully the reader has immediately recognized the problem here. If we have a large sample, this procedure will pounce on tiny deviations from the exponential distribution, and we would decide not to use the exponential model-even if those deviations were quite minor. Again, no model is 100% correct, and thus a goodness of fit test will eventually tell us not to use any model at all. 5.3.2 Kolmogorov-Smirnov Confidence Bands Again consider the problem above, in which we were assessing the fit of a exponential model. In line with our major point that confidence intervals are far superior to hypothesis tests, we now present Kolmogorov- Smirnov confidence bands, which work as follows. Recall the concept of empirical cdfs, presented in Section 4.4.8. It turns out that the distribution of M = max |JFx(t) - Fx(t)| (5.29) -oo(Xs= (5.36) n -1 The distribution of T, under the assumption of a normal population, has been tabulated, and tables for it appear in virtually every textbook on statistics. But what if the population is not normal, as is inevitably the case? The answer is that it doesn't matter. For large n, even for samples having, say, n = 20, the distribution of T is close to N(0,1) by the Central Limit Theorem regardless of whether the population is normal. By contrast, consider the classic procedure for performing hypothesis tests and forming confidence intervals for a population variance o2, which relies on the statistic (n - 1)s2 K = 2(5.37) where again s2 is the unbiased version of the sample variance. If the sampled population is normal, then K can be shown to have a chi-square distribution with n-1 degrees of freedom. This then sets up the tests or intervals. However, it has been shown that these procedures are not robust to the assumption of a normal population. See The Analysis of Variance: Fixed, Random, and Mixed Models, by Hardeo Sahai and Mohammed I. Ageel, Springer, 2000, and the earlier references they cite, especially the pioneering work of Scheffe'. Exercises Note to instructor: See the Preface for a list of sources of real data on which exercises can be assigned to complement the theoretical exercises below. 1. In our example in Section 5.2, assume pi = 70, p2 = 66, o-= 4 and the distribution of height is normal in the two populations. Suppose we are predicting the height of a man who, unknown to us, has height 68. We hope to guess within two inches. Find P(|T1 - 68|) < 2 and P(|T2 - 68|) < 2 for various values of n. 2. In Section 4.2.16 we discussed simultaneous inference, the forming of confidence intervals whose joint confidence level was 95% or some other target value. The Kolmogorov-Smirnov confidence band in Section 5.3.2 allows us to computer infinitely many confidence intervals for Fx (t) at different values of t, at a "price" of only 1.358. Still, if we are just estimating Fx (t) at a single value of t, an individual confidence interval using (4.32) would be narrower than that given to us by Kolmogorov-Smirnov. Compare the widths of these two intervals in a situation in which the true value of Fx (t) =0.4.  Chapter 6 Statistical Relations Between Variables 6.1 The Goals: Prediction and Understanding Prediction is difficult, especially when it's about the future.-Yogi Berrat In this unit we are interested in relations between variables. Before beginning, it is important to understand the typical goals in analyzing such relations: " Prediction: Here we are trying to predict one variable from one or more others. " Understanding: Here we wish to determine which variables have a greater effect on a given variable. Denote the predictor variables by, X ), ..., X (r). The variable to be predicted, Y, is often called the response variable. A common statistical methodology used for such analyses is called regression analysis. In the important special cases in which the response variable Y is an indicator variable,2 taking on just the values 1 and 0 to indicate class membership, we call this the classification problem. (If we have more than two classes, we need several Ys.) In the above context, we are interested in the relation of a single variable Y with other variables X(). But in some applications, we are interested in the more symmetric problem of relations among variables X() (with there being no Y). A typical tool for the case of continuous random variables is principal components analysis, and a popular one for the discrete case is log-linear model; both will be discussed later in this unit. 6.2 Example Applications: Software Engineering, Networks, Text Mining Example: As an aid in deciding which applicants to admit to a graduate program in computer science, we 'Yogi Berra (1925-) is a former baseball player and manager, famous for his malapropisms, such as "When you reach a fork in the road, take it"; "That restaurant is so crowded that no one goes there anymore"; and "I never said half the things I really said." 2Sometimes called a dummy variable. 157  158 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES might try to predict Y, a faculty rating of a student after completion of his/her first year in the program, from X(1) = the student's CS GRE score, X(2) = the student's undergraduate GPA and various other variables. Here our goal would be Prediction, but educational researchers might do the same thing with the goal of Understanding. For an example of the latter, see Predicting Academic Performance in the School of Computing & Information Technology (SCIT), 35th ASEE/IEEE Frontiers in Education Conference, by Paul Golding and Sophia McNamarah, 2005. Example: In a paper, Estimation of Network Distances Using Off-line Measurements, Computer Com- munications, by Danny Raz, Nidhan Choudhuri and Prasun Sinha, 2006, the authors wanted to predict Y, the round-trip time (RTT) for packets in a network, using the predictor variables X(1) = geographical dis- tance between the two nodes, X(2) = number of router-to-router hops, and other variables. The goal here is primarily Prediction. Example: In a paper, Productivity Analysis of Object-Oriented Software Developed in a Commercial Envi- ronment, Software-Practice and Experience, by Thomas E. Potok, Mladen Vouk and Andy Rindos, 1999, the authors mainly had an Understanding goal: What impact, positive or negative, does the use of object- oriented programming have on programmer productivity? Here they predicted Y = number of person-months needed to complete the project, from X(1) = size of the project as measured in lines of code, X(2) = 1 or 0 depending on whether an object-oriented or procedural approach was used, and other variables. Example: Most text mining applications are classification problems. For example, the paper Untangling Text Data Mining, Proceedings of ACL'99, by Marti Hearst, 1999 cites, inter alia, an application in which the analysts wished to know what proportion of patents come from publicly funded research. They were using a patent database, which of course is far too huge to feasibly search by hand. That meant that they needed to be able to (reasonably reliably) predict Y = 1 or 0 according to whether the patent was publicly funded from a number of X(', each of which was an indicator variable for a given key word, such as "NSF" They would then treat the predicted Y values as the real ones, and estimate their proportion from them. 6.3 Regression Analysis 6.3.1 What Does "Relationship" Really Mean? Consider the Davis city population example again. In addition to the random variable W for weight, let H denote the person's height. Suppose we are interested in exploring the relationship between height and weight. As usual, we must first ask, what does that really mean? What do we mean by "relationship"? Clearly, there is no exact relationship; for instance, we cannot exactly predict a person's weight from his/her height. Intuitively, though, we would guess that mean weight increases with height. To state this precisely, take Y to be the weight W and X(1) to be the height H, and define mW ;H(t) =E( CWH =t) (6.1) This looks abstract, but it is just common-sense stuff. For example, mW;H (68) would be the mean weight of all people in the population of height 68 inches. The value of mW;H (t) varies with t, and we would expect  6.3. REGRESSION ANALYSIS 159 that a graph of it would show an increasing trend with t, reflecting that taller people tend to be heavier. We call mW;H the regression function of W on H. In general, my;x (t) means the mean of Y among all units in the population for which X = t. Note the word population in that last sentence. The function mO is a population function. Now, let's again suppose we have a random sample of 1000 people from Davis, with (H1, W1), ..., (H1000, W1ooo) (6.2) being their heights and weights. We again wish to use this data to estimate population values. But the difference here is that we are estimating a whole function now, the whole curve m. That means we are estimating infinitely many values, with one mw;H (t) value for each t.3 How do we do this? The traditional method is to choose a parametric model for the regression function. That way we estimate only a finite number of quantities instead of an infinite number. Typically the parametric model chosen is linear, i.e. we assume that mw;H (t) is a linear function of t: mW;H(t) =ct + d (6.3) for some constants c and d. If this assumption is reasonable-meaning that though it may not be exactly true it is reasonably close-then it is a huge gain for us over a nonparametric model. Do you see why? Again, the answer is that instead of having to estimate an infinite number of quantities, we now must estimate only two quantities-the parameters c and d. Equation (6.3) is thus called a parametric model of mw;H 0. The set of straight lines indexed by c and d is a two-parameter family, analogous to parametric families of distributions, such as the two-parametric gamma family; the difference, of course, is that in the gamma case we were modeling a density function, and here we are modeling a regression function. Note that c and d are indeed population parameters in the same sense that, for instance, r and A are parameters in the gamma distribution family. We will see how to estimate c and d in Section 6.3.7. 6.3.2 Multiple Regression: More Than One Predictor Variable Note that X and t could be vector-valued. For instance, we could have Y be weight and have X be the pair X = (X(1), X(2))= (H, A) = (height, age) (6.4) so as to study the relationship of weight with height and age. If we used a linear model, we would write for mW;H~t /o + /1tl + /32t2 (6.5) 30f course, the population of Davis is finite, but there is the conceptual population of all people who could live in Davis.  160 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES In other words mean weight = 3o + 31 height + ,32 age (6.6) (It is traditional to use the Greek letter 3 to name the coefficients in a linear regression model.) So for instance mW;H (68, 37.2) would be the mean weight in the population of all people having height 68 and age 37.2. 6.3.3 Interaction Terms Equation (6.5) implicitly says that, for instance, the effect of age on weight is the same at all height levels. In other words, the difference in mean weight between 30-year-olds and 40-year-olds is the same regardless of we are looking at tall people or short people. To see that, just plug 40 and 30 for age in (6.5), with the same number for height in both, and subtract; you get 10/32, an expression that has no height term. If we feel that the assumption is not a good one (there are also data plotting techniques to help assess this), we can add an interaction term to (6.5), consisting of the product of the two original predictors. Our new predictor variable X(3) is equal to X(I)X(2), and thus our regression function is mW;H(t) 30 + /3i1t1i + /32t2 + /33t1t2 (6.7) If you perform the same subtraction described above, you'll see that this more complex model does not assume, as the old did, that the difference in mean weight between 30-year-olds and 40-year-olds is the same regardless of we are looking at tall people or short people. Recall the study of object-oriented programming in Section 6.1. The authors there set X(3) = X(1)X(2) The reader should make sure to understand that without this term, we are basically saying that the effect (whether positive or negative) of using object-oriented programming is the same for any code size. Though the idea of adding interaction terms to a regression model is tempting, it can easily get out of hand. If we have k basic predictor variables, then there are potential two-way interaction terms,( ) three-way terms and so on. Unless we have a very large amount of data, we run a big risk of overfitting (Section 6.3.9.1). And with so many interaction terms, the model would be difficult to interpret. 6.3.4 Nonrandom Predictor Variables In our weight/height/age example above, all three variables are random. If we repeat the "experiment," i.e. we choose another sample of 1000 people, these new people will have different weights, different heights and different ages from the people in the first sample. But we must point out that the function myx makes sense even if X is nonrandom. To illustrate this, let's look at the ALOHA network example in our introductory unit on discrete probability, Section 1.1. 1 # simulation of simple form of slotted ALOHA 2  6.3. REGRESSION ANALYSIS 161 3 # a node is active if it has a message to send (it will never have more 4 # than one in this model), inactive otherwise 5 6 # the inactives have a chance to go active earlier within a slot, after 7 # which the actives (including those newly-active) may try to send; if 8 # there is a collision, no message gets through 9 10 # parameters of the system: 11 # s = number of nodes 12 # b = probability an active node refrains from sending 13 # q = probability an inactive node becomes active 14 15 # parameters of the simulation: 16 # nslots = number of slots to be simulated 17 # nb = number of values of b to run; they will be evenly spaced in (0,1) 18 19 # will find mean message delay as a function of b; 20 21 # we will rely on the "ergodicity" of this process, which is a Markov 22 # chain (see http://heather.cs.ucdavis.edu/~matloff/132/PLN/Markov.tex), 23 # which means that we look at just one repetition of observing the chain 24 # through many time slots 25 26 # main loop, running the simulation for many values of b 27 alohamain <- function(s,q,nslots,nb) { 28 deltab = 0.7 / nb # we'll try nb values of b in (0.2,0.9) 29 md <- matrix(nrow=nb,ncol=2) 30 b <- 0.2 31 for (i in 1:nb) { 32 b <- b + deltab 33 w <- alohasim(s,b,q,nslots) 34 md[i,] <- alohasim(s,b,q,nslots) 35 } 36 return(md) 37 } 38 39 # simulate the process for h slots 40 alohasim <- function(s,b,q,nslots) { 41 # status[i,1] = 1 or 0, for node i active or not 42 # status[i,2] = if node i active, then epoch in which msg was created 43 # (could try a list structure instead a matrix) 44 status <- matrix(nrow=s,ncol=2) 45 # start with all active with msg created at time 0 46 for (node in 1:s) status[node,] <- c(1,0) 47 nsent <- 0 # number of successful transmits so far 48 sumdelay <- 0 # total delay among successful transmits so far 49 # now simulate the nslots slots 50 for (slot in 1:nslots) { 51 # check for new actives 52 for (node in 1:s) { 53 if (!status[node,1]) # inactive 54 if (runif(1) < q) status[node,] <- c(1,slot) 55 } 56 # check for attempted transmissions 57 ntrysend <- 0 58 for (node in 1:s){ 59 if (status[node,1]) # active 60 if (runif(1) > b){ 61 ntrysend <- ntrysend + 1 62 whotried <- node 63 } 64 } 65 if (ntrysend ==1) { # something gets through iff exactly one tries 66 # do our bookkeeping 67 sumdelay <- sumdelay + slot - status[whotried,2]  162 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES 68 # this node now back to inactive 69 status[whotried,1] <- 0 70 nsent <- nsent + 1 71 } 72 } 73 return (c (b, sumdelay/nsent)) 74 } A minor change is that I replaced the probability p, the probability that an active node would send in the original example to b, the probability of not sending (b for "backoff"). Let A denote the time A (measured in slots) between the creation of a message and the time it is successfully transmitted. We are interested in mean delay, i.e. the mean of A. We are particularly interested in the effect of b here on that mean. Our goal here, as described in Section 6.1, could be Prediction, so that we could have an idea of how much delay to expect in future settings. Or, we may wish to explore finding an optimal b, i.e. one that minimizing the mean delay, in which case our goal would be more in the direction of Understanding. I ran the program with certain arguments, and then plotted the data: > md <- alohamain(4,0.1,1000,100) > plot (md,cex=0.5,xlab="b",ylab="A") The plot is shown in Figure 6.1. Note that though our values b here are nonrandom, the A values are indeed random. To dramatize that point, I ran the program again. (Remember, unless you specify otherwise, R will use a different seed for its random number stream each time you run a program.) I've superimposed this second data set on the first, using filled circles this time to represent the points: md2 <- alohamain(4,0.1,1000,100) points(md2,cex=0.5,pch=19) The plot is shown in Figure 6.2. We do expect some kind of U-shaped relation, as seen here. For b too small, the nodes are clashing with each other a lot, causing long delays to message transmission. For b too large, we are needlessly backing off in many cases in which we actually would get through. This looks like a quadratic relationship, meaning the following. Take our response variable Y to be A, take our first predictor X M to be b, and take our second predictor X(2) to be b2. Then when we say A and b have a quadratic relationship, we mean mA;b(b) = +0 -|-b +1 -|-32b2(68 for some constants #3o, #31, /#2. So, we are using a three-parameter family for our model of maa No model is exact, but our data seem to indicate that this one is reasonably good, and if further investigation confirms that, it provides for a nice compact summary of the situation. Again, we'll see how to estimate the 3 in Section 6.3.7.  6.3. REGRESSION ANALYSIS 163 0O N~ IC) 0O u0- 0 0 0 0 0 0 0 0 00o000 0o aoo 0o0006600o0 0 Q O%?6~~~QO 0.2 I I I I I | 0.3 0.4 0.5 0.6 0.7 0.8 0.9 b Figure 6.1: Scatter Plot We could also try adding two more predictor variables, consisting of X(3) = q and X(4) = s. We would collect more data, in which we varied the values of q and s, and then could entertain the model mA;b(b) =/30 -+ ib + 322 + /33q + /34s (6.9) 6.3.5 Prediction So, we've taken our data on weight/height/age, and estimated the function m using that data, yielding rh. Now, a new person comes in, of height 70.4 and age 24.8. What should we predict his weight to be? The answer is that we predict his weight to be our estimated mean weight for his height/age group, mW;H,A (70.4, 24.8) (6.10)  164 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES - _ - J 0 0 O " 0 0 0 0 0 0 SO 0 o O 0 0 0.2 I I I I I | 0.3 0.4 0.5 0.6 0.7 0.8 0.9 b Figure 6.2: Scatter Plot, Two Data Sets If our model is (6.5), then (6.10) is mW;H (t) =3O + /3170.4 + /3224.8 where the ,32 are estimated from our data as in Section 6.3.7 below. (6.11) 6.3.6 Optimality of the Regression Function In predicting Y from X (with X random), we might assess our predictive ability by the mean squared prediction error (MSPE): MSPE = E [(Y - w(X))2] (6.12) where w is some function we will use to form our prediction for Y based on X. What w is best, i.e. which w  6.3. REGRESSION ANALYSIS 165 minimizes MSPE? To answer this question, condition on X in (6.12): MSPE = E [E{(Y - w(X))2|X}] (6.13) Theorem 8 The best w is m, i.e. the best way to predict Y from X is to "plug in" X in the regression function. We need this lemma: Lemma 9 For any random variable Z, the constant c which minimizes E[(Z - c)2] (6.14) is c = EZ (6.15) Proof Expand (6.14) to E(Z2) - 2cEZ + c2 (6.16) and use calculus to find the best c. Apply the lemma to the inner expectation in (6.13), with Z being Y and c being some function of X. The minimizing value is EZ, i.e. E(Y lX) since our expectation here is conditional on X. All of this tells us that the best function w in (6.12) is my;x. This proves the theorem. 6.3.7 Parametric Estimation of Linear Regression Functions 6.3.7.1 Meaning of "Linear" Here we model my;x as a linear function of X('), ..., X(: my;x (t) =/30 + 3it(1) + ... + r t (r) (6.17) Note that the term linear regression does NOT necessarily mean that the graph of the regression function is a straight line or a plane. Instead, the word linear refers to the regression function being linear in the  166 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES parameters. So, for instance, (6.8) is a linear model; if for example we multiple /o, /31 and #Q2 by 8, then m is multiplied by 8. 6.3.7.2 Point Estimates and Matrix Formulation So, how do we estimate the 3? Look for instance at (6.8). Keep in mind that in (6.8), the 3i are population values. We need to estimate them from our data. How do we do that? Let's define (bi, AZ) to be the ith pair from the simulation. In the program, this is md[i,]. Our estimated parameters will be denoted by ,32. Using the result of Section 6.3.5 as a guide, the estimation methodology involves finding the values of 3i which minimize the sum of squared differences between the actual A values and their predicted values: 100 Z[Ai - (,3o + $1 b2 + /32b2)]2 (6.18) i=1 Obviously, this is a calculus problem. We set the partial derivatives of (6.18) with respect to the 132 to 0, giving use three linear equations in three unknowns, and then solve. For the general case (6.17), we have r+1 equations in r+1 unknowns. This is most conveniently expressed in matrix terms. Let X be the value of X W for the ith observation in our sample, and let Y be the corresponding Y value. Plugging this data into (6.3.7.1), we have E(YIlX1, ..., X>) = 3 +/31X(1) + ... +3rX i =1,..., n(6.19) That's a system of n linear equations, which from your linear algebra class you know can be represented more compactly by a matrix. That would be E(VIQ) = Q3 (6.20) where (with / denoting matrix transpose and a vector without a / being a row vector) V = (Yi, ..., Yn)',(6.21) 13 = (,3o,11,...,/3r)' (6.22) and Q is the n x (r+1) matrix whose (i,j) element is X with taken to be 1. For instance, if we are predicting weight from height and age, then row 5 of Q would consist of a 1, then the height and age of the fifth person in our sample. Now to estimate the #3i, let /3 (30,31,...3r)' (6.23)  6.3. REGRESSION ANALYSIS 167 Then it can be shown that, after all the partial derivatives are taken and set to 0, the solution is 3 (Q'Q)-Q'V (6.24) 6.3.7.3 Back to Our ALOHA Example R or any other statistical package does the work for us. In R, we can use the im() ("linear model") function: > md <- cbind(md,md[,1]^2) > lmout <- lm(md[,2] ~ md[,1] + md[,3]) First I added a new column to the data matrix, consisting of b2. I then called im(), with the argument md[,2] ~ md[,1] + md[,3] R documentation calls this model specification argument the formula. It states that I wish to use the first and third columns of md, i.e. b and b2, as predictors, and use A, i.e. second column, as the response variable.4 The return value from this call, which I've stored in lmout, is an object of class lm. One of the member variables of that class, coefficients, is the vector #3: > lmout$coefficients (Intercept) md[, 1] md[, 3] 27.56852 -90.72585 79.98616 So, 3o= 27.57 and so on. The result is rA,b(t) = 27.57 - 90.73t + 79.99t2 (6.25) Another member variable in the lm class is fitted.values. This is the "fitted curve," meaning the values of (6.25) at b1, ..., bioo. In other words, this is (6.25). I plotted this curve on the same graph, > lines (cbind(md[,1] ,lmout$fitted.values)) See Figure 6.3. As you can see, the fit looks fairly good. What should we look for? Remember, we don't expect the curve to go through the points-we are estimating the mean of A for each b, not the A values themselves. There is always variation around the mean. If for instance we are looking at the relationship between people heights and weights, the mean weight for people of height 70 inches might be, say, 160 pounds, but we know that some 70-inch-tall people weigh more than this and some weigh less. 4Unfortunately, R did not allow me to put the squared column directly into the formula, forcing me to use chind() to make a new matrix.  168 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES 0 _ . - 0 0 0 0 0 0 o o 0 0 0 00 0 O0 0 0 U 0.2 I I I I I | 0.3 0.4 0.5 0.6 0.7 0.8 0.9 b Figure 6.3: Quadratic Fit Superimposed However, there seems to be a tendency for our estimates of iA,b (t) to be too low for values in the middle range of t, and possible too high for t around 0.3 or 0.4. However, with a sample size of only 100, it's difficult to tell. It's always important to keep in mind that the data are random; a different sample may show somewhat different patterns. Nevertheless, we should consider a more complex model. So I tried a quartic, i.e. fourth-degree, polynomial model. I added third- and fourth-power columns to md, calling the result md4, and invoked the call lm(md4[,2] ~ md4[,1] + md4[,3] + md4[,4] + md4[,5]) The result was > lmout$coefficients (Intercept) md4[, 1] md4[, 3] md4[, 4] md4[, 5] 95.98882 -664.02780 1731.90848 -1973.00660 835.89714  6.3. REGRESSION ANALYSIS 169 O _ - " 0 0 O O 0 O0 O0 O0 O 0 O o0 0 0 00, 000 0 0.2 I I I I I I I 0.3 0.4 0.5 0.6 0.7 0.8 0.9 b Figure 6.4: Fourth Degree Fit Superimposed In other words, we have an estimated regression function of nA,b(t) = 95.98882 - 664.02780 t + 1731.90848 t2 - 1973.00660 t3 + 835.89714 t4 (6.26) The fit is shown in Figure 6.4. It looks much better. On the other hand, we have to worry about overfitting. We return to this issue in Section 6.3.9.1). 6.3.7.4 Approximate Confidence Intervals As usual, we should not be satisfied with just point estimates, in this case the 32. We need an indication of how accurate they are, so we need confidence intervals. In other words, we need to use the 3 to form confidence intervals for the 32. For instance, recall the study on object-oriented programming in Section 6.1. The goal there was primarily Understanding, specifically assessing the impact of OOP. That impact is measured by 32. Thus, we want to  170 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES find a confidence interval for 32. Equation (6.24) shows that the /32 are sums of the components of V, i.e. the Y. So, the Central Limit Theorem implies that the #3 are approximately normally distributed. That in turn means that, in order to form confidence intervals, we need standard errors for the 3Z. How will we get them? Note carefully that so far we have made NO assumptions other than (6.17). Now, though, we need to add an assumption:5 Var(Y|X = t) = a2 (6.27) for all t. Note that this and the independence of the sample observations (e.g. the various people sampled in the Davis height/weight example are independent of each other) implies that Cov(V|Q) = 02I (6.28) where I is the usual identiy matrix (Is on the diagonal, Os off diagonal). Be sure you understand what this means. In the Davis weights example, for instance, it means that the variance of weight among 72-inch tall people is the same as that for 65-inch-tall people. That is not quite true-the taller group has larger variance-but it's probably accurate enough for our purposes here. Keep in mind that the derivation below is conditional on the X ), which is the standard approach, especially since there is the case of nonrandom X. Thus we will later get conditional confidence intervals, which is fine. To avoid clutter, I will sometimes not show the conditioning explicitly, and thus for instance will write Cov(V) instead of Cov(VlQ). We can derive the covariance matrix of 3, as follows. First, we can easily derive that for any m x 1 random vector M and constant (i.e. nonrandom) matrix c with m columns, Cov(cM) = cCov(M)c' (6.29) Also, one can show that the transpose of the product of two matrices is the reverse product of the transposes. In (6.29), set c = (Q'Q)-Q' and M = V. Then from (6.24), Cov(3) =[(Q'Q)-1]Q' Cov(V) Q[(Q'Q)-1]' (6.30) = ( ) ] (2 [(Q'Q)--] 'a -]' (6.31) = a2(Q.Q)-1 (6.32) Here we have used the fact that Q'Q is a symmetric matrix, which implies the same property for its inverse. Whew! That's a lot of work for you, if your linear algebra is rusty. But it's worth it, because (6.30) now gives us what we need for confidence intervals. Here's how: 5Actually, we could derive some usable, though messy,standard errors without this assumption.  6.3. REGRESSION ANALYSIS 171 First, we need to estimate o2. Recall first that for any random variable U, Var(U) = E[(U - EU)2], we have o2 = Var(YlX = t) = Var(Y lX = ti, ..., X = tr) = E [{Y - my;x(t)}2] = E[(Y-3o-13ti -...-/3rtr)2] (6.33) (6.34) (6.35) (6.36) Thus, a natural estimate for a2 would be the sample analog, where we replace E() by averaging over our sample, and replace population quantities by sample estimates: n i=1 (r ) (6.37) So, the estimated covariance matrix for #3 is Cov(#3) = s2(Q/)-1 6.3.7.5 Once Again, Our ALOHA Example In R we can obtain (6.38) via the generic function vcovO: (6.38) > vcov(lmout) (Intercept) (Intercept) 92.73734 md4[, 1] -794.47553 md4[, 3] 2358.86046 md4[, 4] -2915.23828 md4[, 5] 1279.98125 md4[, 1 ] -794.4755 6896.8443 -20705.7047 25822.8320 -11422.3550 md4[, 3] 2358.860 -20705.705 62804.912 -79026.086 35220.412 md4[, 4] -2915.238 25822.832 -79026.086 100239.652 -44990.271 md4[, 5] 1279.981 -11422.355 35220.412 -44990.271 20320.809 What is this telling us? For instance, it is saying that the (4,4) position in the matrix (6.38) is equal to 20320.809, so the standard error of 34 is the square root of this, 142.6. Thus an approximate 95% confidence interval for the true population #Q4 is 835.89714 i 1.96 -142.6 = (556.4, 1115.4) (6.39) That interval is quite wide. Remember what this tells us-that our sample of size 100 is not very large. On the other hand, the interval is quite far from 0, which indicates that our fourth-degree model is legitimately better than our quadratic one. By the way, applying the R function summary() to a linear model object such as lmout here gives standard errors for the 3 and lots of other information.  172 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES 6.3.7.6 Estimation Vs. Prediction In statistical parlance, there is a keen distinction made between the words estimation and prediction. To explain this, let's again consider the example of predicting Y = weight from X = (height,age). Say we have someone of height 67 inches and age 27, and want to guess-i.e. predict-her weight. From Section 6.3.6, we know that the best prediction is m[(67,27)]. However, we do not know the value of that quantity, so we must estimate it from our data. So, our predicted value for this person's weight will be m[(67, 27)], i.e. our estimate for the value of the regression function at the point (67,27). 6.3.7.7 Exact Confidence Intervals Note carefully that we have not assumed that Y, given X, is normally distributed. In the height/weight context, for example, such an assumption would mean that weights in a specific height subpopulation, say all people of height 70 inches, have a normal distribution. If we do make such an assumption, then we can get exact confidence intervals (which of course, only hold if we really do have an exact normal distribution in the population). This again uses Student-t distributions. In that analysis, s2 has n-(r+1) in its denominator instead of our n, just as there was n-1in the denominator for s2 when we estimated a single population variance. The number of degrees of freedom in the Student-t distribution is likewise n-(r+1). But as before, for even moderately large n, it doesn't matter. 6.3.8 The Famous "Error Term" (advanced topic) Books on regression analysis-and there are hundreds, if not thousands of these-generally introduce the subject as follows. They consider the linear case with r = 1, and write Y = 0+31X+e, EE=O (6.40) with E being independent of X. They also assume that E has a normal distribution with variance o2. Let's see how this compares to what we have been assuming here so far. In the linear case with r = 1, we would write my;x(t) = E (Y|X = t) = 0 + ,31t (6.41) Note that in our context, we would define E as E= Y - my~x (X ) (6.42) Equation (6.40) is consistent with (6.41): The former has Ee 0, and so does the latter, since EE = EY - E[my;x(X)] = EY - E[E(Y|X)] = EY - EY = 0 (6.43)  6.3. REGRESSION ANALYSIS 173 In order to produce confidence intervals, we later added the assumption (6.27), which you can see is consis- tent with (6.40) since the latter assumes that Var(E) = o2 no matter what value X has. Now, what about the normality assumption in (6.40)? That would be equivalent to saying that in our context, the conditional distribution of Y given X is normal, which is an assumption we did not make. Note that in the weight/height example, this assumption would say that, for instance, the distribution of weights among people of height 68.2 inches is normal. No matter what the context is, the variable E is called the error term. Originally this was an allusion to measurement error, e.g. in chemistry experiments, but the modern interpretation would be prediction error, i.e. how much error we make when we us my;x (t) to predict Y. 6.3.9 Model Selection The issues raised in Chapter 5 become crucial in regression and classification problems. In this unit, we will typically deal with models having large numbers of parameters. A central principle will be that simpler models are preferable, provided of course they are accurate. Hence the Einstein quote above. Simpler models are often called parsimonious. Here I use the term model selection to mean which predictor variables we will use. If we have data on many predictors, we almost certainly will not be able to use them all, for the following reason: 6.3.9.1 The Overfitting Problem in Regression Recall (6.8). There we assumed a second-degree polynomial for mA;b. Why not a third-degree, or fourth, and so on? You can see that if we carry this notion to its extreme, we get absurd results. If we fit a polynomial of degree 99 to our 100 points, we can make our fitted curve exactly pass through every point! This clearly would give us a meaningless, useless curve. We are simply fitting the noise. Recall that we analyzed this problem in Section 5.2.3 in our unit on modeling. testing. There we noted an absolutely fundamental principle in statistics: In choosing between a simpler model and a more complex one, the latter is more accurate only if either " we have enough data to support it, or " the complex model is sufficiently different from the simpler one This is extremely important in regression analysis. For example, look at our regression model for A against b in the ALOHA simulation in earlier sections. We did analyses for a simpler model, a quadratic polynomial, and a more complex model, a quartic (polynomial of degree 4). Rephrasing the above points in this context, we would say, In choosing between the quadratic and quartic models, the latter is more accurate only if either  174 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES " we have enough data to support it, or " at least one of the coefficients 33 and /34 is quite different from 0 In the weight/height/age example in Section 6.3.1, this would be phrased as In deciding whether to predict from height only, versus from both height and age, the latter is more accurate only if either " we have enough data to support it, or " the coefficient #Q2 is quite different from 0 If we use too many predictor variables,6, our data is "diluted," by being "shared" by so many 32. As a result, Var(3Z) will be large, with big implications: Whether our goal is Prediction or Understanding, our estimates will be so poor that neither goal is achieved. The questions raised in turn by the above considerations, i.e. How much data is enough data?, and How different from 0 is "quite different"?, are addressed below in Section 6.3.9.2. A detailed mathematical example of overfitting in regression is presented in my paper A Careful Look at the Use of Statistical Methodology in Data Mining (book chapter), by N. Matloff, in Foundations of Data Mining and Granular Computing, edited by T.Y. Lin, Wesley Chu and L. Matzlack, Springer-Verlag Lecture Notes in Computer Science, 2005. 6.3.9.2 Methods for Predictor Variable Selection So, we typically must discard some, maybe many, of our predictor variables. In the weight/height/age example, we may need to discard the age variable. In the ALOHA example, we might need to discard b4 and even b3. How do we make these decisions? Note carefully that this is an unsolved problem. If anyone ever claims they have a foolproof way to do this, they do not understand the problem in the first place. Entire books have been written on this subject (e.g. Subset Selection in Regression, by Alan Miller, pub. by Chapman and Hall, 2002), discussing myriad different methods, but again, none of them is foolproof. Most of the methods for variable selection use hypothesis testing in one form or another. Typically this takes the form Ho : 3,= 0 (6.44) In the context of (6.6), this would mean testing Ho0: #32 =0 (6.45) 6In the ALOHA example above, b, b2, b3 and b4 are separate predictors, even though they are of course correlated.  6.3. REGRESSION ANALYSIS 175 If we reject H0, then we use the age variable; otherwise we discard it. I hope I've convinced you that this is not a good idea. As usual, the hypothesis test is asking the wrong question. For instance, in the weight/height/age example, the test is asking whether 32 is zero or not, whereas what we want to know is whether 32 is far enough from 0 for age to give us better predictions of weight. Those are two very, very different questions. A very interesting example of overfitting using real data may be found in the paper, Honest Confidence Inter- vals for the Error Variance in Stepwise Regression, by Foster and Stine, www- stat . wharton . upenn . edu/~stine /research/honest s2 . pdf. The authors, of the University of Pennsylvania Wharton School, took real financial data and deliberately added a number of extra "predictors" that were in fact ran- dom noise, independent of the real data. They then tested the hypothesis (6.44). They found that each of the fake predictors was "significantly" related to Y! This illustrates both the dangers of hypothesis testing and the possible need for multiple inference procedures.7 This problem has always been known by thinking statisticians, but the Wharton study certainly dramatized it. Well, then, what can be done instead? First, there is the same alternative to hypothesis testing that we discussed before-confidence intervals. We saw an example of that in (6.39). Granted, the interval was very wide, telling us that it would be nice to have more data. But even the lower bound of that interval is far from zero, so it looks pretty safe to use b4 as a predictor. Moreover, a confidence interval for /3 tells us whether the variable X) would have much value as a pre- dictor. Once again, consider the weight/height/age example. Suppose our confidence interval for 32 is (0.04,0.56). That would say that, for instance, a 10-year difference in age only makes about half a pound difference in mean weight-in which case age would be of almost no value in predicting weight. A method that enjoys some popularity in certain circles is the Akaike Information Criterion (AIC). It uses a formula, backed by some theoretical analysis, which creates a tradeoff between richness of the model and size of the standard errors of the 32. The R statistical package includes a function AIC() for this, which is used by step() in the regression case. The most popular alternative to hypothesis testing for variable selection today is probably cross validation. Here we split our data into a training set, which we use to estimate the /3, and a validation set, in which we see how well our fitted model predicts new data, say in terms of average squared prediction error. We do this for several models, i.e. several sets of predictors, and choose the one which does best in the validation set. I like this method very much, though I often simply stick with confidence intervals. A rough rule of thumb is that one should have r < 9/5.8 6.3.10 Nonlinear Parametric Regression Models We pointed out in Section 6.3.7.1 that the word linear in linear regression model means linear in /3, not in t. This is the most popular approach, as it is computationally easy, but nonlinear models are often used. The most famous of these is the logistic model, for the case in which Y takes on only the values 0 and 1. 7They added so many predictors that r became greater than n. However, the problems they found would have been there to a large degree even if r were less than n but r/n was substantial. 8Asymptotic Behavior of Likelihood Methods for Exponential Families When the Number of Parameters Tends to Infinity, Stephen Portnoy, Annals of Statistics, 1968.  176 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES As we have seen before, in this case the expected value becomes a probability. The logistic model for a nonvector X is then my;x(t)=P(Y= 1X =t) (6.46) 1 + e-()o+#1L) It extends to the case of vector-valued X in the obvious way. The logistic model is quite widely used in computer science, in medicine, economics, psychology and so on. Here is an example of a nonlinear model used in kinetics of chemical reactions, with r = 3: 9 _ #1t(2) - t(3)/35 my;x (t) = + + (2)+ (3) (6.47) 1 + /32t(1 + /33t(2 + #34t(3 Here the X vector is (hydrogen, n-pentane, isopentane)'. Unfortunately, in most cases, the least-squares estimates of the parameters in nonlinear regression do not have closed-form solutions, and numerical methods must be used. But R does that for you, via the nis() function in general, and via gim() for the logistic and related models in particular. 6.3.11 Nonparametric Estimation of Regression Functions In some applications, there may be no obvious parametric model for myl;x. Or, we may have a parametric model that we are considering, but we would like to have some kind of nonparametric estimation method available as a means of checking the validity of our parametric model. So, how do we estimate a regression function nonparametrically? To guide our intuition on this, let's turn again of the Davis example of the relationship between height and weight. Consider estimation of the quantity mw;H (68.2), the population mean weight of all people of height 68.2. We could take our estimate mW;H (68.2) to be the average weight of all the people in our sample who have that height. But we may have very few people of that height, so that our estimate may have a high variance, i.e. may not be very accurate. What we could do instead is to take the mean weight of all the people in our sample whose heights are near 68.2, say between 67.7 and 68.7. That would bias things a bit, but we'd get a lower variance. All nonparametric regression methods work like this, though with many variations. As our definition of "near," we could take all people in our sample whose heights are within h amount of 68.2. This should remind you of our density estimators in Section 4.6 of our unit on estimation and testing. As we saw there, a generalization would be to use a kernel method. For instance, for univariate X and t: mhy;x (t) = t-x~ (6.48) 9Seehttp://www.mathworks.com/index.html?scid=docframe_homepage.  6.3. REGRESSION ANALYSIS 177 There is an R package that includes a function nkreg() to do this. The R base has a similar method, called LOESS. Note: That is the method name, but the R function is called lowessO. Other types of nonparametric methods include Classification and Regression Trees (CART), nearest- neighbor methods, support vector machines, splines etc. 6.3.12 Regression Diagnostics Researchers in regression analysis have devised some diagnostic methods, meaning methods to check the fit of a model, the validity of assumptions [e.g. (6.27)], search for data points that may have an undue influence (and may actually be in error), and so on. The R package has tons of diagnostic methods. See for example Chapter 4 of Linear Models with R, Julian Faraway, Chapman and Hall, 2005. 6.3.13 Nominal Variables Recall our example in Section 6.2 concerning a study of software engineer productivity. To review, the authors of the study predicted Y = number of person-months needed to complete the project, from X(1) = size of the project as measured in lines of code, X(2) = 1 or 0 depending on whether an object-oriented or procedural approach was used, and other variables. As mentioned at the time, X(2) is called an indicator variable. Let's generalize that a bit. Suppose we are comparing two different object-oriented languages, C++ and Java, as well as the procedural language C. Then we could change the definition of X(2) to have the value 1 for C++ and 0 for non-C++, and we could add another variable, X(3), which has the value 1 for Java and 0 for non-Java. Use of the C language would be implied by the situation X(2) = X(3) = 0. Here we are dealing with a nominal variable, Language, which has three values, C++, Java and C, and representing it by the two indicator variables X(2) and X(3). Note that we do NOT want to represent Language by a single value having the values 0, 1 and 2, which would imply that C has, for instance, double the impact of Java. You can see that if a nominal variable takes on q values, we need q-1 indicator variables to represent it. We say that the variable has q levels. 6.3.14 The Case in Which All Predictors Are Nominal Variables: "Analysis of Variance" Continuing the ideas in Section 6.3.13, suppose in the software engineering study they had kept the project size constant, and instead of X(1) being project size, this variable recorded whether the programmer uses an integrated development environment (IDE). Say X(1) is 1 or 0, depending on whether the programmer uses the Eclipse IDE or no IDE, respectively. Continue to assume the study included the nominal Language variable, i.e. assume the study included the indicator variables X(2) (C++) and X(3) (Java). Now all of our predictors would be nominal/indicator variables. Regression analysis in such settings is called analysis of variance (ANOVA).  178 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES Each nominal variable is called a factor. So, in our software engineering example, the factors are IDE and Language. Note again that in terms of the actual predictor variables, each factor is represented by one or more indicator variables; here IDE has one indicator variables and Language has two. Analysis of variance is a classic statistical procedure, used heavily in agriculture, for example. We will not go into details here, but mention it briefly both for the sake of completeness and for its relevance to Sections 6.3.3 and 6.6. (The reader is strongly advised to review Sections 6.3.3 before continuing.) 6.3.14.1 It's a Regression! The term analyisis of variance is a misnomer. A more appropriate name would be analysis of means, as it is in fact a regression analysis, as follows. First, note in our software engineering example we basically are talking about six groups, because there are six different combinations of values for the triple (X(1), X(2), X(3)). For instance, the triple (1,0,1) means that the programmer is using an IDE and programming in Java. Note that triples of the form (w,1,1) are impossible. So, all that is happening here is that we have six groups with six means. But that is a regression! Remember, for variables U and V, mv;u (t) is the mean of all values of V in the subpopulation group of people (or cars or whatever) defined by U = s. If U is a continuous variable, then we have infinitely many such groups, thus infinitely many means. In our software engineering example, we only have six groups, but the principle is the same. We can thus cast the problem in regression terms: my;x(i, j, k) = E(Y lX(1) - iX(2) _ j, X(3) = k), i, j, k = 0, 1, j + k < 1 (6.49) Note the restriction j + k < 1, which reflects the fact that j and k can't both be 1. Again, keep in mind that we are working with means. For instance, my;x(0, 1, 0) is the population mean project completion time for the programmers who do not use Eclipse and who program in C++. Since the triple (i,j,k) can take on only six values, m can be modeled fully generally in the following six- parameter linear form: my;x (i, j, k) =/30 +/3i + 323 + 33k +34ij + 35ik (6.50) where /34 and /35 are the coefficients of two interaction terms, as in Section 6.3.3. 6.3.14.2 Interaction Terms It is crucial to understand the interaction terms. Without the ij and ik terms, for instance, our model would be my;x(i,.j,k) =13o+13i+2j+ 3k (6.51) (6.51)  6.3. REGRESSION ANALYSIS 179 which would mean (as in Section 6.3.3) that the difference between using Eclipse and and no IDE is the same for all three programming languages, C++, Java and C. That common difference would be 31. If this condition-the impact of using an IDE is the same across languages-doesn't hold, at least approximately, then would use the full model, (6.50). More on this below. Note carefully that there is no interaction term corresponding to jk, since that quantity is 0, and thus there is no three-way interaction term corresponding to ijk either. But suppose we add a third factor, Education, represented by the indicator X(4), having the value 1 if the programmer has a least a Master's degree, 0 otherwise. Then m would take on 12 values, and the full model would have 12 parameters: my;x(i,j, k,l) =/3o+/3ii+/32j+/33k+/341+f35ij+f36ik+37i1+f38j1+39k1+3ioijl+13iiikl (6.52) Again, there would be no ijkl term, as jk = 0. Here 31, #2, 33 and #4 are called the main effects, as opposed to the coefficients of the interaction terms, called of course the interaction effects. The no-interaction version would be my;x (i, j, k,l) = 30 + 3i + 32j + /33k + 341 (6.53) 6.3.14.3 Now Consider Parsimony In the three-factor example above, we have 12 groups and 12 means. Why not just treat it that way, instead of applying the powerful tool of regression analysis? The answer lies in our desire for parsimony, as noted in Section 6.3.9.1. If for example (6.53) were to hold, at least approximately, we would have a far more satisfying model. We could for instance then talk of "the" effect of using an IDE, rather than qualifying such a statement by stating what the effect would be for each different language and education level. Moreover, if our sample size is not very large, we would get more accurate estimates of the various subpopulation means. Or it could be that, while (6.53) doesn't hold, a model with only two-way interactions, my;x(i, j,k,l) =NO +11i +/32j + /3k +/41 + /5ij +/6ik +/7i1 +t8j1 -/9kl (6.54) does work well. This would not be as nice as (6.53), but it still would be more parsimonious than (6.52). Accordingly, the major thrust of ANOVA is to decide how rich a model is needed to do a good job of describing the situation under study. There is an implied hierarchy of models of interest here: . the full model, including two- and three-way interactions, (6.52) 9 the model with two-factor interactions only, (6.54)  180 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES " the no-interaction model, (6.53) Traditionally these are determined via hypothesis testing, which involves certain partitionings of sums of squares similar to (6.18). (This is where the name analysis of variance stems from.) The null distribution of the test statistic often turns out to be an F-distribution. Of course, in this book, we consider hypothesis testing inappropriate, preferring to give some careful thought to the estimated parameters, but it is standard. Further testing can be done on individual #Q1 and so on. Often people use simultaneous inference procedures, discussed briefly in Section 4.2.16 of our unit on estimation and testing, since many tests are performed. 6.3.14.4 Reparameterization Classical ANOVA uses a somewhat different parameterization than that we've considered here. For instance, consider a single-factor setting (called one-way ANOVA) with three levels. Our predictors are then X(1) and X(2). Taking our approach here, we would write my;x (i, j) =#0 +13i+#/32j (6.55) The traditional formulation would be p = y+ a2, i = 1, 2, 3 (6.56) where #1 + p2 + p3/63(.57) and a-i=pi - p (6.58) Of course, the two formulations are equivalent. It is left to the reader to check that, for instance, /31 + /2 + (6.59) 2 There are similar formulations for ANOVA designs with more than one factor. Note that the classical formulation overparameterizes the problem. In the one-way example above, for instance, there are four parameters (p, ai, a2, a3) but only three groups. This would make the system indeterminate, but we add the constraint 3 Zcai 0 (6.60) i=1 Equation (6.24) then must make use of generalized matrix inverses.  6.4. THE CLASSIFICATION PROBLEM 181 6.4 The Classification Problem As mentioned earlier, in the special case in which Y is an indicator variable, with the value 1 if the object is in a class and 0 if not, the regression problem is called the classification problem. It is also sometimes called pattern recognition, in which case the predictors are called features. Also, the term machine learning usually refers to classification problems. If there are c classes, we need c (or c-1) Y variables, which I will denote by Y('), i = 1,...,c. 6.4.1 Meaning of the Regression Function 6.4.1.1 The Mean Here Is a Probability Now, here is a key point: Since the mean of any indicator random variable is the probability that the variable is equal to 1, the regression function in classification problems reduces to my;x(t) = P(Y = 1|X = t) (6.61) (Remember that X and t are vector-valued.) For concreteness, let's look at the patent example in Section 6.1. Again, Y will be 1 or 0, depending on whether the patent had public funding. We'll take X(1) to be an indicator variable for the presence or absence of "NSF" in the patent, X(2) to be an indicator variable for "NIH," and take X(3) to be the number of claims in the patent. This last predictor might be relevant, e.g. if industrial patents are lengthier. So, my;x [(1, 0, 5)] would be the population proportion of all patents that are publicly funded, among those that contain the word "NSF," do not contain "NIH," and make five claims. 6.4.1.2 Optimality of the Regression Function Again, our context is that we want to guess Y, knowing X. Since Y is 0-1 valued, our guess for Y based on X, g(X), should be 0-1 valued too. What is the best g? Again, since Y and g are 0-1 valued, our criterion should be what will I call Probability of Correct Classifi- cation (PCC): PCC = P[Y = g(X)] (6.62) Now proceed as in (6.13): PCC £ [P{Y =g(X)|X}] (6.63) The analog of Lemma 9 is  182 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES Lemma 10 Suppose W takes on values in the set A = {0,1}, and consider the problem of maximizing P(W = c), ceA (6.64) The solution is (1, if P(W =1) >0.5 (.5 0, otherwise Proof Again recalling that c is either 1 or 0, we have P(W = c) = P(W = 1)c + [1 - P(W = 1)] (1 - c) (6.66) = [2P(W = 1) - 1] c + 1 - P(W = 1) (6.67) The result follows. Applying this to (6.63), we see that the best g is given by g (t) 1 if my;x (t) > 0.5 g(t) otewie(6.68) 0, otherwise So we find that the regression function is again optimal, in this new context. 6.4.2 Parametric Models for the Regression Function in Classification Problems Remember, we often try a parametric model for our regression function first, as it means we are estimating a finite number of quantities, instead of an infinite number. 6.4.2.1 The Logistic Model: Form The most common parametric model in the classification problem is the logistic model (often called the logit model), seen in Section 6.3.10. In its r-predictor form, it is myx() ( 11=t = + e-(o+11+..+Irtr) (6.69)  6.4. THE CLASSIFICATION PROBLEM 183 For instance, consider the patent example. Under the logistic model, the population proportion of all patents that are publicly funded, among those that contain the word "NSF," do not contain "NIH," and make five claims would have the value 1 1 + e-(3o+/1+5/33) (6.70) 6.4.2.2 The Logistic Model: Intuitive Motivation The logistic function itself, 1 (6.71) has values between 0 and 1, and is thus a candidate for modeling a probability. Also, it is monotonic in u, making it further attractive, as in many classification problems we believe that my;x (t) should be monotonic in the predictor variables. 6.4.2.3 The Logistic Model: Theoretical Foundation But there are much stronger reasons to use the logit model, as it includes many common parametric models for X. To see this, note that we can write, for vector-valued discrete X and t, P(Y = 1|X = t) P(Y1and X t) (6.72) P(X =t) P(Y = 1)P(X = t|Y = 1) P(X =t) P(Y = 1)P(X = t|Y = 1) P(Y = 1)P(X = t|Y = 1) + P(Y = 0)P(X = t|Y = 0) X.4 1 1 + (6.75) qP(X=t|Y=1) where q = P(Y = 1) is the proportion of members of the population which have Y = 1. (Keep in mind that this probability is unconditional!!!! In the patent example, for instance, if say q = 0.12, then 12% of all patents in the patent population-without regard to words used, numbers of claims, etc.-are publicly funded.) If X is a continuous random vector, then the analog of (6.75) is 1 P(Y =1| = t) = 1qf y0()(6.76) 1 fxi=(t )  184 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES Now suppose X, given Y, has a normal distribution. In other words, within each class, Y is normally distributed. Consider the case of just one predictor variable, i.e. r = 1. Suppose that given Y = i, X has the distribution N(pi, u2), i = 0,1. Then 1_t - 2p fyit)= - exp- -0.5 at(26.77) After doing some elementary but rather tedious algebra, (6.76) reduces to the logistic form 1 1 -+e/o+/3iL) (6.78) where #30 and #Q1 are functions of po, yUo and a. In other words, if X is normally distributed in both classes, with the same variance but different means, then myx has the logistic form! And the same is true if X is multivariate normal in each class, with different mean vectors but equal covariance matrices. (The algebra is even more tedious here, but it does work out.) So, not only does the logistic model have an intuitively appealing form, it is also implied by one of the most famous distributions X can have within each class-the multivariate normal. If you reread the derivation above, you will see that the logit model will hold for any within-class distribu- tions for which In fx y o(t) (6.79) fxy=i(t) (or its discrete analog) is linear in t. Well guess what-this condition is true for exponential distributions too! Work it out for yourself. In fact, a number of famous distributions imply the logit model. 6.4.3 Nonparametric Estimation of Regression Functions for Classification (advanced topic) 6.4.3.1 Use the Kernel Method, CART, Etc. Since the classification problem is a special case of the general regression problem, nonparametric regression methods can be used here too. 6.4.3.2 SVMs There are also some methods which have been developed exclusively, or mainly, for classification. One of them which has been getting a lot of publicity in computer science circles is support vector machines (SVMs). To explain the SVM concept, consider the case r = 2, i.e. two predictor variables XS)~ and X(2  6.4. THE CLASSIFICATION PROBLEM 185 What an SVM would do is use our sample data to draw a curve in the X(1)-X(2) plane, with our classification rule then being, "Guess Y to be 1 if X is on one side of the curve, and guess it to be 0 if X is on the other side." DON'T BUY SNAKE OIL! There are no "magic" solutions to statistical problems. SVMs do very well in some situations, not so well in others. I highly recommend the site www. dt re g . com/benchmarks. htm, which compares six different types of classification function estimators-including logistic regression and SVM-on several dozen real data sets. The overall percent misclassification rates, averaged over all the data sets, was fairly close, ranging from a high of 25.3% to a low of 19.2%. The much-vaunted SVM came in at 20.3%. That's nice, but it was only a tad better than logit's 20.9%. Considering that the latter has a big advantage in that one gets an actual equation for the classification function, complete with parameters which we can estimate and make confidence intervals for, it is not clear just what role SVM and the other nonparametric estimators should play, in general, though in specific applications they may be appropriate. 6.4.4 Variable Selection in Classification Problems 6.4.4.1 Problems Inherited from the Regression Context In Section 6.3.9.2, it was pointed out that the problem of predictor variable selection in regression is un- solved. Since the classification problem is a special case of regression, there is no surefire way to select predictor variables there either. 6.4.4.2 Example: Forest Cover Data And again, using hypothesis testing to choose predictors is not the answer. To illustrate this, let's look again at the forest cover data we saw in Section 4.2.12. There were seven classes of forest cover there. Let's restrict attention to classes 1 and 2. In my R analysis I had the class 1 and 2 data in objects cov1 and cov2, respectively. I combined them, > covland2 <- rbind(covl,cov2) and created a new variable to serve as Y: covland2[,56] <- ifelse(covland2[,55] == 1,1,0) Let's see how well we can predict a site's class from the variable HS 12 (hillside shade at noon) that we investigated in that past unit, using a logistic model. In R we fit logistic models via the glm) function, for generalized linear models. The word generalized here refers to models in which some function of my~x (t) is linear in parameters #3i. For the classification model, in (my~x (t)/[1 - my~x (t)]) =/So + #f1tW1 + ... + t#r tN (6.80) This kind of generalized linear model is specified in R by setting the named argument family to binomial. Here is the call:  186 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES > g <- glm(covland2[,56] covland2[,8],family=binomial) The result was: > summary (g) Call: glm(formula = covland2[, 56] covland2[, 8], family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.165 -0.820 -0.775 1.504 1.741 Coefficients: Estimate Std. Error z value Pr(>Izl) (Intercept) 1.515820 1.148665 1.320 0.1870 covland2[, 8] -0.010960 0.005103 -2.148 0.0317 * Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 959.72 on 810 degrees of freedom Residual deviance: 955.14 on 809 degrees of freedom AIC: 959.14 Number of Fisher Scoring iterations: 4 So, 31 = -0.01. This is tiny, as can be seen from our data in the last unit. There we found that the estimated mean values of HS 12 for cover types 1 and 2 were 223.8 and 226.3, a difference of only 2.5. That difference in essence gets multiplied by 0.01. More concretely, in (6.46), plug in our estimates 1.52 and -0.01 from our R output above, first taking t to be 223.8 and then 226.3. The results are 0.328 and 0.322, respectively. In other words, HS12 isn't having much effect on the probability of cover type 1, and so it cannot be a good predictor of cover type. Yet the R output says that 31 is "significantly" different from 0, with a p-value of 0.03. Thus, we see once again that hypothesis testing does not achieve our goal. Again, cross validation is a better method for choosing predictors. 6.4.5 Y Must Have a Marginal Distribution! In our material here, we have tacitly assumed that the vector (Y,X) has a distribution. That may seem like an odd and puzzling remark to make here, but it is absolutely crucial. Let's see what it means. Consider the study on object-oriented programming in Section 6.1, but turned around. (This example will be somewhat contrived, but it will illustrate the principle.) Suppose we know how many lines of code are in a project, which we will still call X() and we know how long it took to complete, which we will now take as X() and from this we want to guess whether object-oriented or procedural programming was used (without being able to look at the code, of course), which is now our new Y. Here is our huge problem: Given our sample data, there is no way to estimate q in (6.75). That's because the authors of the study simply took two groups of programmers and had one group use object-oriented pro- gramming and had the other group use procedural programming. If we had sampled programmers at random  6.5. PRINCIPAL COMPONENTS ANALYSIS 187 from actual projects done at this company, that would enable us to estimate q, the population proportion of projects done with OOP. But we can't do that with the data that we do have. Indeed, in this setting, it may not even make sense to speak of q in the first place. Mathematically speaking, if you think about the process under which the data was collected in this study, there does exist some conditional distribution of X given Y, but Y itself has no distribution. So, we can NOT estimate P(Y=1|X). About the best we can do is try to guess Y on the basis of whichever value of i makes fx y~i(X) larger. 6.5 Principal Components Analysis 6.5.1 Dimension Reduction and the Principle of Parsimony Consider a random vector X = (X1, X2)T. Suppose the two components of X are highly correlated with each other. Then for some constants c and d, X2 c+dX1 (6.81) Then in a sense there is really just one random variable here, as the second is nearly equal to some linear combination of the first. The second provides us with almost no new information, once we have the first. In other words, even though the vector X roams in two-dimensional space, it usually sticks close to a one- dimensional object, namely the line (6.81). We saw a graph illustrating this in our unit on multivariate distributions, page 84. In general, consider a k-component random vector X = (X1, ..., Xk)T (6.82) We again wish to investigate whether just a few, say w, of the Xi tell almost the whole story, i.e. whether most Xi can be expressed approximately as linear combinations of these few Xi. In other words, even though X is k-dimensional, it tends to stick close to some w-dimensional subspace. Note that although (6.81) is phrased in prediction terms, we are not (or more accurately, not necessarily) interested in prediction here. We have not designated one of the XN to be a response variable and the rest to be predictors. Once again, the Principle of Parsimony is key. If we have, say, 20 or 30 variables, it would be nice if we could reduce that to, for example, three or four. This may be easier to understand and work with, albeit with the complication that our new variables would be linear combinations of the old ones.  188 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES 6.5.2 How to Calculate Them Here's how it works. The theory of linear algebra says that since E is a symmetric matrix, it is diagonaliz- able, i.e. there is a real matrix Q for which QTEQD (6.83) where D is a diagonal matrix. (This is a special case of singular value decomposition.) The columns CZ of Q are the eigenvectors of E, and it turns out that they are orthogonal to each other, i.e. their dot product is 0. Let WZ = CTX, i = 1, ...,1 (6.84) so that the WZ are scalar random variables, and set W k)T (6.85) Then W = QT X (6.86) Now, use the material on covariance matrices from our unit on multivariate analysis, page 75, Cov(W) = Cov(QTX) = QTCov(X)Q = D (from (6.83)) (6.87) Note too that if X has a multivariate normal distribution (which we are not assuming), then W does too. Let's recap: " We have created new random variables Wz as linear combinations of our original X. " The Wz are uncorrelated. Thus if in addition X has a multivariate normal distribution, so that W does too, then the Wz will be independent. " The variance of Wz is given by the ith diagonal element of D. The Wz are called the principal components of the distribution of X. It is customary to relabel the Wz so that W1 has the largest variance, W2 has the second-largest, and so on. We then choose those Wz that have the larger variances, and discard the others, because the latter, having small variances, are close to constant and thus carry no information. All this will become clearer in the example below.  6.6. LOG-LINEAR MODELS 189 6.5.3 Example: Forest Cover Data Let's try using principal component analysis on the forest cover data set we've looked at before. There are 10 continuous variables (also many discrete ones, but there is another tool for that case, the log-linear model, discussed in Section 6.6). In my R run, the data set (. not restricted to just two forest cover types, but consisting only of the first 1000 observations) was in the object f. Here are the call and the results: > prc <- prcomp(f[,1:10]) > summary (prc) Importance of components: PCl PC2 PC3 PC4 PC5 PC6 Standard deviation 1812.394 1613.287 1.89e+02 1.10e+02 96.93455 30.16789 Proportion of Variance 0.552 0.438 6.01e-03 2.04e-03 0.00158 0.00015 Cumulative Proportion 0.552 0.990 9.96e-01 9.98e-01 0.99968 0.99984 PC7 PC8 PC9 PC10 Standard deviation 25.95478 16.78595 4.2 0.783 Proportion of Variance 0.00011 0.00005 0.0 0.000 Cumulative Proportion 0.99995 1.00000 1.0 1.000 You can see from the variance values here that R has scaled the WZ so that their variances sum to 1.0. (It has not done so for the standard deviations, which are for the nonscaled variables.) This is fine, as we are only interested in the variances relative to each other, i.e. saving the principal components with the larger variances. What we see here is that eight of the 10 principal components have very small variances, i.e. are close to constant. In other words, though we have 10 variables X1, ..., X10, there is really only two variables' worth of information carried in them. So for example if we wish to predict forest cover type from these 10 variables, we should only use two of them. We could use W1 and W2, but for the sake of interpretability we stick to the original X vector; we can use any two of the Xi. The coefficients of the linear combinations which produce W from X, i.e. the Q matrix, are available via prc$rotation. 6.6 Log-Linear Models Here we discuss a procedure which is something of an analog of principal components for discrete variables. Our material on ANOVA will also come into play. It is recommended that the reader review Sections 6.3.14 and 6.5 before continuing. 6.6.1 The Setting Let's consider a variation on the software engineering example in Sections 6.2 and 6.3.14. Assume we have the factors, IDE, Language and Education. Our change-of extreme importance-is that we will now assume that these factors are RANDOM. What does this mean?  190 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES In the original example described in Section 6.2, programmers were assigned to languages, and in our extensions of that example, we continued to assume this. Thus for example the number of programmers who use an IDE and program in Java was fixed; if we repeated the experiment, that number would stay the same. If we were sampling from some programmer population, our new sample would have new programmers, but the number using and IDE and Java would be the same as before, as our study procedure specifies this. By contrast, let's now assume that we simply sample programmers at random, and ask them whether they prefer to use an IDE or not, and which language they prefer.10 Then for example the number of programmers who prefer to use an IDE and program in Java will be random, not fixed; if we repeat the experiment, we will get a different count. Suppose now we now wish to investigate relations between the factors. Are choice of platform and language related to education, for instance? 6.6.2 The Data Denote our three factors by X(s), s = 1,2,3. Here X(1), IDE, will take on the values 1 and 2 instead of 1 and 0 as before, 1 meaning that the programmer prefers to use an IDE, and 2 meaning not so. X(3) changes this way too, and X(2) will take on the values 1 for C++, 2 for Java and 3 for C. Note that we no longer use indicator variables. Let X(9) denote the value of X(s) for the rth programmer in our sample, r = 1,2,...,n. Our data are the counts NIJk = number of r such that X(' - i X(2) and X(3) k (6.88) For instance, if we sample 100 programmers, our data might look like this: prefers to use IDE: C++ Java C Bachelor's or less 18 22 6 Master's or more 15 10 4 prefers not to use IDE: C++ Java C Bachelor's or less 7 6 3 Master's or more 4 2 3 So for example N122 =10 and N212 = 4. Here we have a three-dimensional contingency table. Each NiJk value is a cell in the table. '0Other sampling schemes are possible too.  6.6. LOG-LINEAR MODELS 191 6.6.3 The Models Let Pijk be the population probability of a randomly-chosen programmer falling into cell ijk, i.e. Pijk = P ( ) = i and X(2) = j and X(3) = k) = E(NiJk)/n (6.89) As mentioned, we are interested in relations between the factors, in the form of independence, full and partial. Consider first the case of full independence: pijk = P (X(1) = i and X(2) = j and X(3)= k) (6.90) _ p ( (1)= ) _ ".p ( (2) =_j- . p ( (3)=k)(.1 Taking logs of both sides in (6.90), we see that independence of the three factors is equivalent to saying log(pjjk) = a2 + b3 + ck (6.92) for some numbers ai, b3 and c3. The numbers must be nonpositive, and since S P(X(s) = m) = 1 (6.93) M we must have, for instance, 2 Eexp(cg) = 1 (6.94) g=1 The point is that (6.92) looks like our no-interaction ANOVA models, e.g. (6.51). On the other hand, if we assume instead that Education is independent of IDE and Language but that IDE and Language are not independent of each other, our model would be log(pige) = P (() = i and X(2) = j) - P (X(3) = k) (6.95) = a2 + by + diJg+ ck (6.96) Here we have written P (X(1) = i and X(2) = j) as a sum of "main effects" as and by, and "interaction effects," dig analogous to ANOVA. Another possible model would have IDE and Language conditionally independent, given Education, mean- ing that at any level of education, a programmer's preference to use IDE or not, and his choice of program- ming language, are not related. We'd write the model this way:  192 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES log(pijk) = P (X(1) = i andX2 = j) -"P (X(3) =k) (6.97) = ai + bjy+ fk + hjk + ck (6.98) Note carefully that the type of independence in (6.98) has a quite different interpretation than that in (6.96). The full model, with no independence assumptions at all, would have three two-way interaction terms, as well as a three-way interaction term. 6.6.4 Parameter Estimation Remember, whenever we have parametric models, the statistician's "Swiss army knife" is maximum likeli- hood estimation. That is what is most often used in the case of log-linear models. How, then, do we compute the likelihood of our data, the Nijk? It's actually quite straightforward, be- cause the Nijk have the multinomial distribution we studied in Section 3.6.1.1 of our unit on multivariate distributions. !_L_=_p_ ijk (6.99) U ,J,kN k!Pi2 k We then write the Pijk in terms of our model parameters. Take for example (6.96), where we write Pijk e= ai+bj+dij+ck (6.100) We then substitute (6.100) in (6.99), and maximize the latter with respect to the ai, by, di and ck, subject to constraints such as (6.94). The maximization may be messy. But certain cases have been worked out in closed form, and in any case today one would typically do the computation by computer. In R, for example, there is the loglin() function for this purpose. 6.6.5 The Goal: Parsimony Again Again, we'd like "the simplest model possible, but not simpler." This means a model with as much indepen- dence between factors as possible, subject to the model being accurate. Classical log-linear model procedures do model selection by hypothesis testing, testing whether various interaction terms are 0. The tests often parallel ANOVA testing, with chi-square distributions arising instead of F-distributions.  6.7. SIMPSON'S (NON-)PARADOX 193 6.7 Simpson's (Non-)Paradox Suppose each individual in a population either possesses or does not possess traits A, B and C, and that we wish to predict trait A. Let A, B and C denote the situations in which the individual does not possess the given trait. Simpson's Paradox then describes a situation in which P(A|B) > P(A B) (6.101) and yet P(A|B, C) < P(A|B, C) (6.102) In other words, the possession of trait B seems to have a positive predictive power for A by itself, but when in addition trait C is held constant, the relation between B and A turns negative. An example is given by Fabris and Freitas,il concerning a classic study of tuberculosis mortality in 1910. Here the attribute A is mortality, B is city (Richmond, with B being New York), and C is race (African- American, with C being Caucasian). In probability terms, the data show that (these of course are sample estimates) " P(mortality Richmond) = 0.0022 " P(mortality New York) = 0.0019 " P(mortality Richmond, black) = 0.0033 " P(mortality New York, black) = 0.0056 " P(mortality Richmond, white) = 0.0016 " P(mortality New York, white) = 0.0018 The data also show that " P(black Richmond) = 0.37 " P(black New York) = 0.002 a point which will become relevant below. At first, New York looks like it did a better job than Richmond. However, once one accounts for race, we find that New York is actually worse than Richmond. Why the reversal? The answer stems from the fact that racial inequities being what they were at the time, blacks with the disease fared much worse than whites. "C.C. Fabris and A.A. Freitas. Discovering Surprising Patterns by Detecting Occurrences of Simpson's Paradox. In Research and Development in Intelligent Systems XVI (Proc. ES99, The 19th SGES Int. Conf on Knowledge-Based Systems and Applied Artificial Intelligence), 148-160. Springer-Verlag, 1999  194 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES Richmond's population was 37% black, proportionally far more than New York's 0.2%. So, Richmond's heavy concentration of blacks made its overall mortality rate look worse than New York's, even though things were actually much worse in New York. But is this really a "paradox"? Closer consideration of this example reveals that the only reason this example (and others like it) is surprising is that the predictors were used in the wrong order. One normally looks for predictors one at a time, first finding the best single predictor, then the best pair of predictors, and so on. If this were done on the above data set, the first predictor variable chosen would be race, not city. In other words, the sequence of analysis would look something like this: " P(mortality Richmond) = 0.0022 " P(mortality New York) = 0.0019 " P(mortality black) = 0.0048 " P(mortality white) = 0.0018 " P(mortality black, Richmond) = 0.0033 " P(mortality black, New York) = 0.0056 " P(mortality white, Richmond) = 0.0016 " P(mortality white, New York) = 0.0018 The analyst would have seen that race is a better predictor than city, and thus would have chosen race as the best single predictor. The analyst would then investigate the race/city predictor pair, and would never reach a point in which city alone were in the selected predictor set. Thus no anomalies would arise. Exercises Note to instructor: See the Preface for a list of sources of real data on which exercises can be assigned to complement the theoretical exercises below. 1. Suppose we are interested in documents of a certain type, which we'll call Type 1. Everything that is not Type 1 we'll call Type 2, with a proportion q of all documents being Type 1. Our goal will be to try to guess document type by the presence of absence of a certain word; we will guess Type 1 if the word is present, and otherwise will guess Type 2. Let T denote document type, and let W denote the event that the word is in the document. Also, let p2 be the proportion of documents that contain the word, among all documents of Type i, i = 1,2. The event C will denote our guessing correctly. Find the overall probability of correct classification, P(C), and also P(CIW). Hint: Be careful of your conditional and unconditional probabilities here. 2. In the quartic model in ALOHA simulation example, find an approximate 95% confidence interval for the true population mean wait if our backoff parameter b is set to 0.6.  6.7. SIMPSON'S (NON-)PARADOX 195 Hint: You will need to use the fact that a linear combination of the components of a multivariate normal random vector has a univariate normal distributions as discussed in Section 3.6.2.1. 3. Consider the linear regression model with one predictor, i.e. r = 1. Let Y and Xi represent the values of the response and predictor variables for the ith observation in our sample. (a) Assume as in Section 6.3.7.4 that Var(Y lX = t) is a constant in t, oa. Find the exact value of Cov(o30, ,31, as a function of the Xi and a2. Your final answer should be in scalar, i.e. non-matrix form. (b) Suppose we wish to fit the model my;x (t) =f31t, i.e. the usual linear model but without the constant term, #0. Derive a formula for the least-squares estimate of 31. 4. Suppose the random pair (X, Y) has density 8st on 0 < t < s < 1. Find my;x(s) and Var(YlX = t), 0 < s < 1. 5. We showed that (6.76) reduces to the logistic model in the case in which the distribution of X given Y is normal. Show that this is also true in the case in which that distribution is exponential, i.e. fxy(t,i) = ec--, t > 0 (6.103) 6. In this problem, you will conduct an R simulation experiment similar to that of Foster and Stine on overfitting, discussed in Section 6.3.9.2. Generate data X , i = 1, ..., n, j = 1, ..., r from a N(0,1) distribution, and e , i = 1, ..., n from N(0,4). Set Y = X( + e , i = 1, ..., n. This simulates drawing a random sample of n observations from an (r+ 1)-variate population. Now suppose the analyst, unaware that Y is related to only X fits the model mYx (1).X>(r)(ti, ..., tr) =#30 + 3it(1 + ... + t3rt r (6.104) In actuality, /3= 0 for j > 1 (and for i = 0). But the analyst wouldn't know this. Suppose the analyst selects predictors by testing the hypotheses Ho : 3i = 0, as in Section 6.3.9.2, with a = 0.05. Do this for various values of r and n. You should find that, for fixed n and increasing r. You begin to find that some of the predictors are declared to be "significantly" related to Y (complete with asterisks) when in fact they are not (while X(1), which really is related to Y, may be declared NOT "significant." This illustrates the folly of using hypothesis testing to do variable selection. 7. Consider a random part (X, Y) for which the linear model E(Y lX) =Ao + 31X holds, and think about predicting Y, first without X and then with X, minimizing mean squared prediction error (MSPE) in each case. From Section 6.3.6, we know that without X, the best predictor is BY, while with X it is E(Y lX), which under our assumption here is #0o + #31X. Show that the reduction in MSPE accred by using X, i.e. £ [(Y - EY)2] - £ [{Y - E(Y|X)}2] 615 E [(Y -EY )2  196 CHAPTER 6. STATISTICAL RELATIONS BETWEEN VARIABLES is equal to p2 (X, y).  Chapter 7 Markov Chains One of the most famous stochastic models is that of a Markov chain. This type of model is widely used in computer science, biology, physics and so on. 7.1 Discrete-Time Markov Chains 7.1.1 Example: Finite Random Walk One of the most commonly used stochastic models is that of a Markov chain. To motivate this discussion, let us start with a simple example: Consider a random walk on the set of integers between 1 and 5, moving randomly through that set, say one move per second, according to the following scheme. If we are currently at position i, then one time period later we will be at either i-1, i or i+1, according to the outcome of rolling a fair die-we move to i-1 if the die comes up 1 or 2, stay at i if the die comes up 3 or 4, and move to i+ 1 in the case of a 5 or 6. For the special cases i = 1 and i = 5, we simply move back to 2 or 4, respectively. (In random walk terminology, these are called reflecting barriers.) The integers 1 through 5 form the state space for this process; if we are currently at 4, for instance, we say we are in state 4. Let Xt represent the position of the particle at time t, t = 0, 1,2,.... The random walk is a Markov process. The process is "memoryless," meaning that we can "forget the past"; given the present and the past, the future depends only on the present: P(Xt+1= st+1|Xt = st, X_1 = St-1, ... , Xo = so) = P(Xt+1 = st+1|Xt = St) (7.1) The term Markov process is the general one. If the state space is discrete, i.e. countably infinite, then we usually use the more specialized term, Markov chain. Although this equation has a very complex look, it has a very simple meaning: The distribution of our next position, given our current position and all our past positions, is dependent only on the current position., It 'This can be generalized, so that the future depends on the present and also on the state one unit of time ago, etc. However, such models become quite unwieldy. 197  198 CHAPTER 7. MARKOV CHAINS is clear that the random walk process above does have this property; for instance, if we are now at position 4, the probability that our next state will be 3 is 1/3-no matter where we were in the past. Continuing this example, let pig denote the probability of going from position i to position j in one step. For example, P21 =P23 = } while P24 = 0 (we can reach position 4 from position 2 in two steps, but not in one step). The numbers pig are called the one-step transition probabilities of the process. Denote by P the matrix whose entries are the pig: /0 1 0 0 0\ 1 1 1 0 0 3 3 3 0 } } } 0 (7.2) 0 1 11 3 3 3 0 0 0 1 0 By the way, it turns out that the matrix Pk gives the k-step transition probabilities. In other words, the element (i,j) of this matrix gives the probability of going from i to j in k steps. 7.1.2 Long-Run Distribution In typical applications we are interested in the long-run distribution of the process, for example the long-run proportion of the time that we are at position 4. For each state i, define r = lim -(7.3) \t-oo t where Nt is the number of visits the process makes to state i among times 1, 2,..., t. In most practical cases, this proportion will exist and be independent of our initial position X0. The 7i are called the steady-state probabilities, or the stationary distribution of the Markov chain. Intuitively, the existence of i implies that as t approaches infinity, the system approaches steady-state, in the sense that lim P(Xt = i)= _i (7.4) t-oo Actually, the limit (7.4) may not exist in some cases. We'll return to that point later, but for typical cases it does exist, and we will usually assume this. It then suggests a way to calculate the values 7i, as follows. First note that P(X+1 i) Z(PXL k and Xt+1 =i) Z(P(Xt =i)(X+ =iXt k) Z(PXL ~ k k k (7.5)  7.1. DISCRETE-TIME MARKOV CHAINS 199 where the sum goes over all states k. For example, in our random walk example above, we would have 5 5 P(Xt+1 = 3) =ZP(Xt = k and Xt+1 = 3) >ZP(Xtk:= k=1 k=1 Then as t -~ oc in Equation (7.5), intuitively we would have k)P(Xt+1 = 3|Xt 5 k) Z P(Xt = k)pks k=1 (7.6) 7T- >37kPki k (7.7) Remember, here we know the Pki and want to find the 72. Solving these equations (one for each i), called the balance equations, give us the 72. A matrix formulation is also useful. Letting 7 denote the row vector of the elements 7r2, i.e. 7w= (71, 72, ---) these equations (one for each i) then have the matrix form 7 = iPF (7.8) or (I-P)w=O (7.9) Note that there is also the constraint r2= 1 2 (7.10) For the random walk problem above, for instance, the solution is F=w (, 1 , 3 , 3 , 1). Thus in the long run we will spend 1/11 of our time at position 1, 3/11 of our time at position 2, and so on. One of the equations in the system is redundant. We thus eliminate one of them, say by removing the last row of I-P in (7.9). To reflect This can be used to calculate the 72. It turns out that one of the equations in the system is redundant. We thus eliminate one of them, say by removing the last row of I-P in (7.9). To reflect (7.10), we replace the removed row by a row of all Is, and in the right-hand side of (7.9) we replace the last 0 by a 1. We can then solve the system. It can be done with R's solve() function. Or one can note from (7.8) that 7 is a left eigenvector of P with eigenvalue 1, so one can call eign() on P'. But Equation (7.9) may not be easy to solve. For instance, if the state space is infinite, then this matrix equation represents infinitely many scalar equations. In such cases, you may need to try to find some clever trick which will allow you to solve the system, or in many cases a clever trick to analyze the process in some way other than explicit solution of the system of equations. And even for finite state spaces, the matrix may be extremely large. In some cases, you may need to resort to numerical methods, or symbolic math packages.  200 CHAPTER 7. MARKOV CHAINS 7.1.2.1 Periodic Chains Note again that even if Equation (7.9) has a solution, this does not imply that (7.4) holds. For instance, suppose we alter the random walk example above so that 1 for i = 2, 3, 4, with transitions out of states 1 and 5 remaining as before. In this case, the solution to Equation (7.9) is (g, 4, 4, 4, 1). This solution is still valid, in the sense that Equation (7.3) will hold. For example, we will spend 1/4 of our time at Position 4 in the long run. But the limit of P(XZ = 4) will not be 1/4, and in fact the limit will not even exist. If say Xo is even, then Xi can be even only for even values of i. We say that this Markov chain is periodic with period 2, meaning that returns to a given state can only occur after amounts of time which are multiples of 2. 7.1.2.2 The Meaning of the Term "Stationary Distribution" Though we have informally defined the term stationary distribution in terms of long-run proportions, the technical definition is this: Definition 11 Consider a Markov chain. Suppose we have a vector , of nonnegative numbers that sum to 1. Let X0 have the distribution ,. If that results in X1 having that distribution too (and thus also all Xn), we say that , is the stationary distribution of this Markov chain. Note that this definition stems from (7.5). In our (first) random walk example above, this would mean that if we have X0 distributed on the integers 1 through 5 with probabilities (1, 1, , , ) then for example P(X1 = 1) = 1 , P(X1 = 4) =1i etc. This is indeed the case, as you can verify using (7.5) with t = 0. In our "notebook" view, here is what we would do. Imagine that we generate a random integer between 1 and 5 according to the probabilities (1, N, N, N, 1) ,2 and set Xo to that number. We would then generate another random number, by rolling an ordinary die, and going left, right or staying put, with probability 1/3 each. We would then write down X1 and X2 on the first line of our notebook. We would then do this experiment again, recording the results on the second line, then again and again. In the long run, 3/11 of the lines would have, for instance, X0= 4, and 3/11 of the lines would have X1 = 4. In other words, X1 would have the same distribution as Xo. 7.1.3 Example: Stuck-At 0 Fault 7.1.3.1 Description In the above example, the labels for the states consisted of single integers i. In some other examples, convenient labels may be r-tuples, for example 2-tuples (i,j). 2Say by rolling an 11-sided die.  7.1. DISCRETE-TIME MARKOV CHAINS 201 Consider a serial communication line. Let B1, B2, B3, ... denote the sequence of bits transmitted on this line. It is reasonable to assume the BZ to be independent, and that P(BZ = 0) and P(BZ = 1) are both equal to 0.5. Suppose that the receiver will eventually fail, with the type of failure being stuck at 0, meaning that after failure it will report all future received bits to be 0, regardless of their true value. Once failed, the receiver stays failed, and should be replaced. Eventually the new receiver will also fail, and we will replace it; we continue this process indefinitely. Let p denote the probability that the receiver fails on any given bit, with independence between bits in terms of receiver failure. Then the lifetime of the receiver, that is, the time to failure, is geometrically distributed with "success" probability p i.e. the probability of failing on receipt of the i-th bit after the receiver is installed is (1 - p)2-1p for i = 1,2,3,... However, the problem is that we will not know whether a receiver has failed (unless we test it once in a while, which we are not including in this example). If the receiver reports a long string of Os, we should suspect that the receiver has failed, but of course we cannot be sure that it has; it is still possible that the message being transmitted just happened to contain a long string of Os. Suppose we adopt the policy that, if we receive k consecutive Os, we will replace the receiver with a new unit. Here k is a design parameter; what value should we choose for it? If we use a very small value, then we will incur great expense, due to the fact that we will be replacing receiver units at an unnecessarily high rate. On the other hand, if we make k too large, then we will often wait too long to replace the receiver, and the resulting error rate in received bits will be sizable. Resolution of this tradeoff between expense and accuracy depends on the relative importance of the two. (There are also other possibilities, involving the addition of redundant bits for error detection, such as parity bits. For simplicity, we will not consider such refinements here. However, the analysis of more complex systems would be similar to the one below.) 7.1.3.2 Initial Analysis A natural state space in this example would be {(i,j) :i=0,1,...,k-1;j=0,1;i+j-0} (7.12) where i represents the number of consecutive Os that we have received so far, and j represents the state of the receiver (0 for failed, 1 for nonfailed). Note that when we are in a state of the form (k-1,j), if we receive a 0 on the next bit (whether it is a true 0 or the receiver has failed), our new state will be (0,1), as we will install a new receiver. Note too that there is no state (0,0), since if the receiver is down it must have received at least one bit. The calculation of the transition matrix P is straightforward, though it requires careful thought. For example, suppose the current state is (2,1), and that we are investigating the expense and bit accuracy corresponding to a policy having k = 5. What can happen upon receipt of the next bit? The next bit will have a true value of either 0 or 1, with probability 0.5 each. The receiver will change from working to failed status with probability p. Thus our next state could be: " (3,1), if a 0 arrives, and the receiver does not fail;  202 CHAPTER 7. MARKOV CHAINS * (0,1), if a 1 arrives, and the receiver does not fail; or * (3,0), if the receiver fails The probabilities of these three transitions out of state (2,1) are: P(2,1),(3,1) P(2,1),(o,1) P(2,1),(3,0) 0.5(1 - p) 0.5(1 - p) p (7.13) (7.14) (7.15) Other entries of the matrix P can be computed similarly. Note by the way that from state (4,1) we will go to (0,1), no matter what happens. Formally specifying the matrix P using the 2-tuple notation as above would be very cumbersome. In this case, it would be much easier to map to a one-dimensional labeling. For example, if k = 5, the nine states (1,0),...,(4,0),(0,1),(1,1),...,(4,1) could be renamed states 1,2,...,9. Then we could form P under this labeling, and the transition probabilities above would appear as P78 = 0.5(1 P75 = 0.5(1 P73 = P p) p) (7.16) (7.17) (7.18) 7.1.3.3 Going Beyond Finding r Finding the ri should be just the first step. We then want to use them to calculate various quantities of interest.3 For instance, in this example, it would also be useful to find the error rate E, and the mean time (i.e., the mean number of bit receptions) between receiver replacements, p. We can find both E and yu in terms of the -r2, in the following manner. The quantity E is the proportion of the time during which the true value of the received bit is 1 but the receiver is down, which is 0.5 times the proportion of the time spent in states of the form (i,0): 6 =0.5(7F1+7F2+7F3+74) (7.19) This should be clear intuitively, but it would also be instructive to present a more formal derivation of the same thing. Let En be the event that the n-th bit is received in error, with Dn denoting the event that the receiver is down. Then 3Note that unlike a classroom setting, where those quantities would be listed for the students to calculate, in research we must decide on our own which quantities are of interest.  7.1. DISCRETE-TIME MARKOV CHAINS 203 E = lim P(En) (7.20) n-oo rlim P(Xn = 1 and D) (7.21) n-oo lim P(X l = 1)P(Dn) (7.22) n-oo 0.5(7x1+7x2+7x3+7x4) (7.23) Here we used the fact that Xn and the receiver state are independent. Equations (7.20) follow a pattern we'll use repeatedly in this chapter. In subsequent examples we will not show the steps with the limits, but the limits are indeed there. Make sure to mentally go through these steps yourself.4 Now to get p in terms of the 7r note that since p is the long-run average number of bits between receiver replacements, it is then the reciprocal of TI, the long-run fraction of bits that result in replacements. For example, say we replace the receiver on average every 20 bits. Over a period of 1000 bits, then (speaking on an intuitive level) that would mean about 50 replacements. Thus approximately 0.05 (50 out of 1000) of all bits results in replacements. 1 p= - (7.24) Again suppose k = 5. A replacement will occur only from states of the form (4,j), and even then only under the condition that the next reported bit is a 0. In other words, there are three possible ways in which replacement can occur: (a) We are in state (4,0). Here, since the receiver has failed, the next reported bit will definitely be a 0, regardless of that bit's true value. We will then have a total of k = 5 consecutive received Os, and therefore will replace the receiver. (b) We are in the state (4,1), and the next bit to arrive is a true 0. It then will be reported as a 0, our fifth consecutive 0, and we will replace the receiver, as in (a). (c) We are in the state (4,1), and the next bit to arrive is a true 1, but the receiver fails at that time, resulting in the reported value being a 0. Again we have five consecutive reported Os, so we replace the receiver. Therefore, 7]= 74 + 79(O.5 + 0.5p) (7.25) Again, make sure you work through the full version of (7.25), using the pattern in (7.20). 4The other way to work this out rigorously is to assume that Xo has the distribution wr, as in Section 7.1.2.2. Then no limits are needed in (7.20. But this may be more difficult to understand.  204 CHAPTER 7. MARKOV CHAINS Thus 1 1 p (7.26) 1] 1r4+0.579(1+p) This kind of analysis could be used as the core of a cost-benefit tradeoff investigation to determine a good value of k. (Note that the 7r2 are functions of k, and that the above equations for the case k = 5 must be modified for other values of k.) 7.1.4 Example: Shared-Memory Multiprocessor (Adapted from Probabiility and Statistics, with Reliability, Queuing and Computer Science Applicatiions, by K.S. Trivedi, Prentice-Hall, 1982 and 2002, but similar to many models in the research literature.) 7.1.4.1 The Model Consider a shared-memory multiprocessor system with m memory modules and m CPUs. The address space is partitioned into m chunks, based on either the most-significant or least-significant log2 m bits in the address.5 The CPUs will need to access the memory modules in some random way, depending on the programs they are running. To make this idea concrete, consider the Intel assembly language instruction add %eax, (%ebx) which adds the contents of the EAX register to the word in memory pointed to by the EBX register. Ex- ecution of that instruction will (absent cache and other similar effects, as we will assume here and below) involve two accesses to memory-one to fetch the old value of the word pointed to by EBX, and another to store the new value. Moreover, the instruction itself must be fetched from memory. So, altogether the processing of this instruction involves three memory accesses. Since different programs are made up of different instructions, use different register values and so on, the sequence of addresses in memory that are generated by CPUs are modeled as random variables. In our model here, the CPUs are assumed to act independently of each other, and successive requests from a given CPU are independent of each other too. A CPU will choose the ith module with probability qi. A memory request takes one unit of time to process, though the wait may be longer due to queuing. In this very simplistic model, as soon as a CPU's memory request is fulfilled, it generates another one. On the other hand, while a CPU has one memory request pending, it does not generate another. Let's assume a crossbar interconnect, which means there are m2 separate paths from CPUs to memory modules, so that if the m CPUs have memory requests to m different memory modules, then all the requests can be fulfilled simultaneously. Also, assume as an approximation that we can ignore communication delays. 5You may recognize this as high-order and low-order interleaving, respectively.  7.1. DISCRETE-TIME MARKOV CHAINS 205 How good are these assumptions? One weakness, for instance, is that many instructions, for example, do not use memory at all, except for the instruction fetch, and as mentioned, even the latter may be suppressed due to cache effects. Another example of potential problems with the assumptions involves the fact that many programs will have code like for (i = 0; i < 10000; i++) sum += x[i]; Since the elements of the array x will be stored in consecutive addresses, successive memory requests from the CPU while executing this code will not be independent. The assumption would be more justified if we were including cache effects, or (noticed by Earl Barr) if we are studying a timesharing system with a small quantum size. Thus, many models of systems like this have been quite complex, in order to capture the effects of various things like caching, nonindependence and so on in the model. Nevertheless, one can often get some insight from even very simple models too. In any case, for our purposes here it is best to stick to simple models, so as to understand more easily. Our state will be an m-tuple (N1, ..., Nm), where Ni is the number of requests currently pending at memory module i. Recalling our assumption that a CPU generates another memory request immediately after the previous one is fulfilled, we always have that N1 + ... + Nm = m. It is straightforward to find the transition probabilities pi. Here are a couple of examples, with m = 2: * p(2,o),(1,1): Recall that state (2,0) means that currently there are two requests pending at Module 1, one being served and one in the queue, and no requests at Module 2. For the transition (2, 0) - (1, 1) to occur, when the request being served at Module 1 is done, it will make a new request, this time for Module 2. This will occur with probability q2. Meanwhile, the request which had been queued at Module 1 will now start service. So, P(2,o),(1,1) = q2- * P(1,1),(1,1): In state (1,1), both pending requests will finish in this cycle. To go to (1,1) again, that would mean that the two CPUs request different modules from each other-CPUs 1 and 2 choose Modules 1 and 2 or 2 and 1. Each of those two possibilities has probability qiq2, so p(1,1),(1,1) 2qq2. We then solve for the r, using (7.7). It turns out, for example, that _ qlq2 7r(1 1) =_ 1- 2(7.27) 'Fii 1 - 2qiq2 (.7 7.1.4.2 Going Beyond Finding wr Let B denote the number of memory requests in a given memory cycle. Then we may be interested in E(B), the number of requests completed per unit time, i.e. per cycle. We can find E(B) as follows. Let S denote  206 CHAPTER 7. MARKOV CHAINS the current state. Then, continuing the case m = 2, we have from the Law of Total Expectation,6 E(B) = E[E(BIS)] (7.28) = P(S = (2, 0))E(BIS = (2,0)) + P(S = (1, 1))E(BIS = (1, 1)) + P(S = (0, 2))E(BIS = ((Y,.2 ) =7T(2,)E(BIS = (2, 0)) + 1(r,1)E(BIS = (1, 1)) + 1r(o,2)E(BIS = (0, 2)) (7.30) All this equation is doing is finding the overall mean of B by breaking down into the cases for the different states. Now if we are in state (2,0), only one request will be completed this cycle, and B will be 1. Thus E(BIS = (2, 0)) = 1. Similarly, E(BIS = (1, 1)) = 2 and so on. After doing all the algebra, we find that EB= 1 -qiq2 (7.31) 1 - 2q1q2 The maximum value of E(B) occurs when qi = q2 = 2, in which case E(B)=1.5. This is a lot less than the maximum capacity of the memory system, which is m = 2 requests per cycle. So, we can learn a lot even from this simple model, in this case learning that there may be a substantial underutilization of the system. This is a common theme in probabilistic modeling: Simple models may be worthwhile in terms of insight provided, even if their numerical predictions may not be too accurate. 7.1.5 Example: Slotted ALOHA Recall the slotted ALOHA model from Chapter 1: " Time is divided into slots or epochs. " There are n nodes, each of which is either idle or has a single message transmission pending. So, a node doesn't generate a new message until the old one is successfully transmitted (a very unrealistic assumption, but we're keeping things simple here). " In the middle of each time slot, each of the idle nodes generates a message with probability q. " Just before the end of each time slot, each active node attempts to send its message with probability p. " If more than one node attempts to send within a given time slot, there is a collision, and each of the transmissions involved will fail. * So, we include a backoff mechanism: At the middle of each time slot, each node with a message will with probability q attempt to send the message, with the transmission time occupying the remainder of the slot. 6Actually, we could take a more direct route in this case, noting that B can only take on the values 1 and 2. Then EB =P(B= 1) + 2P(B =2) =7T(2,o + 7Ts(O,2) + 27r(1,1). But the analysis below extends better to the case of general m.  7.1. DISCRETE-TIME MARKOV CHAINS 207 So, q is a design parameter, which must be chosen carefully. If q is too large, we will have too mnay collisions, thus increasing the average time to send a message. If q is too small, a node will often refrain from sending even if no other node is there to collide with. Define our state for any given time slot to be the number of nodes currently having a message to send at the very beginning of the time slot (before new messages are generated). Then for 0 < i < n and 0 < j