Content

Binomial random variables

Define \(X\) to be the number of successes in \(n\) independent Bernoulli trials, each with probability of success \(p\). We now derive the probability function of \(X\). First we note that \(X\) can take the values \(0,1,2,\dots,n\). Hence, there are \(n+1\) possible discrete values for \(X\). We are interested in the probability that \(X\) takes a particular value \(x\). If there are \(x\) successes observed, then there must be \(n-x\) failures observed. One way for this to happen is for the first \(x\) Bernoulli trials to be successes, and the remaining \(n-x\) to be failures. The probability of this is given by \[ \underbrace{p \times p \times \dots \times p}_{x \text{ of these}} \times \underbrace{(1-p) \times (1-p) \times \dots \times (1-p)}_{n-x \text{ of these}} = p^x (1-p)^{n-x}. \]

However, this is only one of the many ways that we can obtain \(x\) successes and \(n-x\) failures. They could occur in exactly the opposite order: all \(n-x\) failures first, then \(x\) successes. This outcome, which has the same number of successes as the first outcome but with a different arrangement, also has probability \(p^x (1-p)^{n-x}\). There are many other ways to arrange \(x\) successes and \(n-x\) failures among the \(n\) possible positions. Once we position the \(x\) successes, the positions of the \(n-x\) failures are inevitable.

The number of ways of arranging \(x\) successes in \(n\) trials is equal to the number of ways of choosing \(x\) objects from \(n\) objects, which is equal to \[ \dbinom{n}{x} = \dfrac{n!}{x!(n-x)!} = \dfrac{n \times (n-1) \times (n-2) \times \dots \times (n-x+2) \times (n-x+1)}{x \times (x-1) \times (x-2) \times \dots \times 2 \times 1}. \] (This is discussed in the module The binomial theorem.) Recall that \(0!\) is defined to be 1, and that \(\binom{n}{n} = \binom{n}{0} = 1\). We can now determine the total probability of obtaining \(x\) successes. Each outcome with \(x\) successes and \(n-x\) failures has individual probability \(p^x (1-p)^{n-x}\), and there are \(\binom{n}{x}\) possible arrangements. Hence, the total probability of \(x\) successes in \(n\) independent Bernoulli trials is \[ \dbinom{n}{x}\, p^x (1-p)^{n-x}, \quad \text{for } x = 0,1,2,\dots,n. \]

This is the binomial distribution, and it is arguably the most important discrete distribution of all, because of the range of its application.

If \(X\) is the number of successes in \(n\) independent Bernoulli trials, each with probability of success \(p\), then the probability function \(p_X(x)\) of \(X\) is given by \[ p_X(x) = \Pr(X=x) = \dbinom{n}{x}\, p^x (1-p)^{n-x}, \quad \text{for } x = 0,1,2,\dots,n, \] and \(X\) is a binomial random variable with parameters \(n\) and \(p\). We write \(X \stackrel{\mathrm{d}}{=} \mathrm{Bi}(n,p)\). Notes.

Example: Multiple-choice test

A teacher sets a short multiple-choice test for her students. The test consists of five questions, each with four choices. For each question, she uses a random number generator to assign the letters A, B, C, D to the choices. So for each question, the chance that any letter corresponds to the correct answer is equal to \(\dfrac{1}{4}\), and the assignments of the letters are independent for different questions. One of her students, Barry, has not studied at all for the test. He decides that he might as well guess 'A' for each question, since there is no penalty for incorrect answers.

Let \(X\) be the number of questions that Barry gets right. His chance of getting a question right is just the chance that the letter 'A' was assigned to the correct answer, which is \(\dfrac{1}{4}\). There are five questions. The outcome for each question can be regarded as 'correct' or 'not correct', and hence as one of two possible outcomes. Each question is independent. We therefore have the structure of a sequence of independent Bernoulli trials, each with probability of success (a correct answer) equal to \(\dfrac{1}{4}\). So \(X\) has a binomial distribution with parameters 5 and \(\dfrac{1}{4}\), that is, \(X \stackrel{\mathrm{d}}{=} \mathrm{Bi}(5,\frac{1}{4})\). The probability function of \(X\) is shown in figure 1.

Graph showing Binomial Distribution for n = 5 and p = one quarter.
Detailed description

Figure 1: The probability function \(p_X(x)\) for \(X \stackrel{\mathrm{d}}{=} \mathrm{Bi}(5,\frac{1}{4})\).

We can see from the graph that Barry's chance of getting all five questions correct is very small; it is just visible on the graph as a very small value. On the other hand, the chance that he gets none correct (all wrong) is about 0.24, and the chance that he gets one correct is almost 0.4. What are these probabilities, more precisely? The probability function of \(X\) is given by \[ p_X(x) = \dbinom{5}{x}\, \Bigl(\dfrac{1}{4}\Bigr)^x \Bigl(\dfrac{3}{4}\Bigr)^{n-x}, \quad \text{for } x = 0,1,2,3,4,5. \] So \(p_X(5) = \bigl(\dfrac{1}{4}\bigr)^5 \approx 0.0010\); Barry's chance of 'full marks' is one in a thousand. On the other hand, \(p_X(0) = \bigl(\dfrac{3}{4}\bigr)^5 \approx 0.2373\) and \(p_X(1) = 5 \times \dfrac{1}{4} \times \bigl(\dfrac{3}{4}\bigr)^4 \approx 0.3955\). Completing the distribution, we find that \(p_X(2) \approx 0.2637\), \(p_X(3) \approx 0.0879\) and \(p_X(4) \approx 0.0146\).
In general, it is easier to calculate binomial probabilities using technology. Microsoft Excel provides a function for this, called \(\sf \text{BINOM.DIST}\). To use this function to calculate \(p_X(x)\) for \(X \stackrel{\mathrm{d}}{=} \mathrm{Bi}(n,p)\), four arguments are required: \(x,\ n,\ p,\ \sf \text{FALSE}\). For example, to find the value of \(p_X(2)\) for \(X \stackrel{\mathrm{d}}{=} \mathrm{Bi}(5,\frac{1}{4})\), enter the formula \[ \sf \text{=BINOM.DIST(2, 5, 0.25, FALSE)} \]

in a cell; this should produce the result 0.2637 (to four decimal places). The somewhat off-putting argument \(\sf \text{FALSE}\) is required to get the actual probability function \(p_X(x)\). If \(\sf \text{TRUE}\) is used instead, then the result is the cumulative probability \(\sum_{k=0}^{x} p_X(k)\), which we do not consider here.

Exercise 1

Casey buys a Venus chocolate bar every day during a promotion that promises 'one in six wrappers is a winner'. Assume that the conditions of the binomial distribution apply: the outcomes for Casey's purchases are independent, and the population of Venus chocolate bars is effectively infinite.

  1. What is the distribution of the number of winning wrappers in seven days?
  2. Find the probability that Casey gets no winning wrappers in seven days.
  3. Casey gets no winning wrappers for the first six days of the week. What is the chance that he will get a winning wrapper on the seventh day?
  4. Casey buys a bar every day for six weeks. Find the probability that he gets at least three winning wrappers.
  5. How many days of purchases are required so that Casey's chance of getting at least one winning wrapper is 0.95 or greater?
Exercise 2

Which of the following situations are suitable for modelling with the binomial distribution? For those which are not, explain what the problem is.

  1. A normal six-sided die is rolled 15 times, and the number of fours obtained is observed.
  2. For a period of two weeks in a particular city, the number of days on which rain occurs is recorded.
  3. There are five prizes in a raffle of 100 tickets. Each ticket is either blue, green, yellow or pink, and there are 25 tickets of each colour. The number of blue tickets among the five winning tickets drawn in the raffle is recorded.
  4. Twenty classes at primary schools are chosen at random, and all the students in these classes are surveyed. The total number of students who like their teacher is recorded.
  5. Genetic testing is performed on a random sample of 50 Australian women. The number of women in the sample with a gene associated with breast cancer is recorded.
Exercise 3

Ecologists are studying the distributions of plant species in a large forest. They do this by choosing \(n\) quadrats at random (a quadrat is a small rectangular area of land), and examining in detail the species found in each of these quadrats. Suppose that a particular plant species is present in a proportion \(k\) of all possible quadrats, and distributed randomly throughout the forest. How large should the sample size \(n\) be, in order to detect the presence of the species in the forest with probability at least 0.9?

It is useful to get a general picture of the binomial distribution. figure 2 shows nine binomial distributions, each with \(n=20\); the values of \(p\) range from 0.01 to 0.99.

Nine binomial distribution graphs each with n = 20 and p = 0.1, 0.5, 0.2, 0.35, 0.5, 0.65, 0.8, 0,95 and 0.99.
Detailed description

Figure 2: Nine binomial distributions with parameters \(n=20\) and \(p\) as labelled.

Interactive 1 CDF of interactive 1Interactive 2 CDF of interactive 2

A number of features of this figure are worth noting:

Next page - Content - Mean and variance