Skip to main content

Bernoulli and Binomial Distributions

In Machine Learning, we often ask "Yes/No" questions: Will a user click this ad? Is this transaction fraudulent? Does the image contain a cat? These binary outcomes are modeled using the Bernoulli and Binomial distributions.

1. The Bernoulli Distributionโ€‹

A Bernoulli Distribution is the simplest discrete distribution. It represents a single trial with exactly two possible outcomes: Success (1) and Failure (0).

The Mathโ€‹

If pp is the probability of success, then 1โˆ’p1-p (often denoted as qq) is the probability of failure.

P(X=x)=px(1โˆ’p)1โˆ’xforย xโˆˆ{0,1}P(X = x) = p^x (1-p)^{1-x} \quad \text{for } x \in \{0, 1\}
  • Mean (ฮผ\mu): pp
  • Variance (ฯƒ2\sigma^2): p(1โˆ’p)p(1-p)

2. The Binomial Distributionโ€‹

The Binomial Distribution is the sum of nn independent Bernoulli trials. It tells us the probability of getting exactly kk successes in nn attempts.

The 4 Conditions (B.I.N.S.)โ€‹

For a variable to follow a Binomial distribution, it must meet these criteria:

  1. Binary: Only two outcomes per trial (Success/Failure).
  2. Independent: The outcome of one trial doesn't affect the next.
  3. Number: The number of trials (nn) is fixed in advance.
  4. Same: The probability of success (pp) is the same for every trial.

The Formulaโ€‹

The Probability Mass Function (PMF) is:

P(X=k)=(nk)pk(1โˆ’p)nโˆ’kP(X = k) = \binom{n}{k} p^k (1-p)^{n-k}

Where (nk)\binom{n}{k} is the "n-choose-k" combination formula: n!k!(nโˆ’k)!\frac{n!}{k!(n-k)!}.


3. Visualizing the Trialsโ€‹

If we have n=3n=3 trials, the possible outcomes can be visualized as a tree. The Binomial distribution simply groups these outcomes by the total number of successes.


4. Why this matters in Machine Learningโ€‹

A. Binary Classificationโ€‹

When you train a Logistic Regression model, you are essentially assuming your target variable follows a Bernoulli distribution. The model outputs the parameter pp (the probability of the positive class).

B. Evaluation (A/B Testing)โ€‹

If you show an ad to 1,0001,000 people (nn) and 5050 click it, you use the Binomial distribution to calculate the confidence interval of your click-through rate.

C. Logistic Loss (Cross-Entropy)โ€‹

The "Loss Function" used in most neural networks is derived directly from the likelihood of a Bernoulli distribution. Minimizing this loss is equivalent to finding the p that best fits your binary data.

Loss=โˆ’1nโˆ‘[ylogโก(p)+(1โˆ’y)logโก(1โˆ’p)]\text{Loss} = -\frac{1}{n} \sum [y \log(p) + (1-y) \log(1-p)]

5. Summary Tableโ€‹

FeatureBernoulliBinomial
Number of Trials11nn
Outcomes00 or 1100, 11, 22, โ€ฆ\dots, nn
Meanppnpnp
Variancep(1โˆ’p)p(1-p)np(1โˆ’p)np(1-p)

The Binomial distribution covers discrete successes. But what if we are counting the number of events happening over a fixed interval of time or space? For that, we turn to the Poisson distribution.