distributions

Law of large numbers

  • Ensures that empirical averages computed from large samples are good estimates of the population mean Central limit theorem
  • Enables the assumption of normally distributed data from a large sample even if the population is not normally distributed

Normal Distribution (Gaussian)

  • A symmetrical bell shaped distribution where 68% of data fall within 1 standard deviation, 95% fall within 2 standard deviations and 99.7% fall within 3 standard deviations
  • The PDF is given by $f(x|\mu, \sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
  • A special case is the standard normal (also called z-distribution)
    • where the mean is 0 and standard deviation is 1, so is variance
    • denoted as $N(0, 1)$
  • z-score is a measure that describes a value’s relative position to the mean of a group of values, or how many standard deviation away from the mean it is
    • $Z=\frac{x-\mu}{\sigma}$
  • Key properties
    • a linear combination of normally distributed variables are also normally distributed
    • Central limit theorem
    • transformation - non-normal distributions can be transformed to follow normal using log transform, square root or box-cox transformation -> Feature Transformation Techniques
  • Applications
    • Many ML and statistical inference techniques assume the data follows a normal distribution

Binomial Distributions

  • A discrete probability distribution which models the number of successful outcome of independent Bernoulli trails, often used to model situations where there are exactly 2 outcomes
    • The distribution gives the probability of having $k$ successes out of $N$ trails given a success probability of $p$ and failure of $q=1-p$
    • calculated by $P(X=k)=\binom{N}{k}(p^k)(1-p)^{(n-k)}=\binom{N}{k}(p^k)(q^{(n-k)})$
      • $\binom{N}{k}$ is the binomial coefficient representing the number of ways of choosing $k$ out of $N$ trails calculated by $\frac{N!}{k!(n-k)!}$
      • factorials are the product of all positive integers equal and less than the number
      • factorial simplifications
        • 15!/3! = 5!
  • The properties of the distribution are: mean ($\mu=Np$), variance $(\sigma^2=np(1-p))$, and standard deviation ($\sigma=\sqrt{np(1-p)}$)

Poisson Distribution

  • A discrete probability distribution which models the number of times something will happen within a period
  • Assumptions
    • The occurrence rate is a known constant ($\lambda$)
    • The events occur independently from each other
  • Gives the probability of observing exactly $k$ events in interval
    • calculated by $P(X=k)=\frac{e^{-\lambda}\lambda^k}{k!}$
  • Applications
    • Queueing theory to model the number of customers arriving
    • Epidemiology to model the number of rare diseases or occurrence in a population
    • Manufacturing to estimate the number of defects in a batch of products
  • Relation to other distributions
    • The poisson distribution is a limiting case of the binomial distribution where $n$ is large and $p$ is small and $np=\lambda$ remains constant
    • The inter-event time follows an exponential distribution with parameter $\lambda$