Statistics for Economists and Intelligent Data Analysts


Designed by Professor Sasha Shapoval
University of Lodz
Previous: Introduction Bayesian Approach Next: Review of Probabilities

Bayesian Approach

Overview

In this tutorial, we will discuss the Bayesian approach to estimate probabilities. Assume that we are interested to determine (the probability of) the state of the world based on conducted experiments. You had an opinion regarding the possibilities (= the probability distribution) of the states of the world, then observed some data, and so what?

First, we consider an artificial example, where the challenge is to estimate the probability of success in Bernoulli trials.

Second, we turn to a practical case constructing a (naive) spam-filter. The construction allows as to practice with the Bayesian approach and enjoy a technical realization with Python.

Bernoulli trials

Formulation of the problem

A coin is flipped $N$ times; Heads $H$ is the output of $n$ trials. What is the probability that Heads come at a single experiment?

Notation: Clearly we do not know this probability $p$ definitely. Let's call its estimate $\hat{p}$.

Naive answer: The expected value of successes in the Bernoulli trials is $Np$. Therefore, we can estimate $p$ as the fraction of successes during $N$ observations: $$ \hat{p} = \frac{n}{N} $$

  • Question: What is a potential drawback of this estimating strategy?

  • * The estimate is suspicious if the number of the Bernoulli trials is small (say, $N=3$)

    $\newcommand{\Prob}[1]{\mathbb{P}\{#1\}}$ Improvement: Evidently, an additional knowledge requires to improve the previous answer in the presence of a few trials. First, let's change the task and look for (the estimate of) the distribution of the probability in the Bernoulli trials, given the observed data. In other words, we consider $p$ as a random variable $\mathbb{p}$ with an unknown distribution. $$ \begin{equation} \Prob{\mathbf{p} = t | \mathrm{data}} = \frac{ \Prob{\mathrm{data} | \mathbf{p} = t}\Prob{\mathbf{p} = t} } { \sum_{t'} \Prob{\mathrm{data} | \mathbf{p} = t'}\Prob{\mathbf{p} = t'} } \quad\quad\text{BAYES' RULE} \label{eq:bayes} \end{equation} $$ $\Prob{\mathbf{p} = t | \mathrm{data}}$ is called the a posteriori probability. The computation is based on a priori probabilities $\Prob{\mathbf{p} = t}$ and the likelihood function $\Prob{\mathrm{data} | \mathbf{p} = t'}$.

    The likelihood function is estimated with the data. The choice of the a priori probabilities is questionable.

  • Question What quantities from the above equations are easier to estimate? More complicated?
  • How can they be estimated?

  • * The probabilities $\Prob{\mathbf{p} = t | \mathrm{data}}$ are derived from the data (the training stage)
    * The _a priori_ probabilities $\Prob{\mathbf{p} = t}$ are always difficult to estimate.

    Questions aside (to recall)

  • Task:
  • Two kids are in the family. One of them is a boy. What is the probability that the other is a boy?
  • * Do it yourself
  • Task:
  • Can we say that there is an inequality between $\mathbb{P} (A | B)$ and $\mathbb{P}(A)$
  • * No. See examples below.

    $\mathbb{P} (A | B) < = > \mathbb{P}(A)$

    Example 1


    Two coins
    HH, HT, TH, TT

    A = HH
    B = {HH, HT, TH}

    $\mathbb{P}(A | B) = 1/3$
    $\mathbb{P}(A) = 1/4$

    In this example, $\mathbb{P} (A | B) > \mathbb{P}(A)$

    Example 2
    The same coins A = HH,
    B = TT $\mathbb{P}(A | B) = 0$
    $\mathbb{P}(A) = 1/4$

    In this example, $\mathbb{P} (A | B) < \mathbb{P}(A)$

    Example 3
    The same coins A = {HT, TH}
    B = H* = {HH, HT}
    $\mathbb{P}(A | B) = 1/2$ (Because we waiting for the Tails after Heads)
    $\mathbb{P}(A) = 2/4 = 1/2$

    In this example, $\mathbb{P} (A | B) = \mathbb{P}(A)$

  • Task*:
  • Let $X_1$ and $X_2$ be independent random variable with identical cumulative distribution function $F$. Find the cumulative distribution function of $Y = \max\{X_1, X_2\}$. The same question for $n$ random variables. The same question for the minimum instead of maximum
  • * $G(y) = \mathbb{P}\{Y < y\} = \mathbb{P}\{\max\{X_1, X_2\} < y\} = \mathbb{P}\{X_1 < y, X_2 < y\} = (F(y))^2$

    $\newcommand{\Prob}[1]{\mathbb{P}\{#1\}}$ $\newcommand{\Exp}{\mathbf{E}}$ $\newcommand{\Beta}{\mathbf{B}}$

    Returning to the probability of success

    The probability to observe the data, which is to observe $n$ successes in $N$ trials, given the probability $p$ of the success is $$ \begin{equation} \Prob{\mathrm{data} | \mathbf{p} = t} = t^n (1-t)^{N-n}. \label{e:prob:data} \end{equation} $$

    As we discussed, the a priori probabilities are unknown. However, we may want to obtain the a posteriori probabilities of the same form as priors. Looking for priors among the probabilities of the type $\Prob{\mathbf{p} = t} = t^{a-1} (1-t)^{b-1}\, dt$, we get the a posteriori probabilities of the same type but with different $a$ and $b$.

  • Why?
  • * Because $\displaystyle \Prob{\mathbf{p} = t | \mathrm{data}} \sim \Prob{\mathrm{data} | \mathbf{p} = t}\cdot \Prob{\mathbf{p} = t} = t^n(1-t)^{N-n} \cdot t^{a-1} (1-t)^{b-1}\, dt$
    Main assumption: Beta distribution of priors

    So, our main assumption that the prior probabilities, i.e., unconditional $p$ is distributed with the probability density $$ \frac{1}{B(a,b)} t^{a-1} (1-t)^{b-1}. $$ The factor $1/B(a,b)$ provides that the integration of the density normalizes the integral to $1$ as expected.

  • Question: Prove that $\displaystyle B(a,b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}$, where $\displaystyle \Gamma(z) = \int_0^{+\infty} t^{z-1} e^{-t}$ is the Gamma function:
  • Prove: $\displaystyle \Exp Z = \frac{a}{a+b} $, if $Z \sim \Beta(a,b)$ is the beta distribution with parameters $a$ and $b$

  • * Do it yourself
    * $\displaystyle B(a+1, b) = \frac{\Gamma(a+1)\Gamma(b)}{\Gamma(a+1+b)} = \frac{a\Gamma(a)\Gamma(b)}{(a+b)\Gamma(a+b)} = \frac{a}{a+b} B(a, b)$

    $\newcommand{\Prob}[1]{\mathbb{P}\{#1\}}$ $\newcommand{\Exp}{\mathbf{E}}$ $\newcommand{\Beta}{\mathbf{B}}$ A posteriori distribution is proportional to $$ t^{n+a-1} (1-t)^{N-n+b-1} $$ The previous discussion reveals the normalizing constant: $B(a+n, N-n+b)$. Finally, the a posteriori distribution is $$ \Prob{\mathbf{p} = t \,|\, \mathrm{data}} = \frac{1}{B(a+n,N-n+b)} t^{n+a-1} (1-t)^{N-n+b-1} \, dt. $$ Instead we could write that $\mathbf{p} \sim \Beta(a+n, N-n+b)$. Eventually, the probability of success, given data is obtained through the integration over all possible values of $t$ of the random variable $p$: $$ \Prob{\mathrm{success}\,|\, \mathrm{data}} = \int_0^1 \underbrace{\Prob{\mathrm{success} \,|\, \mathbf{p} = t}}_{t}\cdot \Prob{\mathbf{p} = t \,|\, \mathrm{data}} = \int_0^1 t \cdot \frac{1}{B(a+n,N-n+b)} t^{n+a-1} (1-t)^{N-n+b-1} \, dt = \Exp \Beta(a+n,N-n+b) = \frac{a+n}{a+b+N} $$

    Conclusion:

    The application of the Bayesian approach and priors following the beta-distribution results in the changes in the estimate from $$ \hat{p} = \frac{n}{N} $$ to $$ \hat{p} = \frac{a+n}{a+b+N} $$


  • Question What does it occur with these changes as the sample volume increases

  • * If $n$ and $N$ are large, the influence of $a$ and $b$ disappears.

    $\newcommand{\Exp}{\mathbf{E}}$ $\newcommand{\Var}{\mathbf{Var}}$

    Expected value and the standard deviation of the estimates

    Our estimates obtained with a naive and Bayesian approaches depend on the sample (on observations). As a the functions of the sample, they can be treated as random variables:

    $$ \hat{\mathbf{p}}_0 = \frac{X_N}{N}, \quad \hat{\mathbf{p}}_b = \frac{a+X_N}{a+b+N}, $$

    where $X_N$ is the number of successes in $N$ Bernoulli trials with the unknown probability of success $p$. Therefore,

    $$ \Exp \hat{\mathbf{p}}_0 = \frac{\Exp X_n}{N} = \frac{Np}{N} = p, \quad \Exp \hat{\mathbf{p}}_b = \frac{a+Np}{a+b+N} $$

    The expected value does not, in general, coincide with the true value in the case of the Bayesian approach.

    $$ \Var \big(\hat{\mathbf{p}}_0\big) = \frac{\Var X_n}{N^2} = \frac{p(1-p)}{N}, \quad \Var \big(\hat{\mathbf{p}}_b\big) = \frac{Np(1-p)}{(a+b+N)^2} $$

    Thus, the variance of the estimate indeed decreases when we turn to the Bayesian approach

    Exercise

    Compare two strategies

    • Independently sample $2n$ elements and apply the Bayesian approach to estimate the probability of success in the Bernoulli trials
    • Sample $n$ elements, update the priors and then repeat the procedure of sampling of $n$ elements and computing the a posteriori probabilities

    Detection of spam

    Formulation of the problem

    We are interested in characteristics (does the email belong to spam?) based on some data (the set of emails). The answer can be done in a probabilistic form: the probability that the characteristics belongs to a class based on the data. This is the conditional probability we've just discussed. If the existence in the class is $\theta^*$, we can write

    $$ \mathbb{P}\{\theta^* | \mathrm{data}\} = \frac{ \mathbb{P}\{\mathrm{data} | \theta^*\}\mathbb{P}\{\theta^*\} } { \sum_{\theta} \mathbb{P}\{\mathrm{data} | \theta\}\mathbb{P}\{\theta\} } $$

    $\mathbb{P}\{\theta^* | \mathrm{data}\}$ is called the a posteriori probability. The computation is based on a priori probabilities $\mathbb{P}\{\theta\}$ and the likelihood function $\mathbb{P}\{\mathrm{data} | \theta\}$.

    As we have already discussed, the likelihood function is estimated with the data. The choice of the a priori probabilities is questionable.

    The example and the code are taken from the book Explorations in Computing: An Introduction to Computer Science and Python Programming by John S. Conery

    $\newcommand{\Prob}[1]{\mathbb{P}\{#1\}}$

    Probability that the message is span based on single words
    • We simplify the problem and estimate the probability that the email belongs to spam based on single words
    • A typical message will have dozens of words, some that will predict the message is spam, and others pointing in the opposite direction. We will develop a method for combining probabilities of individual words into a single overall probability for the entire message.

    Let $w$ be a single word of the mail. The probability $\mathbb{P}\{ \mathrm{spam} | w \}$, which is the probability that the message is spam given that we see some word $w$ in the message is estimated with Bayes' rule:

    $$ \Prob{ \mathrm{spam} | w } = \frac{ \Prob{ w | \mathrm{spam} | }\Prob{\mathrm{spam}} } { \Prob{ w | \mathrm{spam} | } \Prob{\mathrm{spam}} + \Prob{ w | \mathrm{good} | } \Prob{\mathrm{good}} } $$
    • The probabilities $\mathbb{P}\{ w | \mathrm{good} | \}$ and $\mathbb{P}\{ w | \mathrm{spam} | \}$ are estimated during the training.
    • A priori probabilities $\mathbb{P}\{\mathrm{spam}\}$ and $\mathbb{P}\{\mathrm{good}\}$ are not known
    • The simplest way is to assign $0.5$ to these unknown probabilities.
    • Let us denote $\mathrm{spamcity}(w)$ the corresponding estimates of $\mathbb{P}\{ \mathrm{spam} | w \}$. Eventually, $$ \mathrm{spamcity}(w) = \frac{ \mathbb{P}\{ w | \mathrm{spam} | \} } { \mathbb{P}\{ w | \mathrm{spam} | \} + \mathbb{P}\{ w | \mathrm{good} | \} }. $$
  • Task:
  • If $\mathrm{spamcity}(w)$ is close to $1$, the word $w$ appears in spam messages more frequently than in non-span messages. Is it correct?
  • * Yes

    Python: Installation

    • Read here if Python is in your Path
    • In the case of Anaconda, which add the required variables into the pass upon run,

    The following code determines the function spamcity()

    def spamcity(w, pbad, pgood):
        #Compute the probability a messaage is spam when it contains a word w.
        #The dictionaries pbad and pgood hold p(w|spam) and p(w|good), respectively.
        if w in pbad and w in pgood:
            return pbad[w] / ( pbad[w] + pgood[w] )
        else:
            return None
    

    You can place it into the separate file antispam.py

    New words (a comment about spamcity())

    An important detail we haven’t considered yet is what to do if a word in an incoming message was never seen during training. This is clearly something we have to anticipate. New words continually being added to the English language, and spammers like to try to disguise words, e.g. writing "m0ney" instead of "money." To address this problem, the spamicity function first needs to make sure the word is defined in both dictionaries. The if statement on line 4 tests to see if the word is in both dictionaries, and if so the spamicity equation is used to compute the value returned by the function. If the word is missing from one (or both) of the dictionaries the special object None is returned as a signal that spamicity is not defined for this word.

    Training (taken from our predecessors)
    • User classified the emails as span / non-spam (this part of the job is pre-defined for us, but everybody can write the appropriate code)
    • The code computed the frequency of words in the emails of both types
    • The frequencies are saved into files bad.txt and good.txt placed in the root directory

    Example

    Word secret occurs $252$ times in $1000$ spam emails and $31$ times in $600$ good emails. Then the corresponding frequencies are $0.252$ and $31/600 = 0.052$. As a result,

    $P\{$occurrence of word secret given that the email is bad$\} = 0.252 $, $P\{$occurrence of word secret given that the email is good$\} = 0.052 $,

    Thus, training specifies $P\{w | \textrm{bad}\}$ and $P\{w | \textrm{good}\}$ for words $w$ from files bad.txt and good.txt.

    Design of the code

    • Split of the email into separate words (to be discussed)
    • Compute the probabilities that the email is span if it contains each word of the email (with spamcity())
    • Aggregate these probabilities into a single probability (to be discussed)
    Splitting the email into separate words
    • Split into words with string.split()
    • Care about different forms of a word
    • In the simplest version of the spam-filter, just string library is used

    Word frequency

    The plan for the word frequency function is to use a dictionary that associates strings with integers, where the strings are words found in the input file and the integers are the number of times a word has been seen. We’ll call this dictionary count, and we will initialize it as an empty dictionary:

    count = {}

    If x is a word then count[x] is its count.

    When the word appears for the first time in the file its count has to be initialized

    A call of the form setdefault(x,y) means "check to see if the item x is in the dictionary, and if not, insert it and give it the value y."

    If the item already is in the dictionary the method doesn’t do anything.

    Inspect the method setdefault(); for example, here

    Dealing with different forms of a word:

    • distinguish university and university, university and University

    s.strip('.1') returns the string obtained from s by removing symbols '.' and '1' from the ends of the string (if any) so that we use

    w.strip(punctuation).lower()

    to obtain the universal form of the word w. The string with punctuation symbols is defined in the library. Use

    from string import punctuation

    • Details regarding strip() method are here

    Note: the punctuation string is a naive way to avoid the repetitions of the words

    Function tokenize(s)

    • takes a string
    • splits it into words
    • strips the punctuations
    • makes the words lower

    To write this function, populate the cell with the code

    from string import punctuation
    
    def tokenize(s):
        "Return the list of words in string s"
        a = [ ]
        for x in s.split():
            a.append( x.strip(punctuation).lower() )
        return a
    

    Alternatively, one can place these lines into a separate file, say with name antispam.py (as in this tutorial)

    Counting words in a file

    We extend the task of the creation of dictionary with the words from the email, designing a standard code wf() (from word frequency) that computes the frequency of each word.

    • The input the file with the eamil
    • The output is the dictionary count that contains words and their frequencies
    • The structure of the code was discussed above; a possible realization placed below may be copied to a cell or to the file antispam.py
    def wf(fn):
        "Make a dictionary of word frequencies"
        count = { }
        for line in open(fn):
            for w in tokenize(line):
                count.setdefault(w, 0)
                count[w] += 1
        return count
    

    Reading probabilities from bad.txt and good.txt to dictionaries

    A function named load_probabilities will read the data from one of the training sets and return it in a dictionary object that associates a word with its probability. The two files are named good.txt and bad.txt, so to create the two dictionaries we just find the paths to the files and pass them to load_probabilities (writing the function to the file antispam.py:

    def load_probabilities(fn):
        prob = { }
        with open(fn, 'r') as f:
            for line in f:
                p, w = line.split()
                prob.update({w: (float(p))})
        return prob
    
  • Question:
  • Describe prob returned by the function load_probabilities()
  • * prob is a dictionary * its keys are the words * its values are the probabilities, technically the values from the first column of the file
  • Question:
  • Why did not we verify that the words are distinct when forming the dictionary?
  • * It is assumed that this job is performed earlier, when files __good.txt__ and __bad.txt__ are created.
    Dealing with interesting words

    One approach, introduced by computer scientist Paul Graham is to consider only the most “interesting” words. The idea is to use words that have either very high or very low spamicity, since those are the ones that give us the most information about the message. A word with a spamicity of $0.5$ is not very informative since it appears equally often in spam and good messages. Words farther away from $0.5$, either closer to $0$ or closer to $1$, will tell us more about the message.

    Reduction of the words to use

    If we know the spamicity of a word we can define the “interestingness quotient” (IQ) of the word using this formula: $$ \mathrm{IQ} = | 0.5 − s | $$ Boring words, with a spamicity around $0.5$, will have an IQ close to $0$. But words with a very high spamicity near $1.0$, or a very low spamicity close to $0$, will lead to higher IQ values. In order to find the interesting words in a message we can store the words in a special type of list known as a priority queue. Structurally, a priority queue is just like a list: it’s a linear container with references to other objects. What distinguishes a priority queue from a regular list is that the objects in the priority queue are always sorted. Each time we add a new item to a priority queue, the item is automatically saved in a location that preserves the order. In our spam filtering program we will store words according to how interesting they are, so that words with a high IQ value will always be toward the front of the list.

    The antispam module has a class named WordQueue that implements this behavior. The size of the queue is specified when the WordQueue object is created. For example, if we want to keep track of the 10 most interesting words in a message we should create a WordQueue with room for $10$ words:

    pq = WordQueue(10)

    To display the new WordQueue object pass it to a function named view_queue:

    view_queue(pq)

    But first some words have to be added into the queue by the method insert:

    pq.insert(word, probability)

    For example,

    pq.insert('money', 0.8)

    or

    pq.insert('money', spamcity('money', pbad, pgood))

    $\newcommand{\Prob}[1]{\mathbb{P}\{#1\}}$

    Combined probability of spam

    • Let $p_i$ be the probability that the message with word w_i is spam. You could also write that $p_i = \mathrm{spamcity}(w_i)$
    • Let $q_i = 1 - p_i$
    • Then processing a message, we deal with the set of words and the set of corresponding probabilities $p_1$, $\ldots$, $p_n$, where $n$ is large
    • If the underlying events were independent we would write that the probability that the message belongs to spam is the product of the probabilities related to the individual words.
    • The equation is $$ \Prob{\mathrm{spam}} = \frac{p_1 \cdot \ldots \cdot p_n} { p_1 \cdot \ldots \cdot p_n + q_1 \cdot \ldots \cdot q_n} $$

    The following code (placed to antispam.py realizes the procedure:

    def combined_probability(queue):
        #queue is of class WordQueue defined above
        p = q = 1.0
        for x in queue.probs():
            p *= x
            q *= (1.0 - x)
        return p / (p + q)
    
  • Question:
  • Your decision is based on the words with the spamcities $p_1 = 0.99$, $p_2 = 0.15$, $p_3 = 0.10$. What is the probability of the fact that the message is spam?
  • * $P = \frac{0.99 \cdot 0.15 \cdot 0.10}{0.99 \cdot 0.15 \cdot 0.10 + (1 - 0.99) \cdot (1 - 0.15) \cdot (1 - 0.10)}
    0.99 * 0.15 * 0.10 / ( 0.99 * 0.15 * 0.10 + (1-0.99) * (1-0.15) * (1-0.10))
    
    0.6599999999999998

    Aggregating the above reasoning to the final code

    Populate a cell with the following code or place it into antispam.py

    def pspam(fn):
        "Compute the probability the message in file fn is spam"
        queue = WordQueue(15) #queue of interesting words
        pgood = load_probabilities('good.txt')
        pbad = load_probabilities('bad.txt')
        for line in open(fn):
            for word in tokenize(line):#word is lower case without punctuation
                p = spamcity(word, pbad, pgood) #Prob{spam | word}
                if p != None:
                    queue.insert(word, p) #add the word to the queue
        return combined_probability(queue)
    

    Testing

    You are encouraged to test individual function, as in the following cell:

    from antispam import *
    pq = WordQueue(10)
    pq.insert('money', spamcity('money', load_probabilities('bad.txt'), 
                                pgood = load_probabilities('good.txt')))
    print(pq)
    
    [('money', 0.8856345885634589, 0.3856345885634589)]
    
    Testing the whole code
    • Load all classes and functions from antispam.py
    from antispam import *
    

    Alternatively, if you place the above functions

    • tokenize()
    • wf()
    • spamcity()
    • load_probabilities()
    • combined_probability()

    into the cells, run these cells

    • Run pspam() with a file that contains the content of an email.
    pspam('msg1.txt')
    
    from antispam import *
    fn = 'email1.txt'
    print(pspam(fn))
    
    1.1670071921857966e-08
    
    print(pspam('msg1.txt'))
    
    0.9293048326577117
    
    print(pspam('msg2.txt'))
    
    4.400695206741519e-05
    

    Summary

    • We discussed a Bayesian approach to estimate a posteriori probabilities
    • estimated the probability of success in the Bernoulli trials by using the beta distribution of priors
    • designed a simple spam filter including
      • aggregation of (preliminary) "decisions" based on suspiciousness of individual words
      • simple parsing of the text to create the dictionary of different words and compute their frequencies
      • application of machine learning to take the final decision, but the training process, which is responsible for the database of frequencies of individual words in spam and non-spam email, is not performed here
    Previous: Introduction Next: Review of probabilities