In this tutorial, we will discuss the Bayesian approach to estimate probabilities. Assume that we are interested to determine (the probability of) the state of the world based on conducted experiments. You had an opinion regarding the possibilities (= the probability distribution) of the states of the world, then observed some data, and so what?
First, we consider an artificial example, where the challenge is to estimate the probability of success in Bernoulli trials.
Second, we turn to a practical case constructing a (naive) spam-filter. The construction allows as to practice with the Bayesian approach and enjoy a technical realization with Python.
A coin is flipped $N$ times; Heads $H$ is the output of $n$ trials. What is the probability that Heads come at a single experiment?
Notation: Clearly we do not know this probability $p$ definitely. Let's call its estimate $\hat{p}$.
Naive answer:
The expected value of successes in the Bernoulli trials is $Np$. Therefore, we can estimate $p$ as the fraction of successes during $N$ observations:
$$
\hat{p} = \frac{n}{N}
$$
$\newcommand{\Prob}[1]{\mathbb{P}\{#1\}}$ Improvement: Evidently, an additional knowledge requires to improve the previous answer in the presence of a few trials. First, let's change the task and look for (the estimate of) the distribution of the probability in the Bernoulli trials, given the observed data. In other words, we consider $p$ as a random variable $\mathbb{p}$ with an unknown distribution. $$ \begin{equation} \Prob{\mathbf{p} = t | \mathrm{data}} = \frac{ \Prob{\mathrm{data} | \mathbf{p} = t}\Prob{\mathbf{p} = t} } { \sum_{t'} \Prob{\mathrm{data} | \mathbf{p} = t'}\Prob{\mathbf{p} = t'} } \quad\quad\text{BAYES' RULE} \label{eq:bayes} \end{equation} $$ $\Prob{\mathbf{p} = t | \mathrm{data}}$ is called the a posteriori probability. The computation is based on a priori probabilities $\Prob{\mathbf{p} = t}$ and the likelihood function $\Prob{\mathrm{data} | \mathbf{p} = t'}$.
The likelihood function is estimated with the data. The choice of the a priori probabilities is questionable.
$\mathbb{P} (A | B) < = > \mathbb{P}(A)$
Example 1
Two coins
HH, HT, TH, TT
A = HH
B = {HH, HT, TH}
$\mathbb{P}(A | B) = 1/3$
$\mathbb{P}(A) = 1/4$
In this example, $\mathbb{P} (A | B) > \mathbb{P}(A)$
Example 2
The same coins
A = HH,
B = TT
$\mathbb{P}(A | B) = 0$
$\mathbb{P}(A) = 1/4$
In this example, $\mathbb{P} (A | B) < \mathbb{P}(A)$
Example 3
The same coins
A = {HT, TH}
B = H* = {HH, HT}
$\mathbb{P}(A | B) = 1/2$ (Because we waiting for the Tails after Heads)
$\mathbb{P}(A) = 2/4 = 1/2$
In this example, $\mathbb{P} (A | B) = \mathbb{P}(A)$
$\newcommand{\Prob}[1]{\mathbb{P}\{#1\}}$ $\newcommand{\Exp}{\mathbf{E}}$ $\newcommand{\Beta}{\mathbf{B}}$
The probability to observe the data, which is to observe $n$ successes in $N$ trials, given the probability $p$ of the success is $$ \begin{equation} \Prob{\mathrm{data} | \mathbf{p} = t} = t^n (1-t)^{N-n}. \label{e:prob:data} \end{equation} $$
As we discussed, the a priori probabilities are unknown. However, we may want to obtain the a posteriori probabilities of the same form as priors. Looking for priors among the probabilities of the type $\Prob{\mathbf{p} = t} = t^{a-1} (1-t)^{b-1}\, dt$, we get the a posteriori probabilities of the same type but with different $a$ and $b$.
So, our main assumption that the prior probabilities, i.e., unconditional $p$ is distributed with the probability density $$ \frac{1}{B(a,b)} t^{a-1} (1-t)^{b-1}. $$ The factor $1/B(a,b)$ provides that the integration of the density normalizes the integral to $1$ as expected.
$\newcommand{\Prob}[1]{\mathbb{P}\{#1\}}$ $\newcommand{\Exp}{\mathbf{E}}$ $\newcommand{\Beta}{\mathbf{B}}$ A posteriori distribution is proportional to $$ t^{n+a-1} (1-t)^{N-n+b-1} $$ The previous discussion reveals the normalizing constant: $B(a+n, N-n+b)$. Finally, the a posteriori distribution is $$ \Prob{\mathbf{p} = t \,|\, \mathrm{data}} = \frac{1}{B(a+n,N-n+b)} t^{n+a-1} (1-t)^{N-n+b-1} \, dt. $$ Instead we could write that $\mathbf{p} \sim \Beta(a+n, N-n+b)$. Eventually, the probability of success, given data is obtained through the integration over all possible values of $t$ of the random variable $p$: $$ \Prob{\mathrm{success}\,|\, \mathrm{data}} = \int_0^1 \underbrace{\Prob{\mathrm{success} \,|\, \mathbf{p} = t}}_{t}\cdot \Prob{\mathbf{p} = t \,|\, \mathrm{data}} = \int_0^1 t \cdot \frac{1}{B(a+n,N-n+b)} t^{n+a-1} (1-t)^{N-n+b-1} \, dt = \Exp \Beta(a+n,N-n+b) = \frac{a+n}{a+b+N} $$
Conclusion:
The application of the Bayesian approach and priors following the beta-distribution results in the changes in the estimate from $$ \hat{p} = \frac{n}{N} $$ to $$ \hat{p} = \frac{a+n}{a+b+N} $$
$\newcommand{\Exp}{\mathbf{E}}$ $\newcommand{\Var}{\mathbf{Var}}$
Our estimates obtained with a naive and Bayesian approaches depend on the sample (on observations). As a the functions of the sample, they can be treated as random variables:
$$ \hat{\mathbf{p}}_0 = \frac{X_N}{N}, \quad \hat{\mathbf{p}}_b = \frac{a+X_N}{a+b+N}, $$where $X_N$ is the number of successes in $N$ Bernoulli trials with the unknown probability of success $p$. Therefore,
$$ \Exp \hat{\mathbf{p}}_0 = \frac{\Exp X_n}{N} = \frac{Np}{N} = p, \quad \Exp \hat{\mathbf{p}}_b = \frac{a+Np}{a+b+N} $$The expected value does not, in general, coincide with the true value in the case of the Bayesian approach.
$$ \Var \big(\hat{\mathbf{p}}_0\big) = \frac{\Var X_n}{N^2} = \frac{p(1-p)}{N}, \quad \Var \big(\hat{\mathbf{p}}_b\big) = \frac{Np(1-p)}{(a+b+N)^2} $$Thus, the variance of the estimate indeed decreases when we turn to the Bayesian approach
Compare two strategies
We are interested in characteristics (does the email belong to spam?) based on some data (the set of emails). The answer can be done in a probabilistic form: the probability that the characteristics belongs to a class based on the data. This is the conditional probability we've just discussed. If the existence in the class is $\theta^*$, we can write
$$ \mathbb{P}\{\theta^* | \mathrm{data}\} = \frac{ \mathbb{P}\{\mathrm{data} | \theta^*\}\mathbb{P}\{\theta^*\} } { \sum_{\theta} \mathbb{P}\{\mathrm{data} | \theta\}\mathbb{P}\{\theta\} } $$$\mathbb{P}\{\theta^* | \mathrm{data}\}$ is called the a posteriori probability. The computation is based on a priori probabilities $\mathbb{P}\{\theta\}$ and the likelihood function $\mathbb{P}\{\mathrm{data} | \theta\}$.
As we have already discussed, the likelihood function is estimated with the data. The choice of the a priori probabilities is questionable.
The example and the code are taken from the book Explorations in Computing: An Introduction to Computer Science and Python Programming by John S. Conery
$\newcommand{\Prob}[1]{\mathbb{P}\{#1\}}$
Let $w$ be a single word of the mail. The probability $\mathbb{P}\{ \mathrm{spam} | w \}$, which is the probability that the message is spam given that we see some word $w$ in the message is estimated with Bayes' rule:
$$ \Prob{ \mathrm{spam} | w } = \frac{ \Prob{ w | \mathrm{spam} | }\Prob{\mathrm{spam}} } { \Prob{ w | \mathrm{spam} | } \Prob{\mathrm{spam}} + \Prob{ w | \mathrm{good} | } \Prob{\mathrm{good}} } $$Python: Installation
The following code determines the function spamcity()
def spamcity(w, pbad, pgood):
#Compute the probability a messaage is spam when it contains a word w.
#The dictionaries pbad and pgood hold p(w|spam) and p(w|good), respectively.
if w in pbad and w in pgood:
return pbad[w] / ( pbad[w] + pgood[w] )
else:
return None
You can place it into the separate file antispam.py
New words (a comment about spamcity()
)
An important detail we haven’t considered yet is what to do if a word in an incoming message was never seen during training. This is clearly something we have to anticipate. New words continually being added to the English language, and spammers like to try to
disguise words, e.g. writing "m0ney" instead of "money."
To address this problem, the spamicity function first needs to make sure
the word is defined in both dictionaries. The if
statement on line 4 tests to see if the word is in both dictionaries, and if so the spamicity equation is used to compute the value returned by the function. If the word is missing from one (or both) of the dictionaries the special object None
is returned as a signal that spamicity is not defined for this word.
Example
Word secret occurs $252$ times in $1000$ spam emails and $31$ times in $600$ good emails. Then the corresponding frequencies are $0.252$ and $31/600 = 0.052$. As a result,
$P\{$occurrence of word secret
given that the email is bad$\} = 0.252 $,
$P\{$occurrence of word secret
given that the email is good$\} = 0.052 $,
Thus, training specifies $P\{w | \textrm{bad}\}$ and $P\{w | \textrm{good}\}$ for words $w$ from files bad.txt and good.txt.
spamcity()
)string.split()
Word frequency
The plan for the word frequency function is to use a dictionary that associates strings with
integers, where the strings are words found in the input file and the integers are the number
of times a word has been seen. We’ll call this dictionary count
, and we will initialize it as an empty dictionary:
count = {}
If x
is a word then count[x]
is its count.
When the word appears for the first time in the file its count has to be initialized
A call of the form setdefault(x,y)
means "check to see if the item x is in the dictionary, and if not, insert it and give it the value y."
If the item already is in the dictionary the method doesn’t do anything.
Inspect the method setdefault()
; for example, here
Dealing with different forms of a word:
university
and university,
university
and University
s.strip('.1')
returns the string obtained from s
by removing symbols '.' and '1' from the ends of the string (if any) so that we use
w.strip(punctuation).lower()
to obtain the universal form of the word w
. The string with punctuation symbols is defined in the library. Use
from string import punctuation
strip()
method are hereNote: the punctuation string is a naive way to avoid the repetitions of the words
Function tokenize(s)
To write this function, populate the cell with the code
from string import punctuation
def tokenize(s):
"Return the list of words in string s"
a = [ ]
for x in s.split():
a.append( x.strip(punctuation).lower() )
return a
Alternatively, one can place these lines into a separate file, say with name antispam.py (as in this tutorial)
Counting words in a file
We extend the task of the creation of dictionary with the words from the email, designing a standard code wf()
(from word frequency) that computes the frequency of each word.
count
that contains words and their frequenciesdef wf(fn):
"Make a dictionary of word frequencies"
count = { }
for line in open(fn):
for w in tokenize(line):
count.setdefault(w, 0)
count[w] += 1
return count
Reading probabilities from bad.txt and good.txt to dictionaries
A function named load_probabilities
will read the data from one of the training sets and return it in a dictionary object that associates a word with its probability. The two files are named good.txt
and bad.txt
, so to create the two dictionaries we just find the paths
to the files and pass them to load_probabilities (writing the function to the file antispam.py:
def load_probabilities(fn):
prob = { }
with open(fn, 'r') as f:
for line in f:
p, w = line.split()
prob.update({w: (float(p))})
return prob
One approach, introduced by computer scientist Paul Graham is to consider only the most “interesting” words. The idea is to use words that have either very high or very low spamicity, since those are the ones that give us the most information about the message. A word with a spamicity of $0.5$ is not very informative since it appears equally often in spam and good messages. Words farther away from $0.5$, either closer to $0$ or closer to $1$, will tell us more about the message.
If we know the spamicity of a word we can define the “interestingness quotient” (IQ) of the word using this formula: $$ \mathrm{IQ} = | 0.5 − s | $$ Boring words, with a spamicity around $0.5$, will have an IQ close to $0$. But words with a very high spamicity near $1.0$, or a very low spamicity close to $0$, will lead to higher IQ values. In order to find the interesting words in a message we can store the words in a special type of list known as a priority queue. Structurally, a priority queue is just like a list: it’s a linear container with references to other objects. What distinguishes a priority queue from a regular list is that the objects in the priority queue are always sorted. Each time we add a new item to a priority queue, the item is automatically saved in a location that preserves the order. In our spam filtering program we will store words according to how interesting they are, so that words with a high IQ value will always be toward the front of the list.
The antispam
module has a class named WordQueue
that implements this behavior. The
size of the queue is specified when the WordQueue
object is created. For example, if we want
to keep track of the 10 most interesting words in a message we should create a WordQueue
with room for $10$ words:
pq = WordQueue(10)
To display the new WordQueue
object pass it to a function named view_queue
:
view_queue(pq)
But first some words have to be added into the queue by the method insert
:
pq.insert(word, probability)
For example,
pq.insert('money', 0.8)
or
pq.insert('money', spamcity('money', pbad, pgood))
$\newcommand{\Prob}[1]{\mathbb{P}\{#1\}}$
w_i
is spam. You could also write that $p_i = \mathrm{spamcity}(w_i)$The following code (placed to antispam.py realizes the procedure:
def combined_probability(queue):
#queue is of class WordQueue defined above
p = q = 1.0
for x in queue.probs():
p *= x
q *= (1.0 - x)
return p / (p + q)
0.99 * 0.15 * 0.10 / ( 0.99 * 0.15 * 0.10 + (1-0.99) * (1-0.15) * (1-0.10))
Populate a cell with the following code or place it into antispam.py
def pspam(fn):
"Compute the probability the message in file fn is spam"
queue = WordQueue(15) #queue of interesting words
pgood = load_probabilities('good.txt')
pbad = load_probabilities('bad.txt')
for line in open(fn):
for word in tokenize(line):#word is lower case without punctuation
p = spamcity(word, pbad, pgood) #Prob{spam | word}
if p != None:
queue.insert(word, p) #add the word to the queue
return combined_probability(queue)
Testing
You are encouraged to test individual function, as in the following cell:
from antispam import *
pq = WordQueue(10)
pq.insert('money', spamcity('money', load_probabilities('bad.txt'),
pgood = load_probabilities('good.txt')))
print(pq)
from antispam import *
Alternatively, if you place the above functions
tokenize()
wf()
spamcity()
load_probabilities()
combined_probability()
into the cells, run these cells
pspam()
with a file that contains the content of an email.pspam('msg1.txt')
from antispam import *
fn = 'email1.txt'
print(pspam(fn))
print(pspam('msg1.txt'))
print(pspam('msg2.txt'))