Statistics for Economists and Intelligent Data Analysts

Designed by Professor Sasha Shapoval

University of Lodz

Previous: Goodness-of-the-fit Bootstrap Next: Estimates

Goodness of the fit and Pearson test¶

Numerical example and theoretical background¶

$5$ events are observed, whereas the background event rate is $\nu_b = 0.5$
Are a new type of events observed?

$n_{\text{obs}} = 5$ $$ \mathsf{Prob}\{ n \ge n_{\text{obs}} \} = 1-\sum_{n=0}^{n_{\text{obs}}-1} f(n;0,\nu_b) = 1 - \sum_{n=0}^{n_{\text{obs}}-1} \frac{\nu_b^n}{n!}e^{-\nu_b} $$

import numpy as np
def prob_greater_poisson(n, p):
    #<n> is the number of observed events
    #<p> is the mean value of the Poisson distribution
    s = 0
    power_cur = 1
    factorial_cur = 1
    exp_val = np.exp(-p)
    for i in range(n):
        s += power_cur / factorial_cur * exp_val
        power_cur *= p
        factorial_cur *= i + 1
    return 1-s

ss = prob_greater_poisson(5, 0.5)
print('Probability to observe these or more extreme data, p-value:', ss)

Probability to observe these or more extreme data, p-value: 0.00017211562995578245

Possible interpretation of the above computation:
the probability to get 5 or more events is equal to $1.7 \cdot 10^{-4}$ under the hypothesis that they are generated by the background process

more detailed investigation: plus-minus interval¶

the variance of the Poisson random variable, $\sigma^2=\mathbf{Var}X = \mathbf{E}X$
our estimate of $\nu = 5$, $\nu = \nu_s + \nu_b$
adding $\sigma$ as the plus/minus component, we get $\hat{\nu} = 5 \pm \sqrt{5}$
the same for the specific event rate: $\hat{\nu}_s = 4.5 \pm 2.3$
the interval $\hat{\nu}_s = 4.5 \pm 4.6$ with doubled plus-minus component includes $0$, which is misleading as we recall the $p$-value equaled to $1.7\cdot 10^{-4}$

Question:
How to resolve this controversion between the arguments based on the p-value and the interval estimate?
* The interval estimate invloves the probabilities that the Poisson random variable with the mean $\nu=5$ fluctuates downward to $\nu_b = 0.5$ or lower. That is incorrect * On the contrary, `p-value` indicates that the Poisson random variable with mean $\mu_b = 0.5$ fluctuates to $n_{\text{obs}} = 5$ or higher

The practice of displaying measured values of Poisson variables with an error bar given by the square root of the observed value unfortunately encourages the incorrect interpretation

Errors in the parameter estimations¶

Assume that the background rate is $0.8$ instead of $0.5$. Then the p-value is

ss = prob_greater_poisson(5, 0.8)
print('Probability to observe these or more extreme data, p-value:', ss)
print('New_probability / Old_probability = ', 
      prob_greater_poisson(5, 0.8) / prob_greater_poisson(5, 0.5))

Probability to observe these or more extreme data, p-value: 0.0014113101458866906
New_probability / Old_probability =  8.199779103439152

Thus, the new probability is approximately $10$ times larger

Deviation of the binned data from the background¶

x_back = [i+0.5 for i in range(20)]
v_back = [1.50, 1.55, 1.75, 1.85, 1.95, 1.94, 1.87, 1.76, 1.70, 1.68, 1.65, 1.64, 1.60, 1.58,
         1.55, 1.20, 1.0, 0.50, 0.22, 0.10]
x_obs = [i+0.5 for i in range(20)]
v_obs = [1, 2, 2, 0, 3, 1, 1, 2, 5, 6, 1, 0, 1, 2, 1, 3, 1, 0, 1, 0]

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(8, 4))
ax = plt.subplot(1,1,1)
ax.plot(x_back, v_back, label='background', linestyle='--')
ax.plot(x_obs, v_obs, label='observations', marker='o', markersize=6)
ax.set_ylabel('Frequency')
ax.legend()
plt.show()

Outliers¶

Are values $n(8.5) = 5$ and $n(9.5) = 6$ outliers that are incosistent with the background level?

print('Background rate when the observed values seem to be extreme', 
      v_back[8] + v_back[9])

Background rate when the observed values seem to be extreme 3.38

Attempt to use the previous approach¶

To what extent $5+6=11$ events are extreme when the background rate is $3.38$
The above formula for the p-value returns

print('Probability {n >= 11 | rate = 3.38} = ', prob_greater_poisson(11, 3.38))

Probability {n >= 11 | rate = 3.38} =  0.0007729287726739864

What is the conclusion? Outliers?

Drawback¶

The place of the peak is chosen aposteriori. The computed probability is related to the peak at the specific placed chosen after the experiment is conducted. It does not take into account that the other data in a better agreement with the background

Pearson's $\chi^2$ test¶

Statistics: $\displaystyle\chi^2 = \sum_{i=1}^N \frac{(n_i - \nu_i)^2}{\nu_i}$
If the data $\mathbf{n} = (n_1,\ldots,n_N)$ are Poisson distributed with mean values $\mathbf{\nu} = (\nu_1,\ldots,\nu_N)$ and the number of entries in each bin is not too small then
the statistics $\chi^2$ follows the $\chi^2$ distribution with $N$ degrees of freedom. This holds regardless of the distribution of the variable $\mathbf{x}$
larger $\chi^2$ corresponds to a larger discrepancy between data and the hypothesis
In our example,

#p[] is a background process; n[] corresponds to observations
def chisquare_simple(n, p):
    if len(n) != len(p) or len(n) == 0 or n == None or p == None:
        return None
    chisquare = 0
    for i in range(len(n)):
        chisquare += (n[i] - p[i])**2 / p[i]
    return chisquare
print(chisquare_simple(v_obs, v_back))

29.6491370705683

Visualizing $\chi^2$ distribution¶

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html

from scipy.stats import chi2
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)
dfs = [10, 20, 40] #degree of freedom
for df in dfs:
    x = np.linspace(chi2.ppf(0.001, df), chi2.ppf(0.999, df), 1000)
    ax.plot(x, chi2.pdf(x, df), '-', lw=5, alpha=0.6, label=f'df = {df}')
plt.title('$\chi^2$ pdf')
plt.legend()
plt.show()

Returning to our example¶

df = len(v_obs)
chi2.ppf(0.01, df), chi2.ppf(0.99, df) #print quantiles

(8.2603983325464, 37.56623478662507)

fig, ax = plt.subplots(1, 1)
dfr = 20 #degree of freedom
x = np.linspace(chi2.ppf(0.01, dfr), chi2.ppf(0.99, dfr), 100)
ax.plot(x, chi2.pdf(x, dfr), 'r-', lw=5, alpha=0.6, label='chi2 pdf')
level = 0.01
quantile1 = chi2.ppf(level, dfr)
quantile2 = chi2.ppf(1-level, dfr)
ax.plot([quantile1, quantile1], [min(chi2.pdf(x, dfr)), max(chi2.pdf(x, dfr))], 'b--', lw=1)
ax.plot([quantile2, quantile2], [min(chi2.pdf(x, dfr)), max(chi2.pdf(x, dfr))], 'b--', lw=1)
ax.set_title('$\\chi^2$ probability density and quantiles')
plt.show()

df = len(v_obs)
alph = [0.001, 0.005, 0.01]
stat = chisquare_simple(v_obs, v_back)
print('statistics', stat)
for a in alph:
    print('significance level = ', a, '; cut-off = ', chi2.ppf(1-a, df), '; chi-2 > cut-off',
         stat > chi2.ppf(1-a, df))

statistics 29.6491370705683
significance level =  0.001 ; cut-off =  45.31474661812586 ; chi-2 > cut-off False
significance level =  0.005 ; cut-off =  39.99684631293865 ; chi-2 > cut-off False
significance level =  0.01 ; cut-off =  37.56623478662507 ; chi-2 > cut-off False

Question:
Can we reject the hypothesis that just the background process is observed?

* No, at the chosen significance levels

Question:
What a p-value does correspond to our observations

* Type 1-chi2.cdf(stat, df)

1-chi2.cdf(stat, df)

0.07574188362914624

More real assumptions¶

We assumed that each $n_i$ is a Poisson random variable and the sum $n_{\text{tot}} = \sum_{i=1}^N n_i$ is itself a Poisson variable with a predicted mean value $\nu_{\text{tot}} = \sum_{i=1}^N \nu_i$
New set of assumptions: $n_{\text{tot}}$ is fixed (from observations), $n_i$ is multinomial distributed with $p_i = \nu_i / n_{\text{tot}}$.
Then statistics: $\displaystyle\chi^2 = \sum_{i=1}^N \frac{(n_i - p_i n_{\text{tot}})^2}{p_i n_{\text{tot}}}$
As a result, the statistics $\chi^2$ is still asymptotically $\chi^2$ distributed but the number of degrees of freedom is $N-1$
Nothing changed in the statistics in our example.
The degree of freedom is reduced by one

Standard computation (corresponding to $N-1$ degree of freedom) is repeated below

import numpy as np
from scipy.stats import chi2
df_mod = len(v_obs) - 1
alph = [0.001, 0.005, 0.01, 0.1]
stat = chisquare_simple(v_obs, v_back)
print('statistics = ', stat, 'p-value = ', '{:.3f}'.format(1-chi2.cdf(stat, df_mod)))
for a in alph:
    print('significance level = ', a, '; cut-off = {:.3f}'.format(chi2.ppf(1-a, df_mod)), '; chi-2 > cut-off',
         stat > chi2.ppf(1-a, df_mod))
print(df_mod)

statistics =  29.6491370705683 p-value =  0.056
significance level =  0.001 ; cut-off = 43.820 ; chi-2 > cut-off False
significance level =  0.005 ; cut-off = 38.582 ; chi-2 > cut-off False
significance level =  0.01 ; cut-off = 36.191 ; chi-2 > cut-off False
significance level =  0.1 ; cut-off = 27.204 ; chi-2 > cut-off True
19

Drawback: Small number of values in the bins¶

We must refuse the theoretical distribution and construct the distribution with the Monte-Carlo method.

The following code computes $N = 1000$ samples with the $\chi^2$ statistics and order the array of statistics

from scipy.stats import poisson
N = 1000 #number of repetition in the Monte-Carlo simulations
stat_mc = []
for _ in range(N):
    v_mc = [poisson.rvs(v) for v in v_back]
    stat_mc.append(chisquare_simple(v_mc, v_back))
    
stat_mc = sorted(stat_mc)

import bisect
alph = [0.001, 0.005, 0.01, 0.1] #must be between 0 and 1
stat = chisquare_simple(v_obs, v_back)
print('statistics = ', stat, 'p-value = {:.3f}'.format(1-bisect.bisect_left(stat_mc, stat) / len(stat_mc)))
for a in alph:
    cutoff = stat_mc[int(round((1-a)*N))]
    print('significance level = ', a, '; cut-off = {:.3f}'.format(cutoff), '; statistics > cut-off', stat > cutoff)

statistics =  29.6491370705683 p-value = 0.104
significance level =  0.001 ; cut-off = 90.917 ; statistics > cut-off False
significance level =  0.005 ; cut-off = 51.150 ; statistics > cut-off False
significance level =  0.01 ; cut-off = 46.912 ; statistics > cut-off False
significance level =  0.1 ; cut-off = 29.869 ; statistics > cut-off False

We observe an increase of $p$-value from $0.056$ to $0.104$

Previous: Goodness-of-the-fit Next: Estimates