Examples:
Examples:
Note, we can test a lot of hypotheses about a lot of population parameters, e.g.
We will focus on hypotheses about population regression slope (ˆβ1), i.e. the causal effect† of X on Y
† With a model this simple, it's almost certainly not causal, but this is the ultimate direction we are heading...
Null hypothesis assigns a value (or a range) to a population parameter
Alternative hypothesis must mathematically contradict the null hypothesis
Null hypothesis assigns a value (or a range) to a population parameter
Alternative hypothesis must mathematically contradict the null hypothesis
A null hypothesis, H0
An alternative hypothesis, Ha
A null hypothesis, H0
An alternative hypothesis, Ha
A test statistic to determine if we reject H0 when the statistic reaches a "critical value"
A null hypothesis, H0
An alternative hypothesis, Ha
A test statistic to determine if we reject H0 when the statistic reaches a "critical value"
A conclusion whether or not to reject H0 in favor of Ha
Sample statistic (^β1) will rarely be exactly equal to the hypothesized parameter (β1)
Difference between observed statistic and true parameter could be because:
Parameter is not the hypothesized value
Parameter is truly hypothesized value but sampling variability gave us a different estimate
We cannot distinguish between these two possibilities with any certainty
Type I error (false positive): rejecting H0 when it is in fact true
Type II error (false negative): failing to reject H0 when it is in fact false
Truth |
|||
---|---|---|---|
Null is True | Null is False | ||
Judgment | Reject Null | TYPE I ERROR | CORRECT |
(False +) | (True +) | ||
Don't Reject Null | CORRECT | TYPE II ERROR | |
(True -) | (False -) |
Truth |
|||
---|---|---|---|
Defendant is Innocent | Defendant is Guilty | ||
Judgment | Convict | TYPE I ERROR | CORRECT |
(False +) | (True +) | ||
Acquit | CORRECT | TYPE II ERROR | |
(True -) | (False -) |
Truth |
|||
---|---|---|---|
Defendant is Innocent | Defendant is Guilty | ||
Judgment | Convict | TYPE I ERROR | CORRECT |
(False +) | (True +) | ||
Acquit | CORRECT | TYPE II ERROR | |
(True -) | (False -) |
Truth |
|||
---|---|---|---|
Defendant is Innocent | Defendant is Guilty | ||
Judgment | Convict | TYPE I ERROR | CORRECT |
(False +) | (True +) | ||
Acquit | CORRECT | TYPE II ERROR | |
(True -) | (False -) |
William Blackstone
(1723-1780)
"It is better that ten guilty persons escape than that one innocent suffer."
Blackstone, William, 1765-1770, Commentaries on the Laws of England
α=P(Reject H0|H0 is true)
α=P(Reject H0|H0 is true)
α=P(Reject H0|H0 is true)
The confidence level is defined as (1−α)
The probability of a Type II error is defined as β:
β=P(Don't reject H0|H0 is false)
Truth |
|||
---|---|---|---|
Null is True | Null is False | ||
Judgment | Reject Null | TYPE I ERROR | CORRECT |
α | (1-β) | ||
Don't Reject Null | CORRECT | TYPE II ERROR | |
(1-α) | β |
Power=1−β=P(Reject H0|H0 is false)
Power=1−β=P(Reject H0|H0 is false)
p(δ≥δi|H0 is true)
After running our test, we need to make a decision between the competing hypotheses
Compare p-value with pre-determined α (commonly, α=0.05, 95% confidence level)
If p<α: statistically significant evidence sufficient to reject H0 in favor of Ha
If p≥α: insufficient evidence to reject H0
Sir Ronald A. Fisher
(1890—1962)
"The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis."
1931, The Design of Experiments
Modern philosophy of science is largely based off of hypothesis testing and falsifiability, which form the "Scientific Method"†
For something to be "scientific", it must be falsifiable, or at least testable
Hypotheses can be corroborated with evidence, but always tentative until falsified by data in suggesting an alternative hypothesis
"All swans are white" is a hypothesis rejected upon discovery of a single black swan
Rigorous course on statistics (ECMG 212 or MATH 112) will spend weeks going through different types of tests:
See today's class notes page for more
R
package called infer
Calculate a statistic, δi†, from a sample of data
Simulate a world where δ is null (H0)
Examine the distribution of δ across the null world
Calculate the probability that δi could exist in the null world
Decide if δi is statistically significant
† δ can stand in for any test-statistic in any hypothesis test! For our purposes, δ is the slope of our regression sample, ˆβ1.
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a modellm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
generate()
simulations of the null world
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
generate()
simulations of the null world
calculate()
the p-value
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
generate()
simulations of the null world
calculate()
the p-value
visualize()
with a histogram (optional)
Test statistic (δ): measures how far what we observed in our sample (^β1) is from what we would expect if the null hypothesis were true (β1=0)
Rejection region: if the test statistic reaches a "critical value" of δ, then we reject the null hypothesis
† Again, see today's class notes for more on the t-distribution. k is the number of independent variables our model has, in this case, with just one X, k=1. We use two degrees of freedom to calculate ^β0 and ^β1, hence we have n−2 df.
Our world, and a world where β1=0 by assumption.
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | |
---|---|---|
(Intercept) | 698.932952 | |
str | -2.279808 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | |
---|---|---|
(Intercept) | 698.932952 | |
str | -2.279808 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | |
---|---|---|
(Intercept) | 647.8027952 | |
str | 0.3235038 |
# save as obs_slopesample_slope <- school_reg_tidy %>% # this is the regression tidied with broom filter(term=="str") %>% pull(estimate)# confirm what it issample_slope
## [1] -2.279808
data %>% specify(y ~ x)
specify()
function, which is essentially a lm()
function for regression (for our purposes)CASchool %>% specify(testscr ~ str)
ABCDEFGHIJ0123456789 |
testscr <dbl> | str <dbl> | ||
---|---|---|---|
690.8 | 17.88991 | ||
661.2 | 21.52466 | ||
643.6 | 18.69723 |
%>% hypothesize(null = "independence")
infer
's language, we are hypothesizing that str
and testscr
are independent
(β1=0)†CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence")
ABCDEFGHIJ0123456789 |
testscr <dbl> | str <dbl> | ||
---|---|---|---|
690.8 | 17.88991 | ||
661.2 | 21.52466 | ||
643.6 | 18.69723 |
type
can be either point
(for specific point estimates for a single variable, such as a sample mean, ((\bar{x}))
, or independence
(for hypotheses about two samples or a relationship between variables). See more here.
%>% generate(reps = n, type = "permute")
reps
and set the type
equal to "permute"
permutation
instead of a bootstrap
for hypothesis testing!CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute")
%>% generate(reps = n, type = "permute")
%>% calculate(stat = "")
We calculate
sample statistics for each of the 1,000 replicate
samples
In our case, calculate the slope, (^β1) for each replicate
CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope")
stat
s for calculation: "mean"
, "median"
, "prop"
, "diff in means"
, "diff in props"
, etc. (see package information)%>% calculate(stat = "")
%>% get_p_value(obs stat = "", direction = "both")
We can calculate the p-value
sample_slope
(-2.28) in our simulated null distributionTwo-sided alternative Ha:β1≠0, we double the raw p-value
CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope") %>% get_p_value(obs_stat = sample_slope, direction = "both")
ABCDEFGHIJ0123456789 |
p_value <dbl> | ||||
---|---|---|---|---|
0 |
%>% visualize()
CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope") %>%visualize()
%>% visualize()
sample_slope
to show our finding on the null distr.CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope") %>% visualize(obs_stat = sample_slope)
%>% visualize()+shade_p_value()
shade_p_value
to see what p isCASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope") %>% visualize(obs_stat = sample_slope)+ shade_p_value(obs_stat = sample_slope, direction = "two_sided")
%>% visualize()+shade_ci()
tibble
of them from 4 slides ago as ci_values
simulations %>% visualize(obs_stat = sample_slope)+ shade_confidence_interval(ci_values)
infer
's visualize()
function is just a wrapper function for ggplot()
simulations
tibble
and just ggplot
a normal histograminfer
's visualize()
function is just a wrapper function for ggplot()
simulations
tibble
and just ggplot
a normal histogramsimulations %>% ggplot(data = .)+ aes(x = stat)+ geom_histogram(color="white", fill="indianred")+ geom_vline(xintercept = sample_slope, color = "blue", size = 2, linetype = "dashed")+ labs(x = expression(paste("Distribution of ", hat(beta[1]), " under ", H[0], " that ", beta[1]==0)), y = "Samples")+ theme_classic(base_family = "Fira Sans Condensed", base_size=20)
infer
's visualize()
function is just a wrapper function for ggplot()
simulations
tibble
and just ggplot
a normal histogramsimulations %>% ggplot(data = .)+ aes(x = stat)+ geom_histogram(color="white", fill="indianred")+ geom_vline(xintercept = sample_slope, color = "blue", size = 2, linetype = "dashed")+ labs(x = expression(paste("Distribution of ", hat(beta[1]), " under ", H[0], " that ", beta[1]==0)), y = "Samples")+ theme_classic(base_family = "Fira Sans Condensed", base_size=20)
R does things the old-fashioned way, using a theoretical null distribution instead of simulation
A t-distribution with n−k−1 df†
Calculate a t-statistic for ^β1:
test statistic=estimate−null hypothesisstandard error of estimate
† k is the number of X variables.
test statistic=estimate−null hypothesisstandard error of estimate
t has the same interpretation as Z, number of std. dev. away from the distribution's center†
Compares to a critical value of t∗ (determined by α & n−k−1)
† Think of our simulated distribution, the center was 0.
‡ The 68-95-99.7% empirical rule!
t=^β1−β1,0se(^β1)t=−2.28−00.48t=−4.75
Our sample slope is 4.75 standard deviations below the mean under H0
p-value: prob. of a test statistic at least as large (in magnitude) as ours if the null hypothesis were true†
† Think of our simulated distribution, the center was 0.
Ha:β1<0
p-value: Prob(t<ti)
Ha:β1>0
p-value: Prob(t>ti)
Ha:β1≠0
p-value: 2×Prob(t>|ti|)
summary(school_reg)
## ## Call:## lm(formula = testscr ~ str, data = CASchool)## ## Residuals:## Min 1Q Median 3Q Max ## -47.727 -14.251 0.483 12.822 48.540 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 698.9330 9.4675 73.825 < 2e-16 ***## str -2.2798 0.4798 -4.751 2.78e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 18.58 on 418 degrees of freedom## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897 ## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
broom
's tidy()
(with confidence intervals)tidy(school_reg, conf.int=TRUE)
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | |
---|---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
H0:β1=0Ha:βa≠0
Because the hypothesis test's p-value < α (0.05)...
We have sufficient evidence to reject H0 in favor of our alternative hypothesis. Our sample suggests that there is a relationship between class size and test scores.
H0:β1=0Ha:βa≠0
Because the hypothesis test's p-value < α (0.05)...
We have sufficient evidence to reject H0 in favor of our alternative hypothesis. Our sample suggests that there is a relationship between class size and test scores.
Using the confidence intervals:
We are 95% confident that the true marginal effect of class size on test scores is between −3.22 and −1.34.
Confidence intervals are all two-sided by nature CI0.95=([^β1−2×se(^β1)],[^β1+2×se(^β1]))
Hypothesis test (t-test) of H0:β1=0 computes a t-value of1 t=^β1se(^β1)
Confidence intervals are all two-sided by nature CI0.95=([^β1−2×se(^β1)],[^β1+2×se(^β1]))
Hypothesis test (t-test) of H0:β1=0 computes a t-value of1 t=^β1se(^β1)
If a confidence interval contains the H0 value (i.e. 0, for our test), then we fail to reject H0.
1 Since our null hypothesis is that β1,0=0, the test statistic simplifies to this neat fraction.
❌ p is the probability that the alternative hypothesis is false
❌ p is the probability that the alternative hypothesis is false
❌ p is the probability that the null hypothesis is true
❌ p is the probability that the alternative hypothesis is false
❌ p is the probability that the null hypothesis is true
❌ p is the probability that our observed effects were produced purely by random chance
❌ p is the probability that the alternative hypothesis is false
❌ p is the probability that the null hypothesis is true
❌ p is the probability that our observed effects were produced purely by random chance
❌ p tells us how significant our finding is
“The widespread use of 'statistical significance' (generally interpreted as (p≤0.05) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.”
Wasserstein, Ronald L. and Nicole A. Lazar, (2016), "The ASA's Statement on p-Values: Context, Process, and Purpose," The American Statistician 30(2): 129-133
“No economist has achieved scientific success as a result of a statistically significant coefficient. Massed observations, clever common sense, elegant theorems, new policies, sagacious economic reasoning, historical perspective, relevant accounting, these have all led to scientific success. Statistical significance has not.”
McCloskey, Dierdre N and Stephen Ziliak, 1996, The Cult of Statistical Significance, p. 112)
Again, p-value is the probability that, if the null hypothesis were true, we obtain (by pure random chance) a test statistic at least as extreme as the one we estimated for our sample
A low p-value means either (and we can't distinguish which):
Test Score | |
---|---|
Intercept | 698.93 *** |
(9.47) | |
STR | -2.28 *** |
(0.48) | |
N | 420 |
R-Squared | 0.05 |
SER | 18.58 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Statistical significance is shown by asterisks, common (but not always!) standard:
Rare, but sometimes regression tables include p-values for estimates
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |