Problem for identification: endogeneity
Problem for inference: randomness
An independent variable (X) is exogenous if its variation is unrelated to other factors that affect the dependent variable (Y)
An independent variable (X) is endogenous if its variation is related to other factors that affect the dependent variable (Y)
Common in statistics, easy to fix
Inferential Statistics: making claims about a wider population using sample data
Sample →⏟statistical inference Population →⏟causal indentification Unobserved Parameters
We want to identify causal relationships between population variables
We'll use sample statistics to infer something about population parameters
Data are information with context
Individuals are the entities described by a set of data
Variables are particular characteristics about an individual
Observations or cases are the separate individuals described by a collection of variables
individuals and observations are not necessarily the same:
Categorical data place an individual into one of several possible categories
In R
: character
or factor
type data
factor
⟹ specific possible categoriesdiamonds %>% count(cut) %>% mutate(frequency = n / sum(n), percent = round(frequency * 100, 2))
cut | n | frequency | percent |
---|---|---|---|
Fair | 1610 | 0.0298480 | 2.98 |
Good | 4906 | 0.0909529 | 9.10 |
Very Good | 12082 | 0.2239896 | 22.40 |
Premium | 13791 | 0.2556730 | 25.57 |
Ideal | 21551 | 0.3995365 | 39.95 |
Good way to represent categorical data is with a frequency table
Count (n): total number of individuals in a category
Frequency: proportion of a category's ocurrence relative to all data
Charts and graphs are always better ways to visualize data
A bar graph represents categories as bars, with lengths proportional to the count or relative frequency of each category
ggplot(diamonds, aes(x=cut, fill=cut))+ geom_bar()+ guides(fill=F)+ theme_pander(base_family = "Fira Sans Condensed", base_size=20)
Avoid pie charts!
People are not good at judging 2-d differences (angles, area)
People are good at judging 1-d differences (length)
Avoid pie charts!
People are not good at judging 2-d differences (angles, area)
People are good at judging 1-d differences (length)
diamonds %>% count(cut) %>%ggplot(data = .)+ aes(x = "", y = n)+ geom_col(aes(fill = cut))+ geom_label(aes(label = cut, color = cut), position = position_stack(vjust = 0.5) )+ guides(color = F, fill = F)+ theme_void()
diamonds %>% count(cut) %>% mutate(cut_name = as.factor(cut)) %>%ggplot(., aes(x = cut_name, y = n, color = cut))+ geom_point(stat="identity", fill="black", size=12) + geom_segment(aes(x = cut_name, y = 0, xend = cut_name, yend = n), size = 2)+ geom_text(aes(label = n),color="white", size=3) + coord_flip()+ labs(x = "Cut")+ theme_pander(base_family = "Fira Sans Condensed", base_size=20)+ guides(color = F)
library(treemapify)diamonds %>% count(cut) %>%ggplot(., aes(area = n, fill = cut)) + geom_treemap() + guides(fill = FALSE) + geom_treemap_text(aes(label = cut), colour = "white", place = "topleft", grow = TRUE)
Quantitative variables take on numerical values of equal units that describe an individual
We can mathematically manipulate only quantitative data
In R
: numeric
type data
integer
if whole numberdouble
if has decimalsDiscrete data are finite, with a countable number of alternatives
Categorical: place data into categories
Quantitative: integers
Continuous data are infinitely divisible, with an uncountable number of alternatives
Many discrete variables may be treated as if they are continuous
ID | Name | Age | Sex | Income |
---|---|---|---|---|
1 | John | 23 | Male | 41000 |
2 | Emile | 18 | Male | 52600 |
3 | Natalya | 28 | Female | 48000 |
4 | Lakisha | 31 | Female | 60200 |
5 | Cheng | 36 | Male | 81900 |
The most common data structure we use is a spreadsheet
data.frame
or tibble
A row contains data about all variables for a single individual
A column contains data about a single variable across all individuals
ID | Name | Age | Sex | Income |
---|---|---|---|---|
1 | John | 23 | Male | 41000 |
2 | Emile | 18 | Male | 52600 |
3 | Natalya | 28 | Female | 48000 |
4 | Lakisha | 31 | Female | 60200 |
5 | Cheng | 36 | Male | 81900 |
df[row,column]
example[3,2] # value in row 3, column 2
## # A tibble: 1 x 1## Name ## <chr> ## 1 Natalya
filter()
and select()
!It is common to use some notation like the following:
Let {x1,x2,⋯,xn} be a simple data series on variable X
It is common to use some notation like the following:
Let {x1,x2,⋯,xn} be a simple data series on variable X
Quick Check: Let x represent the score on a homework assignment: 75,100,92,87,79,0,95
ID | Name | Age | Sex | Income |
---|---|---|---|---|
1 | John | 23 | Male | 41000 |
2 | Emile | 18 | Male | 52600 |
3 | Natalya | 28 | Female | 48000 |
4 | Lakisha | 31 | Female | 60200 |
5 | Cheng | 36 | Male | 81900 |
Cross-sectional data: observations of individuals at a given point in time
Each observation is a unique individual xi
Simplest and most common data
A "snapshot" to compare differences across individuals
Year | GDP | Unemployment | CPI |
---|---|---|---|
1950 | 8.2 | 0.06 | 100 |
1960 | 9.9 | 0.04 | 118 |
1970 | 10.2 | 0.08 | 130 |
1980 | 12.4 | 0.08 | 190 |
1985 | 13.6 | 0.06 | 196 |
Time-series data: observations of the same individual(s) over time
Each observation is a time period xt
Often used for macroeconomics, finance, and forecasting
Unique challenges for time series
A "moving picture" to see how individuals change over time
City | Year | Murders | Population | UR |
---|---|---|---|---|
Philadelphia | 1986 | 5 | 3.700 | 8.7 |
Philadelphia | 1990 | 8 | 4.200 | 7.2 |
D.C. | 1986 | 2 | 0.250 | 5.4 |
D.C. | 1990 | 10 | 0.275 | 5.5 |
New York | 1986 | 3 | 6.400 | 9.6 |
Panel, or longitudinal dataset: a time-series for each cross-sectional entity
Each obs. is an individual in a time period xit
More common today for serious researchers; unique challenges and benefits
A combination of "snapshot" comparisons over time
Variables take on different values, we can describe a variable's distribution (of these values)
We want to visualize and analyze distributions to search for meaningful patterns using statistics
Descriptive Statistics: describes or summarizes the properties of a sample
Inferential Statistics: infers properties about a larger population from the properties of a sample†
† We'll encounter inferential statistics mainly in the context of regression later.
A common way to present a quantitative variable's distribution is a histogram
Divide up values into bins of a certain size, and count the number of values falling within each bin, representing them visually as bars
Example: a class of 13 students takes a quiz (out of 100 points) with the following results:
{0,62,66,71,71,74,76,79,83,86,88,93,95}
Example: a class of 13 students takes a quiz (out of 100 points) with the following results:
{0,62,66,71,71,74,76,79,83,86,88,93,95}
quizzes<-tibble(scores = c(0,62,66,71,71,74,76,79,83,86,88,93,95))
Example: a class of 13 students takes a quiz (out of 100 points) with the following results:
{0,62,66,71,71,74,76,79,83,86,88,93,95}
h<-ggplot(quizzes,aes(x=scores))+ geom_histogram(breaks = seq(0,100,10), color = "white", fill = "#56B4E9")+ scale_x_continuous(breaks = seq(0,100,10))+ scale_y_continuous(limits = c(0,6), expand = c(0,0))+ labs(x = "Scores", y = "Number of Students")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=20)h
The mode of a variable is simply its most frequent value
A variable can have multiple modes
The mode of a variable is simply its most frequent value
A variable can have multiple modes
Example: a class of 13 students takes a quiz (out of 100 points) with the following results:
{0,62,66,71,71,74,76,79,83,86,88,93,95}
There is no dedicated mode()
function in R
, surprisingly
A workaround in dplyr
:
quizzes %>% count(scores) %>% arrange(desc(n))
## # A tibble: 12 x 2## scores n## <dbl> <int>## 1 71 2## 2 0 1## 3 62 1## 4 66 1## 5 74 1## 6 76 1## 7 79 1## 8 83 1## 9 86 1## 10 88 1## 11 93 1## 12 95 1
Looking at a histogram, the modes are the "peaks" of the distribution
May be unimodal, bimodal, trimodal, etc
A distribution is symmetric if it looks roughly the same on either side of the "center"
The thinner ends (far left and far right) are called the tails of a distribution
Outlier: extreme value that does not appear part of the general pattern of a distribution
Can strongly affect descriptive statistics
Might be the most informative part of the data
Could be the result of errors
Should always be explored and discussed!
μ=x1+x2+...+xNN=1NN∑i=1xi
For N values of variable x, "mu" is the sum of all individual x values (xi) from 1 to N, divided by the N number of values†
See today's class notes for more about the summation operator, Σ, it'll come up again!
† Note the mean need not be an actual value of the data!
ˉx=x1+x2+...+xnn=1nn∑i=1xi
ˉx=x1+x2+...+xnn=1nn∑i=1xi
Example:
{0,62,66,71,71,74,76,79,83,86,88,93,95}
ˉx=113(0+62+66+71+71+74+76+79+83+86+88+93+95)ˉx=94413ˉx=72.62
ˉx=x1+x2+...+xnn=1nn∑i=1xi
Example:
{0,62,66,71,71,74,76,79,83,86,88,93,95}
ˉx=113(0+62+66+71+71+74+76+79+83+86+88+93+95)ˉx=94413ˉx=72.62
quizzes %>% summarize(mean=mean(scores))
## # A tibble: 1 x 1## mean## <dbl>## 1 72.6
Example:
{62,66,71,71,74,76,79,83,86,88,93,95}
ˉx=112(62+66+71+71+74+76+79+83+86+88+93+95)=94412=78.67
Example:
{62,66,71,71,74,76,79,83,86,88,93,95}
ˉx=112(62+66+71+71+74+76+79+83+86+88+93+95)=94412=78.67
quizzes %>% filter(scores>0) %>% summarize(mean=mean(scores))
## # A tibble: 1 x 1## mean## <dbl>## 1 78.7
{0,62,66,71,71,74,76,79,83,86,88,93,95}
The median is the midpoint of the distribution
Arrange values in numerical order
symmetric %>% summarize(mean = mean(x), median = median(x))
## # A tibble: 1 x 2## mean median## <dbl> <dbl>## 1 4 4
leftskew %>% summarize(mean = mean(x), median = median(x))
## mean median## 1 4.615385 5
rightskew %>% summarize(mean = mean(x), median = median(x))
## # A tibble: 1 x 2## mean median## <dbl> <dbl>## 1 3.38 3
The more variation in the data, the less helpful a measure of central tendency will tell us
Beyond just the center, we also want to measure the spread
Simplest metric is range =max−min
# Base R summary command (includes Mean)summary(quizzes$scores)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 71.00 76.00 72.62 86.00 95.00
quizzes %>% # dplyr summarize(Min = min(scores), Q1 = quantile(scores, 0.25), Median = median(scores), Q3 = quantile(scores, 0.75), Max = max(scores))
## # A tibble: 1 x 5## Min Q1 Median Q3 Max## <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0 71 76 86 95
quizzes %>% summarize("37th percentile" = quantile(scores,0.37))
## # A tibble: 1 x 1## `37th percentile`## <dbl>## 1 72.3
Boxplots are a great way to visualize the 5 number summary
Height of box: Q1 to Q3 (known as interquartile range (IQR), middle 50% of data)
Line inside box: median (50th percentile)
"Whiskers" identify data within 1.5×IQR
Points beyond whiskers are outliers
Example:
Quiz 1:{0,62,66,71,71,74,76,79,83,86,88,93,95}Quiz 2:{50,62,72,73,79,81,82,82,86,90,94,98,99}
quizzes_new %>% summary()
## student quiz_1 quiz_2 ## Min. : 1 Min. : 0.00 Min. :50.00 ## 1st Qu.: 4 1st Qu.:71.00 1st Qu.:73.00 ## Median : 7 Median :76.00 Median :82.00 ## Mean : 7 Mean :72.62 Mean :80.62 ## 3rd Qu.:10 3rd Qu.:86.00 3rd Qu.:90.00 ## Max. :13 Max. :95.00 Max. :99.00
I don't like the options available for printing out summary statistics
So I wrote my own R function
called summary_table()
that makes nice summary tables (it uses dplyr
and tidyr
!). To use:
Download the summaries.R
file from the website† and move it to your working directory/project folder
Load the function with the source()
command:‡
source("summaries.R")
† One day I'll make this part of a package I'll write.
‡ If it was a package, then you'd load with library()
. But you can run a single .R
script with source()
.
3) The function has at least 2 arguments: the data.frame
(automatically piped in if you use the pipe!) and then all variables you want to summarize, separated by commas†
mpg %>% summary_table(hwy, cty, cyl)
## # A tibble: 3 x 9## Variable Obs Min Q1 Median Q3 Max Mean `Std. Dev.`## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 cty 234 9 14 17 19 35 16.9 4.26## 2 cyl 234 4 4 6 8 8 5.89 1.61## 3 hwy 234 12 18 24 27 44 23.4 5.95
† There is one restriction: No variable name can have an underscore (_)
in it. You will have to rename them or else you will break the function!
4) When knit
ted in R markdown
, it looks nicer:
mpg %>% summary_table(hwy, cty, cyl) %>% knitr::kable(., format="html")
Variable | Obs | Min | Q1 | Median | Q3 | Max | Mean | Std. Dev. |
---|---|---|---|---|---|---|---|---|
cty | 234 | 9 | 14 | 17 | 19 | 35 | 16.86 | 4.26 |
cyl | 234 | 4 | 4 | 6 | 8 | 8 | 5.89 | 1.61 |
hwy | 234 | 12 | 18 | 24 | 27 | 44 | 23.44 | 5.95 |
markdown
and making final products nicer when we discuss your paper project (have you forgotten?)Every observation i deviates from the mean of the data: deviationi=xi−μ
There are as many deviations as there are data points (n)
We can measure the average or standard deviation of a variable from its mean
Before we get there...
σ2=1NN∑i=1(xi−μ)2
Why do we square deviations?
What are these units?
σ=√σ2=√1NN∑i=1(xi−μ)2
σ2=1n−1n∑i=1(xi−ˉx)2
s=√s2=√1n−1n∑i=1(xi−ˉx)2
Example: Calculate the sample standard deviation for the following series:
{2,4,6,8,10}
Example: Calculate the sample standard deviation for the following series:
{2,4,6,8,10}
sd(c(2,4,6,8,10))
## [1] 3.162278
# first let's save our data in a tibblesd_example<-tibble(x=c(2,4,6,8,10))# first find the mean (just so we know)sd_example %>% summarize(mean(x))
## # A tibble: 1 x 1## `mean(x)`## <dbl>## 1 6
# now let's make some more columns:sd_example <- sd_example %>% mutate(deviations = x-mean(x), # take deviations from mean deviations_sq = deviations^2) # square them
sd_example # see what we made
## # A tibble: 5 x 3## x deviations deviations_sq## <dbl> <dbl> <dbl>## 1 2 -4 16## 2 4 -2 4## 3 6 0 0## 4 8 2 4## 5 10 4 16
sd_example %>% # sum the squared deviations summarize(sum_sq_devs = sum(deviations_sq), # divide by n-1 to get variance variance = sum_sq_devs/(n()-1), # square root to get sd std_dev = sqrt(variance))
## # A tibble: 1 x 3## sum_sq_devs variance std_dev## <dbl> <dbl> <dbl>## 1 40 10 3.16
You Try: Calculate the sample standard deviation for the following series:
{1,3,5,7}
You Try: Calculate the sample standard deviation for the following series:
{1,3,5,7}
sd(c(1,3,5,7))
## [1] 2.581989
Population size: N
Mean: μ
Variance: σ2=1NN∑i=1(xi−μ)2
Standard deviation: σ=√σ2
Population size: n
Mean: ˉx
Variance: s2=1n−1n∑i=1(xi−ˉx)2
Standard deviation: s=√s2
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |