2.1 — Data 101 & Descriptive Statistics

ECON 480 • Econometrics • Fall 2020

Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsF20
metricsF20.classes.ryansafner.com

Outline

The Two Big Problems with Data

Data 101

Descriptive Statistics

Measures of Center

Measures of Dispersion

The Two Big Problems with Data

Two Big Problems with Data

We want to use econometrics to identify causal relationships and make inferences about them

Problem for identification: endogeneity
Problem for inference: randomness

Identification Problem: Endogeneity

An independent variable is exogenous if its variation is unrelated to other factors that affect the dependent variable
An independent variable is endogenous if its variation is related to other factors that affect the dependent variable

Identification Problem: Endogeneity

An independent variable is exogenous if its variation is unrelated to other factors that affect the dependent variable

Identification Problem: Endogeneity

An independent variable is endogenous if its variation is related to other factors that affect the dependent variable

Inference Problem: Randomness

Data is random due to natural sampling variation
- Taking one sample of a population will yield slightly different information than another sample of the same population
Common in statistics, easy to fix
Inferential Statistics: making claims about a wider population using sample data
- We use common tools and techniques to deal with randomness

The Two Problems: Where We're Heading...Ultimately

Sample Population Unobserved Parameters

We want to identify causal relationships between population variables
- Logically first thing to consider
- Endogeneity problem
We'll use sample statistics to infer something about population parameters
- In practice, we'll only ever have a finite sample distribution of data
- We don't know the population distribution of data
- Randomness problem

Data 101

Data are information with context
Individuals are the entities described by a set of data
- e.g. persons, households, firms, countries

Data 101

Variables are particular characteristics about an individual
- e.g. age, income, profits, population, GDP, marital status, type of legal institutions
Observations or cases are the separate individuals described by a collection of variables
- e.g. for one individual, we have their age, sex, income, education, etc.
individuals and observations are not necessarily the same:
- e.g. we can have multiple observations on the same individual over time

Categorical Data

Categorical data place an individual into one of several possible categories
- e.g. sex, season, political party
- may be responses to survey questions
- can be quantitative (e.g. age, zip code)
In R: character or factor type data
- factor specific possible categories

Categorical Data: Visualizing I

diamonds %>%
  count(cut) %>%
  mutate(frequency = n / sum(n),
         percent = round(frequency * 100, 2))

Summary of diamonds by cut
cut	n	frequency	percent
Fair	1610	0.0298480	2.98
Good	4906	0.0909529	9.10
Very Good	12082	0.2239896	22.40
Premium	13791	0.2556730	25.57
Ideal	21551	0.3995365	39.95

Good way to represent categorical data is with a frequency table
Count (n): total number of individuals in a category
Frequency: proportion of a category's ocurrence relative to all data
- Multiply proportions by 100% to get percentages

Categorical Data: Visualizing II

Charts and graphs are always better ways to visualize data
A bar graph represents categories as bars, with lengths proportional to the count or relative frequency of each category

ggplot(diamonds, aes(x=cut,
                     fill=cut))+
  geom_bar()+
  guides(fill=F)+
  theme_pander(base_family = "Fira Sans Condensed",
           base_size=20)

Categorical Data: Visualizing III

Avoid pie charts!
People are not good at judging 2-d differences (angles, area)
People are good at judging 1-d differences (length)

Categorical Data: Visualizing III

Avoid pie charts!
People are not good at judging 2-d differences (angles, area)
People are good at judging 1-d differences (length)

Categorical Data: Visualizing IV

Maybe a stacked bar chart

diamonds %>%
  count(cut) %>%
ggplot(data = .)+
  aes(x = "",
      y = n)+
  geom_col(aes(fill = cut))+
  geom_label(aes(label = cut,
                 color = cut),
             position = position_stack(vjust = 0.5)
             )+
  guides(color = F,
         fill = F)+
  theme_void()

Categorical Data: Visualizing IV

Maybe lollipop chart

diamonds %>%
  count(cut) %>%
  mutate(cut_name = as.factor(cut)) %>%
ggplot(., aes(x = cut_name, y = n, color = cut))+
 geom_point(stat="identity",
            fill="black",
            size=12)  +
  geom_segment(aes(x = cut_name, y = 0,
                   xend = cut_name,
                   yend = n), size = 2)+
  geom_text(aes(label = n),color="white", size=3) +
  coord_flip()+
  labs(x = "Cut")+
  theme_pander(base_family = "Fira Sans Condensed",
                base_size=20)+
  guides(color = F)

Categorical Data: Visualizing IV

Maybe a treemap

library(treemapify)
diamonds %>%
  count(cut) %>%
ggplot(., aes(area = n, fill = cut)) +
  geom_treemap() +
  guides(fill = FALSE) +
  geom_treemap_text(aes(label = cut),
                    colour = "white",
                    place = "topleft",
                    grow = TRUE)

Quantitative Data I

Quantitative variables take on numerical values of equal units that describe an individual
- Units: points, dollars, inches
- Context: GPA, prices, height
We can mathematically manipulate only quantitative data
- e.g. sum, average, standard deviation
In R: numeric type data
- integer if whole number
- double if has decimals

Discrete Data

Discrete data are finite, with a countable number of alternatives
Categorical: place data into categories
- e.g. letter grades: A, B, C, D, F
- e.g. class level: freshman, sophomore, junior, senior
Quantitative: integers
- e.g. SAT Score, number of children, age (years)

Continuous Data

Continuous data are infinitely divisible, with an uncountable number of alternatives
- e.g. weight, length, temperature, GPA
Many discrete variables may be treated as if they are continuous
- e.g. SAT scores (whole points), wages (dollars and cents)

Spreadsheets

ID	Name	Age	Sex	Income
1	John	23	Male	41000
2	Emile	18	Male	52600
3	Natalya	28	Female	48000
4	Lakisha	31	Female	60200
5	Cheng	36	Male	81900

The most common data structure we use is a spreadsheet
- In R: a data.frame or tibble
A row contains data about all variables for a single individual
A column contains data about a single variable across all individuals

Spreadsheets
 
    ID 
    Name 
    Age 
    Sex 
    Income 
  


    1 
    John 
    23 
    Male 
    41000 
  

    2 
    Emile 
    18 
    Male 
    52600 
  

    3 
    Natalya 
    28 
    Female 
    48000 
  

    4 
    Lakisha 
    31 
    Female 
    60200 
  

    5 
    Cheng 
    36 
    Male 
    81900 
  



Each cell can be referenced by its row and column (in that order!), df[row,column]

example[3,2] # value in row 3, column 2

## # A tibble: 1 x 1
##   Name   
##   <chr>  
## 1 Natalya
Recall how to “subset” data frames from 1.2; though it’s now much easier with filter() and select()!

   

Spreadsheets II

It is common to use some notation like the following:
Let be a simple data series on variable
- individual observations
- is the value of the ^th observation for

Spreadsheets II

It is common to use some notation like the following:
Let be a simple data series on variable
- individual observations
- is the value of the ^th observation for

Quick Check: Let represent the score on a homework assignment:

What is ?
What is ?
What is ?

Datasets: Cross-Sectional

ID	Name	Age	Sex	Income
1	John	23	Male	41000
2	Emile	18	Male	52600
3	Natalya	28	Female	48000
4	Lakisha	31	Female	60200
5	Cheng	36	Male	81900

Cross-sectional data: observations of individuals at a given point in time
Each observation is a unique individual
Simplest and most common data
A "snapshot" to compare differences across individuals

Datasets: Time-Series

Year	GDP	Unemployment	CPI
1950	8.2	0.06	100
1960	9.9	0.04	118
1970	10.2	0.08	130
1980	12.4	0.08	190
1985	13.6	0.06	196

Time-series data: observations of the same individual(s) over time
Each observation is a time period
Often used for macroeconomics, finance, and forecasting
Unique challenges for time series
A "moving picture" to see how individuals change over time

Datasets: Panel

City	Year	Murders	Population	UR
Philadelphia	1986	5	3.700	8.7
Philadelphia	1990	8	4.200	7.2
D.C.	1986	2	0.250	5.4
D.C.	1990	10	0.275	5.5
New York	1986	3	6.400	9.6

Panel, or longitudinal dataset: a time-series for each cross-sectional entity
- Must be same individuals over time
Each obs. is an individual in a time period
More common today for serious researchers; unique challenges and benefits
A combination of "snapshot" comparisons over time

Descriptive Statistics

Variables and Distributions

Variables take on different values, we can describe a variable's distribution (of these values)
We want to visualize and analyze distributions to search for meaningful patterns using statistics

Two Branches of Statistics

Two main branches of statistics:

Descriptive Statistics: describes or summarizes the properties of a sample
Inferential Statistics: infers properties about a larger population from the properties of a sample^†

^† We'll encounter inferential statistics mainly in the context of regression later.

Histograms

A common way to present a quantitative variable's distribution is a histogram
- The quantitative analog to the bar graph for a categorical variable
Divide up values into bins of a certain size, and count the number of values falling within each bin, representing them visually as bars

Histogram: Example

Example: a class of 13 students takes a quiz (out of 100 points) with the following results:

Histogram: Example

Example: a class of 13 students takes a quiz (out of 100 points) with the following results:

quizzes<-tibble(scores = c(0,62,66,71,71,74,76,79,83,86,88,93,95))

Histogram: Example

Example: a class of 13 students takes a quiz (out of 100 points) with the following results:

h<-ggplot(quizzes,aes(x=scores))+
  geom_histogram(breaks = seq(0,100,10),
                 color = "white",
                 fill = "#56B4E9")+
  scale_x_continuous(breaks = seq(0,100,10))+
  scale_y_continuous(limits = c(0,6), expand = c(0,0))+
  labs(x = "Scores",
       y = "Number of Students")+
  ggthemes::theme_pander(base_family = "Fira Sans Condensed",
           base_size=20)
h

Descriptive Statistics

We are often interested in the shape or pattern of a distribution, particularly:
- Measures of center
- Measures of dispersion
- Shape of distribution

Measures of Center

Mode

The mode of a variable is simply its most frequent value
A variable can have multiple modes

Mode

The mode of a variable is simply its most frequent value
A variable can have multiple modes

Example: a class of 13 students takes a quiz (out of 100 points) with the following results:

Mode

There is no dedicated mode() function in R, surprisingly
A workaround in dplyr:

quizzes %>%
  count(scores) %>%
  arrange(desc(n))

## # A tibble: 12 x 2
##    scores     n
##     <dbl> <int>
##  1     71     2
##  2      0     1
##  3     62     1
##  4     66     1
##  5     74     1
##  6     76     1
##  7     79     1
##  8     83     1
##  9     86     1
## 10     88     1
## 11     93     1
## 12     95     1

Looking at a histogram, the modes are the "peaks" of the distribution
- Note: depends on how wide you make the bins!
May be unimodal, bimodal, trimodal, etc

Symmetry and Skew I

A distribution is symmetric if it looks roughly the same on either side of the "center"
The thinner ends (far left and far right) are called the tails of a distribution

Symmetry and Skew I

If one tail stretches farther than the other, distribution is skewed in the direction of the longer tail

Outliers

Outlier: extreme value that does not appear part of the general pattern of a distribution
Can strongly affect descriptive statistics
Might be the most informative part of the data
Could be the result of errors
Should always be explored and discussed!

Arithmetic Mean (Population)

The natural measure of the center of a population's distribution is its "average" or arithmetic mean

For values of variable , "mu" is the sum of all individual values from 1 to , divided by the number of values^†
See today's class notes for more about the summation operator, , it'll come up again!

^† Note the mean need not be an actual value of the data!

Arithmetic Mean (Sample)

When we have a sample, we compute the sample mean

For values of variable , "x-bar" is the sum of all individual values divided by the number of values

Arithmetic Mean (Sample)

When we have a sample, we compute the sample mean

For values of variable , "x-bar" is the sum of all individual values divided by the number of values

Example:

Arithmetic Mean (Sample)

When we have a sample, we compute the sample mean

For values of variable , "x-bar" is the sum of all individual values divided by the number of values

Example:

quizzes %>%
  summarize(mean=mean(scores))

## # A tibble: 1 x 1
##    mean
##   <dbl>
## 1  72.6

Arithmetic Mean: Affected by OutliersIf we drop the outlier (0)
   

Arithmetic Mean: Affected by Outliers

If we drop the outlier (0)

Example:

Arithmetic Mean: Affected by Outliers

If we drop the outlier (0)

Example:

quizzes %>%
  filter(scores>0) %>%
  summarize(mean=mean(scores))

## # A tibble: 1 x 1
##    mean
##   <dbl>
## 1  78.7

Median

The median is the midpoint of the distribution
- 50% to the left of the median, 50% to the right of the median
Arrange values in numerical order
- For odd : median is middle observation
- For even : median is average of two middle observations

Mean, Median, and Outliers

Mean, Median, Symmetry, Skew I

Symmetric distribution: mean median

symmetric %>%
  summarize(mean = mean(x),
            median = median(x))

## # A tibble: 1 x 2
##    mean median
##   <dbl>  <dbl>
## 1     4      4

Mean, Median, Symmetry, Skew II

Left-skewed: mean median

leftskew %>%
  summarize(mean = mean(x),
            median = median(x))

##       mean median
## 1 4.615385      5

Mean, Median, Symmetry, Skew III

Right-skewed: mean median

rightskew %>%
  summarize(mean = mean(x),
            median = median(x))

## # A tibble: 1 x 2
##    mean median
##   <dbl>  <dbl>
## 1  3.38      3

Measures of Dispersion

Measures of Dispersion: Range

The more variation in the data, the less helpful a measure of central tendency will tell us
Beyond just the center, we also want to measure the spread
Simplest metric is range

Measures of Dispersion: 5 Number Summary ICommon set of summary statistics of a distribution: "five number summary":
Minimum value
25th percentile (Q1, median of first 50% of data)
50th percentile (median, Q2)
25th percentile (Q3, median of last 50% of data)
Maximum value

   

Measures of Dispersion: 5 Number Summary ICommon set of summary statistics of a distribution: "five number summary":
Minimum value
25th percentile (Q1, median of first 50% of data)
50th percentile (median, Q2)
25th percentile (Q3, median of last 50% of data)
Maximum value

# Base R summary command (includes Mean)
summary(quizzes$scores)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   71.00   76.00   72.62   86.00   95.00
quizzes %>% # dplyr
  summarize(Min = min(scores),
            Q1 = quantile(scores, 0.25),
            Median = median(scores),
            Q3 = quantile(scores, 0.75),
            Max = max(scores))

## # A tibble: 1 x 5
##     Min    Q1 Median    Q3   Max
##   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0    71     76    86    95
   

Measures of Dispersion: 5 Number Summary II

The ^th percentile of a distribution is the value that places percent of values beneath it

quizzes %>%
  summarize("37th percentile" = quantile(scores,0.37))

## # A tibble: 1 x 1
##   `37th percentile`
##               <dbl>
## 1              72.3

Boxplots I

Boxplots are a great way to visualize the 5 number summary
Height of box: to (known as interquartile range (IQR), middle 50% of data)
Line inside box: median (50^th percentile)
"Whiskers" identify data within
Points beyond whiskers are outliers
- common definition:

Comparisons I

Boxplots (and five number summaries) are great for comparing two distributions

Example:

Comparisons II

quizzes_new %>% summary()

##     student       quiz_1          quiz_2     
##  Min.   : 1   Min.   : 0.00   Min.   :50.00  
##  1st Qu.: 4   1st Qu.:71.00   1st Qu.:73.00  
##  Median : 7   Median :76.00   Median :82.00  
##  Mean   : 7   Mean   :72.62   Mean   :80.62  
##  3rd Qu.:10   3rd Qu.:86.00   3rd Qu.:90.00  
##  Max.   :13   Max.   :95.00   Max.   :99.00

Aside: Making Nice Summary Tables I

I don't like the options available for printing out summary statistics
So I wrote my own R function called summary_table() that makes nice summary tables (it uses dplyr and tidyr!). To use:

Download the summaries.R file from the website^† and move it to your working directory/project folder
Load the function with the source() command:^‡

source("summaries.R")

^† One day I'll make this part of a package I'll write.

^‡ If it was a package, then you'd load with library(). But you can run a single .R script with source().

Aside: Making Nice Summary Tables II

3) The function has at least 2 arguments: the data.frame (automatically piped in if you use the pipe!) and then all variables you want to summarize, separated by commas^†

mpg %>%
  summary_table(hwy, cty, cyl)

## # A tibble: 3 x 9
##   Variable   Obs   Min    Q1 Median    Q3   Max  Mean `Std. Dev.`
##   <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>       <dbl>
## 1 cty        234     9    14     17    19    35 16.9         4.26
## 2 cyl        234     4     4      6     8     8  5.89        1.61
## 3 hwy        234    12    18     24    27    44 23.4         5.95

^† There is one restriction: No variable name can have an underscore (_) in it. You will have to rename them or else you will break the function!

Aside: Making Nice Summary Tables II

4) When knitted in R markdown, it looks nicer:

mpg %>%
  summary_table(hwy, cty, cyl) %>%
  knitr::kable(., format="html")

Variable	Obs	Min	Q1	Median	Q3	Max	Mean	Std. Dev.
cty	234	9	14	17	19	35	16.86	4.26
cyl	234	4	4	6	8	8	5.89	1.61
hwy	234	12	18	24	27	44	23.44	5.95

We'll talk more about using markdown and making final products nicer when we discuss your paper project (have you forgotten?)

Measures of Dispersion: Deviations

Every observation deviates from the mean of the data:
There are as many deviations as there are data points
We can measure the average or standard deviation of a variable from its mean
Before we get there...

Variance (Population)

The population variance of a population distribution measures the average of the squared deviations from the population mean

Why do we square deviations?
What are these units?

Standard Deviation (Population)

Square root the variance to get the population standard deviation , the average deviation from the population mean (in same units as )

Variance (Sample)

The sample variance of a sample distribution measures the average of the squared deviations from the sample mean

Why do we divide by ?

Standard Deviation (Sample)

Square root the sample variance to get the sample standard deviation , the average deviation from the sample mean (in same units as )

Sample Standard Deviation: Example

Example: Calculate the sample standard deviation for the following series:

Sample Standard Deviation: Example

Example: Calculate the sample standard deviation for the following series:

sd(c(2,4,6,8,10))

## [1] 3.162278

The Steps to Calculate sd(), Coded I

#  first let's save our data in a tibble
sd_example<-tibble(x=c(2,4,6,8,10))
# first find the mean (just so we know)
sd_example %>%
  summarize(mean(x))

## # A tibble: 1 x 1
##   `mean(x)`
##       <dbl>
## 1         6

# now let's make some more columns:
sd_example <- sd_example %>%
  mutate(deviations = x-mean(x), # take deviations from mean
         deviations_sq = deviations^2) # square them

The Steps to Calculate sd(), Coded IIsd_example # see what we made

## # A tibble: 5 x 3
##       x deviations deviations_sq
##   <dbl>      <dbl>         <dbl>
## 1     2         -4            16
## 2     4         -2             4
## 3     6          0             0
## 4     8          2             4
## 5    10          4            16
   

The Steps to Calculate sd(), Coded IIIsd_example %>%
  # sum the squared deviations
  summarize(sum_sq_devs = sum(deviations_sq), 
            # divide by n-1 to get variance
            variance = sum_sq_devs/(n()-1), 
            # square root to get sd
            std_dev = sqrt(variance))

## # A tibble: 1 x 3
##   sum_sq_devs variance std_dev
##         <dbl>    <dbl>   <dbl>
## 1          40       10    3.16
   

Sample Standard Deviation: You Try

You Try: Calculate the sample standard deviation for the following series:

Sample Standard Deviation: You Try

You Try: Calculate the sample standard deviation for the following series:

sd(c(1,3,5,7))

## [1] 2.581989

Descriptive Statistics: Populations vs. Samples

Population parameters

Population size:
Mean:
Variance:
Standard deviation:

Sample statistics

Population size:
Mean:
Variance:
Standard deviation:

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

2.1 — Data 101 & Descriptive Statistics

ECON 480 • Econometrics • Fall 2020

Ryan Safner Assistant Professor of Economics safner@hood.edu ryansafner/metricsF20 metricsF20.classes.ryansafner.com

Outline

The Two Big Problems with Data

Two Big Problems with Data

Identification Problem: Endogeneity

Identification Problem: Endogeneity

Identification Problem: Endogeneity

Inference Problem: Randomness

The Two Problems: Where We're Heading...Ultimately

Data 101

Data 101

Data 101

Categorical Data

Categorical Data: Visualizing I

Categorical Data: Visualizing II

Categorical Data: Visualizing III

Categorical Data: Visualizing III

Categorical Data: Visualizing IV

Categorical Data: Visualizing IV

Categorical Data: Visualizing IV

Quantitative Data I

Discrete Data

Continuous Data

Spreadsheets

Spreadsheets

Spreadsheets II

Spreadsheets II

Datasets: Cross-Sectional

Datasets: Time-Series

Datasets: Panel

Descriptive Statistics

Variables and Distributions

Two Branches of Statistics

Histograms

Histogram: Example

Histogram: Example

Histogram: Example

Descriptive Statistics

Measures of Center

Mode

Mode

Mode

Multi-Modal Distributions

Symmetry and Skew I

Symmetry and Skew I

Outliers

Arithmetic Mean (Population)

Arithmetic Mean (Sample)

Arithmetic Mean (Sample)

Arithmetic Mean (Sample)

Arithmetic Mean: Affected by Outliers

Arithmetic Mean: Affected by Outliers

Arithmetic Mean: Affected by Outliers

Median

Mean, Median, and Outliers

Mean, Median, Symmetry, Skew I

Mean, Median, Symmetry, Skew II

Mean, Median, Symmetry, Skew III

Measures of Dispersion

Measures of Dispersion: Range

Measures of Dispersion: 5 Number Summary I

Measures of Dispersion: 5 Number Summary I

Measures of Dispersion: 5 Number Summary II

Boxplots I

Comparisons I

Comparisons II

Aside: Making Nice Summary Tables I

Aside: Making Nice Summary Tables II

Aside: Making Nice Summary Tables II

Measures of Dispersion: Deviations

Variance (Population)

Standard Deviation (Population)

Variance (Sample)

Standard Deviation (Sample)

Sample Standard Deviation: Example

Sample Standard Deviation: Example

The Steps to Calculate sd(), Coded I

The Steps to Calculate sd(), Coded II

Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsF20
metricsF20.classes.ryansafner.com