You go into data analysis with the tools you know, not the tools you need
The next 2-3 weeks are all about giving you the tools you need
We will extend them as we learn specific models


Free and open source
A very large community
R firstCan handle virtually any data format
Makes replication easy
Can integrate into documents (with R markdown)
R is a language so it can do everything

library("gapminder")ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent))+ geom_point(alpha=0.3)+ geom_smooth(method = "lm")+ scale_x_log10(breaks=c(1000,10000, 100000), label=scales::dollar)+ labs(x = "GDP/Capita", y = "Life Expectancy (Years)")+ facet_wrap(~continent)+ guides(color = F)+ theme_light()

library(gapminder)
The average GDP per capita is $`r
round(mean(gapminder$gdpPercap),2)` with a standard deviation of $`r
round(sd(gapminder$gdpPercap),2)` .
The average GDP per capita is $7215.33 with a standard deviation of $9857.45.
R is the programming language that executes commands
R Studio is an integrated development environment (IDE) that makes your coding life a lot easier
R Markdown
R Studio
R is like your car's engine, R Studio is the dashboard
You will do everything in R Studio
R itself is just a command language (you could run it in your computer's shell/terminal/command prompt)

R Studio
R Studio has 4 window panes:

R Studio
†May not be immediately visible until you create new files.
You don't “learn R”, you learn how to do things in R
In order to do learn this, you need to learn how to search for what you want to do
You don't “learn R”, you learn how to do things in R
In order to do learn this, you need to learn how to search for what you want to do
My #rstats learning path:
— Jesse Mostipak (@kierisi) August 18, 2017
1. Install R
2. Install RStudio
3. Google "How do I [THING I WANT TO DO] in R?"
Repeat step 3 ad infinitum.

Type individual commands into the console window
Great for testing individual commands to see what happens
Not saved! Not reproducible! Not recommended!
2+2
## [1] 4summary(mpg$hwy)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 12.00 18.00 24.00 23.44 27.00 44.00Type individual commands into the console window
Great for testing individual commands to see what happens
Not saved! Not reproducible! Not recommended!

Source pane is a text-editor
Make .R files: all input commands in a single script
Comment with #
Can run any or all of script at once
Can save, reproduce, and send to others!

A later lecture: R Markdown, a simple markup language to write documents in
Can integrate text, R code, figures, citations & bibliographies in a single plain-text file & output into a variety of formats: PDF, webpage, slides, Word doc, etc.

Practicing typing at the Command line/Console
Learning different commands and objects relevant for data analysis
Saving and running .R scripts
Later: R markdown, literate programming, workflow management
Today may seem a bit overwhelming
R assumes a default (often inconvenient) "working directory" on your computer
open or save files Find out where R this is with getwd()
Change it with setwd(path/to/folder)†
Soon I'll show you better ways where you won't ever have to worry about this
† Note the path is OS-specific. For Windows it might be C:/Documents/. For Mac it is often your username folder.

Hadley Wickham
Chief Scientist, R Studio
"There’s an implied contract between you and R: it will do the tedious computation for you, but in return, you must be completely precise in your instructions. Typos matter. Case matters." - R for Data Science, Ch. 4


help(function_name) or ?(function_name) to get documentation on a functionFrom Kieran Healy's excellent (free online!) book on Data Visualization.

]
# starts a comment, R will ignore everything on the rest of that line# Run regression of y on x, save as reg1 reg1<-lm(y~x, data=data) #runs regression summary(reg1$coefficients) #prints coefficients
I follow this style guide (you are not required to)†
Naming objects and files will become important‡
my webpage in html turned into http://my%20webpage%20in%20html.htmli_use_underscoressome.people.use.snake.caseothersUseCamelCase
† Also described in today's course notes page and the course reference page.
‡ Consider your folders on your computer as well...
You'll have to get used to the fact that you are coding in commands to execute
Start with the easiest: simple math operators and calculations:
You'll have to get used to the fact that you are coding in commands to execute
Start with the easiest: simple math operators and calculations:
> 2+2
## [1] 4You'll have to get used to the fact that you are coding in commands to execute
Start with the easiest: simple math operators and calculations:
> 2+2
## [1] 4> and give you output starting with ## [1]2^3
## [1] 82^3
## [1] 8sqrt(25)
## [1] 52^3
## [1] 8sqrt(25)
## [1] 5log(6)
## [1] 1.7917592^3
## [1] 8sqrt(25)
## [1] 5log(6)
## [1] 1.791759pi/2
## [1] 1.570796library()library("package_name")install.packages()† install.packages("package_name")
creating objects
= (or <-)running functions on objects
function_name(object_name)# make an objectmy_object = -c(1,2,3,4,5)# look at it my_object
## [1] -1 -2 -3 -4 -5# find the sumsum(my_object)
## [1] -15# find the mean mean(my_object)
## [1] -3Functions have "arguments," the input(s)
Some functions may have multiple inputs
The argument of a function can be another function!
# find the sdsd(my_object)
## [1] 1.581139# round everything in my object to two decimalsround(my_object,2)
## [1] -1 -2 -3 -4 -5# round the sd to two decimalsround(sd(my_object),2)
## [1] 1.58Numeric objects are just numbers†
Can be mathematically manipulated
x = 2 y = 3x+y
## [1] 5x*y
## [1] 6integer or double if there are decimal values.Character objects are "strings" of text held inside quote marks
Can contain spaces, so long as contained within quote marks
name = "Ryan Safner"address = "Washington D.C."name
## [1] "Ryan Safner"address
## [1] "Washington D.C."TRUE or FALSE indicators>, <: greater than, less than>=, <=: greater than or equal to, less than or equal to==, !=: is equal to, is not equal to†&in& : is a member of the set of (∈)&: "AND"|: "OR" † One = assigns a value (like <-).
Two == evaluate a conditional statement.
z = 10 # set z equal to 10z==10 # is z equal to 10?
## [1] TRUE"red"=="blue" # is red equal to blue?
## [1] FALSEz > 1 & z < 12 # is z > 1 AND < 12?
## [1] TRUEz <= 1 | z==10 # is z >= 1 OR equal to 10?
## [1] TRUEFactor objects contain categorical data - membership in mutually exclusive groups
Look like strings, behave more like logicals, but with more than two options
## [1] senior junior freshman junior freshman sophomore junior ## [8] freshman senior junior ## Levels: freshman sophomore junior senior## [1] senior junior freshman junior freshman sophomore junior ## [8] freshman senior junior ## Levels: freshman < sophomore < junior < seniorVector: the simplest type of object, just a collection of objects
Make a vector using the combine c() function
# create a vector called vecvec = c(1,"orange", 83.5, pi)# look at vecvec
## [1] "1" "orange" "83.5" "3.14159265358979"Data frame: what we'll be using almost always
Think like a "spreadsheet"
Each column is a vector (variable)
Each row is an observation (pair of values for all variables)
library("ggplot2")diamonds
## # A tibble: 53,940 x 10## carat cut color clarity depth table price x y z## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39## # … with 53,930 more rowsDataframes are really just combinations of (column) vectors
You can make data frames by combinining named vectors with data.frame() or creating each column/vector in each argument
# make two vectorsfruits = c("apple","orange","pear","kiwi","pineapple")numbers = c(3.3,2.0,6.1,7.5,4.2)# combine into dataframedf = data.frame(fruits,numbers)# do it all in one step (note the = instead of <-)df = data.frame(fruits=c("apple","orange","pear","kiwi","pineapple"), numbers=c(3.3,2.0,6.1,7.5,4.2))# look at itdf
## fruits numbers## 1 apple 3.3## 2 orange 2.0## 3 pear 6.1## 4 kiwi 7.5## 5 pineapple 4.2= or <- my_vector = c(1,2,3,4,5)
my_vector
## [1] 1 2 3 4 5my_vector
## [1] 1 2 3 4 5my_vector = c(2,7,9,1,5)my_vector
## [1] 2 7 9 1 5class()class("six")
## [1] "character"class(6)
## [1] "numeric"class()class("six")
## [1] "character"class(6)
## [1] "numeric"is.() is.numeric("six")
## [1] FALSEis.character("six")
## [1] TRUEas.object_class()numeric, etc! as.character(6)
## [1] "6"as.numeric("six")
## [1] NAmixed_vector = c(pi, 12, "apple", 6.32)class(mixed_vector)
## [1] "character"mixed_vector
## [1] "3.14159265358979" "12" "apple" "6.32"df
## fruits numbers## 1 apple 3.3## 2 orange 2.0## 3 pear 6.1## 4 kiwi 7.5## 5 pineapple 4.2class(df$fruits)
## [1] "character"class(df$numbers)
## [1] "numeric"†Remember each column in a data frame is a vector!
str() command to view its structureclass(df)
## [1] "data.frame"str(df)
## 'data.frame': 5 obs. of 2 variables:## $ fruits : chr "apple" "orange" "pear" "kiwi" ...## $ numbers: num 3.3 2 6.1 7.5 4.2n) rows with head()head(df)
## fruits numbers## 1 apple 3.3## 2 orange 2.0## 3 pear 6.1## 4 kiwi 7.5## 5 pineapple 4.2head(df, n=2)
## fruits numbers## 1 apple 3.3## 2 orange 2.0summary()summary(df)
## fruits numbers ## Length:5 Min. :2.00 ## Class :character 1st Qu.:3.30 ## Mode :character Median :4.20 ## Mean :4.62 ## 3rd Qu.:6.10 ## Max. :7.50† For numeric data only; a frequency table is displayed for character or factor data

data.frame objects can be viewed in their own panel by clicking on the name of the object
my_vector = c(2,4,5,10)my_vector+4 # add 4 to all elements
## [1] 6 8 9 14my_vector^2 # square all elements
## [1] 4 16 25 100length(my_vector) # how many elements
## [1] 4sum(my_vector) # add all elements
## [1] 21max(my_vector) # find largest element
## [1] 10min(my_vector) # find smallest element
## [1] 2mean(my_vector) # mean of all elements
## [1] 5.25median(my_vector) # median of all elements
## [1] 4.5sd(my_vector) # standard deviation
## [1] 3.40343+ sign waiting for you to finish the command> 2+(2*3+
)--or hit Esc to cancelmtcars
## mpg cyl disp hp drat wt qsec## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02## Valiant 18.1 6 225.0 105 2.76 3.460 20.22## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40df[r,c]r or c blank selects all rows or columnsc()1:r and c! 1 You can also "negate" values, selecting everything except for values with a - in front of them.
mtcars
## mpg cyl disp hp drat wt qsec## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02## Valiant 18.1 6 225.0 105 2.76 3.460 20.22## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40mtcars[1,] # first row
## mpg cyl disp hp drat wt qsec## Mazda RX4 21 6 160 110 3.9 2.62 16.46mtcars[c(1,3,4),] # first, third, and fourth rows
## mpg cyl disp hp drat wt qsec## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46## Datsun 710 22.8 4 108 93 3.85 2.320 18.61## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44mtcars[1:3,] # first three rows
## mpg cyl disp hp drat wt qsec## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02## Datsun 710 22.8 4 108 93 3.85 2.320 18.61mtcars
## mpg cyl disp hp drat wt qsec## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02## Valiant 18.1 6 225.0 105 2.76 3.460 20.22## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40mtcars[,6] # select column 6
## [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070mtcars[,2:4] # select columns 2 through 4
## cyl disp hp## Mazda RX4 6 160.0 110## Mazda RX4 Wag 6 160.0 110## Datsun 710 4 108.0 93## Hornet 4 Drive 6 258.0 110## Hornet Sportabout 8 360.0 175## Valiant 6 225.0 105## Duster 360 8 360.0 245## Merc 240D 4 146.7 62## Merc 230 4 140.8 95## Merc 280 6 167.6 123## Merc 280C 6 167.6 123## Merc 450SE 8 275.8 180mtcars
## mpg cyl disp hp drat wt qsec## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02## Valiant 18.1 6 225.0 105 2.76 3.460 20.22## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40[[]] selects a column by positionmtcars[[6]]
## [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070$mtcars$wt # same thing
## [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070mtcars
## mpg cyl disp hp drat wt qsec## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02## Valiant 18.1 6 225.0 105 2.76 3.460 20.22## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40mtcars[mtcars$wt>4,] # select obs with wt>4
## mpg cyl disp hp drat wt qsec## Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4mtcars[mtcars$cyl==6,] # select obs with exactly 6 cyl
## mpg cyl disp hp drat wt qsec## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44## Valiant 18.1 6 225.0 105 2.76 3.460 20.22## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90mtcars[mtcars$wt<4 & mtcars$wt>2,] # select obs where 2<wt<4
## mpg cyl disp hp drat wt qsec## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02## Valiant 18.1 6 225.0 105 2.76 3.460 20.22## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90mtcars[mtcars$cyl==4 | mtcars$cyl==6,] # select obs with 4 OR 6 cyl
## mpg cyl disp hp drat wt qsec## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44## Valiant 18.1 6 225.0 105 2.76 3.460 20.22## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90Next class: data visualization with ggplot2
And then: data wrangling with tidyverse
And then: literate programming and workflow management with R Markdown, R Projects, maybe git
Finally: onto statistics and econometric theory!
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |