1.4 — Data Wrangling in the tidyverse — R Practice
Getting Set Up
Before we begin, start a new file with
New File \(\rightarrow\)
R Script. As you work through this sheet in the console in
R, also add (copy/paste) your commands that work into this new file. At the end, save it, and run to execute all of your commands at once.
First things first, load
Warm Up to
select() the variables
select() all variables except
arrange() by year.
arrange() by year, but in descending order.
arrange() by year, then by life expectancy.
filter() observations with
pop greater than 1 billion.
Of those, look only at
Try out the pipe (
%>%) if you haven’t already, by chaining commands:
select() your data to look only at
country in the year
1997, for countries that have a
gdpPercap greater than 20,000, and
arrange() them alphabetically.
mutate() a new variable called
GDP that is equal to
gdpPercap * pop.
mutate() a new population variable that is the
pop in millions.
summarize() to get the average GDP per capita.
Get the number of observations, average, minimum, maximum, and standard deviation for GDP per capita.
Get the average GDP per capita over time. Hint, first
Get the average GDP per capita by continent.
Get the average GDP per capita by year and by continent.Hint: do
year first, if you do
continent first, there are no years to group by!
Then save this as another
gdp. Create a
ggplot of a
line graph of average continent GDP over time using the
Try it again all in one command with the pipe
%>%. Instead of saving the data as
gdp, pipe it right into
ggplot!Hint: You can use
. as a placeholder.
Example: the Economics of College Majors
Now let’s step it up to work with some data “in the wild” to answer some research questions. This will have you combine your
dplyr skills and add some new things such as importing with
Let’s look at fivethirtyeight’s article "
The Economic Guide To Picking A College Major
". fivethirtyeight is great about making the data behind their articles public, we can download all of their data here. Search for
college majors and click download (the blue arrow button).This will download a
.zip file that contains many spreadsheets. Unzip it with a program that unzips files (such as WinZip, 7-zip, the Unarchiver, etc).
We will look at the
The description in the
readme file for the data is as follows:
||Rank by median earnings|
||Major code, FO1DP in ACS PUMS|
||Category of major from Carnevale et al|
||Total number of people with major|
||Sample size (unweighted) of full-time, year-round ONLY (used for earnings)|
||Women as share of total|
||Number employed (ESR == 1 or 2)|
||Employed 35 hours or more|
||Employed less than 35 hours|
||Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)|
||Number unemployed (ESR == 3)|
||Unemployed / (Unemployed + Employed)|
||Median earnings of full-time, year-round workers|
||25th percentile of earnigns|
||75th percentile of earnings|
||Number with job requiring a college degree|
||Number with job not requiring a college degree|
||Number in low-wage service jobs|
Import the data with
read_csv() and assign it to an object called
majors.One way to avoid error messages is to move (on your computer)
recent_grads.csv to the same folder as R’s working directory, which again you can check with
The first argument of this command is the name of the original file, in quotes.If the file is in a different folder, the argument is the full path in quotes.
Look at the data with
glimpse(). This is a suped-up version of
What are all of the unique values of
Major? How many are there?
Which major has the lowest unemployment rate?
What are the top 3 majors that have the highest percentage of women?
Make a boxplot of
Median wage by
Major_Category.You won’t be able to read the labels easily, so add
theme(axis.text.x=element_text(angle=45, hjust=1) to angle
x-axis labels (and move them down by 1)
Which major category is the least popular in this sample?Hint: use
Is there a systematic difference in median earnings between STEM majors and non-STEM majors? First define:
Next, make a variable called
stem, for whether or not a
"not_stem".^[Hint: try out the
ifelse() function which has three inputs: condition(s) for a variable(s), what to do if
TRUE (the if), and what to if
FALSE (the else), i.e.
You’ll of course need to change the
do_this into something!
median for stem and not stem groups.