1.3 — Data Visualization with ggplot2 — R Practice
Getting Set Up
Before we begin, start a new file with File
\(\rightarrow\) New File
\(\rightarrow\) R Script
. As you work through this sheet in the console in R
, also add (copy/paste) your commands that work into this new file. At the end, save it, and run to execute all of your commands at once.
“Our Plot” from Class
Download and run in R Studio on your computer (or open the file in our R Studio cloud project and run it there) to see our plot from class.
Exploring the Data
1
We will look at GDP per Capita and Life Expectancy using some data from the gapminder project. There is a handy package called gapminder
that uses a small snippet of this data for exploratory analysis. Install and load the package gapminder
. Type ?gapminder
and hit enter to see a description of the data.
2
Let’s get a quick look at gapminder
to see what we’re dealing with.
- Get the
str
ucture of thegapminder
data. - What variables are there?
- Look at the
head
of the dataset to get an idea of what the data looks like. - Get
summary
statistics of all variables.
Simple Plots in Base R
3
Let’s make sure you can do some basic plots before we get into the gg
. Use base R
’s hist()
function to plot a histogram of gdpPercap
.
4
Use base R
’s boxplot()
function to plot a boxplot of gdpPercap
.
5
Now make it a boxplot by continent
.Hint: use formula notation with ~
.
6
Now make a scatterplot of gdpPercap
on the \(x\)-axis and LifeExp
on the \(y\)-axis.
Plots with ggplot2
7
Load the package ggplot2
(you should have installed it previously. If not, install first with install.packages("ggplot2")
).
8
Let’s first make a bar
graph to see how many countries are in each continent. The only aes
thetic you need is to map continent
to x
. Bar graphs are great for representing categories, but not quantitative data.
9
For quantitative data, we want a histogram
to visualize the distribution of a variable. Make a histogram
of gdpPercap
. Your only aes
thetic here is to map gdpPercap
to x
.
10
Now let’s try adding some color, specifically, add an aes
thetic that maps continent
to fill.
In general, color
refers to the outside borders of a geom
(except points), fill
is the interior of an object.
11
Instead of a histogram
, change the geom
to make it a density
graph. To avoid overplotting, add alpha=0.4
to the geom
argument (alpha changes the transparency of a fill
).
12
Redo your plot from 11 for lifeExp
instead of gdpPercap
.
13
Now let’s try a scatterplot for lifeExp
(as y
) on gdpPercap
(as x
). You’ll need both for aes
thetics. The geom
here is geom_point()
.
14
Add some color by mapping continent
to color
in your aes
thetics.
15
Now let’s try adding a regression line with geom_smooth()
. Add this layer on top of your geom_point()
layer.
16
Did you notice that you got multiple regression lines (colored by continent)? That’s because we set a global
aes
thetic of mapping continent
to color
. If we want just one regression line, we need to instead move the color = continent
inside the aes
of geom_point
. This will only map continent
to color
for points, not for anything else.
17
Now add an aes
thetic to your point
s to map pop
to size
.
18
Change the color of the regression line to "black"
. Try first by putting this inside an aes()
in your geom_smooth
, and try a second time by just putting it inside geom_smooth
without an aes()
. What’s the difference, and why?
19
Another way to separate out continents is with facet
ing. Add +facet_wrap(~continent)
to create subplots by continent
.
20
Remove the facet
layer. The scale
is quite annoying for the x
-axis, a lot of points are clustered on the lower level. Let’s try changing the scale by adding a layer: +scale_x_log10()
.
21
Now let’s fix the labels by adding +labs()
. Inside labs
, make proper axes titles for x
, y
, and a title
to the plot. If you want to change the name of the legends (continent color), add one for color
and size
.
22
Now let’s try subsetting by looking only at North America. Take the gapminder
dataframe and subset it to only look at continent=="Americas"
). Assign this to a new dataframe object (call it something like america
.) Now, use this as your data
, and redo the graph from question 17. (You might want to take a look at your new dataframe to make sure it worked first!)
23
Try this again for the whole world, but just for observations in the year 2002.