2.2 — Random Variables and Distributions — Class Notes
Contents
Tuesday, September 8, 2019
Overview
Today we finish your crash course/review of basic statistics with random variables and distributions.
Slides
Live Class Session on Zoom
The live class Zoom meeting link can be found on Blackboard (see LIVE ZOOM MEETINGS
on the left navigation menu), starting at 11:30 AM.
If you are unable to join today’s live session, or if you want to review, you can find the recording stored on Blackboard via Panopto (see Class Recordings
on the left navigation menu).
Problem Set
Problem Set 1 answers are posted on that page in various formats.
Problem set 2 (on classes 2.1-2.2) is posted shortly, and is will be due by Sunday September 13.
Math Appendix: Properties of Expected Value and Variance
There are several useful mathematical properties of expected value and variance.
Property 1: the expected value of a constant is itself, and the variance of a constant is 0.
\[\begin{align*} E(c)&=c\\ var(c)&=0\\ sd(c)&=0\\ \end{align*}\]
For any constant, \(c\)
- Example: \(E(2)=2\), \(var(2)=0\), \(sd(2)=0\)
Property 2: adding or subtracting a constant to a random variable and then taking the mean or variance:
\[\begin{align*} E(X \pm c)&=E(X) \pm c\\ var(X \pm c)&=X\\ sd(X \pm c)&=X\\ \end{align*}\]
For any constant, \(c\)
- Example: \(E(2+X)=2+E(X)\), \(var(2+X)=var(X)\), \(sd(2+X)=sd(X)\)
Property 3: multiplying a constant to a random variable and then taking the mean or variance:
\[\begin{align*} E(aX)&=E(X) aE(X)\\ var(aX)&=a^2var(X)\\ sd(aX)&=|a|sd(X)\\ \end{align*}\]
For any constant, \(a\)
- Example: \(E(2X)=2E(X)\), \(var(2X)=4var(X)\), \(sd(2X)=2sd(X)\)
Property 4: the expected value of the sum of two random variables is equal to the sum of each random variable’s expected value:
\[E(X \pm Y)=E(X) \pm E(Y)\]
R Appendix: Graphing Statistical and Mathematical Functions in R
The mosaic
package is useful for making and using mathematical functions in R
.
Creating Mathematical Functions
You can create custom mathematical functions using mosaic by defining an R function()
with multiple arguments. As a simple example, make the function \(f(x) = 10x-x^2\) (with one argument, \(x\) since it is a univariate function) as follows:
# store as a named function, I'll call it "my_function"
my_function<-function(x){10*x-x^2}
# look at it
my_function
## function(x){10*x-x^2}
There are some notational requirements from R
for making functions. Any coefficient in front of a variable (such as the 10 in 10x
must be explicitly multiplied by the variable, as in 10*x
).
To use the function to calculate its value at a particular value of x
, simply define what the (x)
is and run your named function on it:
## [1] 16
## [1] 16 24
## [1] 16 21 24 25 24 21
## [1] 16 24
Graphing Mathematical Functions
In ggplot
there is a dedicated stat_function()
(equivalent to a geom_
layer) to graph mathematical and statistical functions. All that is needed is a data.frame
of a range of x
values to act as the source for data
, and set x
equal to those values for aes
thetics.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.4 ✓ purrr 0.3.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x mosaic::count() masks dplyr::count()
## x purrr::cross() masks mosaic::cross()
## x mosaic::do() masks dplyr::do()
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x ggstance::geom_errorbarh() masks ggplot2::geom_errorbarh()
## x dplyr::lag() masks stats::lag()
## x tidyr::pack() masks Matrix::pack()
## x mosaic::stat() masks ggplot2::stat()
## x mosaic::tally() masks dplyr::tally()
## x tidyr::unpack() masks Matrix::unpack()
Then we add the stat_function
, where fun =
is the most important argument where you define the to function to graph as your function created above, for example, our my_function
.
You can also adjust things like size, color, and line type.
ggplot(data = data.frame(x = 1:10))+
aes(x = x)+
stat_function(fun = my_function, color = "blue", size = 2, linetype = "dashed")
Bult-in Statistical Functions
There are some standard statistical distributions built into R. They require a combination of a specific prefix and a distribution.
Prefixes:
Action/Type | Prefix |
---|---|
random draw | r |
density (pdf) | d |
cumulative density (cdf) | p |
quantile (inverse cdf) | q |
Distributions:
Distribution | Name in R |
---|---|
Normal | norm |
Uniform | unif |
Student’s t | t |
Binomial | binom |
Negative binomial | nbinom |
Hypergeometric | hyper |
Weibull | weibull |
Beta | beta |
Gamma | gamma |
Thus, what you want is a combination of the prefix and the distribution.
Some common examples:
- Take random draws from a normal distribution:
rnorm(n = 10, # take 10 draws from a normal distribution with:
mean = 2, # mean of 2
sd = 1) # sd of 1
## [1] 2.6070850 2.4363768 2.4640655 1.6825498 0.1926861 1.8054952 0.4379302
## [8] 2.9041258 2.5405607 1.3065504
- Get probability of a random variable being less than or equal to a value (cdf) from a normal distribution:
# find probability of area to the LEFT of a number on pdf (note this = cdf of that number!)
pnorm(q = 80, # number is 80 from a distribution where:
mean = 200, # mean is 100
sd = 100, # sd is 100
lower.tail = TRUE) # looking to the LEFT in lower tail
## [1] 0.1150697
- Find the value of a distribution that is a specified percentile.
# find the 38th percentile value
qnorm(p = 0.38, # 38th percentile from a distribution where:
mean = 200, # mean is 200
sd = 100) # sd is 100
## [1] 169.4519
Graphing Statistical Functions
You can also graph these commonly used statistical functions by setting fun =
the named functions in your stat_function()
layer. If you want to specify the mean and standard deviation, use args = list()
to include the required arguments from the named function above (e.g. dnorm
needs mean
and sd
).
ggplot(data = data.frame(x = -400:600))+
aes(x = x)+
stat_function(fun = dnorm, args = list(mean = 200, sd = 200), color = "blue", size = 2, linetype = "dashed")
If you don’t include this, it will graph the standard distribution:
ggplot(data = data.frame(x = -4:4))+
aes(x = x)+
stat_function(fun = dnorm, color = "blue", size = 2, linetype = "dashed")
To add shading under a distribution, simply add a duplicate of the stat_function()
layer, but add geom="area"
to indicate the area beneath the function should be filled, and you can limit the domain of the fill
with xlim=c(start,end)
, where start
and end
are the x-values for the endpoints of the fill.
# graph normal distribution and shade area between -2 and 2
ggplot(data = data.frame(x = -4:4))+
aes(x = x)+
stat_function(fun = dnorm, color = "blue", size = 2, linetype = "dashed")+
stat_function(fun = dnorm, xlim = c(-2,2), geom = "area", fill = "green", alpha=0.5)
Hence, here is one graph from my slides:
ggplot(data = tibble(x=35:115))+
aes(x = x)+
stat_function(fun = dnorm, args = list(mean = 75, sd = 10), size = 2, color="blue")+
stat_function(fun = dnorm, args = list(mean = 75, sd = 10), geom = "area", xlim = c(65,85), fill="blue", alpha=0.5)+
labs(x = "X",
y = "Probability")+
scale_x_continuous(breaks = seq(35,115,5))+
theme_classic(base_family = "Fira Sans Condensed",
base_size=20)