2.2 — Random Variables and Distributions — Class Notes

Overview
Slides
Live Class Session on Zoom
Problem Set
Math Appendix: Properties of Expected Value and Variance
R Appendix: Graphing Statistical and Mathematical Functions in R

Tuesday, September 8, 2019

Overview

Today we finish your crash course/review of basic statistics with random variables and distributions.

Slides

Live Class Session on Zoom

The live class Zoom meeting link can be found on Blackboard (see LIVE ZOOM MEETINGS on the left navigation menu), starting at 11:30 AM.

If you are unable to join today’s live session, or if you want to review, you can find the recording stored on Blackboard via Panopto (see Class Recordings on the left navigation menu).

Problem Set

Problem Set 1 answers are posted on that page in various formats.

Problem set 2 (on classes 2.1-2.2) is posted shortly, and is will be due by Sunday September 13.

Math Appendix: Properties of Expected Value and Variance

There are several useful mathematical properties of expected value and variance.

Property 1: the expected value of a constant is itself, and the variance of a constant is 0.

\[\begin{align*} E(c)&=c\\ var(c)&=0\\ sd(c)&=0\\ \end{align*}\]

For any constant, \(c\)

Example: \(E(2)=2\), \(var(2)=0\), \(sd(2)=0\)

Property 2: adding or subtracting a constant to a random variable and then taking the mean or variance:

\[\begin{align*} E(X \pm c)&=E(X) \pm c\\ var(X \pm c)&=X\\ sd(X \pm c)&=X\\ \end{align*}\]

For any constant, \(c\)

Example: \(E(2+X)=2+E(X)\), \(var(2+X)=var(X)\), \(sd(2+X)=sd(X)\)

Property 3: multiplying a constant to a random variable and then taking the mean or variance:

\[\begin{align*} E(aX)&=E(X) aE(X)\\ var(aX)&=a^2var(X)\\ sd(aX)&=|a|sd(X)\\ \end{align*}\]

For any constant, \(a\)

Example: \(E(2X)=2E(X)\), \(var(2X)=4var(X)\), \(sd(2X)=2sd(X)\)

Property 4: the expected value of the sum of two random variables is equal to the sum of each random variable’s expected value:

\[E(X \pm Y)=E(X) \pm E(Y)\]

R Appendix: Graphing Statistical and Mathematical Functions in R

The mosaic package is useful for making and using mathematical functions in R.

# install.packages("mosaic") install for first use
library(mosaic) # load

Creating Mathematical Functions

You can create custom mathematical functions using mosaic by defining an R function() with multiple arguments. As a simple example, make the function \(f(x) = 10x-x^2\) (with one argument, \(x\) since it is a univariate function) as follows:

# store as a named function, I'll call it "my_function"
my_function<-function(x){10*x-x^2}

# look at it
my_function

## function(x){10*x-x^2}

There are some notational requirements from R for making functions. Any coefficient in front of a variable (such as the 10 in 10x must be explicitly multiplied by the variable, as in 10*x).

To use the function to calculate its value at a particular value of x, simply define what the (x) is and run your named function on it:

# f of 2 
my_function(2)

## [1] 16

# f of 2 and 4
my_function(c(2,4))

## [1] 16 24

# f of 2 through 7
my_function(2:7)

## [1] 16 21 24 25 24 21

# ALTERNATIVELY
# define x first as a vector and then run function on it

x<-c(2,4)
my_function(x)

## [1] 16 24

Graphing Mathematical Functions

In ggplot there is a dedicated stat_function() (equivalent to a geom_ layer) to graph mathematical and statistical functions. All that is needed is a data.frame of a range of x values to act as the source for data, and set x equal to those values for aesthetics.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ tibble  3.0.4     ✓ purrr   0.3.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x mosaic::count()            masks dplyr::count()
## x purrr::cross()             masks mosaic::cross()
## x mosaic::do()               masks dplyr::do()
## x tidyr::expand()            masks Matrix::expand()
## x dplyr::filter()            masks stats::filter()
## x ggstance::geom_errorbarh() masks ggplot2::geom_errorbarh()
## x dplyr::lag()               masks stats::lag()
## x tidyr::pack()              masks Matrix::pack()
## x mosaic::stat()             masks ggplot2::stat()
## x mosaic::tally()            masks dplyr::tally()
## x tidyr::unpack()            masks Matrix::unpack()

# x values are integers 1 through 10
ggplot(data = data.frame(x = 1:10))+
  aes(x = x)

Then we add the stat_function, where fun = is the most important argument where you define the to function to graph as your function created above, for example, our my_function.

ggplot(data = data.frame(x = 1:10))+
  aes(x = x)+
  stat_function(fun = my_function)

You can also adjust things like size, color, and line type.

ggplot(data = data.frame(x = 1:10))+
  aes(x = x)+
  stat_function(fun = my_function, color = "blue", size = 2, linetype = "dashed")

Bult-in Statistical Functions

There are some standard statistical distributions built into R. They require a combination of a specific prefix and a distribution.

Prefixes:

Action/Type	Prefix
random draw	`r`
density (pdf)	`d`
cumulative density (cdf)	`p`
quantile (inverse cdf)	`q`

Distributions:

Distribution	Name in R
Normal	`norm`
Uniform	`unif`
Student’s t	`t`
Binomial	`binom`
Negative binomial	`nbinom`
Hypergeometric	`hyper`
Weibull	`weibull`
Beta	`beta`
Gamma	`gamma`

Thus, what you want is a combination of the prefix and the distribution.

Some common examples:

Take random draws from a normal distribution:

rnorm(n = 10, # take 10 draws from a normal distribution with:
      mean = 2, # mean of 2
      sd = 1) # sd of 1

##  [1] 2.6070850 2.4363768 2.4640655 1.6825498 0.1926861 1.8054952 0.4379302
##  [8] 2.9041258 2.5405607 1.3065504

Get probability of a random variable being less than or equal to a value (cdf) from a normal distribution:

# find probability of area to the LEFT of a number on pdf (note this = cdf of that number!)
pnorm(q = 80, # number is 80 from a distribution where: 
      mean = 200, # mean is 100
      sd = 100, # sd is 100
      lower.tail = TRUE) # looking to the LEFT in lower tail

## [1] 0.1150697

Find the value of a distribution that is a specified percentile.

# find the 38th percentile value
qnorm(p = 0.38, # 38th percentile from a distribution where:
      mean = 200, # mean is 200
      sd = 100) # sd is 100

## [1] 169.4519

Graphing Statistical Functions

You can also graph these commonly used statistical functions by setting fun = the named functions in your stat_function() layer. If you want to specify the mean and standard deviation, use args = list() to include the required arguments from the named function above (e.g. dnorm needs mean and sd).

ggplot(data = data.frame(x = -400:600))+
  aes(x = x)+
  stat_function(fun = dnorm, args = list(mean = 200, sd = 200), color = "blue", size = 2, linetype = "dashed")

If you don’t include this, it will graph the standard distribution:

ggplot(data = data.frame(x = -4:4))+
  aes(x = x)+
  stat_function(fun = dnorm, color = "blue", size = 2, linetype = "dashed")

To add shading under a distribution, simply add a duplicate of the stat_function() layer, but add geom="area" to indicate the area beneath the function should be filled, and you can limit the domain of the fill with xlim=c(start,end), where start and end are the x-values for the endpoints of the fill.

# graph normal distribution and shade area between -2 and 2
ggplot(data = data.frame(x = -4:4))+
  aes(x = x)+
  stat_function(fun = dnorm, color = "blue", size = 2, linetype = "dashed")+
  stat_function(fun = dnorm, xlim = c(-2,2), geom = "area", fill = "green", alpha=0.5)

Hence, here is one graph from my slides:

ggplot(data = tibble(x=35:115))+
  aes(x = x)+
  stat_function(fun = dnorm, args = list(mean = 75, sd = 10), size = 2, color="blue")+
  stat_function(fun = dnorm, args = list(mean = 75, sd = 10), geom = "area", xlim = c(65,85), fill="blue", alpha=0.5)+
  labs(x = "X",
       y = "Probability")+
  scale_x_continuous(breaks = seq(35,115,5))+
  theme_classic(base_family = "Fira Sans Condensed",
           base_size=20)