R is a free software environment for statistical computing and graphics. It is supported by the R Foundation for Statistical Computing and was initially developed in the late 1990s. As a full programming language, it is more flexible and capacious than alternatives for statistical computing such as Stata. Since it is designed for data analysis, it is particularly well suited to the needs of data scientists compared to most other programming languages. Since it is open source, it is a customizable and extensible tool, for which thousands of useful and tailored packages have been written by users worldwide. Finally, it is widely used and a large community has grown around R, meaning that help is always around the corner.
For more information, see the R project website. You can get a taste for the R community at R-bloggers, the RStudio community forum, R-Ladies, and Stack Overflow.
Here are a few things to remember about R: - “R” refers to both the language and the environment (application). - R runs in memory, objects are loaded in memory. - It’s expected that you’ll install and use additional packages - Packages are open source and user contributed, so use established packages or evaluate quality. - There are multiple ways to do most things. Some ways are better than others, but sometimes it is a question of style and preference. - You can, and often will, have more than one dataset open in R at the same time.
RStudio is a software program that makes working in R easier, developed and maintained by a company with the same name. It provides an integrated development environment, or IDE, for R. RStudio helps you organize your workflow and keep track of your work. The top-left pane is where you open, work on, and save script files. The bottom-left pane includes the console, which is where your code actually “runs” – you can run code here directly, or from a script file. The right-hand side panes include tools for manging your environment, workspace, and packages; for plotting and graphics; for accessing help files; and more.
At MSiA, you have access to RStudio Server, which allows you to access RStudio through a web browser and do computation on a server rather than your personal computer.
Much of R’s power comes from contributed packages. You can install and manage packages using the Packages tab in the bottom right pane in RStudio. Or you can install packages with a command:
install.packages("tidyverse")
When you install R and RStudio Desktop on your local computer, “base R” packages are already installed and you will need to install any other packages you want touse. When you access RStudio Server Pro through the MSIA server, some common packages are already installed at the system-level, so you will not actually need to do this step. You can easily check what packages have been installed already in the “Packages” tab on the bottom-right pane.
Note that tidyverse
is a composite package that will
install multiple component packages and their dependencies. It includes
dplyr
and ggplot2
, which we will use
extensively on Day 3 of the boot camp. You’ll get a lot of messages as
the installation happens.
CRAN (Comprehensive R Archive Network) is the name of the package
repository. There are mirrors around the
world. You can also install packages that are not on CRAN using the
devtools
package.
If you have trouble or get errors when trying to install a package, you might need to specify the repository mirror to download from:
install.packages("tidyverse", repos="http://cran.wustl.edu/")
After you install a package, you have to load it with the
library
function to actually use it.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Functions are called with functionName(parameters)
.
Multiple paramters are comma-separated. Paramters can be named. For
unnamed parameters, the order matters. R functions don’t change the
objects passed to them (more on this later). Instead, to store the
result of a function, you need to assign its output to an object using
the assignment operator <-
. For example:
object <- functionName(parameter1, parameter2)
.
Remember that all functions come with help files. Well-written packages will include extensive help files which explain the kinds of inputs the function takes, the purpose of each parameter or “argument” in addition to the inputs, and the kinds of output the function produces. They will also often include examples.
To access a help file, enter ?functionName
into the
console or use the Help tab in the bottom-right pane of RStudio.
Sometimes different packages will include functions with the same
name. To make sure you are using the function from the right package,
you can use the following syntax:
packageName::functionName()
.
A good reference for R Basics is the Base R Cheat Sheet.
Remember, to run a line of code from a script file in RStudio:
command + return
Ctrl + r
Here are the basic arithmetic operators in R:
# Addition
2+2
## [1] 4
# Subtraction
6-2
## [1] 4
# Multiplication
3.452*6
## [1] 20.712
# Division
5/2
## [1] 2.5
# Exponent
2^4
## [1] 16
# Modulo (returns remainder after division)
5%%2
## [1] 1
You can use ?Arithmetic
to pull up the help for
arithmetic operators.
Functions are called with functionName(parameters)
.
Multiple paramters are comma-separated. Paramters can be named. For
unnamed parameters, the order matters. R functions don’t change the
objects passed to them (more on this later).
log(10)
## [1] 2.302585
log(16, base=2)
## [1] 4
log10(10)
## [1] 1
sqrt(10)
## [1] 3.162278
exp(10)
## [1] 22026.47
sin(1)
## [1] 0.841471
1 < 2
## [1] TRUE
TRUE == FALSE
## [1] FALSE
'a' != "Boy" # not equal
## [1] TRUE
Note that character vectors/strings can use single or double quotes.
&
is and, and |
is or, and
!
is not:
TRUE & FALSE
## [1] FALSE
!TRUE & FALSE
## [1] FALSE
TRUE | FALSE
## [1] TRUE
(2 > 1) & (3 > 2)
## [1] TRUE
You use these to join together conditions.
Use the <-
operator to assign values to variables.
=
also works but is bad practice and less common.
The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.
Variable names can contain letters, numbers, underscores and periods. They cannot start with a number; they should not start with an underscore or period in regular use. They cannot contain spaces. Different people use different conventions for long variable names, these include
x <- 4
x
## [1] 4
y <- 3/10
y
## [1] 0.3
x + y
## [1] 4.3
myVariable <- x <- 3 + 4 + 7
Note that when you create a variable in RStudio, it shows up in the environment tab in the top-right pane.
There are a few types of data in R:
TRUE
or FALSE
2+5i
"text data"
, denoted with single or double
quotesclass(TRUE)
## [1] "logical"
class("foo")
## [1] "character"
Vectors store multiple values of a single data type. You can create a
vector by combining values with the c()
function.
x<-c(1,2,3,4,5)
x<-1:5
Vectors can only have one type of values in them. The order depends on what types can be converted to other types. If there’s multiple types, everything in a vector will be converted to the type of the lowest in this list:
x<-c(TRUE, 2, 4.3)
x
## [1] 1.0 2.0 4.3
x<-c(4, "alpha", TRUE)
x
## [1] "4" "alpha" "TRUE"
Functions and arithmetic operators can apply to vectors as well:
x <- c(1,2,3,4,5)
x+1
## [1] 2 3 4 5 6
x*2
## [1] 2 4 6 8 10
x*x
## [1] 1 4 9 16 25
log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
x < 5
## [1] TRUE TRUE TRUE TRUE FALSE
Some functions will apply to each element of a vector, but others take a vector as a parameter:
log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
sum(x)
## [1] 15
Vectors are one-dimensional and can’t be nested:
c(c(1,2,3), 4, 5)
## [1] 1 2 3 4 5
Vector indexes (and all other indexes in R) start with 1, not 0:
x <- c('a', 'b', 'c', 'd', 'e')
x[1]
## [1] "a"
You can take “slices” of vectors using index brackets:
x[1:3]
## [1] "a" "b" "c"
Or exclude values with a negative sign:
x[-1]
## [1] "b" "c" "d" "e"
Elements are returned in the order that the indices are supplied:
y <- c(5,1)
y
## [1] 5 1
You can use a vector of integers or booleans to select from a vector as well:
x[x<'c']
## [1] "a" "b"
x[c(1,3,5)]
## [1] "a" "c" "e"
Get the length of a vector with length
:
length(x)
## [1] 5
See if a value is in a vector with the %in%
operator:
'b' %in% x
## [1] TRUE
Or get the first position of one or more elements in a vector with
the match
function:
match(c('b', 'd', 'k'), x)
## [1] 2 4 NA
Use which
to find all positions:
y <- c('a','b','c','a','b','c')
which(y == 'c')
## [1] 3 6
You can also name the elements of a vector:
x<-1:5
names(x)<-c("Ohio","Illinois","Indiana","Michigan","Wisconsin")
x
## Ohio Illinois Indiana Michigan Wisconsin
## 1 2 3 4 5
Which allows you to select values from the vector using the names:
x["Ohio"]
## Ohio
## 1
x[c("Illinois", "Indiana")]
## Illinois Indiana
## 2 3
NA
)Before we move onto other data structures, let’s pause to consider
how to deal with missing values in a vector (or, later, a data frame).
Missing data in R is encoded as NA
. Some functions will
ignore NA
when doing computations. Others will ignore
missing values if you tell them to. Others will process NA
and give you a result of NA
.
tmp <- c(1,2,5,NA,6,NA,2,5,1,1,NA,5)
You can test for NA
(is.na
). Or you can get
the index location of the missing observations within the vector (useful
for later selecting observations in your dataset).
is.na(tmp)
## [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
which(is.na(tmp))
## [1] 4 6 11
It can also be useful to count the number of NA
s in a
vector:
sum(is.na(tmp))
## [1] 3
Why does this work? How can you sum logical values? This takes advantage of the trick that TRUE=1 and FALSE=0. The function call tries to convert the logicals to numeric, and this is how the conversion works:
as.numeric(c(TRUE, FALSE))
## [1] 1 0
Remember that different functions treat NA
differently.
With an input vector that includes NA
values,
mean
results in NA
. It has an option to
exclude missing:
mean(tmp)
## [1] NA
mean(tmp, na.rm=TRUE)
## [1] 3.111111
table
behaves differently. It excludes NA
by default. You have to tell it to include NA
.
table(tmp)
## tmp
## 1 2 5 6
## 3 2 3 1
table(tmp, useNA = "ifany")
## tmp
## 1 2 5 6 <NA>
## 3 2 3 1 3
NULL
is another special type. NULL
is
usually used to mean undefined. You might get it when a function can’t
compute a result. NULL
is a single value and can’t be in a
vector. (NA
s can be in vectors and data.frames.)
c()
## NULL
c(NULL, NULL)
## NULL
The above somewhat surprisingly gives a single NULL
because of the restrictions on how it’s used.
NULL
should not be used for missing data.
NaN
, Inf
NaN
means “not a number”.
0/0
## [1] NaN
Inf
and -Inf
are “infinity” and “negative
infinity”.
1/0
## [1] Inf
-1/0
## [1] -Inf
Factors are a special type of vector can be used for categorical
variables. Why would we need them? Consider that the values of character
vectors sometimes have an order, and we may want to store this
information in R. For example, consider a vector containing month names.
When we use table()
, R arranges the values in alphabetical
order.
months<-c("January","March","February","December","January","March")
table(months)
## months
## December February January March
## 1 1 2 2
The factor()
function converts a vector into a factor.
Without supplying any additional information, the function infers the
possible “levels” that the vector takes.
months_fac <- factor(months)
levels(months_fac)
## [1] "December" "February" "January" "March"
Factors can be ordered, which is useful when you
have categorical variables in your data. Let’s create an ordered factor
from the months variable. Using the table()
function on the
factor, we can see one of the benefits of using factors for categorical
variables - the values are ordered in meaningful way rather than
alphabetically.
months_fac <- factor(months, levels=c("January","February","March","December"))
table(months_fac)
## months_fac
## January February March December
## 2 1 2 1
Note that you cannot add values to a factor that are not included as one of the levels.
months_fac[5] <- "April"
The best solution to this is to remake the factor. The
factor
function will convert months_fac
in the
example below back to character data before creating the new factor.
months_fac <- factor(months, levels=c("January","February","March","April","December"))
months_fac[5] <- "April"
months_fac
## [1] January March February December April March
## Levels: January February March April December
Alternatively, when you create the factor for the first time, you can
include all possible levels of the factor. This has the added benefit of
producing even more meaningful results when using functions such as
table()
.
months_fac <- factor(months, levels=c("January","February","March","April","May","June","July","August","September","October","November","December"))
table(months_fac)
## months_fac
## January February March April May June July August
## 2 1 2 0 0 0 0 0
## September October November December
## 0 0 0 1
Under the hood, factors are stored as integers, with the (ordered) levels attribute providing information about the character value associated with each integer.
typeof(months_fac)
## [1] "integer"
Even if you don’t plan to use categorical data, you should know that factors exist because when reading data into R, text strings can be loaded as factors.
Lists are a bit like complex vectors. An element of a list can hold any other object, including another list. You can keep multi-dimensional and ragged data in R using lists.
l1 <- list(1, "a", TRUE, 1+4i)
l1
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 1+4i
l2 <- list(title = "Research Bazaar", numbers = 1:10, data = TRUE )
l2
## $title
## [1] "Research Bazaar"
##
## $numbers
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $data
## [1] TRUE
Indexing lists is a little different. [[1]]
is the first
element of the list as whatever type it was. [1]
is a
subset of the list – the first element of the list as a list. You can
also access list elements by name using the $
operator.
l2[[2]]
## [1] 1 2 3 4 5 6 7 8 9 10
l2[2]
## $numbers
## [1] 1 2 3 4 5 6 7 8 9 10
l2$numbers
## [1] 1 2 3 4 5 6 7 8 9 10
Matrices in R are two-dimensional arrays. All values of a matrix must
be of the same type. You can initialize a matrix using the
matrix()
function.
matrix(c('a', 'b', 'c', 'd'), nrow=2)
## [,1] [,2]
## [1,] "a" "c"
## [2,] "b" "d"
y<-matrix(1:25, nrow=5, byrow=TRUE)
y
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
## [3,] 11 12 13 14 15
## [4,] 16 17 18 19 20
## [5,] 21 22 23 24 25
Matrices are used sparingly in R, primarly for numerical calculations or explicit matrix manipulation. You can attach names to rows and columns.
Matrix algebra functions are available:
y%*%y
## [,1] [,2] [,3] [,4] [,5]
## [1,] 215 230 245 260 275
## [2,] 490 530 570 610 650
## [3,] 765 830 895 960 1025
## [4,] 1040 1130 1220 1310 1400
## [5,] 1315 1430 1545 1660 1775
x<-1:5
y%*%x
## [,1]
## [1,] 55
## [2,] 130
## [3,] 205
## [4,] 280
## [5,] 355
y^-1 # matrix inversion
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00000000 0.50000000 0.33333333 0.25000000 0.20000000
## [2,] 0.16666667 0.14285714 0.12500000 0.11111111 0.10000000
## [3,] 0.09090909 0.08333333 0.07692308 0.07142857 0.06666667
## [4,] 0.06250000 0.05882353 0.05555556 0.05263158 0.05000000
## [5,] 0.04761905 0.04545455 0.04347826 0.04166667 0.04000000
y * -1
## [,1] [,2] [,3] [,4] [,5]
## [1,] -1 -2 -3 -4 -5
## [2,] -6 -7 -8 -9 -10
## [3,] -11 -12 -13 -14 -15
## [4,] -16 -17 -18 -19 -20
## [5,] -21 -22 -23 -24 -25
Elements in a matrix are indexed like
mat[row number, col number]
. Omitting a value for row or
column will give you the entire column or row, respectively.
y[1,1]
## [1] 1
y[1,]
## [1] 1 2 3 4 5
y[,1]
## [1] 1 6 11 16 21
y[1:2,3:4]
## [,1] [,2]
## [1,] 3 4
## [2,] 8 9
y[,c(1,4)]
## [,1] [,2]
## [1,] 1 4
## [2,] 6 9
## [3,] 11 14
## [4,] 16 19
## [5,] 21 24
Using just a single index will get the element from the specified position, as if the matrix were turned into a vector first:
w<-matrix(5:29, nrow=5)
w[7]
## [1] 11
as.vector(w)[7]
## [1] 11
Data frames are the core data structure in R. A data frame is a list of named vectors with the same length. Columns are typically variables and rows are observations. Different columns can have different types of data:
id<-1:20
id
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
color<-c(rep("red", 3), rep("green",10), rep("blue", 7))
color
## [1] "red" "red" "red" "green" "green" "green" "green" "green" "green"
## [10] "green" "green" "green" "green" "blue" "blue" "blue" "blue" "blue"
## [19] "blue" "blue"
score<-runif(20)
score
## [1] 0.26746528 0.88039204 0.29705474 0.33920258 0.20098507 0.78476251
## [7] 0.08509029 0.84760003 0.96860436 0.58015890 0.38231632 0.01595548
## [13] 0.85073792 0.02826909 0.77014412 0.47813959 0.01802027 0.77637716
## [19] 0.75218987 0.81090626
df<-data.frame(id, color, score)
df
## id color score
## 1 1 red 0.26746528
## 2 2 red 0.88039204
## 3 3 red 0.29705474
## 4 4 green 0.33920258
## 5 5 green 0.20098507
## 6 6 green 0.78476251
## 7 7 green 0.08509029
## 8 8 green 0.84760003
## 9 9 green 0.96860436
## 10 10 green 0.58015890
## 11 11 green 0.38231632
## 12 12 green 0.01595548
## 13 13 green 0.85073792
## 14 14 blue 0.02826909
## 15 15 blue 0.77014412
## 16 16 blue 0.47813959
## 17 17 blue 0.01802027
## 18 18 blue 0.77637716
## 19 19 blue 0.75218987
## 20 20 blue 0.81090626
Instead of making individual objects first, we could do it all together:
df<-data.frame(id=1:20,
color=c(rep("red", 3), rep("green",10), rep("blue", 7)),
score=runif(20))
Data frames can be indexed like matrices to retrieve the values.
df[2,2]
## [1] "red"
df[1,]
## id color score
## 1 1 red 0.1362236
df[,3]
## [1] 0.13622360 0.66215314 0.74072325 0.98937198 0.36876626 0.29211562
## [7] 0.83333151 0.44268778 0.31209898 0.63400344 0.08188263 0.57687832
## [13] 0.94952485 0.25048447 0.44971459 0.63450290 0.73108527 0.80173578
## [19] 0.06149398 0.39271307
You can use negative values when indexing to exclude values:
df[,-2]
## id score
## 1 1 0.13622360
## 2 2 0.66215314
## 3 3 0.74072325
## 4 4 0.98937198
## 5 5 0.36876626
## 6 6 0.29211562
## 7 7 0.83333151
## 8 8 0.44268778
## 9 9 0.31209898
## 10 10 0.63400344
## 11 11 0.08188263
## 12 12 0.57687832
## 13 13 0.94952485
## 14 14 0.25048447
## 15 15 0.44971459
## 16 16 0.63450290
## 17 17 0.73108527
## 18 18 0.80173578
## 19 19 0.06149398
## 20 20 0.39271307
df[-1:-10,]
## id color score
## 11 11 green 0.08188263
## 12 12 green 0.57687832
## 13 13 green 0.94952485
## 14 14 blue 0.25048447
## 15 15 blue 0.44971459
## 16 16 blue 0.63450290
## 17 17 blue 0.73108527
## 18 18 blue 0.80173578
## 19 19 blue 0.06149398
## 20 20 blue 0.39271307
You can also use the names of the columns after a $
or
in the indexing:
df$color
## [1] "red" "red" "red" "green" "green" "green" "green" "green" "green"
## [10] "green" "green" "green" "green" "blue" "blue" "blue" "blue" "blue"
## [19] "blue" "blue"
Indexing into a data frame with a single integer or name of the column will give you the column(s) specified as a new data frame.
df['color']
## color
## 1 red
## 2 red
## 3 red
## 4 green
## 5 green
## 6 green
## 7 green
## 8 green
## 9 green
## 10 green
## 11 green
## 12 green
## 13 green
## 14 blue
## 15 blue
## 16 blue
## 17 blue
## 18 blue
## 19 blue
## 20 blue
df[2:3]
## color score
## 1 red 0.13622360
## 2 red 0.66215314
## 3 red 0.74072325
## 4 green 0.98937198
## 5 green 0.36876626
## 6 green 0.29211562
## 7 green 0.83333151
## 8 green 0.44268778
## 9 green 0.31209898
## 10 green 0.63400344
## 11 green 0.08188263
## 12 green 0.57687832
## 13 green 0.94952485
## 14 blue 0.25048447
## 15 blue 0.44971459
## 16 blue 0.63450290
## 17 blue 0.73108527
## 18 blue 0.80173578
## 19 blue 0.06149398
## 20 blue 0.39271307
Instead of index numbers or names, you can also select values by using logical statements. This is usually done with selecting rows.
df[df$color == "green",]
## id color score
## 4 4 green 0.98937198
## 5 5 green 0.36876626
## 6 6 green 0.29211562
## 7 7 green 0.83333151
## 8 8 green 0.44268778
## 9 9 green 0.31209898
## 10 10 green 0.63400344
## 11 11 green 0.08188263
## 12 12 green 0.57687832
## 13 13 green 0.94952485
df[df$score > .5,]
## id color score
## 2 2 red 0.6621531
## 3 3 red 0.7407232
## 4 4 green 0.9893720
## 7 7 green 0.8333315
## 10 10 green 0.6340034
## 12 12 green 0.5768783
## 13 13 green 0.9495248
## 16 16 blue 0.6345029
## 17 17 blue 0.7310853
## 18 18 blue 0.8017358
df[df$score > .5 & df$color == "blue",]
## id color score
## 16 16 blue 0.6345029
## 17 17 blue 0.7310853
## 18 18 blue 0.8017358
You can assign names to the rows of a data frame as well as to the columns, and then use those names for indexing and selecting data.
rownames(df)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20"
You can add columns or rows simply by assigning values to them. There
are also rbind
and cbind
(for row bind and
column bind) functions that can be useful.
df$year<-1901:1920
df
## id color score year
## 1 1 red 0.13622360 1901
## 2 2 red 0.66215314 1902
## 3 3 red 0.74072325 1903
## 4 4 green 0.98937198 1904
## 5 5 green 0.36876626 1905
## 6 6 green 0.29211562 1906
## 7 7 green 0.83333151 1907
## 8 8 green 0.44268778 1908
## 9 9 green 0.31209898 1909
## 10 10 green 0.63400344 1910
## 11 11 green 0.08188263 1911
## 12 12 green 0.57687832 1912
## 13 13 green 0.94952485 1913
## 14 14 blue 0.25048447 1914
## 15 15 blue 0.44971459 1915
## 16 16 blue 0.63450290 1916
## 17 17 blue 0.73108527 1917
## 18 18 blue 0.80173578 1918
## 19 19 blue 0.06149398 1919
## 20 20 blue 0.39271307 1920
df[22,]<-list(21, "green", 0.4, 1921)
Note that we had to use a list for adding a row because there are different data types.
Before reading or writing files, it’s often useful to set the working directory first so that you don’t have specify complete file paths.
You can go to the Files tab in the bottom right window in RStudio and
find the directory you want. Then under the More menu, there is an
option to set the current directory as the working directory. Or you can
use the setwd
command like:
setwd("~/training/intror") # ~ stands for your home directory
setwd("/Users/username/Documents/workshop") # mac, absolute path example
setwd("C:\Users\username\Documents\workshop") # windows, absolute path example
In our case, we are working out of the directory team/bootcamp/2018 in the base directory. So we can set our working directory as follows:
setwd("~/team/bootcamp/2018/R session materials")
To check where your working directory is, use
getwd()
:
getwd()
## [1] "/Users/alice/bootcamp-2023"
Read in a csv file and save it as a data frame with a name. Below are two examples, using a CSV file and a local file stored in the working directory respectively:
# Using a URL
schooldata <- read.csv("https://goo.gl/f4UhMX")
# Using a local file
gapminder <- read.csv("data/gapminder5.csv")
You can view the data frames in RStudio using the View()
function.
View(schooldata)
View(gapminder)
You could also use the Import Dataset option in the Environment tab in the top right window in RStudio.
Looking at the help for read.csv
, there are a number of
different options and different function calls. read.table
,
read.csv
, and read.delim
all work in the same
basic way and take the same set of arguments, but they have different
defaults. Key options to pay attention to include:
header
: whether the first row of the file has the names
of the columnssep
: the separator used (comma, tab (enter as
\t
), etc) in the filena.strings
: how is missing data encoded in your file?
“NA” are treated as missing by default; blanks are treated as missing by
default in everything but character data.stringsAsFactors
: should strings (text data) be
converted to factors or kept as is? Example of this below.Let’s redo the above with a better set of options:
gapminder <- read.csv("data/gapminder5.csv",
stringsAsFactors=FALSE,
strip.white=TRUE,
na.strings=c("NA", ""))
The option na.strings
is needed now because while blanks
are treated as missing by default in numeric fields (which includes
factors), they aren’t by default missing for character data.
readr
PackageDoes all of the above seem annoying or unnecessarily complicated? Others have thought so too.
Look at the readr
package (part of the tidyverse), which
attempts to smooth over some of the annoyances of reading in file in R.
The main source of potential problems when using readr
functions is that it guesses variable types from a subset of the
observations, so if you have a strange value further down in your
dataset, you might get an error or an unexpected value conversion.
To read in the same data with the same settings as above, using
readr
(note similar function name, with _
instead of .
):
library(readr)
gapminder <- read_csv("data/gapminder5.csv")
## Rows: 1704 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, continent
## dbl (4): year, pop, lifeExp, gdpPercap
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Options used above are defaults in readr
. You get a long
message about the column types.
Learn more at the readr
website.
For Stata, SAS, or SPSS files, try the haven
or
foreign
packages. For Excel files, use the
readxl
package.
data.table
package also has functions for reading in
data, which you will learn about on Day 3 of the boot camp. The
fread
function is relatively fast for reading a rectangular
standardized data file into R.
R also has packages for reading other structured files like XML and JSON, or interfacing with databases. For more on using R with databases, see the R section of the Databases workshop materials from NUIT Research Computing Services.
There are also multiple packages that make collecting data from APIs (either in general or specific APIs like the Census Bureau) easier. There are also packages that interface with Google docs/drive and Dropbox, although those APIs change frequently, so beware when using those packages if they haven’t been updated recently.
In the previous section, we imported two datasets. For the rest of
today, we will focus on the Gapminder data, which is stored in our
environment as gapminder
. To refresh yourself, you can view
the data frame in R using the View()
function.
View(gapminder)
You can also see a list of variables using names()
.
names(gapminder)
## [1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
Other useful functions are dim()
which shows the
dimensions of the data frame, str()
which shows the
dimensions of the data frame along with the names of variables and the
first few values in each variable, nrow()
and
ncol()
which show the number of rows and colums, and
head()
which shows the first few rows of the data frame (5
rows by default).
When applied to a data frame, the summary()
function
provides useful summary statistics for each variable (i.e. column) in
the data frame. Let’s try it with the Gapminder data:
summary(gapminder)
## country year pop continent
## Length:1704 Min. :1952 Min. :6.001e+04 Length:1704
## Class :character 1st Qu.:1966 1st Qu.:2.794e+06 Class :character
## Mode :character Median :1980 Median :7.024e+06 Mode :character
## Mean :1980 Mean :2.960e+07
## 3rd Qu.:1993 3rd Qu.:1.959e+07
## Max. :2007 Max. :1.319e+09
## lifeExp gdpPercap
## Min. :23.60 Min. : 241.2
## 1st Qu.:48.20 1st Qu.: 1202.1
## Median :60.71 Median : 3531.8
## Mean :59.47 Mean : 7215.3
## 3rd Qu.:70.85 3rd Qu.: 9325.5
## Max. :82.60 Max. :113523.1
We can also use functions like mean()
,
median()
, var()
, sd()
, and
quantile()
to calculate other summary statistics for
individual variables. For example, let’s calculate the mean of life
expectancy. Recall that we can use the $
operator to call
up a variable within a data frame using its name.
mean(gapminder$lifeExp)
## [1] 59.47444
A useful way to examine a discrete or categorical variable is to use
a frequency table. These are easy to make in R, using the
table()
function:
table(gapminder$continent)
##
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
prop.table()
is a useful wrapper around
table()
, showing the proportion of rows in each
category:
prop.table(table(gapminder$continent))
##
## Africa Americas Asia Europe Oceania
## 0.36619718 0.17605634 0.23239437 0.21126761 0.01408451
You can generate a frequency table with more than one variable as well:
table(gapminder$continent, gapminder$year)
##
## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## Africa 52 52 52 52 52 52 52 52 52 52 52 52
## Americas 25 25 25 25 25 25 25 25 25 25 25 25
## Asia 33 33 33 33 33 33 33 33 33 33 33 33
## Europe 30 30 30 30 30 30 30 30 30 30 30 30
## Oceania 2 2 2 2 2 2 2 2 2 2 2 2
Notice that each row in the data frame represents one country in a given year. Perhaps we are interested in analyzing only data from one year. To do this, we will have to “subset” our data frame to include only those rows that we want to keep.
The subset()
function lets you select rows and columns
you want to keep. You can either name columns or rows, or include a
logical statement such that only rows/columns where the statement is
true are retained.
subset(data.frame,
subset=condition indicating rows to keep,
select=condition indicating columns to keep)
For eaxmple, let’s create a new data frame containing only 2007 data by subsetting the original data frame.
gapminder07 <- subset(gapminder, subset = year==2007)
Look at the number of rows in the new data frame: it is only 142, whereas the original data frame has 1704 rows.
nrow(gapminder07)
## [1] 142
The sort()
function reorders elements, in ascending
order by default. You can flip the order by using the
decreasing = TRUE
argument.
sort(gapminder07$lifeExp)
## [1] 39.613 42.082 42.384 42.568 42.592 42.731 43.487 43.828 44.741 45.678
## [11] 46.242 46.388 46.462 46.859 48.159 48.303 48.328 49.339 49.580 50.430
## [21] 50.651 50.728 51.542 51.579 52.295 52.517 52.906 52.947 54.110 54.467
## [31] 54.791 55.322 56.007 56.728 56.735 56.867 58.040 58.420 58.556 59.443
## [41] 59.448 59.545 59.723 60.022 60.916 62.069 62.698 63.062 63.785 64.062
## [51] 64.164 64.698 65.152 65.483 65.528 65.554 66.803 67.297 69.819 70.198
## [61] 70.259 70.616 70.650 70.964 71.164 71.338 71.421 71.688 71.752 71.777
## [71] 71.878 71.993 72.235 72.301 72.390 72.396 72.476 72.535 72.567 72.777
## [81] 72.801 72.889 72.899 72.961 73.005 73.338 73.422 73.747 73.923 73.952
## [91] 74.002 74.143 74.241 74.249 74.543 74.663 74.852 74.994 75.320 75.537
## [101] 75.563 75.635 75.640 75.748 76.195 76.384 76.423 76.442 76.486 77.588
## [111] 77.926 78.098 78.242 78.273 78.332 78.400 78.553 78.623 78.746 78.782
## [121] 78.885 79.313 79.406 79.425 79.441 79.483 79.762 79.829 79.972 80.196
## [131] 80.204 80.546 80.653 80.657 80.745 80.884 80.941 81.235 81.701 81.757
## [141] 82.208 82.603
sort(gapminder07$lifeExp, decreasing=TRUE)
## [1] 82.603 82.208 81.757 81.701 81.235 80.941 80.884 80.745 80.657 80.653
## [11] 80.546 80.204 80.196 79.972 79.829 79.762 79.483 79.441 79.425 79.406
## [21] 79.313 78.885 78.782 78.746 78.623 78.553 78.400 78.332 78.273 78.242
## [31] 78.098 77.926 77.588 76.486 76.442 76.423 76.384 76.195 75.748 75.640
## [41] 75.635 75.563 75.537 75.320 74.994 74.852 74.663 74.543 74.249 74.241
## [51] 74.143 74.002 73.952 73.923 73.747 73.422 73.338 73.005 72.961 72.899
## [61] 72.889 72.801 72.777 72.567 72.535 72.476 72.396 72.390 72.301 72.235
## [71] 71.993 71.878 71.777 71.752 71.688 71.421 71.338 71.164 70.964 70.650
## [81] 70.616 70.259 70.198 69.819 67.297 66.803 65.554 65.528 65.483 65.152
## [91] 64.698 64.164 64.062 63.785 63.062 62.698 62.069 60.916 60.022 59.723
## [101] 59.545 59.448 59.443 58.556 58.420 58.040 56.867 56.735 56.728 56.007
## [111] 55.322 54.791 54.467 54.110 52.947 52.906 52.517 52.295 51.579 51.542
## [121] 50.728 50.651 50.430 49.580 49.339 48.328 48.303 48.159 46.859 46.462
## [131] 46.388 46.242 45.678 44.741 43.828 43.487 42.731 42.592 42.568 42.384
## [141] 42.082 39.613
The order()
function gives you the index positions in
sorted order:
order(gapminder07$lifeExp)
## [1] 122 87 141 113 74 4 142 1 22 75 108 53 28 95 117 78 31 118
## [19] 18 20 23 14 133 41 17 127 89 43 69 80 36 29 52 11 46 94
## [37] 42 129 121 77 47 62 19 49 54 88 140 111 90 9 81 59 27 98
## [55] 109 12 84 70 130 55 51 128 60 61 86 39 101 102 100 132 40 73
## [73] 37 3 15 120 107 68 66 110 82 26 93 25 16 57 139 137 131 76
## [91] 112 125 79 138 85 115 13 38 5 99 103 8 97 32 83 136 2 106
## [109] 34 72 116 104 135 33 35 126 24 71 105 30 63 44 48 134 10 50
## [127] 91 7 114 96 92 65 21 45 64 123 119 6 124 58 56 67
order()
is useful for arranging data frames. Combined
with head()
, which shows the first 5 rows of a data frame,
we can use this to view the rows of the data frame with the highest life
expectancy:
head(gapminder07[order(gapminder07$lifeExp, decreasing=TRUE),])
## # A tibble: 6 × 6
## country year pop continent lifeExp gdpPercap
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Japan 2007 127467972 Asia 82.6 31656.
## 2 Hong Kong China 2007 6980412 Asia 82.2 39725.
## 3 Iceland 2007 301931 Europe 81.8 36181.
## 4 Switzerland 2007 7554661 Europe 81.7 37506.
## 5 Australia 2007 20434176 Oceania 81.2 34435.
## 6 Spain 2007 40448191 Europe 80.9 28821.
Sorting a table is often useful. For example:
sort(table(gapminder07$continent))
##
## Oceania Americas Europe Asia Africa
## 2 25 30 33 52
You can add variables to a data frame in several ways. Here, we will
show two standard methods using base R. On Day 3, you will learn about
alternatives using the data.table
and dplyr
approaches.
To demonstrate, let’s first create a vector with the same number of
values as the number of rows in the data frame. If you want to learn
what is going on in this code, look at the help file for the function
sample()
.
newvar <- sample(1:5000, 1704, replace = FALSE)
You can add a variable/column by using the cbind()
function:
gapminder <- cbind(gapminder, newvar)
You can add a variable/column by assigning it to data frame directly:
gapminder$newvar <- newvar
To remove a variable/column from a data frame, you can assign a
NULL
value to the variable:
gapminder$newvar <- NULL
You can also remove a variable/column by negatively indexing the data frame:
gapminder <- gapminder[-"newvar"]
gapminder <- gapminder[,-c("newvar")]
# The second method is equivalent to the first, but can be used to remove multiple columns at the same time.
To add rows, you can use the function rbind()
. Remember
that rows may include different data types, in which case you would need
to use the function list()
.
To recode a variable, you could make a new column, or overwrite the
existing one entirely. For example, let’s create a new variable for life
expectancy containing rounded values, using the round()
function.
gapminder07$lifeExp_rounded <- round(gapminder07$lifeExp)
If you just want to replace part of a column (or vector), you can assign to a subset. For example, let’s say we want to create a new variable that marks all cases where life expectancy is higher than the mean as “High” and those where it is lower than the mean as “Low”.
# Start by creating a new variable with all missing values
gapminder07$lifeExp_highlow <- NA
# Replace higher-than-mean values with "High"
gapminder07$lifeExp_highlow[gapminder07$lifeExp>mean(gapminder07$lifeExp)] <- "High"
# Replace lower-than-mean values with "Low"
gapminder07$lifeExp_highlow[gapminder07$lifeExp<mean(gapminder07$lifeExp)] <- "Low"
There’s also a recode()
function in the
dplyr
library. You specify the reassignment of values. For
example, let’s create a new variable with abbreviated continent
names.
library(dplyr)
gapminder07$continent_abrv <- recode(gapminder07$continent,
`Africa`="AF",
`Americas`="AM",
`Asia`="AS",
`Europe`="EU",
`Oceania`="OC")
table(gapminder07$continent_abrv)
##
## AF AM AS EU OC
## 52 25 33 30 2
We will return to recode()
and other dplyr
functions on Day 3. The ifelse()
function, covered in Day
2, is also useful for recoding.
To compute summary statistics by groups in the data, one option is to
use the aggregate()
function. For example, we can calculate
the mean of life expectancy for each continent:
aggregate(gapminder07$lifeExp ~ gapminder07$continent, FUN=mean)
## gapminder07$continent gapminder07$lifeExp
## 1 Africa 54.80604
## 2 Americas 73.60812
## 3 Asia 70.72848
## 4 Europe 77.64860
## 5 Oceania 80.71950
The ~
operator can be read as “by” or “as a function
of”, and is used in many contexts. A construction such as
y ~ x1 + x2
is referred to as a formula.
We can also aggregate by two variables. For example, let’s use the original Gapminder data (not just the 2007 data) and aggregate by continent and year.
aggregate(gapminder$lifeExp ~ gapminder$year + gapminder$continent, FUN=mean)
## gapminder$year gapminder$continent gapminder$lifeExp
## 1 1952 Africa 39.13550
## 2 1957 Africa 41.26635
## 3 1962 Africa 43.31944
## 4 1967 Africa 45.33454
## 5 1972 Africa 47.45094
## 6 1977 Africa 49.58042
## 7 1982 Africa 51.59287
## 8 1987 Africa 53.34479
## 9 1992 Africa 53.62958
## 10 1997 Africa 53.59827
## 11 2002 Africa 53.32523
## 12 2007 Africa 54.80604
## 13 1952 Americas 53.27984
## 14 1957 Americas 55.96028
## 15 1962 Americas 58.39876
## 16 1967 Americas 60.41092
## 17 1972 Americas 62.39492
## 18 1977 Americas 64.39156
## 19 1982 Americas 66.22884
## 20 1987 Americas 68.09072
## 21 1992 Americas 69.56836
## 22 1997 Americas 71.15048
## 23 2002 Americas 72.42204
## 24 2007 Americas 73.60812
## 25 1952 Asia 46.31439
## 26 1957 Asia 49.31854
## 27 1962 Asia 51.56322
## 28 1967 Asia 54.66364
## 29 1972 Asia 57.31927
## 30 1977 Asia 59.61056
## 31 1982 Asia 62.61794
## 32 1987 Asia 64.85118
## 33 1992 Asia 66.53721
## 34 1997 Asia 68.02052
## 35 2002 Asia 69.23388
## 36 2007 Asia 70.72848
## 37 1952 Europe 64.40850
## 38 1957 Europe 66.70307
## 39 1962 Europe 68.53923
## 40 1967 Europe 69.73760
## 41 1972 Europe 70.77503
## 42 1977 Europe 71.93777
## 43 1982 Europe 72.80640
## 44 1987 Europe 73.64217
## 45 1992 Europe 74.44010
## 46 1997 Europe 75.50517
## 47 2002 Europe 76.70060
## 48 2007 Europe 77.64860
## 49 1952 Oceania 69.25500
## 50 1957 Oceania 70.29500
## 51 1962 Oceania 71.08500
## 52 1967 Oceania 71.31000
## 53 1972 Oceania 71.91000
## 54 1977 Oceania 72.85500
## 55 1982 Oceania 74.29000
## 56 1987 Oceania 75.32000
## 57 1992 Oceania 76.94500
## 58 1997 Oceania 78.19000
## 59 2002 Oceania 79.74000
## 60 2007 Oceania 80.71950
Now that we have a dataset … we can do statistics! You will learn more about particular statistical models and methods over the course of your program. For now, let’s do some basic things to get a feel for how R handles statistical analysis.
You can use the cor()
function to calculate correlation
(Pearson’s r):
cor(gapminder07$lifeExp, gapminder07$gdpPercap)
## [1] 0.6786624
You can also find the covariance:
cov(gapminder07$lifeExp, gapminder07$gdpPercap)
## [1] 105368
Do countries with high or low life expectancy have different GDP per capita? Apart from simply comparing the means for the two groups, we can use a T-test to evaluate the likelihood that these means are significantly different from each other.
t.test(gapminder07$gdpPercap~gapminder07$lifeExp_highlow)
##
## Welch Two Sample t-test
##
## data: gapminder07$gdpPercap by gapminder07$lifeExp_highlow
## t = 10.564, df = 95.704, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group High and group Low is not equal to 0
## 95 percent confidence interval:
## 12674.02 18539.14
## sample estimates:
## mean in group High mean in group Low
## 17944.685 2338.104
Remember: you can read ~
as “as a function of”. So the
above code reads “GDP per capita as a function of life expectancy”,
meaning grouped by or explained by.
We don’t have to use the formula syntax. We can specify data for two different groups. Let’s see if GDP per capita is different when comparing the Americas and Asia.
t.test(gapminder07$gdpPercap[gapminder07$continent=="Asia"], gapminder07$gdpPercap[gapminder07$continent=="Americas"])
##
## Welch Two Sample t-test
##
## data: gapminder07$gdpPercap[gapminder07$continent == "Asia"] and gapminder07$gdpPercap[gapminder07$continent == "Americas"]
## t = 0.46849, df = 55.535, p-value = 0.6413
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4816.823 7756.813
## sample estimates:
## mean of x mean of y
## 12473.03 11003.03
By storing the output of the T-test (which is a list) as its own object, we can access different parts of the results.
t1 <- t.test(gapminder07$gdpPercap~gapminder07$lifeExp_highlow)
names(t1)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
t1$p.value
## [1] 9.507438e-18
Of course, the two life expectancy “groups” we used above to conduct a T-test are based on a continuous variable indicating life expectancy. We may be more interested in whether this variable predicts GDP per capita rather than the two “groups” that we created using an arbitrary threshold.
The basic syntax for a liner regression is shown below. Note that
instead of repeating df$variablename
several times, we can
indicate the data frame name using the data =
argument and
simply use variable names.
lm(y ~ x1 + x2 + x3, data=df_name)
Example:
lm(gdpPercap ~ lifeExp, data=gapminder07)
##
## Call:
## lm(formula = gdpPercap ~ lifeExp, data = gapminder07)
##
## Coefficients:
## (Intercept) lifeExp
## -36759.4 722.9
The default output isn’t much. You get a lot more with
summary()
:
r1 <- lm(gdpPercap ~ lifeExp, data=gapminder07)
summary(r1)
##
## Call:
## lm(formula = gdpPercap ~ lifeExp, data = gapminder07)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14473 -7840 -2145 6159 28143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36759.42 4501.25 -8.166 1.67e-13 ***
## lifeExp 722.90 66.12 10.933 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9479 on 140 degrees of freedom
## Multiple R-squared: 0.4606, Adjusted R-squared: 0.4567
## F-statistic: 119.5 on 1 and 140 DF, p-value: < 2.2e-16
Note that a constant (Intercept) term was added automatically.
Let’s try another regression with two indpendent variables. This time, we will predict life expectancy as a function of GDP per capita and population.
r2 <- lm(lifeExp ~ gdpPercap + pop, data=gapminder07)
summary(r2)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap + pop, data = gapminder07)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.496 -6.119 1.899 7.018 13.383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.921e+01 1.040e+00 56.906 <2e-16 ***
## gdpPercap 6.416e-04 5.818e-05 11.029 <2e-16 ***
## pop 7.001e-09 5.068e-09 1.381 0.169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.87 on 139 degrees of freedom
## Multiple R-squared: 0.4679, Adjusted R-squared: 0.4602
## F-statistic: 61.11 on 2 and 139 DF, p-value: < 2.2e-16
You will often want to save your work in R as well. There are a few different ways to save:
We imported the gapminder
data earlier in CSV format,
and manipulated it in several ways: we subsetted the 2007 data and added
the variables lifeExp_rounded
,
lifeExp_highlow
. and continent_abrv
.
The best method for making your workflow and analysis reproducible is to write any data sets you create to plain text files.
Let’s try to save our subsetted and manipulated
gapminder07
data frame as a CSV. To write a CSV, there are
write.csv
and write.table
functions, similar
to their read
counterparts. The one trick is that you
usually want to NOT write row.names.
write.csv(gapminder07, file="data/gapminder_2007_edited.csv",
row.names=FALSE)
Or using readr
package’s equivalent:
write_csv(schooldata, "data/gapminder_2007_edited.csv")
You can use the save
function to save multiple objects
together in a file. The standard file extension to use is
.RData
. Example:
save(schooldata, gapminder,
file = "workshopobjects.RData")
To later load in saved data, use the load
function:
load("workshopobjects.RData")
This can be useful if you’re working with multiple objects and want
to be able to pick up your work easily later. But.RData
files generally aren’t portable to other programs, so think of them only
as internal R working files – not the format you want to keep data in
long-term. Loading a .RData
file will overwrite objects
with the same name already in the environment.
You can also save all the objects in your environment by using the
save.image()
function, or by clicking the “Save” icon in
the Environment pane in RStudio.
We will spend a lot more time later in the boot camp on data
visualization, but today we will briefly introduce some functions for
visualization that are included in base R. These functions are useful to
quickly visualize data in early phases of analysis, and their syntax is
often incorporated into other packages. For more advanced and
aesthetically pleasing data visualization, you will want to use the
ggplot2
package, which we will go over in detail on Day
3.
Histograms are a simple and useful way to visualize the distribution
of a variable. For example, let’s plot a histogram of life expectancy
from the gapminder07
data frame:
hist(gapminder07$lifeExp)
By reading the help file for the hist()
function, we can
identify several arguments that can change the aesthetics of the plot.
The breaks =
argument controls the number of breaks on the
x-axis.
hist(gapminder07$lifeExp, breaks=20,
main="Life expectancy (2007 data)", ylab="Frequency", xlab="Life expectancy")
The simplest way to plot the relationship between two variables is a
scatterplot. If you provide two variables to the plot()
function in R, it produces a scatterplot. Let’s try it with life
expectancy and GDP per capita in the gapminder07
data
frame. Recall that ~
means “a function of”, so we will put
the y-axis variable on the left and the x-axis variable on the
right.
plot(gapminder07$lifeExp ~ gapminder07$gdpPercap)
Again, we can add axes and labels:
plot(gapminder07$lifeExp ~ gapminder07$gdpPercap, main="Life expectancy as a function of GDP per capita (2007 data)", xlab="GDP per capita", ylab="Life expectancy")
Perhaps we want to add a line indicating the mean value of life
expectancy. We can do this by using the abline()
function
to add a line after creating a plot. Adding multiple layers to a plot is
much more intuitive and flexible with the ggplot2
package,
which we will explore on Day 3.
plot(gapminder07$lifeExp ~ gapminder07$gdpPercap, main="Life expectancy as a function of GDP per capita (2007 data)", xlab="GDP per capita", ylab="Life expectancy")
abline(h = mean(gapminder07$lifeExp))