Part of the tidyverse
, ggplot2
is an R
package that implements the “grammar of graphics” in R.
In ggplot, graphics are built by supplying data and mapping of data values to aesthetics, and then adding layers that build geometric objects, scales, labels, and more.
dplyr
and other
tidyverse
packages.imports
, generation
,
merged_energy
, long_gen
and
long_merged_energy
in your Enrivorment, read in
gapminder
and recreate gapminder07
.ggplot()
functiongeom
functions, such as geom_point()
or geom_hist()
aes()
function nested within the chosen geom
function+
operatorTask:
Create scatterplot of lifeExp
against
gdpPercap
from gapminder07
data.
Steps:
gapminder07
data frame as input
to ggplot()
functiongeom
; in this case we choose
geom_point()
to generate pointsx
and y
values, to
geom_point()
.ggplot(gapminder07) +
geom_point(aes(x = gdpPercap, y = lifeExp)) +
labs(title = "Relationship between life expectancy and GDP per capita in 2007",
x = "GDP per capita", y = "Life expectancy")
Using the gapminder07
data, create a scatterplot of the
natural log of gdpPercap
as a function of the natural log
of pop
. Give it a title and axis labels.
Remember, you will need three functions: ggplot()
,
geom_point()
, and labs()
.
ggplot(gapminder07) +
geom_point(aes(x = log(pop), y = log(gdpPercap))) +
labs(title = "Relationship between GDP per capita and population in 2007", x = "Logged GDP per capita", y = "Logged life expectancy")
%>%
) in
dplyr
.dplyr
and ggplot2
are
designed to work well together.Task:
Plot a column chart of total energy
generated over time.
Steps:
long_gen
.geom_col()
and map appropriate
x
and y
variables.long_gen %>%
group_by(datetime) %>%
summarise(output=sum(output)) %>%
ggplot() +
geom_col(aes(x=datetime, y=output)) +
labs(title="Total energy generated, by hour", x="Hour", y="Output (MW)")
Task:
Plot a column chart hydroelectric power generated
over time.
Hint: There are two types of hydroelectric sources in the
data: large_hydro
and small_hydro
.
long_gen %>%
filter(source=="large_hydro"|source=="small_hydro") %>%
group_by(datetime) %>%
summarise(output=sum(output)) %>%
ggplot() +
geom_col(aes(x=datetime, y=output)) +
labs(title="Total hydro power generated, by hour", x="Hour", y="Output (MW)")
generation %>%
mutate(hydro=large_hydro+small_hydro) %>%
ggplot() +
geom_col(aes(x=datetime, y=hydro)) +
labs(title="Total hydro power generated, by hour", x="Hour", y="Output (MW)")
We have already seen examples of two kinds of plots:
geom_point()
geom_col()
Let’s see a few more.
geom_line()
imports %>%
ggplot() +
geom_line(aes(x=datetime, y=imports)) +
labs(title="Energy imports over time", x="Hour", y="Amount imported (MW)")
aes()
).aes()
function.size=
and col=
.imports %>%
ggplot() +
geom_line(aes(x=datetime, y=imports), size=1.2, col="red") +
labs(title="Energy imports over time", x="Hour", y="Amount imported (MW)")
geom_area()
generation %>%
ggplot() +
geom_area(aes(x=datetime, y=wind), fill="darkblue") +
labs(title="Hourly wind power generation, Sept 3-9", x="Hour", y="Output (MW)")
geom_boxplot()
long_gen %>%
ggplot() +
geom_boxplot(aes(x=source, y=output)) +
labs(title="Amount of energy generated by each source, Sept 3-9", x="Source type", y="Output (MW)")
Task:
Plot a line of large hydro generation over time,
and a smoothed line of the same relationship on top of it.
Steps:
generation
data to
ggplot()
.geom_line()
with
appropriate x
and y
aesthetics. Make it
turquoise for fun.geom_smooth()
with the same x
and
y
aesthetics.
geom_smooth()
plots smoothed
conditional means (estimated using a loess regression)generation %>%
ggplot() +
geom_line(aes(x=datetime, y=large_hydro), col="turquoise3") +
geom_smooth(aes(x=datetime, y=large_hydro)) +
labs(title="Hydroelectric (large) generation per hour, Sept 3-9", x="Hour", y="Output (MW)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Task:
Create a column chart that shows the total output per
source.
"darkred"
.geom
function that you
need.labs()
.long_merged_energy %>%
group_by(source) %>%
summarise(output=sum(output)) %>%
ggplot() +
geom_col(aes(x=source, y=output), fill="darkred") +
geom_hline(aes(yintercept=mean(output))) +
labs(title="Total output per energy source over Sept 3-9", y="Output (MW)", x="Source")
labs()
layer can add titles (title=
) and axis labels
(x=
and y=
) to a plot.subtitle=
)
and captions (caption=
).imports %>%
ggplot() +
geom_line(aes(x=datetime, y=imports), col="red") +
labs(title="Energy imports over time in California", subtitle="Hourly data from September 3-9, 2018",
caption="Source: California Energy Commission", x="Hour", y="Amount imported (MW)")
scale_alpha
: control transparency
settingsscale_color
and
scale_fill
: control color palettes and other aspects of
data mapped to colorsscale_x
and scale_y
:
control the position of axis markersTask:
Recreate line plot of imports over time, but label x-axis
with hours rather than dates.
Steps:
imports
data to
ggplot()
.geom_line()
layer and map
appropriate x
and y
variables.scale_x_datetime()
layer and use
date_labels=
and date_breaks=
arguments to
modify x-axis labels and breaks.labs()
layer.imports %>%
ggplot() +
geom_line(aes(x=datetime, y=imports), col="red") +
scale_x_datetime(date_labels="%H:%M", date_breaks="12 hours") +
labs(title="Energy imports over time in California", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Amount imported (MW)")
ggplot comes with several preset themes that can be added as layers, including:
theme_grey()
: defaulttheme_bw()
: strip colors, including grey gradientstheme_dark()
and theme_light()
: change the
background of the coordinate systemtheme_minimal()
and theme_void()
: see
reference guide for detailsimports %>%
ggplot() + geom_line(aes(x=datetime, y=imports), col="red") +
scale_x_datetime(date_labels="%H:%M", date_breaks="12 hours") +
labs(title="Energy imports over time in California", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Amount imported (MW)") +
theme_minimal()
The theme()
layer lets you modulate many components of
the theme, including:
axis.labels.x =
).legend.position =
).plot.background =
).ggplot2::theme()
.imports %>%
ggplot() + geom_line(aes(x=datetime, y=imports), col="red") +
scale_x_datetime(date_labels="%H:%M", date_breaks="12 hours") +
labs(title="Energy imports over time in California", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Amount imported (MW)") +
theme(axis.text.x=element_text(angle=45, hjust=1, size=12))
coord_flip()
is the default Cartesian
coordinate system used; by explicitly calling this function, you can
change the limits of the x
and y
axes from
their defaultscoord_fixed()
sets a fixed aspect
ratio between the x
and y
axescoord_transform()
lets you transform
the Cartesian coordinates using functions like sqrt
or
log
coord_polar()
changes the coordinate
system to polar coordinates rather than a Cartesian systemlong_gen %>%
mutate(date=lubridate::date(datetime)) %>%
group_by(date) %>% summarise(output=sum(output)) %>%
ggplot() + geom_col(aes(x=date, y=output)) +
labs(title="Total energy generated, by day", x="Day", y="Output (MW)")
long_gen %>%
mutate(date=lubridate::date(datetime)) %>%
group_by(date) %>% summarise(output=sum(output)) %>%
ggplot() + geom_col(aes(x=date, y=output)) +
labs(title="Total energy generated, by day", x="Day", y="Output (MW)") +
coord_flip()
stat
functions in
ggplot that enable you to conduct statistical transformations of your
data prior to plotting.stat
and geom
layers can
be used interchangeably, as each stat
layer has a
geom
argument and vice versa.geom()
using the group =
argument, and
demarcate the distinct objects created per group in some way.Task:
Create a line plot of energy output over time, with separate
lines for each source.
Steps:
long_merged_energy
to ggplot()
.geom_line()
layer and specify
x=datetime
and y=output
.group=source
and
col=source
.labs()
layer.long_merged_energy %>%
ggplot() +
geom_line(aes(x=datetime, y=output, group=source, col=source)) +
labs(title="Output by energy source over time", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Output (MW)")
Task:
Create a line plot that compares generation of wind, solar,
and geothermal energy over time.
Bonus: Set the line size to 1.5.
long_merged_energy %>%
filter(source=="wind"|source=="solar"|source=="geothermal") %>%
ggplot() +
geom_line(aes(x=datetime, y=output, group=source, col=source), size=1.5) +
labs(title="Wind vs. Solar vs. Geothermal generation", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Output (MW)")
long_merged_energy %>% filter(source=="wind"|source=="solar"|source=="geothermal") %>%
ggplot() +
geom_line(aes(x=datetime, y=output, group=source, col=source), size=1.5) +
scale_color_brewer(palette="Accent", name="Energy source") +
labs(title="Wind vs. Solar vs. Geothermal generation", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Output (MW)")
col=
is used to color objects like lines and
pointsfill=
is used to color objects like columns and
histograms
col=
will create colored outlines for such objectsgeom
function, we generate multiple geometric objects on
the same plot.Task:
Create a column chart of energy use by day, grouped by
source.
Steps:
long_merged_energy
to summarize
output by date and source.
mutate()
to create a
date
variable from datetime
.group_by()
and
summarize()
to calculate total output per date per
source.ggplot()
geom_col()
layer, supply
x
and y
aesthetics.group=source
and
fill=source
.labs()
layer.long_merged_energy %>%
mutate(date=lubridate::date(datetime)) %>%
group_by(date, source) %>%
summarize(output=sum(output)) %>%
ggplot() +
geom_col(aes(x=date, y=output, group=source, fill=source)) +
labs(title="Energy use by day", x="Day", y="Output (MW)")
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
position="dodge"
in
geom_col()
‘unstack’ the columns for each source.position="fill"
to
normalize height of stacked columns.long_merged_energy %>%
mutate(date=lubridate::date(datetime)) %>%
group_by(date, source) %>%
summarize(output=sum(output)) %>%
ggplot() +
geom_col(aes(x=date, y=output, group=source, fill=source), position="dodge") +
labs(title="Energy use by day", x="Day", y="Output (MW)")
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
long_merged_energy %>%
mutate(date=lubridate::date(datetime)) %>%
group_by(date, source) %>%
summarize(output=sum(output)) %>%
ggplot() +
geom_col(aes(x=date, y=output, group=source, fill=source), position="fill") +
labs(title="Energy use by day", x="Day", y="Output (MW)")
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
Task:
Create a line graph of the output by day, with a different
line for each regrouped group (renewable, hydro, etc.)
Steps:
regroup
.ggplot()
.geom_line()
and
geom_point()
with shape=group
.geom_line()
with
linetype=group
.# Prepare data
long_merged_energy_regroup <- long_merged_energy %>%
rename(type = source) %>%
merge(regroup, by = "type") %>%
mutate(date=lubridate::date(datetime)) %>%
group_by(date, group) %>%
summarise(output=sum(output))
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 3
## # Groups: date [1]
## date group output
## <date> <chr> <dbl>
## 1 2019-09-03 hydro 93919.
## 2 2019-09-03 imports 169826.
## 3 2019-09-03 nuclear 53926.
## 4 2019-09-03 other 0
## 5 2019-09-03 renewable 183304.
## 6 2019-09-03 thermal 313436.
long_merged_energy_regroup %>%
ggplot() +
geom_line(aes(x=date, y=output, group=group, col=group), size=0.8) +
geom_point(aes(x=date, y=output, group=group, shape=group)) +
labs(title="Output by source group over time", subtitle="Data collected during September 3-9, 2018", x="Date", y="Output (MW)")
long_merged_energy_regroup %>%
ggplot() +
geom_line(aes(x=date, y=output, group=group, linetype=group), size=1) +
labs(title="Output by source group over time", subtitle="Data collected during September 3-9, 2018", x="Date", y="Output (MW)")
size=
and alpha=
inside
aes()
modulate the size and transparency of geom objects
based on some data value.gapminder07 %>%
ggplot() +
geom_point(aes(x=log(gdpPercap), y=lifeExp, size=pop, col=continent)) +
scale_size_continuous(name="Population") + scale_color_discrete(name="Continent") +
labs(title="Life expectancy as a function of GDP per capita in 2007", x="Logged GDP per capita", y="Life expectancy")
Task:
Visualize the average output for each hour of the day,
grouped by source.
You need to identify the output per source per hour (e.g. 01:00, 02:00, etc) averaged over all days.
dplyr
and
lubridate
functions.geom
(s) to use, and how to
demarcate groups."Set3"
) and change the legend name.labs()
!ex5 <- long_merged_energy %>%
mutate(hour=lubridate::hour(datetime)) %>%
group_by(hour, source) %>%
summarize(output=mean(output)) %>%
ggplot() +
geom_area(aes(x=hour, y=output, fill=factor(source))) +
scale_fill_brewer(palette="Set3", name="Source") +
labs(title="Average hourly output by source",
subtitle="Data collected during September 3-9",
x="Hour of the day", y="Output (MW)") +
theme_bw()
## `summarise()` has grouped output by 'hour'. You can override using the
## `.groups` argument.
Task:
Compare energy generation over time, across sources.
How do we do this? Using what we’ve learned so far:
long_gen
to
ggplot()
.geom_line()
layer, setting
x=datetime
and y=output
in
aes()
.group=source
and
col=source
in aes()
.long_gen %>%
ggplot() +
geom_line(aes(x=datetime, y=output, group=source, col=source), size=1) +
labs(title="Generation over time, by energy source", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Output (MW)")
Is this a helpful plot?
col=
in
aes()
, let’s add a facet_wrap()
layer.facet_wrap(~source)
,
i.e. tell ggplot to “facet by source”.long_gen %>%
ggplot() +
geom_line(aes(x = datetime, y = output)) +
facet_wrap(~source) +
labs(title="Generation over time, by energy source", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Output (MW)")
long_gen %>%
ggplot() +
geom_line(aes(x = datetime, y = output)) +
facet_wrap(~source, scales="free") +
labs(title="Generation over time, by energy source", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Output (MW)")
Task:
Alter the facet plot by:
long_gen %>%
ggplot() +
geom_line(aes(x = datetime, y = output, col=source), size=1) +
facet_wrap(~source, scales="free") +
labs(title="Generation over time, by energy source", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Output (MW)") +
theme(legend.position = "bottom")
Three ways to save images created by ggplot:
ggsave()
as a layer to your plotting
codeThis is the end of the R lecture sessions.
After a short break, we will return for a cumulative exercise that combines the skills you have learned since Monday.
The instructions are in a markdown file in the exercises
folder. Create a new RMarkdown file (save it using this naming
convention: FinalRExercise_LastnameFirstname.Rmd
), in which
you will complete the exercise.
You can work in small groups, but write up your code separately.
Raise your hand if you need help!
You’ve created several RMarkdown files over the past three days. Since you stored these in a forked repo, it is possible to create a pull request and ‘submit’ these changes to the base repo. This is optional, but gives you a chance to explore Github functionality and share your work with your classmates.
Before you submit your completed exercises, move every new file
in your repo to the submissions
folder. This
ensures that we won’t inadvertently make changes to the session
materials. Then, create a new pull request, asking to merge changes from
your fork to the base repository.
Comment your code
Always remember to comment your code!
When writing particularly complex code in
dplyr
orggplot2
, this includes commenting within a flow of%>%
or+
operators. See the lecture notes for some examples of this.