Introduction to ggplot

What is ggplot?

The ggplot2 package is part of tidyverse, a suite of packages maintained by the RStudio company. In ggplot, graphics are built by supplying data and mapping of data values to aesthetics, and then adding layers that build geometric objects, scales, labels, and more.

Before using ggplot, make sure to load the package. You can either load tidyverse or ggplot2 directly.

library(tidyverse)
library(ggplot2)

Why ggplot?

It is highly customizable and extensible.
It is designed to work well with dplyr and other tidyverse packages.
It has been in use for over 10 years, meaning that help is widely available.
Its wide usage has also given rise to a wide variety of avilable extensions, including those that handle create interactive graphics (ggigraph), animated graphics (gganimate), alluvial diagrams (ggalluvial), maps (ggmap), networks (ggraph), and more.
Consistent “grammar” across different packages/visualizations classes

Some useful resources

The ggplot cheatsheet maintained by RStudio is an invaluable resource. I always have it open when using ggplot2.
The ggplot reference guide maintained by RStudio. In your early days using ggplot2, this will be more useful in conjunction with the cheatsheet since it contains more detail on syntax and links to full help files.
The Data Visualization chapter in the R for Data Science book by Garrett Grolemund and Hadley Wickham. The rest of the book explores other parts of the tidyverse.
ggedit is an extension package for ggplot2 that enables users to launch an graphical user interface (GUI) with which to build ggplot graphics layer-by-layer. When completed, the user can copy the code required to generate the plot and return to their script file.
- There is some controversy over whether GUIs diminish creativity and flexibility in data science, but nevertheless, this tool may help early learners get used to the grammar of ggplot.
And, of course, the help files that come with the package itself.

Basics: the grammar of graphics

ggplot works by combining several functions using the + operator. This creates layers in an additive fashion. You can start a new line after each + operator for readability if you would like (I do so throughout these notes.) Each function does something specific: provide a dataset, create a geometric object, add labels, add scales, change the coordinate system or layout, change the color palette, etc.

Word of warning!! Don’t confuse the pipe %>% from dplyr with the + from ggplot2!

Each graphic generated by ggplot requires at least three basic components:

data: a data frame containing the data that you want to visualize, provided in the ggplot() function’s required data = argument.
geometric objects: the objects/shapes that you want to plot, indicated through one of the many available geom functions, such as geom_point() or geom_hist().
aesthetic mapping: the mapping from the data to the geometric objects, provided in an aes() function nested within the mapping = argument of the geom function.
- You can also provide the aes() function to the mapping = argument of ggplot(). If you do so, all geom functions will pull from the aesthetic mapping provided in this initial ggplot() function.

Together, the basic structure looks like this:

ggplot(data = <DATA FRAME>) + 
  <GEOM_FUNCTION>(mapping = aes(<VARIABLES>))

To demonstrate with an example, let’s re-create the scatterplot we made in Day 1 using data from Gapminder. Recall that we used the 2007 data subset and plotted the relationship between life expectancy on the y-axis and GDP per capita on the x-axis. Using the structure above, we would proceed as follows:

Provide the data in the data = argument of the initial ggplot() function.
Select geom_point() as our second function, since the geometric object we want to generate is points.
Within the mapping = argument of geom_point(), we provide our aesthetic mapping. In this case, we need to provide an x and y variable for the points.

Altogether, it looks like this:

ggplot(data = gapminder07) + 
  geom_point(mapping = aes(x = gdpPercap, y = lifeExp))

## Or, you can provide the mapping in the call to `ggplot`

# ggplot(data = gapminder07, mapping = aes(x = gdpPercap, y = lifeExp)) + 
#  geom_point()

Recall how we added a horizontal line to the scatterplot on Day 1. We can replicate that in ggplot by adding another layer using the geom_hline() function, which generates another geometric object (specifically, a horizontal line). Notice that data = and mapping = have been dropped here: as with all functions in R, since these are the first arguments, R assumes that the first inputs provided to the function are for these first arguments.

ggplot(gapminder07) + 
  geom_point(aes(x = gdpPercap, y = lifeExp)) + 
  geom_hline(aes(yintercept = mean(lifeExp)))

Remember that you can also the aesthestics in the call to ggplot. This allows you to use the same aesthetic mapping across a number of different geoms.

# Doesn't work!
# ggplot(data = gapminder07) + 
#  geom_point(mapping = aes(x = gdpPercap, y = lifeExp)) +
#  geom_smooth()

# Works!
ggplot(data = gapminder07, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We can also add titles and axes as a layer using the labs() function.

ggplot(data = gapminder07) + 
  geom_point(aes(x = gdpPercap, y = lifeExp)) + 
  geom_hline(aes(yintercept = mean(lifeExp))) + 
  labs(title = "Relationship between life expectancy and GDP per capita in 2007", 
       x = "GDP per capita", y = "Life expectancy")

As you can see, the grammar of graphics used in ggplot2 breaks down the information that goes into each graphic into several layers, each of which you can customize.

Preparing your data

Recall the role of pipes (%>%) in dplyr this morning. Since dplyr and ggplot2 are both part of tidyverse, they are designed to work well together. You can use a series of pipes to prepare your data to be in the appropriate format, subset, etc before plotting.

For example, consider the above plot. Instead of specifying the data in the ggplot() function, we can supply the data through a pipe. Instead of using the subsetted gapminder07 data, we can use the original gapminder and use a filter() function to select the 2007 observations only.

gapminder %>% 
  filter(year == 2007) %>% 
  ggplot() + 
  geom_point(aes(x = gdpPercap, y = lifeExp)) + 
  geom_hline(aes(yintercept = mean(lifeExp))) + 
  labs(title = "Relationship between life expectancy and GDP per capita in 2007", 
       x = "GDP per capita", y = "Life expectancy")

For another example of preparing data for plotting, let’s turn to the California energy data, which we will use for the rest of the ggplot session. Recall that in the advanced manipulation session this morning, you used dplyr to create long-format data frames long_gen (all generation data) and long_merged_gen (all generation + imports data). Say we wanted to visualize the total energy generated in the state over time. Starting with long_gen, and using group_by() and summarize(), we can sum the output for each date-time value (i.e. each hour) and feed this directly into ggplot(). Note that this kind of graphing task is one of the uses of having a long-format version of your data.

long_gen %>% 
  group_by(datetime) %>% 
  summarise(output=sum(output)) %>% 
  ggplot() + 
  geom_col(aes(x=datetime, y=output)) + 
  labs(title="Total energy generated, by hour", x="Hour", y="Output (MW)")

Perhaps we are concerned that the above plot has too much granularity. Instead, we want to plot the total output per day instead of per hour. Again, we can achieve this by manipulating our data prior to piping it into ggplot(). Note that we use the date() function in lubridate which takes a date-time value and returns the date portion only.

long_gen %>% 
  mutate(date=lubridate::date(datetime)) %>% 
  group_by(date) %>% 
  summarise(output=sum(output)) %>% 
  ggplot() + 
  geom_col(aes(x=date, y=output)) + 
  labs(title="Total energy generated, by day", x="Day", y="Output (MW)")

Geometric objects

We have already seen two kinds of geometries: geom_point() which generates points using x and y values, and geom_col() which generates columns for a bar chart. Let’s explore a few more, and learn how to modulate the appearance of the geometric objects created.

Line plot

Let’s say we’d like to plot the amount of energy imported over time. We can use geom_line to generate a line connecting all the data points. To do so, we need to provide x and y values in the aesthetic mapping function nested within geom_line(). Note that we are using the imports data frame here, but we could easily use the wide- or long-format merged data frames, as long as we use the appropriate filtering functions.

imports %>%
  ggplot() + 
  geom_line(aes(x=datetime, y=imports)) + 
  labs(title="Energy imports over time", x="Hour", y="Amount imported (MW)")

Changing size and color

Once we have created the geometric object we want, using the appropriate aesthetic mapping, we may want to change the appearance of the object. The features that we can modify vary depending on the geom object (the cheatsheet and each function’s reference files are helpful here). For most function, we can modify the size and shape of the object(s) created. Let’s try to increase the size of the line in the plot above and make it red in color. Note that the col = and size = arguments here are outside the aes() function since we are not mapping anything from the data frame here.

imports %>%
  ggplot() + 
  geom_line(aes(x=datetime, y=imports), size=1.2, col="red") + 
  labs(title="Energy imports over time", x="Hour", y="Amount imported (MW)")

To learn more about the color inputs you can provide in R, see the R colors cheatsheet.

Area plot

In addition to lines and columns, we can also generate an area plot with geom_area(). Let’s try it with a plot of wind power generation over time. Note that to change the color on this plot, we use fill = rather than col =. This is because we want to fill the geometric object with a color; using col = would create an outline around the plot (try it out to see the difference). Also note that

generation %>% 
  ggplot() + 
  geom_area(aes(x=datetime, y=wind), fill="darkblue") + 
  labs(title="Hourly wind power generation, Sept 3-9", x="Hour", y="Output (MW)")

Box plot

Let’s explore one more plot that visualizes a different kind of relationship. Instead of plotting trends over time, let’s plot the distribution of each source’s output using a box plot (aka a box-and-whisker plot) which shows the 25% quantile, mean, and 50% quantile in a “box” and the minimum and maximum values in “whiskers”. We can do so using the function geom_boxplot(); note again that this is a case where having long-format data is useful.

long_gen %>% 
  ggplot() + 
  geom_boxplot(aes(x=source, y=output)) + 
  labs(title="Amount of energy generated by each source, Sept 3-9", x="Source type", y="Output (MW)")

Multiple geometries in one plot

The above examples show cases where one geometric object is plotted per graphic. ggplot allows you to add multiple geometric objects as layers. Below is a plot of large hydro power generation over time, first shown using a line with geom_line(). On top of that line plot, we can also add a smoothed line with geom_smooth(), which plots smoothed conditional means (estimated using a loess regression) in order to aid observation of trends in cases of overplotting.

generation %>% 
  ggplot(aes(x=datetime, y=large_hydro)) + 
  geom_line(, col="turquoise3") + 
  geom_smooth(aes(x=datetime, y=large_hydro)) + 
  labs(title="Hydroelectric (large) generation per hour, Sept 3-9", x="Hour", y="Output (MW)")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

An alternative syntax is to do the aesthetic mapping in the ggplot() function, after which all geom functions will adopt the previously defined aesthetic mapping. The below code generates a plot identical to the above.

merged_energy %>% 
  ggplot(aes(x=datetime, y=large_hydro)) + 
  geom_line(col="turquoise3") + 
  geom_smooth() + 
  labs(title="Hydroelectric (large) generation per hour, Sept 3-9", x="Hour", y="Output (MW)")

Usually when we plot multiple geometric objects in one graphic, we are trying to plot the same relationship across different groups in the data. We will explore how this works in the below section on visualizing grouped data.

Other geometric objects

There are many, many more geometric objects that you can create using ggplot. For a full list, see the cheatsheet and reference guide linked above. Several extension packages add even more geometric objects.

Additional layers

Labels

We have already seen how we can specify the title and axes labels of a graphic with the labs() layer. Note that you can also specify a subtitle and caption for the graphic in the same layer. Let’s see this in action using the line plot of imports data we created earlier.

imports %>%
  ggplot() + 
  geom_line(aes(x=datetime, y=imports), col="red") + 
  labs(title="Energy imports over time in California", subtitle="Hourly data from September 3-9, 2018",
       caption="Source: California Energy Commission",
       x="Hour", y="Amount imported (MW)")

For more advanced labeling, you can use the annotate() layer, which generates text according to specifications at any point on the coordinate system. See the cheatsheet and reference guide for more details.

Scales

Scales allow you to manipulate the relationship between data values and aesthetics beyond aesthetic mapping. In other words, they “control the details of how data values are translated to visual properties” (quote from reference guide). You can set the transparency of objects using scale_alpha functions, modify color palettes using scale_color and scale_fill functions, the position of axis markers using scale_x and scale_y functions, and more. The cheatsheet helpfully organizes scale functions based on the type of task they are suited for, and can see a full list of scale functions in the reference guide.

To see an example at work, let’s use the scale_x_datetime() function to manipulate the scaling and display of datetime values of the x-axis. Specifically, we will use date_labels = to define which labels to use for the date values and date_breaks = to define how far apart the breaks between the x-axis ticks should be.

(Note: %H:%M means the datetime should be displayed in 24-hour clock format without any date information; you can see a list of datetime conversion specifications in R by looking up ?strptime.)

imports %>%
  ggplot() + 
  geom_line(aes(x=datetime, y=imports), col="red") + 
  scale_x_datetime(date_labels="%H:%M", date_breaks="12 hours") +
  labs(title="Energy imports over time in California", subtitle="Hourly data from September 3-9, 2018",
       x="Hour", y="Amount imported (MW)")

The most common use of scale layers is to specify and adjust color palettes used when plotting grouped data. We will see some examples of this in the below section on that topic.

Themes

Preset themes

ggplot comes with several preset themes, including:

theme_grey(): this is the default
theme_bw(): strip colors, including grey gradients
theme_dark() and theme_light(): these change the backgrounds of the coordinate system
theme_minimal() and theme_void(): these are what they sound like. Try them out!

For a full list, see the reference guide file on complete themes (looking up the help file for any of the above functions will open this file).

There are also many ggplot extension packages that provide additional themes.

Manually modify theme components

Instead of using a pre-set theme, you can modulate many components of the default theme using the theme() function. In this manually controlled theme layer, some of the things you can modify:

Angle, position, and other characteristics of axis labels (e.g. using axis.labels.x = element_text() argument in the theme() function).
Angle, position, and other characteristics of legends (this is useful when plotting grouped data).
Characteristics of the background of the plot (e.g. using the plot.background = argument).

For a full list of the components that can be modified in the theme layer, see the help file for ggplot2::theme().

Coordinate system adjustment

A common coordinate system adjustment layer is coord_flip(), which rotates the plot such that the x and y axes are flipped. Let’s see an example with the bar chart that we generated with geom_col() earlier in the session.

long_gen %>% 
  mutate(date=lubridate::date(datetime)) %>% 
  group_by(date) %>% 
  summarise(output=sum(output)) %>% 
  ggplot() + 
  geom_col(aes(x=date, y=output)) + 
  labs(title="Total energy generated, by day", x="Day", y="Output (MW)") + 
  coord_flip()

You can manipulate the coordinate system in other ways:

coord_flip() is the default Cartesian coordinate system used; by explicitly calling this function, you can change the limits of the x and y axes from their defaults
coord_fixed() sets a fixed aspect ratio between the x and y axes
coord_transform() lets you transform the Cartesian coordinates using functions like sqrt or log
coord_polar() changes the coordinate system to polar coordinates rather than a Cartesian system

Stats

There are several stat functions in ggplot that enable you to conduct statistical transformations of your data prior to plotting. stat and geom layers can be used interchangeably, as each stat layer has a geom argument and vice versa. Some reasons that you may want to use a stat layer instead of a geom layer, or specify the stat argument in a geom layer include:

overriding the default setting, e.g. using stat="identity" to specify y values in geom_bar() instead of the default stat="count" which uses counts of each x value for the y-axis.
using a transformed version of a variable for aesthetic mapping, e.g. proportions intead of counts in a bar chart.
drawing attention to statistical transformations in your code.

Since these are relatively rare and more advanced use cases of ggplot, we will not explore them in detail here. If you are interested, see Chapter 3.7 of the R for data science online textbook.

Visualizing grouped data

A common task in data visualization is plotting variables that are grouped, in order to make comparisons across groups or to show how a particular trend breaks down on a group level. For example, instead of visualizing energy generation as a whole, we may want to visualize generation from each source relative to ther sources.

There are two broad ways to visualize grouped data. One is to generate multiple geometric objects of the same type for each group by indicating the grouping variable using the group = function in the aes() function. Another is to use facets. We will examine both below. Throughout these examples, you will see one of the uses of converting data to long-format.

Colors and fill

Note that the col = argument is specified within aes() here, as our goal is to map data values to the color aesthetics. When supplied outside the aes() function, the role of col= is to modify the color of the geometric object unrelated to any data values.

Although ggplot will generally interpret a col = argument inside aes() as providing a grouping variable, it is good practice to specify group = anyway.

long_merged_energy %>%
  ggplot() + 
  geom_line(aes(x=datetime, y=output, group=source, col=source)) + 
  labs(title="Output by energy source over time", subtitle="Hourly data from September 3-9, 2018", 
       x="Hour", y="Output (MW)")

Let’s take a look at a more complex plot that uses grouping to generate multiple lines. For example, we might look at the above plot and think that the repeated patterns over multiple days is too noisy. Perhaps we are interested in the average trend over the course of a single day. To visualize this, we can plot the average output of each energy source per hour over the 7 days in the data. Comments are included in the code below describing what each step does. Note how the code for this plot combines several of the topics we have discussed so far (data manipulation before plotting, scales, themes, and labels).

long_merged_energy %>% 
  # Create a variable indicating hour only
  mutate(hour=lubridate::hour(datetime)) %>% 
  # Group data by hour and source
  group_by(hour, source) %>% 
  # Compute mean output for each hour-source unit
  summarise(output=mean(output)) %>% 
  # Pipe data into ggplot
  ggplot() + 
  # Plot lines for output over hour, grouped by source
  geom_line(aes(x=hour, y=output, group=source, col=source), size=1) + 
  # Use Set3 color palette to distinguish lines from each other, and give legend a title
  scale_color_brewer(palette="Set3", name="Energy Source") + 
  # Use dark theme to make colors more visible + 
  theme_dark() + 
  # Add labels
  labs(title="Average hourly output by energy source", subtitle="Data collected during Sept 3-9, 2018",
       x="Hour", y="Mean output (MW)")

## `summarise()` has grouped output by 'hour'. You can override using the
## `.groups` argument.

Remember from earlier that we may need to use either col = or fill = depending on the type of geometric object whose appearance we are trying to change. This is the same when we are mapping data to the colors. For example, in order to change the color of the objects in an area plot or bar chart, we will need to use fill = rather than col =.

Position adjustment

In the above example, all groups are plotted overlaying each other. One way to visualize data more clearly, especially when we are dealing with bars or area plots rather than lines, is to use position adjustment.

For example, let’s plot a column chart with the total energy output per day. Note that we use the fill = argument here to demarcate each group in a different color. By default, ggplot stacks all the groups on top of each other.

long_merged_energy %>% 
  mutate(date=lubridate::date(datetime)) %>% 
  group_by(date, source) %>% 
  summarize(output=sum(output)) %>% 
  ggplot() + 
  geom_col(aes(x=date, y=output, group=source, fill=source)) + 
  labs(title="Energy use by day", x="Day", y="Output (MW)")

## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.

Instead of stacking the groups on top of each other, we can also use the position = "dodge" argument to create a column per group and arrange them next to each other horizontally.

long_merged_energy %>% 
  mutate(date=lubridate::date(datetime)) %>% 
  group_by(date, source) %>% 
  summarize(output=sum(output)) %>% 
  ggplot() + 
  geom_col(aes(x=date, y=output, group=source, fill=source), position="dodge") + 
  labs(title="Energy use by day", x="Day", y="Output (MW)")

## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.

A third alternative is to use position = "fill", which normalizes the height of each bar and as such shows the proportion of the data in each bin that consists of each group.

long_merged_energy %>% 
  mutate(date=lubridate::date(datetime)) %>% 
  group_by(date, source) %>% 
  summarize(output=sum(output)) %>% 
  ggplot() + 
  geom_col(aes(x=date, y=output, group=source, fill=source), position="fill") + 
  labs(title="Energy use by day", x="Day", y="Output (MW)")

## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.

Let’s see another example of how position adjustment works with an area plot, of energy output over time by source. geom_area() defaults to position="stack" when using groups, but I have explicitly defined it here for illustration.

long_merged_energy %>% 
  ggplot() + 
  geom_area(aes(x=datetime, y=output, group=source, fill=source), position="stack") + 
  labs(title="Energy use over time", subtitle="Data collected during September 3-9, 2018", 
       x="Hour", y="Output (MW)")

We might look at this plot and think it’s not particularly helpful. To simplify the above plot, we can do two things: (1) use the categories contained in the regroup data frame to reduce the number of categories; and (2) switch to position="fill" instead of position="stack".

# Make sure regroup data has been imported
regroup <- read_csv("data/ca_energy_regroup.csv")

## Rows: 12 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): type, group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Prepare data frame
long_merged_energy %>% 
  rename(type = source) %>% 
  merge(regroup, by = "type") %>% 
  group_by(datetime, group) %>%
  summarise(output=sum(output)) %>% 
  # Pipe data into ggplot
  ggplot() + 
  geom_area(aes(x=datetime, y=output, group=group, fill=group), position="fill") + 
  labs(title="Energy use over time", subtitle="Data collected during September 3-9, 2018", 
       x="Hour", y="Output (MW)")

## `summarise()` has grouped output by 'datetime'. You can override using the
## `.groups` argument.

Shapes and linetypes

For certain geom functions, such as geom_point(), we can demarcate groups by shape. This is especially helpful in conjuction with line plots. For example, let’s plot total output by day, grouped using the grouping variable from regroup. First, we will plot lines and use colors to demarcate groups. Then, we will plot points and use shapes to demarcate groups.

# Prepare data
long_merged_energy %>%
  rename(type = source) %>% 
  merge(regroup, by = "type") %>% 
  mutate(date=lubridate::date(datetime)) %>% 
  group_by(date, group) %>%
  summarise(output=sum(output)) %>% 
  # Pipe data into ggplot
  ggplot() + 
  geom_line(aes(x=date, y=output, group=group, col=group), size=0.8) +
  geom_point(aes(x=date, y=output, group=group, shape=group)) + 
  labs(title="Output by source group over time", subtitle="Data collected during September 3-9, 2018",
       x="Date", y="Output (MW)")

## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.

The shape = approach works well for points, but when working with line plots, we can demarcate groups by linetype = instead. For example, instead of combining colored lines and shapes in the above plot, let’s use linetypes to demarcate groups.

# Prepare data
long_merged_energy %>%
  rename(type = source) %>% 
  merge(regroup, by = "type") %>% 
  mutate(date=lubridate::date(datetime)) %>% 
  group_by(date, group) %>%
  summarise(output=sum(output)) %>% 
  # Pipe data into ggplot
  ggplot() + 
  geom_line(aes(x=date, y=output, group=group, linetype=group), size=1) +
  labs(title="Output by source group over time", subtitle="Data collected during September 3-9, 2018",
       x="Date", y="Output (MW)")

## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.

Sizes and alpha

Especially useful for continuous variables, you can map the size and alpha (transparency) of geometric objects to data values. Since the grouping variables we have been working with in the California energy data are categorical, let us briefly return to the gapminder data to show an example. For example, you could create a scatterplot of the relationship between life expectancy and logged GDP per capita using the gapminder data, and map the values of population to the size of that points by using aes(size=pop) in geom_point(). To show how you can incorporate multiple grouping variables, I have also added aes(col=continent).

gapminder07 %>% 
  ggplot() + 
  geom_point(aes(x=log(gdpPercap), y=lifeExp, size=pop, col=continent)) + 
  scale_size_continuous(name="Population") + 
  scale_color_discrete(name="Continent") +
  labs(title="Life expectancy as a function of GDP per capita in 2007", 
       x="Logged GDP per capita", y="Life expectancy")

Facets

While color, fill, shapes, and linetypes are useful ways to plot grouped data, we sometimes want to separate the visualization for each group even more clearly. To do so, we can use faceting, i.e. plot each group on a separate coordinate system. The simplest way to create facets using a grouping variable is facet_wrap(), but you can also control the arrangement of facets or use two grouping variables with facet_grid().

Let’s say we are interested in examining trends in energy generation by source. Using the skills we’ve learned so far, we could generate a line plot as follows.

long_gen %>%
  ggplot() + 
  geom_line(aes(x=datetime, y=output, group=source, col=source), size=1) + 
  labs(title="Generation over time, by energy source", subtitle="Hourly data from September 3-9, 2018", 
       x="Hour", y="Output (MW)")

But this feels too noisy, especially if our primary goal is to visualize the trend for each source. An alternative strategy is to facet by source rather than using a color aesthetic to separate each source.

long_gen %>% 
  ggplot() + 
  geom_line(aes(x = datetime, y = output)) + 
  facet_wrap(~source) + 
  labs(title="Generation over time, by energy source", subtitle="Hourly data from September 3-9, 2018", 
       x="Hour", y="Output (MW)")

This is a little better! But note that the scales for each coordinate system are fixed, i.e. the scale limits are the same for each plot. If our main goal is to examine the patterns over time for each source, rather than comparing the sources to each other, we can specify scales="free" in facet_wrap().

long_gen %>% 
  ggplot() + 
  geom_line(aes(x = datetime, y = output)) + 
  facet_wrap(~source, scales="free") + 
  labs(title="Generation over time, by energy source", subtitle="Hourly data from September 3-9, 2018", 
       x="Hour", y="Output (MW)")

That’s much better! Note that we can combine faceting with the previous skills we’ve learned to incorporate multiple groupings. For example, we can use map the source groups in regroup to the color aesthetic in geom_line as follows. Since this is a fairly complicated mix of data manipulation and visualization, comments are included describing each step.

# Begin data preparation
long_gen %>% 
  # Rename "source" variable to "type", to prepare for merging
  rename(type = source) %>%
  # Merge energy generation data with regroup data by "source" variable
  merge(regroup, by = "type") %>% 
  # Pipe data into dplyr
  ggplot() + 
  # Generate lines of output ~ datetime, color based on "group" variable, and increase size
  geom_line(aes(x=datetime, y=output, group=group, col=group), size=1) + 
  # Adjust color palette for "group" colors and give legend a better name
  scale_color_brewer(palette="Set1", name="Type of energy source") +
  # Create facets by source, with free scales
  facet_wrap(~type, scales="free") + 
  # Add labels
  labs(title="Generation over time, by energy source", subtitle="Hourly data from September 3-9, 2018", 
       x="Hour", y="Output (MW)") + 
  # Use the minmal theme
  theme_bw() + 
  # Customize theme to move the legend to the bottom
  theme(legend.position = "bottom")

Final notes

Comment your code

In some of the examples above, you have seen that we can write comments within a flow of %>% or + operators. It is highly recommended that you use such comments when doing complex data manipulation or data visualization tasks, or indeed when doing both together. This enables not only others to read your code, but also aids your own understanding of the code when returning to it at a later time.

Saving images

Graphics generated using ggplot2 can be saved as images in three ways:

Write the result of the code to a named object, which will live in your environment. You can save objects in your environment to an image as usual.
Once the plot has been generated in the “Plots” tab in the bottom-right pane, you can use the “save” icon in that tab to save the image file. You will be able to change the directory and modify the dimensions of the image when you do this.
Use the ggsave() function as an additional layer at the end of your code. You must specify the filename when using ggsave(), and can additionally specify the filepath (if different from working directory) and dimensions of the image, along with many other settings.

Let’s see an example where we combine methods 1 and 3. First, we will save a graphic as an object, and then call the object and add a ggsave() layer to create a .png file in our directory.

# Save a column chart of imports over time
plot_importsovertime <- ggplot(imports) + 
  geom_line(aes(x=datetime, y=imports), col="red") + 
  scale_x_datetime(date_labels="%H:%M", date_breaks="12 hours") +
  labs(title="Energy imports over time in California", subtitle="Hourly data from September 3-9, 2018",
       x="Hour", y="Amount imported (MW)")

# Save the plot as an image
plot_importsovertime + ggsave("importsovertime.png", width=5, height=3)

R7: Data visualization with ggplot2

Kumar Ramanathan (with revisions by Richard Paquin Morel)

September 2023