In this tutorial you learn the steps needed to create basic figures in R. You can use these skills to create a poster where you analyze empirical evidence on a policy intervention in the context of the IS-LM model.
You learn how to import data into R, how to filter and select the
data that you need, and how to make basic figures using the
ggplot2
package.
This tutorial assumes that you have already installed R and RStudio, and have read sections 2, 3.1-3.4, and 4-6 of A (very) short introduction to R and sections 1-4 of How to make any plot in ggplot2?.
It is convenient to create an R project for each assignment that you are working on. A project is basically a folder that stores all files related to the assignment.
You can create a project as follows:
After a project is created, there are two easy ways of accessing it. You can either use the same dropdown window in the top right corner of RStudio that you used to create the project, and click on the name of the project there, or you can find the project folder within your files and click on the file with the .Rproj extension.
tidyverse
Next, you need to install and load the tidyverse
package. tidyverse
is a collection of packages that allow
clean workflows in R. The ggplot2
package that you will use
to create figures is one of these included packages.
You need to install the package only once, but you need to load it every time you open and use R. It is good practice to load the package on the top of each script.
install.packages("tidyverse") # install the package
library(tidyverse) # load the package
Next, you need to import your data, which is in the format of a CSV (comma-separated values) file. Normally you would download a file to your project folder and load it in R from there. In this case, the data is available on Github, so you can link to it directly using the link below, without having to download it first.
You can import a file using the read_csv()
function and
by assigning the file to an object.
data <- read_csv("https://github.com/ucrdatacenter/projects/raw/main/SSCECON207/2022h1/workshop%201/ISLM_data.csv") # import the data from Github
The IS-LM dataset contains the real interest rate, CPI-based inflation rate, and growth rates in GDP, consumption, investment, government spending.
For your project you will only need to use part of the data. To see what the data frame looks like, you can view it by clicking on the name of the object in the Environment tab in the top right corner of RStudio.
The file “available_data.csv” (view here) provides a list of the countries and time periods available in the dataset. You can use this file to select a policy intervention to study. For each country in the list, data is available in the period between the years in the “first” and “last” columns.
Once you decide which country and time period you would like to use
in your project, you need to filter the data to remove the observations
that you do not need. You can do that with the filter()
function. For example, if you are looking for US data between 2006-2013,
you can do that as follows:
data_US <- data %>%
filter(year >= 2006, year <= 2013, code == "USA")
The pipe operator (%>%
) means that the next function
uses the previous result as an input: in this case, it takes the object
data
as the first argument of the filter()
function.
The double equal sign in code == "USA"
means that
instead of assigning the string "USA"
to a variable named
code (which is what code = "USA"
would do), you are testing
whether the contents of the object code
and the string
"USA"
are equal to each other. Similarly, the
less-than-or-equal and more-than-or-equal signs test the relationship
between the contents of the variable year
and the specified
year. If the result of the test is true, the function keeps the
observation, and if the result is false, the observation is removed. You
can filter for country using both the variable country
,
which has full country names, or using the variable code
with country codes. The results of the two operations will be
equivalent. Make sure to remember that R is case-sensitive, so e.g. a
filter set as code == "usa"
will not work.
In some cases it is easier to work with a data frame that only
includes the variables that you need. You can select these variables
using the select()
function. You can either list the
variables that you want to keep as the arguments of the function, or you
can list the variables you want to remove, using a -
sign
in front of the variable name.
For example, you can filter your data to only include the year and the components of GDP:
data_US %>%
select(year, consumption, investment, govt_spending) # only keep these four variables
## # A tibble: 8 × 4
## year consumption investment govt_spending
## <dbl> <dbl> <dbl> <dbl>
## 1 2006 2.59 3.06 1.22
## 2 2007 2.28 -1.62 1.60
## 3 2008 0.576 -6.21 2.45
## 4 2009 -0.251 -16.6 4.26
## 5 2010 1.53 10.5 0.0111
## 6 2011 0.761 4.22 -3.07
## 7 2012 0.831 7.66 -1.46
## 8 2013 0.858 4.67 -1.89
If you want to plot these variables on a single figure, it helps to turn the data into long format.
Currently each row of the data corresponds to one year, and three columns show the values of three variables; the data is in wide format. In long format you would have three rows corresponding to one year, a single column of values, and a column specifying which component of GDP that value corresponds to.
You can convert between these two forms using the
pivot_longer()
and pivot_wider()
functions.
Below is an example of using pivot_longer()
. If you need
more help on the function arguments, the help-files of the function
provide a good explanation. You can access these help-files by running
?pivot_longer()
and ?pivot_wider()
.
data_US %>%
select(year, consumption, investment, govt_spending) %>%
pivot_longer(cols = -year, names_to = "component", values_to = "value") # convert to long format
## # A tibble: 24 × 3
## year component value
## <dbl> <chr> <dbl>
## 1 2006 consumption 2.59
## 2 2006 investment 3.06
## 3 2006 govt_spending 1.22
## 4 2007 consumption 2.28
## 5 2007 investment -1.62
## 6 2007 govt_spending 1.60
## 7 2008 consumption 0.576
## 8 2008 investment -6.21
## 9 2008 govt_spending 2.45
## 10 2009 consumption -0.251
## # … with 14 more rows
The code cols = -year
means that you want to transform
all columns of the data frame except for year to long format. This
notation is equivalent to
cols = c(consumption, investment, govt_spending)
, but in
this case cols = -year
is shorter and more convenient.
ggplot
figuresFigures made with ggplot
are built from several layers.
You always use the same basic code structure to create a wide range of
figures:
ggplot()
function creates a blank canvas for you to
work on.+
signs.The variables that you want to display on the graph must always be
wrapped in an aes()
function, which stands for aesthetics.
This specification tells R to determine the value of the aesthetic (x
and y axes, colors, groups, line types, etc.) based on the value of the
variable. aes()
can be specified both in the main
ggplot()
function (in which case it will apply to all
geoms) or within a geom_...()
function (then it only
applies to that geom).
The following code creates a time-series plot of the interest rate in the US over time. Comments explain what each line of code does.
data_US %>% # feed the previously created data frame into the ggplot function call
ggplot() + # create the ggplot "blank canvas"
geom_line(aes(x = year, y = rate)) + # set the variables to be displayed on each axis
labs(title = "Interest rate in the US", # add a title to the plot
x = "Year", # set the x axis label
y = "Interest rate (%)") + # set the y axis label
theme_light() # change the color scheme and layout of the plot to a different theme
If you are happy with how a figure looks, you can save it using the
ggsave()
function. As the function argument you need to
specify the file name that you want to use:
ggsave("interest_rate.jpg") # save the plot
If instead of a time series plot, you want to make a scatterplot, you
would use geom_point()
instead of geom_line()
.
Additionally, you can use geom_text()
to display labels
next to the points. In that case, you need to set an additional
aesthetic: the label you want displayed. Other settings are also
possible within each geom, such as size, color, transparency, etc.
data_US %>%
ggplot(aes(x = GDP, y = rate)) + # these aesthetics apply to all geoms
geom_point() + # make a scatterplot
geom_text(aes(label = year), nudge_y = 0.25, size = 3) + # add the years as labels next to each point; shift them slightly up using the nudge_y argument so they don't overlap
labs(title = "Output and interest rate in the US (2006-2012)",
x = "GDP growth (%)",
y = "Interest rate (%)") +
theme_light()
ggsave("scatterplot.jpg")
You can also connect ggplot
figures to data
cleaning/manipulation processes using the %>%
operator.
The code below connects the data manipulation shown previously to a
time-series plot of the components of GDP:
data_US %>%
select(year, consumption, investment, govt_spending) %>%
pivot_longer(cols = -year, names_to = "component", values_to = "value") %>%
ggplot() +
geom_line(aes(x = year, y = value, color = component)) +
labs(title = "Components of GDP in the US",
x = "", # an empty string means that no axis title will be added
y = "Percentage change") +
theme_light()
ggsave("GDP.jpg")
You can find additional materials, including useful data sources, and
online materials for using R and ggplot
on the Data
Center website.