In this tutorial you learn the steps needed to create basic figures in R. You will need these skills in the poster and presentation assignments in the course.
You learn how to find and import data into R, how to clean the data
that you need, and how to make basic figures using the
ggplot2
package.
This tutorial assumes that you have already installed R and RStudio. If you have not done so yet, please follow this installation tutorial.
Before you proceed, read and follow along sections 2.3-2.5, 3.1-3.4, and 4-6 of A (very) short introduction to R.
In addition, create a new project for the Macroeconomics course, and
install the tidyverse
package, as explained here.
In each section of this tutorial you will find videos guiding you through the processes explained in the text. You should first read the relevant section as an introduction to the contents, then watch and follow along with the video presenting the steps in more detail.
If you have any questions after watching a video, check the help
files of functions (access by running ?functionname
), look
at more extensive Data Center
tutorials, try googling your question, or email datacenter@ucr.nl to attend office
hours. The code shown in the videos is available on Github.
Once you have a research question or topic you would like to look into, you need to determine what data you need: establish a list of the indicators and countries, and the time period. Once you know what you’re looking for, you can search for a source.
For macroeconomic indicators reliable data sources are the World Bank Databank and the OECD database. In addition, you can find a more extensive list of data sources here.
In the following we will work with GDP data from OECD (link) and CO2 emissions data from the World Bank (link).
Before importing data to R, you need to download it to your computer and save it in your project folder.
CSV files are a convenient and simple way of storing data. CSV stands for “comma-separated values”: the raw data is text file where each line of text is a row of data, and values within a row are separated by commas. In most cases your computer will automatically open CSV files in Excel, where they are displayed as a table. CSV files are the most common and also one of the easiest to import to R.
In order to import a CSV file to R, open your project in RStudio. In
the top left corner of RStudio find and click File\(\to\)Import dataset\(\to\)From Text (readr). In the
data import pop-up window click “Browse” and find the GDP data that you
moved to your project folder. RStudio will try to automatically detect
the format of your data: the result of that is shown in the Data preview
window. If something looks wrong, try changing some of the settings
below the data preview until the preview looks correct. Additionally,
you should change the name of the data to something sensible,
e.g. GDP
. Once all the settings are ready, you can copy the
contents of the “Code preview” window into your script, and use it to
import your data. As long as you start your script by loading the
tidyverse
library, you don’t need to copy
library(readr)
, as readr
is a part of
tidyverse
.
Importing an Excel file follows the same process as a CSV file, but
you should choose From Excel instead of From Text
(readr) when importing the dataset. Every following step is the
same, except that you do need to load the readxl
package
separately, as it is not a part of tidyverse
.
If you are interested, more information on importing files with other data formats, and writing your own import code are shown in this tutorial.
Data files downloaded from online sources are not always in a convenient format for analysis in R: variable names are not always intuitive or consistent, you might need to make some additional calculations, recode variables, or remove some variables/observations. You can make a list of steps you need to take by observing the structure of your data.
You can view your data with the function View(dataname)
,
while the output of summary(dataname)
shows you all the
variables you have, including their type and some details about their
content. In case you need more information about what your variables
mean, you can look at the documentation on website of your data
source.
A useful tool to keep in mind when planning your data cleaning is the
pipe operator %>%
(keyboard shortcut:
Ctrl/Command+Shift+M). Using it in a workflow means that the next
function uses the previous result as an input, and helps you work in a
linear fashion. E.g. “first I need to filter observations for the
Netherlands, then I need to use the resulting dataset to calculate GDP
growth rates”.
Often you don’t need all variables included in your downloaded
dataset. Then you can select the subset of variables you need (or the
subset you would like to remove). The function for doing so is
select()
, and the arguments of the function are your
dataset, followed by the names of the variables you would like to keep
(or remove, if the variable names are preceded by -
). The
following examples show how to use the function with and without the
pipe operator.
# keep only variables x and y
select(data, x, y)
# remove variables x and y
data %>%
select(-x, -y)
Especially if you want to combine datasets from different sources,
you may want variable names to be consistent across datasets or
convenient to work with. This problem can be easily fixed using the
rename()
function, which has the format
rename(data, "new_name" = "old_name")
Often you only need a subset of your data, e.g. observations from a
particular country or after a given year. You can filter your dataset
using the filter()
function and logical expressions
(e.g. keep if the value for the variable year is greater than 2000, or
keep if the value for the variable country is “Netherlands”). You need
to use the logical operators: ==
means equal to,
!=
means not equal to, >=,<=,>,<
define comparisons, %in%
means “is one of”, and you can
combine multiple expressions with AND &
and OR
|
. If the result of the test is true, the
filter()
function keeps the observation, and if the result
is false, the observation is removed. For instance, to filter for Dutch
or Belgian observations between the years 2000 and 2020:
data %>%
filter((country == "Netherlands" | country == "Belgium") & year >= 2000 & year <= 2020)
In some cases you might need to do additional calculations with your data. For example you may want to calculate GDP growth rates from annual GDP observations, calculate averages over time, or treat a number as a character string. Helpful functions in this case are the following:
mutate()
: to create new variables (or modify existing
variables) using functions or calculations - think of it as adding a new
column to your data frame.summarize()
: to create new variables using functions,
using all rows from your data frame (or from a part of your data frame)
- think e.g. if you have a data frame of GDP data from 20 years, and you
want to calculate the average value of GDP in this dataset.group_by()
: to specify grouping variables before using
mutate()
or summarize()
- think e.g. if you
have GDP data from 20 years from two countries, and you want to
calculate average GDP over time separately for the two countries.Compare the following two simple datasets:
Data A:
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
year | country | value |
---|---|---|
2022 | countryA | 10 |
2022 | countryB | 12 |
2023 | countryA | 14 |
2023 | countryB | 15 |
Data B:
year | countryA | countryB |
---|---|---|
2022 | 10 | 12 |
2023 | 14 | 15 |
Data A is in long format, and Data B is in wide format. The tables contain the same information, but sometimes one format is more convenient than the other.
You can convert between these two forms using the
pivot_longer()
(wide to long) and
pivot_wider()
(long to wide) functions. To use
pivot_longer()
you need to specify which columns you’d like
to turn into a single column: e.g. to go from Data B to Data A, you’d
use the argument cols = c(countryA, countryB)
(or
equivalently, cols = -year
). To use
pivot_wider()
, you need to specify which column to use for
variable names, and which column for variable values: going from Data A
to Data B would use the arguments
names_from = country, values_from = value
.
If you want to work with variables from multiple dataframes (e.g. plotting them on the same plot), you need to combine those dataframes.
There are multiple ways to combine data frames. The simplest is
row-binding: there you take two data frames that have the same
variables, and basically place one below the other. You can then use the
bind_rows()
function, with the argument listing the
dataframes to combine.
However, most of the time you need something more complicated than
row-binding. E.g. if in both dataframes your entities are defined by the
country and the year, you want to match up the observations so that each
row contains data from only one country-year combination. If your
identifying variables have the same names in both datasets, you can
replace the bind_rows()
function with
full_join()
and specify the dataframes to combine as the
argument in the same way.
ggplot
figuresggplot
is a powerful R package for creating figures.
Figures made with ggplot
are built from several layers. You
always use the same basic code structure to create a wide range of
figures:
ggplot()
function creates a blank canvas for you to
work on.+
signs.The variables that you want to display on the graph must always be
wrapped in an aes()
function, which stands for aesthetics.
This specification tells R to determine the value of the aesthetic (x
and y axes, colors, groups, line types, etc.) based on the value of the
variable. aes()
can be specified both in the main
ggplot()
function (in which case it will apply to all
geoms) or within a geom_...()
function (then it only
applies to that geom).
For line charts you use geom_line()
, and for
scatterplots geom_point()
. You can fit a line on your
scatterplot with geom_smooth()
. If you want to fit a
straight line, add method = "lm"
as an argument inside
geom_smooth()
.
If you want to change the color of a point/line based on the value of
a variable, specify color = variable
inside the aes()
function. If you would
like all points/lines to be a particular color, specify
color = "blue"
outside the
aes()
function but still inside the geom you’d like to
modify. Each geom’s help file lists all characteristics that you can
modify.
Once you have a base plot, you can change the title and axis labels
(always make sure to use clear labels!). Once you’re happy with your
plot, you can save it in your project folder by using
ggsave("filename.jpg")
. This function saves the last plot
you created, and you can also use other file formats such as .png or
.pdf.
If you would like to see more ways of plotting with
ggplot
, check out the R Graph Gallery or some of the
other useful links here.
Sections 1-4 of How
to make any plot in ggplot2? give particularly good explanations for
beginners.