R can handle practically any type of data, from simple text files to files used by other (not necessarily open-source) software and complex databases. This gives users a lot of flexibility in terms of data sources and formats.
In addition to using your own data (e.g. as exported from a survey), the Data Center keeps a continuously updated list of useful datasets by discipline, accessible here.
In the following, we’ll discuss how to import and export from and to various file formats, and discuss a number of available packages to do so.
Start by reading the introduction to importing the most common file types (text files, Excel, SPSS, Stata) here. It is good to be aware of the Import dataset button and use it when needed, but in the long run it is easier to be aware of the available import functions and use them directly. The rest of this section gives more information and examples of importing data from different file formats and different levels of tidiness.
We will use the following packages for importing different file formats:
library(tidyverse)
library(readxl) # for Excel
library(haven) # for SPSS, Stata, SAS
In addition, the rio
package provides a generic import function,
however, it is simply a wrapper for many of the other import functions
shown below. While it may be easier to use the same import function for
many file formats, rio
redirects you to the original functions if you
look for the possible function arguments, therefore for debugging it is
better practice to use the “original” functions. In some cases, rio
can read URL file paths that readr
and readxl
can’t.
In the following, we’ll work with some example data of student characteristics and grades. First, download this zip-file from GitHub, and extract it into a data folder within your apprenticeship project directory. We now import each file, explaining the packages, functions, and function arguments used. These files are all different subsets of the same dataset on student characteristics and grades at a university (original source).
student1.csv
is a comma-separated text file. Opening it in a notepad
or Excel shows that the column separators are commas (,). The
read_csv()
function from the previous workshop expects commas as a
separator, while read_csv2()
expects semicolons (common with
e.g. Dutch language settings). Since CSV files are a form of delimited
text files, we can also use the more versatile read_delim()
function
specifying the delimiter as the argument.
In the following, we use the read_csv()
(and read_delim()
) function
to import the data, and assign the resulting object to an object called
student1
with the assignment operator <-
. The student1
object is
now a tibble in the R environment: you can find the object in the
Environment tab in RStudio, and view the data by clicking on the object
name or running View(student1)
in the Console.
student1 <- read_csv("data/student1.csv")
student1 <- read_delim("data/student1.csv", delim = ",")
In this case, we used read_csv()
only specifying its one mandatory
argument: the file path. When using read_delim()
, we also specified an
optional argument: we defined the delimiter as a comma, thereby
overriding the default function behavior. To learn more about the
mandatory and optional arguments of a function, and find out what the
default behaviors are, you can use the ?
operator followed by the
function name, e.g. ?read_delim
in the Console to open the help file
of a function (or use the search bar of the Help tab).
student2.tab
is also a delimited text file. Opening it in a notepad
shows that the delimiter is a tab. The notation for tab whitespace is
\t
, which we can specify in the delim
argument. Like the “.tab” file
extension, “.tsv” is also a tab-separated text file, so the more
specialized read_tsv()
function also works.
If you load this data without optional arguments, you’ll see that
instead of the correct number of columns, we get a single column, with
the variable name containing some metadata. To get the correct number of
columns, we need to skip the first row of the data that contains this
metadata, as the actual data starts from the second row. We can use the
skip
argument to skip the first row. If we use one of the relevant
import functions, and assign the outcome to the student2
object, we
can see that the data is now correctly imported and shows up in the
environment next to student1
as another tibble.
student2 <- read_delim("data/student2.tab", delim = "\t", skip = 1)
student2 <- read_tsv("data/student2.tab", skip = 1)
student3.xlsx
is an Excel file. To import Excel file we need the
read_excel()
function from the readxl
package (the readxl
package
is one of packages that is not a part of core tidyverse
but uses the
same principles). With the read_excel()
function you can specify which
sheet to use in addition to similar arguments as for delimited text
files. Notice that by default R imports the “Metadata” sheet, so we can
use the sheet
argument to specify which sheet to import. In addition,
the first two rows contain introductory text, not the data, so we can
use the skip
argument to skip those rows. You may also notice that
previously the variable name for age was Student_Age
and now it is
Student_age
. You can rename the variable either by giving a full list
of column names in the import function, but often it is easier to use
the rename()
function after importing.
student3 <- read_excel("data/student3.xlsx", sheet = "Data")
The haven
package (also not core tidyverse
but same principles)
reads files in the data formats of SPSS (.sav) and Stata (.dta). It can
also extract variable and value labels from these files; here we can use
the read_spss()
to import student4.sav
.
student4 <- read_spss("data/student4.sav")
RDS is an R-specific file format that saves all attributes of the
dataframe (e.g. grouping, factor levels). It is particularly useful for
saving intermediate data files, e.g. saving the cleaned data before
analysis to avoid needing to run the data cleaning script repeatedly. To
import an RDS file such as student5.rds
, use the read_rds()
function.
student5 <- read_rds("data/student5.rds")
A few notes regarding importing and exporting data:
- Always make sure you know your current working directory and the
relative path to your data directory. It is better to use relative
rather than absolute file paths (i.e.
data/data.csv
instead ofC:/User/Project/data/data.csv
). - Note that if you are using Windows, you may need to replace the backslashes (\ in the file path with forward slashes (/) to avoid errors.
- You can import files directly from URLs, although you usually need the
URL of a raw file. If a file downloads immediately instead of opening
in a raw format, you can try to copy that download link by
right-clicking and selecting “Copy link address”; the
import()
function fromrio
might be successful with those links. - To export data from R, you can almost always use the
write_...()
function corresponding to the desired file format, e.g.write_csv()
. For Excel files the preferred export function iswrite_xlsx()
, and for SPSS’s .sav files it iswrite_sav()
. - For other file formats, the generic
write()
function is useful; you can specify any file format, and if your input data is readable in the chosen format, the file will export properly. - In all
write_()
functions you need to specify the data you’d like to save and the output file path (absolute or relative) including chosen file extension.
# example export code
write_csv(student1, "data/new_data.csv")