Import Datasets of Other Formats into R

Learn how to import datasets of other formats into R.

The Penn World Table 7.0 dataset example illustrates how R imports a comma-delimited file. In practice, we will always come across datasets in many other formats. R is powerful in terms of data importation. Below we will present some sample code for importing datasets of several common formats. All the code below will assume we have executed setwd() function and specified where the datasets are stored.

  1. Tab-delimited file: The R code for a tab-delimited file is very similar to those for the comma-delimited file, except for how to denote that a file is tab-delimited.
dataframe.name <- read.table("filename.txt", header=TRUE, sep="\t", na.strings=".", strip.white=TRUE,
                   stringsAsFactors=False)

The read.table() function imports a tab-delimited file called filename.txt, in which the header row contains variable names, the columns are separated by tabs, and the observation values denoted by "." are treated as missing. The output from the read.table function is assigned to a data object arbitrarily called dataframe.name.

We use the option sep="\t" to indicate the file is tab-delimited. The read.table function can also import the comma delimited file by specifying the option sep="," instead.

  1. Stata file: To import stata files into R, we may use one of the following several ways.
# install.packages("foreign") # install.packages("Hmisc") library(foreign) library(Hmisc)
dataframe.name <- stata.get("filename.dta")

In the example above, we first use the library function to load into R two add-on packages, foreign and Hmisc. Note that the code assumes that the packages have been installed; if they have not been, we will get an error message when trying to load the packages. Then we use the stata.get function to read the stata file and create a data object called dataframe.name. In a second way to import a stata file, we first use the library function to load into R the add-on package, foreign. Then we use the read.dta function to read the stata file, and create a data object dataframe.name.

library(foreign)
dataframe.name <- read.dta("filename.dta")

There are three important caveats to remember when we import stata files into R. First, many variable names in stata files contain underscores like in variable_name, which R does not like. We could use the option convert. underscore = TRUE to replace an underscore with a period. Second, many variables in stata files are often read into R as factor variables, which could be hard to work with for beginners. We could use the option convert.factors = FALSE to tell R to import numerical variables as numerical ones only. With these options added, the R code may appear as follows:

dataframe.name <- stata.get("filename.dta", convert.underscore=TRUE, convert.factors=FALSE)
dataframe.name <- read.dta("filename.dta", convert.underscore=TRUE, convert.factors=FALSE)

Finally, both stata.get and read.dta functions can not import files in the STATA13 format. A new package for importing Stata 13 data files is now available. The following code will allow us to do that.

 # install.packages('readstata13')
library(readstata13)
dataframe.name <- read.dta13("filename.dta")

We first install and load the readstata13 package, then use the read.dta13 function to import the stata file, and assign it to a data object in R.

Stata files often contain very informative variable labels. Here is one way to preserve those variable labels in a separate file, which we refer to as “codebook” in the R code below.

dataframe.name <- read.dta("filename.dta")
var.labels <- attr(dataframe.name,"var.labels") codebook <- data.frame(var.name=names(dataframe.name),
var.labels)
  1. SPSS file: Similar to stata files, we can use two alternative functions for importing SPSS files: read.spss in the foreign package, and spss.get in the Hmisc package. The latter is preferable because it automatically takes care of many options, making data importation smoother. The R code for the latter is as follows:
library(foreign)
library(Hmisc)
dataframe.name <- spss.get("filename.sav")

In many SPSS files, variables often come with value labels. If we want to keep the variables with those same levels defined by value labels, we can add an option to the argument of the function and use the following code instead.

dataframe.name <- 
spss.get("filename.sav", 
use.value.labels=TRUE)
  1. SAS file: SAS files can be read into R using the foreign and haven packages among others. We can use the read.xport() function in the foreign package and the read_sas() function in the haven package to read two different SAS data formats, respectively.
    library(foreign)
    dataframe.name <- read.xport("filename.xport")
    
    library(haven)
    dataframe.name <- read_sas("filename.sas7bdat")
    
  2. Excel file: Most frequently students work with Excel spreadsheet files. The best data importing method, recommended in many books on R, is to export those files from within Excel to comma-delimited or tab-delimited files and then import them into R using the code described earlier. We encourage R beginners to follow this advice.

However, R does have the ability to directly communicate with Excel spread- sheets for data importation. The Excel file needs some preparation before we read it into R. For example, keep the variable names in the first row of the spreadsheet, and give the sheet to be read into R a worksheet name so that we can refer to it in R. Here is the R code:

# install.packages('RODBC')
library(RODBC)
channel <- 
odbcConnectExcel("filename.xls")
dataframe1 <- sqlFetch(channel, 
"worsheet1")
odbcClose(channel)

In the code, we first install and load the RODBC package into R; ignore the installation step if the package has been installed. The odbcConnectExcel() function reads an excel file called filename.xls, and returns an RODBC connection object called channel. Then the sqlFetch() function uses the channel object, imports the Excel worksheet labeled worksheet1, and assigns the output to an R data object called dataframe1. Finally, the odbcClose() function removes the RODBC connection object called channel.

Sometimes, we would like to import multiple worksheets from an Excel file. Suppose we have three worksheets, labeled worksheet1, worksheet2, and worksheet3. The code below shows how we can import the three worksheets into three R data objects called dataframe1, dataframe2, and dataframe3, respectively.

library(RODBC)
channel <- odbcConnectExcel("filename.xls") dataframe1 <- sqlFetch(channel, "worsheet1") dataframe2 <- sqlFetch(channel, "worsheet2") dataframe3 <- sqlFetch(channel, "worsheet3") odbcClose(channel)

A couple of other packages also can help import Excel files into R, including gdata and XLConnect. To use these packages, one may try the following code:

library(gdata)
dataframe <- read.xls("datafile.xls", sheet = 1)
library(XLConnet)
workbook <- loadWorkbook("datafile.xls")
dataframe <- readWorksheet(workbook, sheet = "Sheet1")
  1. R data format: If a data object is saved in R data format, it can be directly loaded into R using the load() function. The R code below shows how to save a data object in the workspace, and then load it into R.
save(data.object, file = 
"datafile.RData") load("datafile.RData")

Get hands-on with 1200+ tech skills courses.