Using R in a Research Project

Get a brief overview of how to use R in a research project.

When to use R in a research project

There are several stages in an empirical research project. These stages often start with the identification of a research problem and end with a report containing the findings and implications. Let’s review these stages:

  1. Identify a research problem.
  2. Survey the literature (find out what’s known about the problem).
  3. Formulate a theoretical argument and some testable hypothesis.
  4. Measure concepts.
  5. Collect data.
  6. Prepare data.
  7. Analyze data.
  8. Report findings and implications.

The tasks of identifying a significant and interesting research problem, surveying the extant literature, formulating a coherent theoretical argument and some testable hypothesis that explains the research puzzle, measuring concepts in the theory empirically, and collecting data for the empirical indicators of the concepts—tasks (1) to (5)—are generally dealt with in substantive and research design courses in a field. Those topics are beyond the scope of this course. Yet tasks (6) to (8) may all involve R as a research instrument. Specifically, using R for actual research projects is to analyze particular research problems, such as evaluating the impact of a policy or testing the impact of a causal factor (or an independent variable) on an outcome (or a dependent variable) of interest, as postulated by pre-specified theoretical expectations. How to accomplish tasks (6) to (8) will be illustrated in later sections of the course.

A research project of this type presents at least two challenges, for which R is helpful.

First, a project involves a range of tasks, such as the following:

  • Import data into software.
  • Merge different datasets.
  • Verify data.
  • Create new variables.
  • Recode and rename variables.
  • Visualize data.
  • Run statistical estimation procedures.
  • Carry out diagnostic tests and so on.

Second, an analyst needs to reproduce their analysis, including dataset construction and estimation results, even years later. The first challenge concerns the efficiency of analysis, whereas the second concerns the reproducibility and integrity of the analysis.

To achieve both efficiency and reproducibility, experienced analysts always choose to write down the computing code in one or more programs. This helps in submitting, revising, and resubmitting code to reproduce an analysis quickly and whenever necessary. In this course, we focus on how to write and submit R programs for specific tasks rather than use the interactive use or menu-driven interface of R. For all practical purposes, the programming approach is much more efficient and consistent than the interactive or menu-driven approach.

Essentials about R

R is a computer language and an environment for statistical computing and graphics with important advantages. Started by Robert Gentleman and Ross Ihaka of the University of Auckland in 1995, it is now maintained by the R core-development team of volunteer developers. R is referred to as a computer language because as a dialect of the S language developed in the late 1980s at AT&T’s labs, R allows users to follow the algorithms, define and add new functions, and write new analytic methods rather than merely supplying canned routines. R is also a coherent system that provides an environment with an integrated suite of software facilities for data storage, manipulation, analysis, and visualization. In addition, R is flexible. It runs on Windows, UNIX, and Mac OS X. It can be easily extended in terms of new functions and state-of-the-art statistical methods; the over 10,000 add-on packages by the end of January 2017 through the CRAN family of internet sites testify to this fact. Last but not least, R is free, as are its numerous add-on packages. Hence, R is popular among practitioners in many fields and scholars in many disciplines, including the social sciences.

To learn more about how to set up R and organize the program code, you can refer to Appendix B. Here, we offer a brief introduction to writing and executing R programs, installing and loading add-on packages, producing graphical and numerical output, and then turn to essential reference information about important symbols and common coding errors.