The R Data Analysis System

Graphics

Stemplots

Boxplots

Quantile Plots

Histograms

Bar Charts and Pie Charts

Scatterplots and Linear Regression Lines

Multiple Scatterplots

3-D Scatterplots

Smooth Curves

Copying and Saving Plots

This document will focus on some of the graphical procedures available in R. R’s graphics capabilities are extensive, but it cannot be said that they are user friendly. R actually has two graphics packages, a basic package and a fancier package called lattice graphics. Everything you will want to do right now can be accomplished with the basic package. This section describes only the most often used graphics functions. Much more information about these functions and others is available in the R help files.

Stemplots

Stemplots are also called stem and leaf diagrams. In R they are produced in the console window by the command

> stem(data)

where data is the name of the numeric vector you want to plot. The stem function has optional arguments for controlling how many stems and leaves you see. The default values of these arguments are usually satisfactory. Stemplots are not very useful for large data sets.

Boxplots

A boxplot is also called a box and whisker diagram. It visually depicts the five number summary of a numeric data set, i.e., the minimum, the maximum, and the quartiles. It also shows outliers. To make a boxplot of a vector called data, type

> boxplot(data)

The boxplot function has optional arguments for controlling the page layout of the plot and fine details of the boxes and whiskers in the plot. When you are ready to experiment with them, you can read about these options by calling

> help(boxplot)

Side-by-side boxplots are useful for comparing the distributions of several data vectors. If data1, data2, data3, etc. are the names of the data vectors, a side-by-side boxplot can be produced by

> boxplot(data1, data2, data3)

Side by side boxplots of data grouped by levels of a factor can be produced using formula syntax, as in

> boxplot(data~factor)

where data and factor are two variables in the same data frame.

An optional argument that is sometimes helpful is the logical varwidth argument. It adjusts the width of each box to reflect the sample size of its data set. It is invoked as follows.

> boxplot(data1, data2, data3, varwidth=T)

Quantile Plots

Suppose a numeric data vector is a sample of size n from some distribution. A normal quantile plot of the data compares the order statistics (ordered data values) to the expected order statistics from a standard normal distribution. The vertical coordinates of the points are the ordered data values and the horizontal coordinates are the expected standard normal order statistics. If the data is a sample from the normal distribution with mean µ and standard deviation σ, the points of the normal quantile plot will lie close to a straight line with intercept µ and slope σ. To make a normal quantile plot, type

> qqnorm(data)

> qqline(data)

The second command above draws a line through the points whose coordinates are the quartiles on each axis. This line helps you assess the straightness of the set of points, and thus the departure from normality of the data.

There is another type of quantile plot for comparing the order statistics from two independent samples to assess whether or not they come from the same distribution. The two samples do not have to be of the same size. If the data from the two samples are the vectors data1 and data2, the command is

> qqplot(data1,data2)

The interpretation of this plot is similar to that produced by qqnorm. If the two samples are from distributions that differ only in location and scale the points of the plot should be close to a straight line. Reliable interpretation of quantile plots takes practice.

Histograms

A basic, no-frills histogram of a numeric vector data can be produced with the command

> hist(data)

There are lots of optional arguments to the hist function for controlling the colors and appearance of the bars in the histogram, the titles, and the axis labels. The default bins or class intervals are chosen on the basis of the range of the data and the size of the data set and are always of equal length. The argument breaks allows you to specify your own class intervals. breaks is an increasing numeric vector that gives the end-points of the intervals. For example, to make a histogram with bins (0,1], (1,2], ...., (9,10) use

> hist(data, breaks=0:10)

To make one with bins (0,2], (2,6], (6,8], (8,10)

> hist(data, breaks=c(0,2,6,8,10))

Notice that the bin widths are not all the same in the last example.
By default, the vertical scale of the bars of an R histogram shows the counts of the data values that fall in each of the class intervals. Histograms are often used as approximations of a density function, and in that role they should be density functions themselves. That is, the sum of the areas of the histogram bars should be 1. This can be accomplished by setting the logical probability argument to True, as in

> hist(data, prob=T)

With bins of equal length, the shape of the histogram is the same when prob = T as when prob = F. The only difference is in the scale of the vertical axis.

Bar Charts and Pie Charts

Bar charts are different from histograms. Histograms are for numeric data whereas bar charts and pie charts are for counts of levels of a factor. Usually, the counts must be tabulated first with the table function.

> counts = table(factor)

This produces a vector of the counts of the various levels of the factor variable factor. This is a vector with named components, the names being the factor levels. The bar chart is then given by

> barplot(counts)

and the pie chart by

> pie(counts)

Optional arguments to barplot allow horizontal bars, multicolored bars, stacked bars, labels and legends.

Scatterplots and Linear Regression Lines

Scatterplots or scatter diagrams are probably encountered more than any other kind of plot in elementary statistics. A scatterplot is just a plot of a finite set of points (x_i, y_i) in a cartesian plane. The usual reason for doing a scatterplot is to conjecture or investigate a noisy functional relationship between y and x. If the values of x are in a numeric vector xdata and the values of y in a numeric vector ydata the command

> plot(xdata,ydata)

gives you the scatterplot. It is important that the lengths of xdata and ydata be the same. If they are not, you will get an error message. If xdata and ydata are the columns of a two-column matrix or data frame xyframe, the plot is even easier.

> plot(xyframe)

Sometimes it is more convenient to use formula syntax to create a scatterplot.

> plot(ydata~xdata)

If either xdata or ydata is non-numeric the plot function gives a different kind of plot, depending on the character of the variables. All of these kinds of plots are useful in different ways.

In order to superimpose the least squares regression line on a scatterplot, you must first calculate the regression coefficients. There are lots of ways of doing this. The most general and useful is with the lm (for linear model) function. First, create the fitted linear model with a command such as

> xymodel=lm(ydata~xdata,data=xyframe)

The argument "data = xyframe" is needed only if xdata and ydata are variables in a data frame xyframe and you need to tell lm where they are. If they are variables in the top level of your workspace you don't need this argument. After creating the fitted model, the regression line is superimposed on the scatterplot as follows.

> plot(ydata~xdata)

> abline(coef(xymodel))

The function coef extracts the least squares regression coefficients from the fitted model object. abline is a generic function for adding a straight line to an existing plot.

Multiple Scatterplots

You can get simultaneous scatterplots of all the pairs of variables in a data frame by either of the commands

> pairs(dataframe)

or

> plot(dataframe)

This is a very good way to look at how several variables interact in pairs. Variables in the data frame that are non-numeric factors are treated as numeric. Thus, pairs involving factors may not tell you anything useful.

3-D Scatterplots

Three dimensional scatterplots cannot be produced with the basic graphics package. The best way to do them is through R Commander. You load the R Commander package by

> library(Rcmdr)

It takes a few seconds for the package to be loaded and for a window to be opened on Commander. Suppose that in your R workspace you have a data frame named dataframe, with variables x1, x2 and y, among other variables. First, make dataframe the active data set in R Commander by clicking on the button labelled "Data set" and then selecting dataframe from the menu. After that, click on the "Graphs" drop down menu and select "3D Graphs - 3D Scatterplot". In the dialog box that pops up, choose y as the dependent variable and x1 and x2 as the independent variables. Select or deselect any options you want or don't want and then click "OK". The three dimensional scatterplot can be rotated with your mouse, points can be identified, and the picture can be saved to an external file.

Smooth Curves

Suppose funct is the name of a previously defined function of one numeric variable. It can be either a built-in R function or a function that you have defined yourself. To plot the graph of the function from the lower limit a and the upper limit b, type

> curve(funct, from=a, to=b)

For example, the sine function is graphed by

> curve(sin, from=-pi, to=pi)

There are optional arguments for adjusting certain features of the plot. If the plotted function has a simple formula, the formula can be used in place of the name of a function. This eliminates the need to define the function prior to calling curve. For example,

> curve(-2*x^2+1,from=-2,to=2)

plots part of the parabola y = -2x²+1. You can superimpose a curve on a pre-existing plot by using the logical "add" argument to the curve function.

> curve(-2*x^2+1, add=T)

Notice that the from and to arguments were not used here because, presumably, you want the added curve to extend across the range of the previous plot.

Copying and Saving Plots

To copy and paste a plot into another application, such as a Microsoft Word document, right-click in the plot area and from the popup menu select either "Copy as metafile" or "Copy as bitmap". Then you may simply paste it into your other application. To save a plot, right-click again and select "Save as metafile" or "Save as bitmap". These can be converted to jpeg or gif files externally if you like. You may find resolution to be improved by resizing the plot inside R before copying it. The menu also allows you to print the plot to a printer or to pdf format, if you have the right software.