The R Data Analysis System

Linear Regression and ANOVA

Extracting Information from a Fitted Model

Hypothesis Tests and Confidence Intervals

This document describes R functions for simple and multiple linear regression, analysis of variance, and linear models in general. R Commander has an extensive menu of functions for creating, graphing and analyzing linear models, but it works only on data frames. There are some tasks that are easier to accomplish by entering commands at the command line in the R console. The R Commander menus are fairly straightforward, so we will describe Commander functions only when they offer significant advantages over typing commands in the basic console.

If *x* and *y* are numeric vectors of the same length, a
scatterplot of the points (*x*_{i}, *y*_{i}) is
produced by

> plot(*x*, *y*)

If *xandy* is a two-column matrix or data frame with the values of *x*
in the first column and the values of *y* in the second, type

> plot(*xandy*)

For a numeric matrix or data frame with more than two columns, this same command will give a scatterplot matrix, i.e., an array of scatterplots of all the pairs of variables in the matrix or data frame. This is a useful way of looking at data with many variables. The command

> pairs(*xandy*)

will do the same thing.

Once you have created a scatterplot, or any picture in R, you can save it by right clicking on it and choosing "Save as bitmap" or "Save as metafile". It can be converted to another graphics format outside of R.

Interactive 3-d scatterplots are best done with R Commander. First, make sure the 3 variables you want to plot are a data frame. Then in Commander make that data frame the active data set. In the "Graphs" menu, select "3D Graph" and then "3D Scatterplot". You can rotate the picture, add a plane regression surface and residual line segments if you like.

The least squares fit of a straight line to bivariate data can be
accomplished in several ways. Generally, the most useful is to apply the *lm*
function to create a linear model object. Suppose you want to name the
linear model object *modelA.lm* and the variables *x* and *y*
are in a data frame *xyframe.*

> modelA.lm=lm(y~x, data= xyframe)

The "data" argument tells R where to find the variables *x*
and *y*. If they are stand-alone numeric vectors of the same length
in your workspace, the data argument is not necessary. The expression
"y ~ x" is an R __formula__ and can be read as " y is modeled
as a function of x". The formula argument is always required in a
call of *lm*.

The *lm* function is used for fitting linear models of all kinds:
simple linear regression models, multiple regression models, polynomial
regression models, and analysis of variance and covariance models.

A summary of the important information in a fitted model is given by the *summary*
function.

> summary(modelA.lm)

This gives details of the call to the lm function, a summary of the distribution of the residuals, the coefficients of the fitted regression equation, and much more. Individual attributes of the fitted model can be extracted separately. For example,

> coef(modelA.lm)

gives the regression coefficients alone. Similarly,

> resid(modelA.lm)

gives the vector of residuals, and

> fitted(modelA.lm)

gives the fitted (predicted) values of y for each of the input values of x.

Suppose the scatterplot of values of x and y is produced by

> plot(x,y)

and the fitted model is *modelA.lm*. The least squares regression
line may be added to the plot by

> abline(coef(modelA.lm))

The function *abline *is the generic function for adding a line to a
plot. Other plots that may be helpful, especially for multiple
regression, are

> plot(fitted(modelA.lm),y)

> abline(0,1)

where abline adds a line with intercept 0 and slope 1, and

> plot(fitted(modelA.lm), resid(modelA.lm))

> abline(h=0)

where abline adds a horizontal line with ordinate 0.

In multiple linear regression, the expected value of the response variable y
is modeled as a function of the form E[y]=b_{0}+b_{1}x_{1}+...+b_{p}x_{p}. Estimating
the regression coefficients b_{1},
..., b_{p} by least squares
involves solving a system of equations known as the normal equations. In
R, the same function *lm* is used for both single and multiple
regression. The only difference is in the complexity of the formula
argument. If p=3 and the names of the variables in R are *y*, *x1*,
*x2*, *x3*, the command

> modelB.lm=lm(y~x1+x2+x3, data= xyframe)

creates the linear model object containing all the information about the least squares fit. The "data" argument is not needed if the variables are vectors in R's search path. Otherwise, it must be included with the name of the data frame containing those variables, as in the example above.

Information about the least squares estimates, residuals, fitted values, and so on, can be extracted as before.

> coef(modelB.lm)

> resid(modelB.lm)

> fitted(modelB.lm)

Most of the information you might want can be seen all at once with

> summary(modelB.lm)

Fitting a polynomial function to bivariate data by least squares is a
special case of linear model fitting and is also accomplished with *lm*.
The term "linear" applies because the model expression is linear in
the unknown coefficients even if it is not linear in the independent variable *x*.
To fit a cubic polynomial, use

> modelC.lm=lm(y~x+I(x^2)+I(x^3), data=xyframe)

Polynomials of other degrees are fitted similarly. The expression "I(x^2)" in the model formula is necessary to tell R to treat x^2 as a mathematical construct (i.e., the square of x) and not to interpret it according to the rules of formula syntax. If you simply write "y~x+x^2+x^3" R will misunderstand you. Also, do not type "y~I(x+x^2+x^3)" because then R will treat x+x^2+x^3 as a single variable and will calculate only one regression coefficient.

If there are two independent variables *x1* and *x2*, a quadratic
function of these two variables can be fitted by

> modelD.lm=lm(y~x1*x2+I(x1^2)+I(x2^2), data=xyframe)

The term "x1*x2" in the model formula is called an __interaction__
term and implicitly includes the intercept and the first degree terms in the
quadratic function as well as the product of the two independent
variables. Model formulas for higher degree polynomials in any number of
variables are constructed similarly, but they quickly become very complex as
the number of variables increases. R has a built-in function *poly*
for assisting in polynomial fitting. It calculates orthogonal polynomials
which are data-dependent, so the results can be difficult to interpret.

R can fit many other kinds of response models: splines, loess models, generalized additive models, generalized linear models, nonlinear least squares, and others. There are far too many of them to discuss here, but they are thoroughly documented in R's help files.

The summary of a linear model object reports the least squares estimates of the coefficients in the model, the student-t statistics computed from each of them for testing the null hypothesis that that coefficient is equal to 0, and the two sided p-value for the alternative hypothesis that it is not 0. In addition, an F statistic and its p-value is reported for the omnibus hypothesis that some regression coefficient is different from 0. It is quite possible that none of the individual p-values is smaller than some prescribed significance level, but collectively all of the terms in the model contribute significantly.

To get 95% confidence intervals for the regression coefficients, type, for example,

> confint(modelD.lm)

The 95% level is the default. If you want a different confidence level, e.g. 99%, specify it as follows.

> confint(modelD.lm, level=.99)

For numeric bivariate data the correlation between the variables may be more
meaningful that the details of a linear least squares fit of one to the
other. If *x* and *y* are numeric variables of the same length,

> cor(x,y)

returns the sample Pearson product-moment correlation coefficient. To test the null hypothesis that the true correlation is 0 and to find a confidence interval for the true correlation, use

> cor.test(x,y)

The confidence level can be adjusted with the "conf.level"
argument. *cor.test* also gives you three options for a measure of
association, Kendall's tau, Spearman's rho, and Pearson's correlation.

One-way analysis of variance is just a linear model of the expected response
*y* as a function of a categorical variable or treatment variable *x*.
In R, categorical variables are called __factors__. The fitting is
done with the same *lm* program used for regression.

> modelF.lm = lm(y~x, data=xyframe)

> summary(modelF.lm)

The summary looks the same as it does with linear regression models.
The biggest difference is in interpreting the coefficients. If the
treatment levels are denoted by L1, L2,..., Lp, the intercept parameter is the
mean response for level L1. The next parameter estimate reported is the
difference in mean responses between levels L2 and L1. The third
parameter estimate reported is the difference in mean response between levels
L3 and L1, and so on. Thus, L1 serves as a "control" treatment
level and the parameters of interest are the differences between the other
treatment levels and the control. These differences are called __contrasts__.

The foregoing is not true if x is an ordered factor for which the ordering
of the treatment levels is important. In this case the coefficients in
the least squares fit by *lm* are based on orthogonal polynomial contrasts
and have no simple interpretation. In most cases, they are simply ignored
since estimation of parameters is not usually the purpose of anova.

Near the bottom of the summary the value of the F statistic is given for testing the null hypothesis that all treatment means are the same, as well as its p-value. The summary contains all the information in a traditional one-way anova table. Such a table can be produced in R by

> anova(modelF.lm)

If the anova table shows that there are significant differences in mean
response among the treatment levels, the next question to consider is which
pairs of treatment levels differ significantly. Neither the output of *lm*
nor that of *anova* answers this question. The function
pairwise.t.test does answer it while adjusting p-values to compensate for
spurious significances that might arise simply because so many comparisons are
being made. It can be invoked by

> attach(xyframe)

> pairwise.t.test(y,x)

The result does not depend on whether x is an ordered factor or an unordered factor.

Suppose that in the data frame *xyframe*,
*y* is the numeric response variable and *x1* and *x2* are
factors. For a 2-way analysis of variance model without interactions,
type

> modelG.lm=lm(y~x1+x2, data = xyframe)

If there are multiple observations for each combination of levels of factors
*x1* and *x2*, you can incorporate interactions into the model via

> modelG.lm=lm(y~x1*x2, data = xyframe)

Then

> anova(modelG.lm)

will give you a traditional 2-way anova table with p-values for both the main effects and the interaction between the two factors. If different combinations of factor levels have different numbers of observations, you will get a warning that the design is unbalanced, but the results are not invalidated. It simply means that interpretation is more difficult.