The R Data Analysis System
Linear Regression and ANOVA
This document describes R functions for simple and multiple linear regression, analysis of variance, and linear models in general. R Commander has an extensive menu of functions for creating, graphing and analyzing linear models, but it works only on data frames. There are some tasks that are easier to accomplish by entering commands at the command line in the R console. The R Commander menus are fairly straightforward, so we will describe Commander functions only when they offer significant advantages over typing commands in the basic console.
If x and y are numeric vectors of the same length, a scatterplot of the points (xi, yi) is produced by
> plot(x, y)
If xandy is a two-column matrix or data frame with the values of x in the first column and the values of y in the second, type
For a numeric matrix or data frame with more than two columns, this same command will give a scatterplot matrix, i.e., an array of scatterplots of all the pairs of variables in the matrix or data frame. This is a useful way of looking at data with many variables. The command
will do the same thing.
Once you have created a scatterplot, or any picture in R, you can save it by right clicking on it and choosing "Save as bitmap" or "Save as metafile". It can be converted to another graphics format outside of R.
Interactive 3-d scatterplots are best done with R Commander. First, make sure the 3 variables you want to plot are a data frame. Then in Commander make that data frame the active data set. In the "Graphs" menu, select "3D Graph" and then "3D Scatterplot". You can rotate the picture, add a plane regression surface and residual line segments if you like.
The least squares fit of a straight line to bivariate data can be accomplished in several ways. Generally, the most useful is to apply the lm function to create a linear model object. Suppose you want to name the linear model object modelA.lm and the variables x and y are in a data frame xyframe.
> modelA.lm=lm(y~x, data= xyframe)
The "data" argument tells R where to find the variables x and y. If they are stand-alone numeric vectors of the same length in your workspace, the data argument is not necessary. The expression "y ~ x" is an R formula and can be read as " y is modeled as a function of x". The formula argument is always required in a call of lm.
The lm function is used for fitting linear models of all kinds: simple linear regression models, multiple regression models, polynomial regression models, and analysis of variance and covariance models.
A summary of the important information in a fitted model is given by the summary function.
This gives details of the call to the lm function, a summary of the distribution of the residuals, the coefficients of the fitted regression equation, and much more. Individual attributes of the fitted model can be extracted separately. For example,
gives the regression coefficients alone. Similarly,
gives the vector of residuals, and
gives the fitted (predicted) values of y for each of the input values of x.
Suppose the scatterplot of values of x and y is produced by
and the fitted model is modelA.lm. The least squares regression line may be added to the plot by
The function abline is the generic function for adding a line to a plot. Other plots that may be helpful, especially for multiple regression, are
where abline adds a line with intercept 0 and slope 1, and
> plot(fitted(modelA.lm), resid(modelA.lm))
where abline adds a horizontal line with ordinate 0.
In multiple linear regression, the expected value of the response variable y is modeled as a function of the form E[y]=b0+b1x1+...+bpxp. Estimating the regression coefficients b1, ..., bp by least squares involves solving a system of equations known as the normal equations. In R, the same function lm is used for both single and multiple regression. The only difference is in the complexity of the formula argument. If p=3 and the names of the variables in R are y, x1, x2, x3, the command
> modelB.lm=lm(y~x1+x2+x3, data= xyframe)
creates the linear model object containing all the information about the least squares fit. The "data" argument is not needed if the variables are vectors in R's search path. Otherwise, it must be included with the name of the data frame containing those variables, as in the example above.
Information about the least squares estimates, residuals, fitted values, and so on, can be extracted as before.
Most of the information you might want can be seen all at once with
Fitting a polynomial function to bivariate data by least squares is a special case of linear model fitting and is also accomplished with lm. The term "linear" applies because the model expression is linear in the unknown coefficients even if it is not linear in the independent variable x. To fit a cubic polynomial, use
> modelC.lm=lm(y~x+I(x^2)+I(x^3), data=xyframe)
Polynomials of other degrees are fitted similarly. The expression "I(x^2)" in the model formula is necessary to tell R to treat x^2 as a mathematical construct (i.e., the square of x) and not to interpret it according to the rules of formula syntax. If you simply write "y~x+x^2+x^3" R will misunderstand you. Also, do not type "y~I(x+x^2+x^3)" because then R will treat x+x^2+x^3 as a single variable and will calculate only one regression coefficient.
If there are two independent variables x1 and x2, a quadratic function of these two variables can be fitted by
> modelD.lm=lm(y~x1*x2+I(x1^2)+I(x2^2), data=xyframe)
The term "x1*x2" in the model formula is called an interaction term and implicitly includes the intercept and the first degree terms in the quadratic function as well as the product of the two independent variables. Model formulas for higher degree polynomials in any number of variables are constructed similarly, but they quickly become very complex as the number of variables increases. R has a built-in function poly for assisting in polynomial fitting. It calculates orthogonal polynomials which are data-dependent, so the results can be difficult to interpret.
R can fit many other kinds of response models: splines, loess models, generalized additive models, generalized linear models, nonlinear least squares, and others. There are far too many of them to discuss here, but they are thoroughly documented in R's help files.
The summary of a linear model object reports the least squares estimates of the coefficients in the model, the student-t statistics computed from each of them for testing the null hypothesis that that coefficient is equal to 0, and the two sided p-value for the alternative hypothesis that it is not 0. In addition, an F statistic and its p-value is reported for the omnibus hypothesis that some regression coefficient is different from 0. It is quite possible that none of the individual p-values is smaller than some prescribed significance level, but collectively all of the terms in the model contribute significantly.
To get 95% confidence intervals for the regression coefficients, type, for example,
The 95% level is the default. If you want a different confidence level, e.g. 99%, specify it as follows.
> confint(modelD.lm, level=.99)
For numeric bivariate data the correlation between the variables may be more meaningful that the details of a linear least squares fit of one to the other. If x and y are numeric variables of the same length,
returns the sample Pearson product-moment correlation coefficient. To test the null hypothesis that the true correlation is 0 and to find a confidence interval for the true correlation, use
The confidence level can be adjusted with the "conf.level" argument. cor.test also gives you three options for a measure of association, Kendall's tau, Spearman's rho, and Pearson's correlation.
One-way analysis of variance is just a linear model of the expected response y as a function of a categorical variable or treatment variable x. In R, categorical variables are called factors. The fitting is done with the same lm program used for regression.
> modelF.lm = lm(y~x, data=xyframe)
The summary looks the same as it does with linear regression models. The biggest difference is in interpreting the coefficients. If the treatment levels are denoted by L1, L2,..., Lp, the intercept parameter is the mean response for level L1. The next parameter estimate reported is the difference in mean responses between levels L2 and L1. The third parameter estimate reported is the difference in mean response between levels L3 and L1, and so on. Thus, L1 serves as a "control" treatment level and the parameters of interest are the differences between the other treatment levels and the control. These differences are called contrasts.
The foregoing is not true if x is an ordered factor for which the ordering of the treatment levels is important. In this case the coefficients in the least squares fit by lm are based on orthogonal polynomial contrasts and have no simple interpretation. In most cases, they are simply ignored since estimation of parameters is not usually the purpose of anova.
Near the bottom of the summary the value of the F statistic is given for testing the null hypothesis that all treatment means are the same, as well as its p-value. The summary contains all the information in a traditional one-way anova table. Such a table can be produced in R by
If the anova table shows that there are significant differences in mean response among the treatment levels, the next question to consider is which pairs of treatment levels differ significantly. Neither the output of lm nor that of anova answers this question. The function pairwise.t.test does answer it while adjusting p-values to compensate for spurious significances that might arise simply because so many comparisons are being made. It can be invoked by
The result does not depend on whether x is an ordered factor or an unordered factor.
Suppose that in the data frame xyframe, y is the numeric response variable and x1 and x2 are factors. For a 2-way analysis of variance model without interactions, type
> modelG.lm=lm(y~x1+x2, data = xyframe)
If there are multiple observations for each combination of levels of factors x1 and x2, you can incorporate interactions into the model via
> modelG.lm=lm(y~x1*x2, data = xyframe)
will give you a traditional 2-way anova table with p-values for both the main effects and the interaction between the two factors. If different combinations of factor levels have different numbers of observations, you will get a warning that the design is unbalanced, but the results are not invalidated. It simply means that interpretation is more difficult.