Multiple Linear Regression


The multiple linear regression model is a model that describes a relationship between a dependent variable y=(y1, y2,..., yn) and p independent variables x1, x2,..., xp , where xi=(xi1, xi2,..., xin ), i=1,...,p for p>1  as y=a+b1x1+b2x2+....+bpxp +e,  where e=(e1, e2,..., en) is the error vector term, and b1, b2,...,bp are unknown parameters to be estimated. The terminology for the regression diagnostics mirrors that of simple linear regression with just few exceptions.


The assumptions on a multiple linear regression models are the same as the assumptions on the errors in  Linear Models, namely
1. e1, e2,..., en are random and independent,
2. e1, e2,..., en  all have mean 0,
3. e1, e2,..., en  all have the same variance (homoscedasticity),
4. e1, e2,..., en  are normally distributed.


Residual: The difference between the predicted value (based on the regression equation) and the actual, observed value.

Outlier: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its value on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.

Leverage: An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. High leverage points can have a great amount of effect on the estimate of regression coefficients.

Influence: An observation is said to be influential if removing the observation substantially changes the estimate of the regression coefficients.  Influence can be thought of as the product of leverage and outlierness.

Cook's distance (or Cook's D): A measure that combines the information of leverage and residual of the observation.


Click on one of the link below to see how to perform a simple linear regression with the chosen of the package.


R SAS Minitab