Statistics – Multiple Linear regression
Hi, here is one long blog post for this week, because of a question I got. The user has tried to find any “step-by-step” instruction how to do multiple linear regression, but it’s hard to find on web. This is a one day training normally, but I will do my best to pick up the most important steps.
There are some rules for doing linear regression:
1. The y variable (dependent variable) should be normal distributed. In this example I use salary as my y-variable, and it’s not normal distributed so I have selected people with lower salaries and skipped the others – to get it more normal. To check normal distribution, have a look on my other blog post on Statistics Normal Distribution.
2. The x variables should be 0/1 coded or numeric. In my example gender is coded 1 and 2 so I have to recode them to 0 and 1. Age I can keep as it is – numeric.
3. The x variables should not be too highly correlated (investigate with Pearson correlations or Factor analysis before). Check with multi collinearity-statistics within regression command.
4. Also check so you don’t have any outliers that influence the regression model. You can choose “casewise diagnostics” within linear regression by clicking on the “statistics”-button. You can also save COOKS distance from the “Save”-button within the regression command and see if there is anyone differing with a high value compared to the rest.
5. The final model should have residual (ZRESID) that has the same variation along the prediction line (ZPRED). It’s not so good if the variation increase with increased predicted y-values for example.
Checking the multicollinearity
If you are unsure about the correlation. Run a regression model : choose command: Analyze – Regression – Linear. Put in all x-variables you want into your regression model plus the dependent variable (y): salary, and click the button “statistics” and then click the box at “collinearity diagnostics” (see below):
Then check the table in the output. You can see that 2 variables has lower values (red circles around the tolerance values), because I have discovered that these 2 variables is highly correlated to each other (r=-0.68) . But it doesn’t have to be any problem in my model, it will be more serious when the tolerance value is under 0.1 and specially around 0.01. What you do then is to delete one of the variables with low tolerance value from the model.
If you are unsure of how to create a regression, then a stepwise regression could be a help for you. Then the program will choose only the best x-variables for you, step by step. The best variable will be included in step 1, the next best in step 2 and so on. Have a look what happens to the “Adjusted R-square” as it should increase every time.
Here below you can see the results from my stepwise regression:
In the result above you can see in the footnote a. that “working_overtime” is the first x-variable to be choosen. So it’s the most important variable. Then you can see that the adjusted R-square says that this variable can explain 55% (0.55) of the salary variation. It’s the adjusted R-square that can be compared between the models.
As step 2, the variable “working in office” come in to the model, but this variable just improve the model with 2.8% to adjusted R-Square= 0.578. The small increase is probably because of the correlation between these 2 x-variables. At step 3, variable “male” is coming in and this variable increase the model a lot, with nearly 6% (up to 0.639) in the adjusted R-square.
Check the residualplot You can create that plot within the regression command if you click the “Plots”-button, and then you choose to plot ZRESID (Y) against ZPRED (X), see below:
Understand the regression model’s result
If I do a multiple linear regression that is not stepwise, then I choose the default method: enter. I choose all x-variables that was chosen from the stepwise regression, but I skip the “working in office” in this example. (I could have kept it though).
Look at the adjusted R-square to see if it’s ok (good if it’s over 60%).
Then look at the coefficient matrix, in the “Sig” column (see the red arrow below).
Here every x variable is important for the salary, as they all have significance values below 0.05.
Also have a look at the Beta coefficient column to compare which variable is most important for the model (see the green arrow below). We can see that “working_overtime” is the far most important with a value of -0.699 (minus or plus sign doesn’t matter). On the second place we have “Male” with a value of 0.146.
Then to understand the effect each of the variable has to salary, then look in the blue marked square.
So for example “Age of employee”, if we control for all other x variables, then for each year older you will get 31 more “money” in monthly salary. (“money in this case is Swedish SEK”).
Look at the variable “male”, if we control for all other x variables, then if you are male you will earn 1030 more money in monthly salary, compared with if you are female.
Look at the variable “working_overtime”, if we control for all other x variables, then if you use to work overtime then you earn 6434 money less compare to the employees that is not working overtime.
So to understand the result of the regression, you must be aware of the variables you have put in, and also remember that the model can look quite different if you take out one variable or bring in one extra variable. So be careful and plan your work step by step when working with regression analysis.
Some people likes to use stepwise regression as a help, but then you must know your variables so you understand what happens in the different step.
If you are interested in further training please get in touch!
This was all from now!