Preview only show first 10 pages with watermark. For full document please download

L10-edu5950 Simple Regression Analysis

   EMBED


Share

Transcript

EDU5950

SEM2 2010-11
CORRELATION &
SIMPLE REGRESSION

Correlation - Test of association
 

 

 

A correlation measures the “degree of association” between
two variables (interval or ordinal)
Associations can be positive (an increase in one variable is
associated with an increase in the other) or negative (an
increase in one variable is associated with a decrease in the
other)
Correlation is measured in “r” (parametric, Pearson’s) or
“ρ” (non-parametric, Spearman’s)

1

Test of association - Correlation
 

Compare two continuous variables in terms of degree of
association
  e.g. attitude scale vs behavioural frequency
300

250

250

200
200

150
150

100

100

50

50

0
0

50

100

150

200

250

0

300

0

50

Positive

100

150

200

250

Negative

Test of association - Correlation
 

Test statistic is “r” (parametric) or “ρ” (non-parametric)
0 (random distribution, zero correlation)
1 (perfect correlation)

180

160

160

140

140

120

120

100

100

80

80

60

60
40

40

20

20

0

0
0

50

100

150

High

200

250

0

50

100

150

200

250

Low

2

.Test of association ..3. zero correlation) 1 (perfect correlation)   180 200 160 180 160 140 140 120 120 100 100 80 80 60 60 40 40 20 20 0 0 0 50 100 High 150 200 250 0 50 100 150 200 250 Zero Regression & Correlation     A correlation measures the “degree of association” between two variables (interval (50.2.Correlation Test statistic is “r” (parametric) or “ρ” (non-parametric) 0 (random distribution.)) Associations can be positive (an increase in one variable is associated with an increase in the other) or negative (an increase in one variable is associated with a decrease in the other) 6 3 .150…) or ordinal (1.100.

then how heavy? 140 120 100 80 60 40 20 0 0 50 100 150 200 Height (cms) 7 Example: Symptom Index vs Drug A Graph Two: Relationship between Symptom Index and Drug A     160 140 Symptom Index Weight (kgs) 160 120 100 80   60 40 20 Strong negative correlation Can see how relationship works. but cannot make predictions What Symptom Index might we predict for a standard dose of 150mg? 0 0 50 100 150 200 250 Drug A (dose in mg) 4 . Weight Graph One: Relationship between Height and Weight 180 n  Strong positive correlation between height and weight n  Can see how the relationship works.Example: Height vs. but cannot predict one from the other n  If 120cm tall.

Example: Symptom Index vs Drug A Graph Three: Relationship between Symptom Index and Drug A (with best-fit line)     180 Symptom Index 160 140 120 100   80 60 40 20 0 0 50 100 150 200 250   “Best fit line” Allows us to describe relationship between variables more accurately. We can now predict specific values of one variable from knowledge of the other All points are close to the line Drug A (dose in mg) Example: Symptom Index vs Drug B Graph Four: Relationship between Symptom Index and Drug B (with best-fit line)   160 Symptom Index 140 120 100   Will predictions be as accurate?   Why not?   “Residuals” 80 60 40 20 We can still predict specific values of one variable from knowledge of the other 0 0 50 100 150 200 250 Drug B (dose in mg) 5 .

A secondary purpose is to use regression analysis as a means of explaining causal relationships among variables. 6 .Correlation examples 11 Regression Ø  Ø  Regression analysis procedures have as their primary purpose the development of an equation that can be used for predicting values on some DV for all members of a population.

If the correlation is perfect (i. Regression line is the straight line that lies closest to all points in a given scatterplot This line sometimes pass through the centroid of the scatterplot. The correlation tells us how much information about the DV is contained in the IV. 7 . to which is referred as simple linear regression.e r = ±1.00). or just simple regression. Goal: to obtain a linear equation so that we can predict the value of the DV if we have the value of the IV. Simple regression capitalizes on the correlation between the DV and IV in order to make specific predictions about the DV. Simple regression involves a single IV and a single DV. Regression analysis is the means by which we determine the best-fitting line. the IV contains everything we need to know about the DV.Ø  Ø  Ø  Ø  Ø  Ø  Ø  Ø  Ø  The most basic application of regression analysis is the bivariate situation. called the regression line. and we will be able to perfectly predict one from the other.

It is important to determine exactly where the regression line crosses the Y-axis (this value is known as the Y-intercept). It is the slope that largely determines the predicted values of Y from known values for X. Ø  Ø  Ø  The degree of slope is determined by the amount of change in Y that accompanies a unit change in X. 8 .Ø  3 important facts about the regression line must be known: Ø  Ø  Ø  Ø  Ø  The extent to which points are scattered around the line The slope of the regression line The point at which the line crosses the Y-axis The extent to which the points are scattered around the line is typically indicated by the degree of relationship between the IV (X) and DV (Y). This relationship is measured by a correlation coefficient – the stronger the relationship. the higher the degree of predictability between X and Y.

Ø  Ø  Ø  Ø  Ø  The regression line is essentially an equation that express Y as a function of X. Ø  b is the slope of the regression line Ø  a is the Y-intercept Simple Linear Regression ♠ Purpose ► determine relationship between two metric variables ► predict value of the dependent variable (Y) based on value of independent variable (X) ♠ Requirement : ► DV Interval / Ratio ► IV Internal / Ratio ♠ Requirement : Ø  The independent and dependent variables are normally distributed in the population Ø The cases represents a random sample from the population 9 . The basic equation for simple regression is: Y = a + bX where Y is the predicted value for the DV. X is the known raw score value on the IV.

Simple Regression How best to summarise the data? 160 180 140 160 140 Symptom Index Symptom Index 120 100 80 60 120 100 80 60 40 40 20 20 0 0 0 50 100 150 200 250 0 50 Drug A (dose in mg) 100 150 200 250 Drug A (dose in mg) Adding a best-fit line allows us to describe data simply General Linear Model (GLM) How best to summarise the data? Ø  Establish equation for the best-fit line: Y = a + bX 200 180 160 140 Where: a = y intercept (constant) b = slope of best-fit line Y = dependent variable X = independent variable 120 100 80 60 40 20 0 0 50 100 150 200 250 10 .

R2 is the square of the correlation coefficient Ø  Reflects variance accounted for in data by the best-fit line Ø  Takes values between 0 (0%) and 1 (100%) Ø  Frequently expressed as percentage. rather than decimal Ø  High values show good fit.Simple Regression R2 . no apparent relationship between X and Y) Implies that a best-fit line will be a very poor description of data IV (regressor.randomly scattered points. predictor) 11 .“Goodness of fit” Ø  For simple regression. low values show poor fit Simple Regression Low values of R2 DV 300 Ø  250 Ø  200 150 100 Ø  50 0 0 100 200 300 R2 = 0 (0% .

“Goodness of fit” 180 160 160 140 120 120 S ymptom Index S ymptom Index 140 100 80 60 100 80 60 40 40 20 20 0 0 0 50 100 150 200 250 Drug A (dose in mg) 0 50 100 150 200 250 Drug B (dose in mg) Good fit ⇒ R2 high Moderate fit ⇒ R2 lower High variance explained Less variance explained 12 .Simple Regression High values of R2 300 250 DV 200 Ø  R2 = 1 150 Ø  100 50 0 0 100 200 300 IV (100% .points lie directly on the line .perfect relationship between X and Y) 250 Ø  DV 200 150 100 Implies that a best-fit line will be a very good description of data 50 0 0 50 100 150 200 250 IV Simple Regression R2 .

Problem: to draw a straight line through the points that best explains the variance 9 8 7 Line can then be used to predict Y from X 6 5 4 3 2 1 0 0 1 2 3 4 5 6 25 Example: Symptom Index vs Drug A Ø  Graph Three: Relationship between Symptom Index and Drug A (with best-fit line) Ø  180 Symptom Index 160 Ø  140 120 100 80 60 Ø  40 20 “Best fit line” allows us to describe relationship between variables more accurately. We can now predict specific values of one variable from knowledge of the other All points are close to the line 0 0 50 100 150 200 250 Drug A (dose in mg) 26 13 .

Regression Ø  Establish equation for the best-fit line: Y = a + bX n  Best-fit line same as regression line n  b is the regression coefficient for x n  x is the predictor or regressor variable for y 27 Regression .Types 14 .

not just the sample l  Population parameters: β0 and β1 l  Sample statistics: a and b 30 15 .Linear Regression .Model Yi = β 0 + β1 X i + ε i Constant Population Regression Coefficients Sample ˆ = a + bX Y Parameters l  The population parameters β0 and β1 are simple the least squares estimates computed on all the members of the population.

Inference About the Population Slope and Intercept Y = β0 + β1 X + ε l  If β1 > 0 then we have a graph like this: β0 + β1 X X 31 Inference About the Population Slope and Intercept Y = β0 + β1 X + ε l  If β1 < 0 then we have a graph like this: This is the mean of Y for those whose independent variable is X β0 + β1 X X 32 16 .

Inference About the Population Slope and Intercept Y = β0 + β1 X + ε l  If β1 = 0 then we have a graph like this: β0 + β1 X Note how the mean of Y does not depend on X: Y and X are independent X Copyright (c) Bani K. we can test the null hypothesis l  H0 : that Y and X are independent by testing H0 : β1 = 0 l  The p-value in regression tables tests this hypothesis 34 17 . Mallick 33 Linear Regression and Correlation Y = β0 + β1 X + ε β1 = 0 then Y and X are independent l  If l  So.

98 Ice Cream Sales 4 3.68 2.4 3.5 3 2.5 1 0.Ice Cream Example X Temperature 63 70 73 75 80 82 85 88 90 91 92 75 98 100 92 87 84 88 80 82 76 Y Sales 1.14 1.52 1.92 3.26 2.17 2.28 3.06 3.58 2.14 3.8 2.86 2.36 2.83 2.24 1.68 1.05 2.5 2 1.5 0 0 20 40 60 80 100 120 ˆ Y = a + bX Simple Regression Line TWO STEPS TO SIMPLE LINEAR REGRESSION ● Regression equation : Ŷ = a + bX ● Correlation coefficient (r) ● Coefficient of Determination (r²) Descriptive Inferential Hypothesis Test : 1 Regression Model 2 Slope 18 .25 2.9 3.

5 6 9 10 8 7 5 6 7. Calculate coefficient of determination and the correlation coefficient 2. Determine the prediction equation. Distribution for the data is presented in the table below.05 level of significance Data set: ID 1 2 3 4 5 6 7 8 9 10 Scores Assign 8. 3.First Step -Descriptive Derive Regression / Prediction equation ● Calculate a and b a=y–b X Ŷ = a + bX Example1 : Data were collected from a randomly selected sample to determine relationship between average assignment scores and test scores in statistics.5 5 Test 88 66 94 98 87 72 45 63 85 77 19 . 1. Test hypothesis for the slope at 0.

050 Prediction equation: Ŷ = 18.05 + 8.05 + 8.257 (7.  Derive Regression / Prediction equation = 215. Y will change by 8.05 8.5 5 Y 88 66 94 98 87 72 45 63 85 77 Summary stat: = 77.257X 10 72 775 544.2) n ΣΧ ΣΥ ΣΧ² ΣΥ² ΣΧΥ = 18.257 units 57 18.ID 1 2 3 4 5 6 7 8 9 10 1.5 6 9 10 8 7 5 6 7.1 a= y – b x X 8.441 5.795.5 = 8.5 62.5 Interpretation of regression equation Ŷ = 18.257 26.2 ΔY ΔX 20 .257x For every 1 unit change in X.5 – 8.

65 (5.29) = 8.Example 2: MARITAL SATISFACTION   Children : Y   Parents : X 1 3 7 9 8 4 5 Mean of X No of pairs ΣX Σ X squared Standard deviation Σ XY 3 2 6 7 8 6 3 Mean of Y ΣY Σ X squared Standard deviation 1.438 Prediction equation: Ŷ = 8.00 +.  Derive Regression / Prediction equation a= y – b x = 5.44 + 65x 21 .

Deviation   N   2.000   . Predictors: (Constant).PMR MATH   22 .6 8.65 units 5 0.43 ΔY ΔX Descriptive Statistics   Mean   Grade .571a   R Square   Adjusted R Square   .   . (1-tailed)   Grade .326   . Dependent Variable: Grade . Y will change by .000   . Error of the Estimate   1.000   .PMR MATH   TEACHER_FACTO R   e si o n 62   62   0   a.91443   62   Correlations   Model Summaryb   Model   Grade .315   Std.000   . TEACHER_FACTOR   b.PMR TEACHER_F MATH   Pearson Correlation   Grade .571   R   .9643   .Interpretation of regression equation Ŷ = 8.43 + .PMR MATH   TEACHER_FACTO n R   N   Grade .468   62   3.65x For every 1 unit change in X.53   1.PMR MATH   TEACHER_FACTOR   Std.215   di m TEACHER_FACTO .571   1.PMR MATH   1   ACTOR   1.   62   62   R   Sig.

327   .   62   62   62   62   62   62   62   62   62   Model   d 1   62   Model Summaryb   Adjusted R Std. (1-tailed)   N   Grade .848 1. .021     Sig.591   Sig.440   .917   .000   .   .9643   Race   Grade .101   .571   -.019   1.692     (Constant)   TEACHER_FACTOR a.90   62   .   . TEACHER_FACTOR   b.588 Total     131.PMR MATH   TEACHER_FACTOR   3.TEACHER PMR MATH   _FACTOR   Race   1.304   1.571   1.225   i m e n Sig. Error of R Square   Square   the Estimate   R   .53   1.PMR MATH   TEACHER_FA CTOR   Race   Grade . Predictors: (Constant).PMR MATH   23 .572a   .91443   1.170   .000   . TEACHER_FACTOR b.440   .   ANOVAb Model   1   Sum of Squares   Regression   Residual     df     42. Predictors: (Constant).435 1 60     61   Mean Square       42. Dependent Variable: Grade .015   .015   .848 88.000   Descriptive Statistics   Mean   Std.000a     a.000   .PMR MATH   TEACHER_FA CTOR   Race   62   . Race.468   Grade . Dependent Variable: Grade .PMR MATH   TEACHER_FA CTOR   Race   s i o n   0 a. Error   Beta   -1.019   -.117   5.453   .PMR MATH   .593   Correlations   Pearson Correlation   N   Grade .387   .476   F 29. Deviation   2.453   .   .000   . Dependent Variable: Grade .PMR MATH   Model 1   Coefficientsa     Standardized Unstandardized Coefficients   Coefficients   B   Std.000   .571   t   -1.

497   59   1.000a   88. Dependent Variable: Grade .PMR MATH   Coefficientsa   Model   1   (Constant)   TEACHER_FACTOR   Race   a.Model   1   Regression   Residual   Total   ANOVAb   Sum of Mean Squares   Square   df   F   Sig.571   -.150   5. Predictors: (Constant). Race.435   61   a.065   .172   -.853   .246   Sig. Error   -.500   131.349   -.000   .806   24 .917   . TEACHER_FACTOR   b.PMR MATH   Unstandardized Coefficients   B   Std.   42.939   2   21.   .469   14.313   .980   .255   .265   Standardized Coefficients   Beta   . Dependent Variable: Grade .026   t   -1.