Highline Class, BI 348 Basic Business Analytics using Excel Chapter 04: Linear Regression 1 Topics 1. Decisions Based on Relationship Between Two or More Variables 2. Regression Analysis 3. Scatter Chart 4. Types of Relationships 5. Scatter Chart and Ybar and X Bar Lines 6. Covariance and Correlation 7. Simple Liner Regression Model 8. Assumptions about Error in Model 9. Simple Liner Regression Equation
10. Estimated Simple Linear Regression Equation 11. Calculating Slope & Y-Intercept using the Least Squares Method 12. Experimental Region 13. How to Interpret Slope and Y-Intercept 14. Prediction with Estimated Simple Liner Regression Equation 15. Residuals 16. Coefficient of Determination or R Squared 17. SST = SSR + SSE 18. Standard Error 19. Data Analysis Regression feature 20. LINEST Function 21. Multiple Regression 22. How to Interpret Slope and Y-Intercept in Multiple Regression 23. Testing Significance of Slope and Y-Intercept (Inference) 24. Multicollinearity 25. Categorical Variable
26. Inference in Vary Large Samples 2 Decisions Based on Relationship Between Two or More Variables Managerial decisions are often based on the relationship between two or more variables Predict/Estimate Sales (Y) based on: Advertising Expenditures (x1) Predict/Estimate Annual Amount Spent on Credit Card (Y) based on: Household Annual Income (x1) Education (x2). Predict/Estimate Bike Price (Y) based on: Bike Weight (x1)
Predict/Estimate Stroke (Y) based on: Age (x1) Blood Pressure (x2) Smoking (x3) 3 X Y Data Independent Variable = x Predictor variable Dependent Variable = y = f(x) Variable that is predicted or estimated Response variable
4 Regression Analysis Regression Analysis A statistical procedure used to develop an equation showing how two or more variables are related. Allows us to Build Model/Equation to help Estimate and Predict. The entire process will take us from: Taking an initial look at data to see if there is a relationship. Creating an equation to help us estimate/predict. Assessing whether equation fits the sample data. Use Statistical Inference to see if there is a significant relationship. Predict with the equation.
Regression Analysis does not prove a cause and effect relationship, but rather it helps us to create a model (equation) that can help us to estimate or make predictions. Simple Regression Regression analysis involving one independent variable (x) and one dependent variable (y). Linear Regression Regression analysis in which relationships between the independent variables and the dependent variable are approximated by a straight line. Simple Linear Regression
Relationship between one independent variable and one dependent variable that is approximated by a straight line, with slope and intercept. Multiple Linear Regression Regression analysis involving two or more independent variables to create straight line model/ equation. Curvilinear Relationships (not covered in this class) Relationships that are not linear. 5 Scatter Chart to See If There Is a Relationship
Graphical method to investigate if there is a relationship between 2 quantitative variables Excel Charting: Independent Variable = x Horizontal Axis Left most column in data set Dependent Variable = y = f(x) Vertical Axis Column to right of x data column Always label the x and y axes. Use an informative chart title. Goal of chart: Visually, we are looking to see if there is a relationship pattern.
For our Sales (x) Ad Expense (y) data we see a direct relationship. To get estimated line & equation and r^2, rightclick markers in chart and click on Add Trendline. Then click dialog button for Linear and the checkboxes for Display equation on chart & Display R^2 on chart. Learn about equation & r^2 later 6 Types of Relationships Investigate if there is a relationship: With the Scatter Chart, you look to see if there is a relationship.
Looks like As x increases, y increases. Direct or Positive Relationship Looks like As x increases, y decreases. Inverse or Indirect or Negative Relationship Looks like No Relationship 7 Baseball Data Scatter Charts 8 Covariance and Correlation: Numerical Measures to
Investigate if There is a Relationship These numerical measures will be more precise that the Positive, Negative No Relationship (also Little Relationship) categories that the Scatter Chart gave us. Numerical measures to investigate if there is a relationship between two quantitative variables. 9 Scatter Chart and Ybar and X Bar Lines Scatter Charts are graphical means to find a relationship between 2 quantitative variables. We need a numerical measure that is more precise than our Scatter Chart. To understand how the numerical measure can do this, we plot a Ybar line and Xbar line on our chart. 10
Covariance Measure of the linear relationship between two quantitative variables. 1. Positive values indicate a positive relationship; negative, a negative relationship. 2. Close to zero means there is not much of a relationship. 3. The magnitude of covariance is difficult to interpret. 4. Covariance has problems with units (like feet compared to inches). 5. We can standardize covariance by dividing it by sx*sy to get Coefficient of Correlation.
In Excel use COVARIANCE.S function for sample data: Y data first, x data second 11 12 Coefficient of Correlation (rxy) Measure the strength and direction of the linear relationship between two quantitative variables. A relative measure of strength of association (relationship) between 2 variables or a measure of strength per unit of standard deviation, sx * sy . Solves Covariance units/ magnitude problem.
In Excel use CORREL or PEARSON functions. Investigate if there is a relationship: We will have a number answer that indicates the strength and direction: 1. 2. 3. 4. 5. 6. Always a number between -1 and 1. 0 = No correlation Near 0.5 of -0.5 = moderate correlation Near -1 or 1 = strong correlation
Does not have problems with units like Covariance does. Can only be used for one independent variable to measure a linear relationship As opposed to Coefficient of Determination (r squared or Goodness of Fit Test), which can be used for 1 or more independent variables and for linear or non-liner relationships Note: Because the Correlation Coefficient measure the strength and direction of a LINEAR relationship, not nonlinear relationships. If you get a correlation measure near zero, it may be true that there is a very weak linear relationship, but that does not say that there is not some other sort of non-linear relationship. 13
14 Ad Expenditures / Sales Example: Covariance and Correlation in Excel Covariance and Correlation look very strong. 15 Analyst Wants to See if There is a Relationship Between BMX Racing Bike Weight and Price. Negative/Inverse Relationship: As Bike Weight increases, Price decreases. Covariance & Correlation are negative & strong.
16 Ad Expenditures / Sales Example: Chart, Covariance and Correlation Indicate a Relationship Now that we see that there is a relationship between the two variables, Weekly Ad Expense and Weekly Sales, through our Scatter Chart, Covariance & Correlation: We choose the simple linear regression to create an equation that will allow use to predict and estimate sales based on advertising expenditures. 17 Overview: Simple Linear Regression Algebra: f(x) = y = m*x + b Statistics: Yhat = = b1*x + b0 (sample statistics)
y = 1x + 0 (population parameters) Slope = m = b1 = 1 = For every one unit of x, how much does y change? Intercept = b = b0 = 0 = At what point does line cross y-axis? or what is y, when x = 0? The equation describes a straight line. 18 Simple Liner Regression Model with Population Parameters 1. Simple Linear Regression Model: y = 1 x + 0 +
y = Predicted value 1 = Slope 1 = Beta sub 1 x = Value you put into the equation to try and predict the y value. 0 = Y-Intercept 0 = Beta sub 0 = Error Value = random variable that accounts for the variability in y that cannot be explained by the liner relationship between x and y. = Epsilon 19 Because Not All Sample Points Are On The Estimated Line
We Will Get Some Error () 20 Assumptions About The Error Value () Necessary for the Least Squares Method of calculating b1 and b0. 1. The assumption of bell shape for errors, indicates that right on the line, the mean of the error value at any particular x is zero. E() = 0. This means that we can use the slope and intercept (1 & 0) as constants. 2. Total population can be thought of as having sub-populations.
For each x value there is a range of possible y values (sub-population). The Bell Shaped distribution is an assumption about the possibility of getting a y value above or below the line for a given x value. 3. The error () variation will be constant E(y|x) = 1x + 0 Describes the line down the middle, where = 0. Is the mean of all the y values and sits exactly on the line.
21 Simple Liner Regression Equation with Population Parameters 2. Simple Linear Regression Equation: E(y|x) = 1x + 0 E(y|x) = Expected Value or Mean of all the y values at a particular x value. E(y|x) = 1x + 0 describes s straight line down the middle, where = 0. 22 Sample Slope and Y-Intercept Because population parameters for slope & intercept are not usually known, we
estimate them using sample data in order to calculate sample statistics for slope and y-intercept. 23 Estimated Simple Linear Regression Equation with Sample Statistics Estimated Simple Linear Regression Equation: 3. yhat = = b1x + b0 = y hat b1 = Slope = a sample statistic that estimates the population parameter 1 = slope of estimated regression line.
b0 = y-intercept = a sample statistic that estimates the population parameter 0 = y-intercept of estimates regression line. gives us two possible interpretations: 1. = Point estimator of E(y|x) = Estimates mean of all y values for a given x in population. or 2. = Can Predict individual y value for a particular business situation. Graph of the estimated simple linear regression equation is called estimated regression line. 24 Estimation Process for Simple Linear Regression 25
Overview: Least Squares Method to Derive Formula for b1 & b0 **For proof of formulas, see downloadable pdf file. 26 Formulas for estimated Slope (b1) Y-intercept (b0) The formulas used to estimate b1 and b0 can be derived using Differential Calculus in what is called the Least Squares Method.
**For proof of formulas, see downloadable pdf file. 27 Ad Expenditures / Sales Example: Calculate Slope and Y-intercept 28 Experimental Region We are not sure if the relationship is linear outside our x sample data range. It is best to make predictions over the range of the min and max of the sample x data. This range is called the
Experimental Region When you make predictions outside the Experimental Region, it is called Extrapolation The y-intercept is often estimated using Extrapolation We do not have empirical evidence that the relationship holds outside the experimental region; an estimate created outside the experimental region is an unreliable estimate. 29 Bike Weight/ Bike Price Example:
Calculate Slope and Y-intercept from Sample Data, Make a Prediction Can we expect that a bike can have a price of $1,732.40 (x = 0)? The y-intercept was estimated using Extrapolation The range for these entry level racing bikes is about 20 to 27 lbs. There are specialized pro BMX racing bikes that weigh less than 20 lbs., but they can cost upwards of $3,000. Our equation would predict that a bike weighing 17 lbs. would be $736.14 It is best to not use equation outside our range of min & max x sample values, our Experimental Region. 30 How to Interpret Slope and Y-Intercept Slope What is the correct interpretation of slope? For every $1 increase in Ad Expense, the
equation predicts that Weekly Sales will increase by $8.24. Y-Intercept What is the correct interpretation of the yintercept? When x = 0, the equation predicts that the Weekly Sales will be $20,406. Because 0 is not in our Experimental Region, extrapolating what the sales will be at x = 0 yields an unreliable estimate. We still can use it to make predictions over the Experimental Region. 31 Prediction with Estimated Simple Liner Regression Equation
= 8.24 * x + 36857.04 Manager plans to spend $43,000 on ads next week. X = 43,000 = 8.24 * 43000 + 36857.04 = $391,243.63 Our prediction of $391,243.63 is reasonable because our model is based on sample evidence and we predicted using an x value that is within the Experimental Region. 32 Does the Equation Predict Perfectly?
We need to examine something called Residuals. 33 Residuals = Y1 = Particular Y value Predicted Y Value Sometimes the Equation Underpredicts Sometimes the Equation Overpredicts Vertical lines on chart represnet residuals or
Errors in using Predicted Value as compared to Actual Y Value. 34 Residuals = Predicted Values = Calculate predicted values using Estimated Equation at each FORECAST function in Excel can be used for predicted
values. All it needs is an x value and the known y and x values and it will calculate the predicted value using the Estimated Regression Equation Slope and Intercept. Residuals = Particualr value Predicted value Distance that Original Y Sample Value is above or below the Estimated Line Vertical lines on chart represnet residuals Note 1: Sum of Yi values = Sum of predicted values (all i ) Note 2:
Sum of residuals = 0 Note 3: Residuals Squared are minimized becasue we calculated predicted values with b1 & b0 35 Ybar Line vs. Estimated Simple Linear Equation Line for Making Predictions Ybar Model Yhat Model These errors are called Deviations
A Lot of Error when Making Predictions These errors are called Residuals Much Less Error. Manager Now Has a Better Model For Predicting As Compared to a Ybar Model. 36 How Well Does Estimated Equation/Line Fit the Sample Data? The measure we will use is called:
Coefficient of Determination or R Squared. It is a measure of Goodness of Fit. 37 If All Sample Points Fell on Estimated Line, There Would Be No Errors 38 Error in Predicting Y Using Equation 39 Total Error if we used just Ybar
40 Two Parts in Total Error 41 Total Error = Explained +Unexplained Total Error = Regression +Error 42 Comparing Regression or Error to Total to measure Goodness of Fit If we would like to compare the total Regression or Error to the Total Error, the problem is: If we want to add up all Total Error or all Unexplained or all Regression:
We would get zero!!!! No problem: We square each before adding!!!! 43 Squaring Regression or Error to Total to measure Goodness of Fit Helps with the Zeros. 44 Squaring & then Summing Total, Regression & Error
=1 =1 =1 2 2 2 ^ ^ = ( ) = ( ) = ( )
Sum of Squares Total. Measure of error involved in using Ybar to make a prediction. How well sample points cluster around Ybar line. Sum of Squares due to Regression.
Measure of how far Predicted Value is away from Ybar. Amount of SST that is explained by the Estimated Regression Equation/Line. Explained part of SST. If all Sample Points fall on Estimated Line, SSR = SST.
Sum of Squares due to Error. Measure of how far away Particular Value is away from Predicted Value. Amount of SST that is unexplained by the Estimated Regression Equation/Line. Unexplained part of
SST. How well sample points cluster around line If all Sample Points fall on Estimated Line, SSE 45 How to Think About SST and SSE SST SST = Measure of Total Error in using Ybar Line as Compared to Sample Data Points. SSE
SSE = Total Amount of Error in using Estimated Simple Linear Equation Line as Compared to Sample Data Points. 46 How to Think About SSR SSR = Total Amount that Estimated Simple Linear Equation Explains Over Ybar Line. Measure of How Much Better Using Yhat is for Making Predictions than
using Ybar. 47 Relationship Between SST, SSR and SSE =1 =1 =1
2 2 2 ^ ^ = ( ) = ( ) = ( ) SST = SSR + SSE Because SST = SSR + SSE, we can compare the parts to the whole. SSR/SST = Proportion of SST that Estimated Regression Equation/Line Explains. SSE/SST = Proportion of SST that is NOT Explained by the Estimated Regression Equation/Line.
48 Coefficient of Determination Coefficient of Determination = = = 1 - = r^2 results in a number between 0 and 1 The closer to 1, the better the fit of the Estimated Equation to the X-Y Sample Data Points. r^2 = 1 all X-Y sample markers fall on the Estimated Regression Line. SST = SSR SSE = 0
r^2can be thought of as: Measure of Goodness of fit of Estimated Regression Equation/Line to the X-Y Sample Data Points. The proportion of the Sum of Squared Total that can be explained by using the Estimated Regression Equation. The proportion of the variability in the dependent variable (y) that is explained by the Estimated Regression Equation/Line. In Excel the RSQ function can be used to calculate r^2 using just the sample x & y data.
49 Ad Expenditures / Sales Example: Calculate Coefficient Of Determination 50 The Closer to 1, the Better the Fit. 51 Coefficient of Determination From page 137 in our Essentials of Business Analytics textbook (ISBN10: 1-285-18726-1): For typical data in the social and behavioral sciences, values of r^2 as low as 0.25 are often considered useful.
For data in physical and life sciences, r^2 values of 0.60 or greater are often found, and in some cases , r^2 values greater than 0.90 can be found. In business applications, r^2 values vary greatly, depending on the unique characteristics of each applications. 52 Compare Coefficient Of Determination & Coefficient Of Correlation C. Of Correlation = rxy C. Of Determination = r^2 rxy = (Sign of b1)*SQRT(r^2). Number between -1 and 1. Measures strength and direction of
liner relationship between one independent variable and one dependent variable. Only for liner relationships. Only for one independent variable. r^2 = (rxy)^2. Number between 0 and 1. Measure strength and goodness of fit of relationship. Can be used on linear or nonlinear relationships. Can be used for one or more independent variables. Referred to as R^2 in Multiple Regression
53 Estimates for Variance & Standard Deviation of the Estimated Regression Equation 2 = ( ^ ) = = Estimate of Variance for Regression Equation = Estimate of Standard Deviation for Regression Equation, called Standard Error of Estimate or Standard Error of y If you already have the residual values calculated, you can use the Excel function STEYX to calculate Standard Error of Estimate. All it needs are the x and y sample point values.
q = # of independent variables. 54 Standard Error of the Estimate 55 Bike Weight/ Bike Price Example: Calculate Coefficient Of Determination & Standard Error 56 Remember Busn 210: How Fairly A Statistic Represents Its Data Points.
Measuring How Fairly The Mean Represents Its Data Points with Standard Deviation 2 ( ) 1 Measuring How Fairly the Estimated Regression Equation Represents Its Data Points with Standard Deviation of the Estimated Line or Standard Error of the Estimate
^ ( ) 1 2 57 Degrees of Freedom, page 150 in textbook Degrees of freedom represents the number of independent units of information in a calculation. In general, Degrees of Freedom = df = n - # of estimated parameters. Both of these would become:
= 2 ( ) 2 ^
( ) = 58 Data Analysis, Regression feature Step 1: Dialog Box 59 Data Analysis, Regression feature Step 2: Output 60
What Regression Output Means 61 LINEST Array Function to deliver 10 statistics for Linear Regression Highlight one more column than there are independent variables and 5 rows. Remember to verify that curly brackets are present after you used Ctrl + Shift + Enter. Enter Array Function with
keystrokes Ctrl + Shift + Enter. 62 Multiple Regression We predict a y value based on two or more independent variables For example: Predict/Estimate Annual Amount Spent on Credit Card (Y) based on: Household Annual Income (x1) Education (x2). Predict/Estimate Stroke (Y) based on: Age (x1) Blood Pressure (x2) Smoking (x3)
63 Multiple Regression Model & Equation y = 0 + 1x1 + 2x2 + + + + + qxq + Equation E(y|x1,x2,xq) = 0 + 1x1 + 2x2 + + + + + qxq Model y = dependent variable Equation x1, x2, . . . , xq = independent variables
0, 1, 2, . . . , q = parameters = error term (variability in y that cannot be explained by the linear effect of the q independent variables) For the Model, y is a linear function of x1, x2, . . . , xq and the error term . 64 Estimated Multiple Regression Equation Estimated Multiple Regression Equation: = b1x1 + b2x2 + + + + + bqxq + b0 q = Number of independent variables b0 = Point estimate of 0 = Y-intercept = estimate of expected value (mean) of dependent variable y when all the independent variables are equal to zero. b1, b2, bq = Point estimate of Slope 1 , 2 , q = the change in mean value of the y (dependent variable), that corresponds to a one unit increase in the independent variable
xi , while holding all of the other independent variables constant (fixed). = estimated mean value of the dependent variable 65 Coefficient of Determination Coefficient of Determination (R^2) can determine the goodness of fit of the Regression Equation to the Sample data for two or more independent variables. When the number of independent variables is increased, R^2 can never decrease. SSR cannot become smaller. SSE cannot become bigger. Coefficient of Determination Simple Linear Regression use: r^2 Multiple Regression use: R^2
66 Calculating Slope and Y-Intercept for Multiple Regression In Excel we can use: Data Analysis, Regression feature Can be used for Simple and Multiple Linear Regression LINEST Array Function Can be used for Simple and Multiple Linear Regression 67 Multiple Regression for Credit Card Company Example: Goal: Credit Card company wants to analyze new customers to
see how much they will spend on a credit card each year. Predict/Estimate Annual Amount Spent on Credit Card (Y) based on: Household Annual Income (x1) Education (x2). 68 If We Just Tried To Build Equation Based On Annual Income (X1) r^2 Goodness of Fit of Equation to Data
69 Now We Add More Independent Variables r^2 increases 70 Estimated Multiple Regression Equation For Predicting: = Annual Income.
= # Years Post High School Education. = Predicted Annual Credit Card Charges. b1 =Holding # Years Post High School Education fixed, for every $1 of Annual Income Increase, Annual Credit Card Charges increase by $0.1468. b2 =Holding Annual Income fixed, for every 1 year of education past HS, Annual Credit Card Charges increase by $771.25. Experimental Region for x1 is $9,900 to $134,000. Experimental Region for x2 is 1 to 8 Reliable predictions come when x values are entered into equation only over the Experimental Region for each x. = 0.1468*54,000+771.253*2+1943.074 = $7,529.09 Interpreting Slopes & Intercept in Multiple Regression
b1 =Holding # Years Post High School Education fixed, for every $1 of Annual Income Increase, Annual Credit Card Charges increase by $0.1468. b2 =Holding Annual Income fixed, for every 1 year of education past HS, Annual Credit Card Charges increase by $771.25. b0 = Y-intercept. Outside our Experimental Region. Extrapolation does not make sense. 72 Statistical Inference Take sample from population. Calculate a statistic that estimates a population parameter. From the sample evidence makes reasonable statements about the population parameter. Examples from Busn 210: Hypothesis Testing we made statements like: At alpha of 0.05, it is reasonable to assume that the mean annual cost of an insurance policy is $724.
Confidence Intervals we made statements like: Because the sample statistic was between our Upper and Lower Limit for our 95% Confidence Interval, it is reasonable to assume that the mean annual cost of an insurance policy is $724. Statistical Inference (textbook): Process of making estimates and drawing conclusions about one or more characteristics of a population (the value of one or more parameters) through the analysis of a sample drawn from the population. 73 Inference and Regression In Regression Analysis, we took a sample and estimated the parameters for slope and y-intercept. Our sample statistics for slope and y-intercept are our point estimates of the population parameters, slope and y-intercept. Now we need to test to see whether or not our point estimates are reasonable or
not. However: Before we can test the reasonableness of the slope and y-intercept, we must check to see if the assumptions necessary to use the Least Squares Regression Model are valid 74 Conditions Necessary for Valid Inference with the Least Squares Model For any given combination of values of the independent variables x1, x2,, xq, the population of potential error terms must be: 1. Normally Distributed (Bell Shaped)
2. Has a mean of zero 3. Has a constant variance 4. The values of are statistically independent In general, this is a concern only when we collect data from single entity over time, like in a time series We can visually test assumptions by examining Residual Plots 75 Implication of Assumptions Normally Distributed (Bell Shaped)
Implication: Because y is a linear function of , y is normally distributed random variable for all values of x. has a mean of zero, E() = 0 Implication: The slopes and intercepts are constants and we can use the line to estimate or predict Has a constant variance Implication: variance is same for all x The values of are statistically independent Linear model can be used without an adjustment for seasonality or other cyclical
patterns 76 Visually Testing Assumptions: Plot Residuals Against X Values Markers above Zero Line indicate a Positive Residual, which is a Sample Point that is above the Predicted Value. The equation under estimated as compared to the actual Sample Point.
Markers below Zero Line indicate a Negative Residual, which is a Sample Point that is below the Predicted Value. The equation over estimated as compared to the actual Sample Point. If assumptions are met we should see consistent markers above and below the Zero Line with a higher frequency near the Zero Line. This plot does not supply evidence that would support rejecting the assumptions. 77 You can manually plot Residuals Against x or use Data Analysis, Regression feature Manual
Data Analysis, Regression feature 78 Visually Testing Assumptions: Plot Residuals Against X Values 1. term (and thus dependent variable y) is Normally Distributed. If this assumption is met we should see: The frequency of values near Zero Line should be greater than frequency of values away from Zero Line. 2. Has a mean of zero. If this assumption is met we should see: For a given x value the center of the spread of the errors should
be near the Zero Line. About same number of values above and below the Zero Line. 3. Has a constant variance. If this assumption is met we should see: Errors are symmetrically distributed above and below and along the Zero Line across the x values. Spread of errors looks similar for each x. This plot does NOT provide evidence of a violation of the conditions necessary for valid inference in regression. The implication of 1) bell shaped errors, 2) mean = 0 & 3) constant variance is that the point estimates are unbiased (do not tend to underpredict or overpredict), for any given combination of values of the independent variables x1, x2,, xq .
79 If assumptions are met, point estimates tend to not underpredict or overpredict (unbiased estimate) 80 Inference in Regression is Generally Valid Unless You See Marked Violations Such as These: These plots provide strong evidence of a violation of the conditions necessary for valid inference in regression.
Equation does not reflect relationship between x & y Not consistent Variance Collect data from single entity over time. Time on horizontal axis. Residuals not Independent. Skewness indicates Not Normal Distributed
81 * From Essentials of Business Analytics textbook ISBN10: 1-285-18726-1 You Can Also Plot Predicted Values Against X Values To Check Assumptions About (Manual Method Only) 82 If Residual Plots Show Assumptions Are Met: We can run Hypothesis test to check reasonableness of regression parameters 0, 1, 2, . . . , q . We can create Confidence Intervals for our predicted values of our dependent variable. Hypothesis Testing (Busn 210) A Statistical procedure that uses sample evidence & probability theory to determine whether a statement about the values of a parameter are reasonable (reliable) or
not. Confidence Intervals (Busn 210) From sample data we calculate a lower & upper limit to make probability statement about how sure we are that the population parameter will lie between the lower and upper limit. 95% Confidence Intervals mean that if we constructed 200 similar intervals, about 95 would contain the population parameter and 5 would not. 83 The Logical Element to Test in Linear Regression: Slope If the slope is zero, there is probably NOT a relationship If the slope is NOT zero, there is probably a relationship
84 Hypothesis Test to Check if Slope/s Are Equal to Zero If slope/s are ALL equal to zero, then: E(y|x1,x2,xq) = 0 + 1x1 + 2x2 + + + + + qxq becomes: E(y|x1,x2,xq) = 0 + 0*xx1 + 0*xx2 + + + + + 0*xxq becomes: E(y|x1,x2,xq) = 0 Not a linear function of x1, x2,, xq For our Hypothesis Test, our goal is to Reject the hypothesis ALL Slope/s = 0. If ALL Slope/s = 0, then model would be no better than the Ybar line for making predictions.
85 Steps For Hypothesis Testing 1. State The Null and Alternative Hypotheses. Null Hypothesis = H0 = All Slope/s = 0 Alternative Hypothesis = Ha = All Slope/s <> 0 At Least One Slope is NOT Equal to Zero 2. Set Level of Significance = alpha. Alpha = risk of rejecting H0 when it is TRUE. Alpha determines the hurdle for whether or not the test statistic just represents sample error or there is a true statistically significant difference (past the hurdle). Alpha is used to compare against p-value. P-value <= Alpha, we Reject H0 and accept Ha Alpha is often 0.05 or 0.01. When testing the slope, when we get a statistically significant difference, it will mean: It is reasonable to assume that at least one of the slopes is not zero.
It is reasonable to assume that there is a statistically significant relationship. 3. Rejection Rule: If p-value is less than our alpha, we reject H0 and accept Ha , otherwise, we fail to reject H0 . 86 Steps For Hypothesis Testing 4. From Sample Data calculate the Test Statistic and then calculate the p-value of the Test Statistic. Use F Test Statistic (and F Distribution) for Testing Overall Significance: F Test Statistic is: = = p-value: = F.DIST.RT(F Test Statistic, q , n-q-1)
Use t Test Statistic (and t Distribution) for testing individual Slope and Y-Intercept: t Test Statistic for Slope: t= t Test Statistic for Y-Intercept: t= p-value for t Test Statistic: =T.DIST.2T(ABS(t Test Statistic), n-q-1) Data Analysis Regression Output provides F & t Test Statistics, & p-values. The key is: If p-value is less than Alpha, Reject H0 and Accept Ha 87
Steps For Hypothesis Testing 5. From the sample evidence makes reasonable statements about the population parameter. If we reject H0 and accept Ha we will say: The sample evidence suggests that at least one slope is not equal to zero. It is reasonable to assume that there is a significant relationship at the given level of significance. xi and y are related and a linear relationship explains a statistically significant portion of the variability in y over the Experimental Region. If we fail to reject H0 we will say: The sample evidence suggests that all slope/s are equal to zero. It is reasonable to assume that there is NOT a significant relationship at the given level of significance. 88 F
Distribution for Hypothesis Test 89 F Test Statistic for Hypothesis Test Use the F Distribution F Test Statistic is: = = SSR = Sum of squares due to regression (explained variation) SSE = Sum of squares due to error (unexplained variation) q = the number of independent variables in the regression model
n = the number of observations in the sample SSR/q = MSR = Mean Square Regression = test statistic that measures variability in the dependent variable y that is explained by the independent variables (x1, x2 xq) SSE/(n-q-1) = MSE = Mean Square Error = measure of variability that is not explained by x1, x2 xq . df = degrees of freedom = term used in Excel ANOVA output. The larger the F value, the stronger the evidence that there is an overall regression relationship. P-value = Probability of getting the F Test Statistic or greater in the F Distribution.
The smaller the p-value the stronger the evidence that there is an overall regression relationship (stronger the evidence against the Null that all slopes are zero). If p-value is smaller than alpha, we reject H0 . Data Analysis Regression Output provides F and p-value 90 Formulas for testing individual estimates of parameters: Sum of Squares of Error (Residuals) = SSE = Estimate of Variance of Estimated Regression Line: s
2 = MSE = SSE/(n - 2) Estimate of Standard Deviation of Estimated Regression Line = Standard Error Of The Estimate = s = SQRT(MSE) Estimated Standard Deviation of a particular slope = Confidence interval for 1 is = , where ta/2 is the t value providing an area of a/2 in the upper tail of a t distribution with n - 2 degrees of freedom. Estimated Standard Deviation of y-intercept = s* Confidence interval for 0 is = , where ta/2 is the t value providing an area of a/2 in the upper tail of a t distribution with n - 2 degrees of freedom. 91
Testing Individual Regression Parameters If F Test Statistic indicates that at least one of the slope/s are not zero, then we can test if there is a statistically significant relationship between the dependent variable y and each of the independent variables by testing each slope. We use the t Distribution (Bell Shaped Probability Distribution from Busn 210) We use t Test Statistic to test whether the slope is zero: t= b1 = slope = estimated standard error of slope t = # of standard deviations. If t is past our hurdle, we reject H0 and accept Ha. H0 : Slope = 0 Ha : Slope <> 0 Alpha of 0.5 or 0.01 are often used. Alpha determines the hurdle or is used to compare against p-value.
This is a two-tail test. If t is past hurdle in either direction, reject H0 and accept Ha. It seems reasonable that the slope is not zero. If the p-value is less than alpha, it seems reasonable that the slope is not zero. The smaller the p-value, the stronger the evidence that the slope is not zero and the more evidence we have that a relationship exists between y and x. In Simple Linear Regression, t test and F test will yield same p-value. 92 t Distribution for Hypothesis Test 93 Hypothesis Test For Weekly Ad Expense and Sales Example:
First we look at the residual plot to see if the assumptions of the Least Squares Method are met. It appears that the assumptions are met. The plot does NOT provide evidence of a violation of the conditions necessary for valid inference in regression. Because the p-value for the F Test Statistic is less than 0.01, we reject H0 and accept Ha. It is reasonable to assume that the slope is not zero and that there is a significant relationship between x and y. A linear relationship explains a statistically significant portion of the variability in y over the Experimental Region. Similarly, p-value for Y-Intercept is less than 0.01 and so we conclude it is not zero. However, the Y-Intercept value is not in our Experimental Region. 94
Hypothesis Test For Credit Card Example: The plots does NOT provide evidence of a violation of the conditions necessary for valid inference in regression. From our p-value, it is reasonable to assume that at least one slope is not zero and that there is a significant relationship and a linear relationship explains a statistically significant portion of the variability in y over the Experimental Region. The p-value for the individual slopes are less than 0.01 and therefore each slope appears to not be zero. The Y-Intercept p-value is much larger than 0.01 and so we fail to reject H0. We fail to reject the statement that the Y-Intercept is 95
What the F Statistic Hypothesis Test Looks Like 96 Confidence Intervals to Test if Slope 1 & Y-Intercept 0 Are Equal to 0 Excel Data Analysis, Regression tool calculates upper and lower limit for a Confidence Interval Interval does not contain 0: conclude Y-Intercept (0) is not zero (when all x are set to zero). Interval does not contain 0: conclude Slope (1) is not zero (there is a linear relationship) Found an overall regression relationship at both alpha = 0.05 &
alpha = 0.01 97 Nonsignificant Variables: Reassess Whole Model/Equation Slope If Slope not significant (Do not reject H0 : Slope = 0) If practical experience suggests that the nonsignificant x (independent variable) has a relationship with the y variable, consider leaving the x in the model/equation. Business example: # of deliveries for a truck route had insignificant slope, but was clearly related to total time.
If the model/equation adequately explains the y variable without the nonsignificant x independent variable, try rerunning the regression process without the nonsignificant x variable, but be aware that the calculations for the remaining variables may change. Key is that you may have to run the regression tool in Excel a number of times over various variables to try and get the best slopes and yintercept for the equation. Intercept If Y-intercept not significant The decision to include or not include the calculated y-intercept may require special consideration because setting Constant is
Zero in Data Analysis Regression tool will set the equation intercept equal to zero and may dramatically change the slope values. Business example when you might want the equation to go through the origin (x=0,y=0): labor hours = x and Output = y. 98 Multicollinearity Multicollinearity Correlation among the independent variables when performing multiple regression. In Multiple Regression when you have more than one x, each x should be related to the y value, but in general, no two x values should be related to each other. For example, if we have y = time for truck deliveries in a day, x1 = number of miles, x2 = amount of gas, because number of miles is related to gas, the
resulting multiple regression process may have problems. Use PEARSON or CORREL to analyze any 2 x variables Rule of thumb: if absolute value is greater than 0.7, there is potential problem. Problems with correlation among the independent variables is that it increases the variances & standard errors of the estimated parameters (0, 1, 2, . . . , q ) and predicted values of y, and so inference based on these estimates is not as precise than it should be. For example, if t test or confidence intervals lead us to reject a variable as nonsignificant, it may be because there is too much variation and thus the interval is too wide (or t stat not past hurdle). We may incorrectly conclude that the variable is not significantly different from zero when the independent variable actually has a
strong relationship with the dependent variable. If inference is a primary goal, we should avoid variables that are highly correlated. If two variables are highly correlated, consider removing one. If predicting is primary goal, multicollinearity is not necessarily a concern. Note: If any statistic (b0, b1, b2, . . . , bq ) or p-value changes significantly when a new x variable is added or removed, we must suspect that multicollinearity is at play. Checking correlation between pairs of variables does not always uncover multicollinearity. Variable might be correlated with multiple other variables. To check: 1) treat x1 as dependent and the rest of the x as independent and run ANOVA table to see if R^2 is big to see if there strong relationship. R^2 > 0.5, rule of thumb that there might be multicollinearity. 99 Categorical Independent Variables Convert categorical variables to dummy variables k = number of categorical variables
k 1 - Number of dummy variables Examples: Methods of payment = Credit Card or PayPal x1 = 1 if Credit Card, 0 = PayPal Definition becomes: x1 = 1, then Credit Card x1 = 0, then PayPal Methods of payment = Credit Card or PayPal or Cash 3 = k = number of categorical variables Dummy Variables = k 1 = 3 1 = 2: x1 = 1 if Credit Card, 0 = anything else x2 = 1 if PayPal, 0 = anything else Definition becomes: x1 = 1 AND x2 = 0, then Credit Card
x1 = 0 AND x2 = 1, then PayPal x1 = 0 AND x2 = 0, then Cash 100 Categorical Independent Variables Must convert original categorical variable fields in data set to new dummy data IF or VLOOKUP function. Histograms for count of residuals can be used for residuals with and without the Categorical Variable. Higher frequencies on positive side means equation is underpredicting Higher frequencies on negative side means equation is overpredicting Detecting multicollinearity with a dummy variable is difficult because we dont have a quantitative variable. We can get around this by estimating the regression model twice: once
with dummy variable and once without. If estimated slopes and the associated p-values dont change much, we can assume that there is not strong multicollinearity. 101 Multiple Regression with Categorical Variable Must convert Smoker Data to 0 and 1 Data 102 Multiple Regression Estimated Equation with Categorical Variable = 0.835*Age-x1 + 0.228*BloodPressure-x2 + 10.61*Smoker-x3 -72.51
Smoker = 0.835*Age-x1 + 0.228*BloodPressure-x2 + 10.61 -72.51 Smoker = 0.835*Age-x1 + 0.228*BloodPressure-x2 -61.9 Non-Smoker = 0.884*Age-x1 + 0.243*BloodPressure-x2 -76.1 103 Inference and Very Large Samples When sample size is large: 1. Estimates of Variance and Standard Error (# Standard Deviations) are calculated with sample size in the denominator. As sample size increases, Estimates of Variance and Standard Error decrease. 2. Law of Large Numbers (Large Sample Size) says that as sample size gets bigger, statistic approaches parameter. As statistic approaches parameter, variation between the two decreases. As the variation between the two decreases, Estimates of Variance and Standard Error decrease. 3. As Estimates of Variance and Standard Error decrease, the intervals used in
inference (Hypothesis Testing and Confidences Intervals) decrease, p-values get smaller, and almost all relationships will seem significant (meaningful and specious). You cant really tell from the small p-value if the relationship is meaningful or specious (deceptively attractive) 4. Multicollinearity can still be an issue. 104 Small Sample Size Maybe be hard to test assumptions for inference in regression, like with a Residual Plot (because not enough sample points). Assessing multicollinearity is difficult. 105