inse4642

Carlos Guzman-Zavala The University of Memphis INSE 4642

username: ut24hrmachine password: inse Link to Activity Log

Chapter 1
Data Analysis and Statistical Description
-Purposes of data Analysis ………………………………… done 09/15/99
-Description of data sets available for Analysis……………… done 0916/99
-Description of One Variable…………………………………09/19/99
-Description of 2 or More Variables…………………………09/22/99
-Age as an Independent Variable…………………………….09/25/99
-Logarithms and Multiplicative Effects…………………….09/28/99

-Exercises

-Links

Statistical information for interpretation and forecasting for investors.

-Articles
09/30/99-------------------------------------------------------------------------------------------------

Chapter 2
Sampling and Statistical Inference

-Introduction……………………………………………………10/01/99
-Sampling in the Real World…………………………………10/04/99
-Exercises…………………………………10/07/99

-Articles
…………………………………10/09/99

Chapter 3
Time Series

-Introduction…………………………………10/12/99

-Concepts used in data generation rules…………………………………10/15/99

-Two data generated rules…………………………………10/17/99

-Exercises…………………………………10/21/99

-Articles…………………………………10/23/99

MIDTERM EXAM by OCT 25

100 pages of Book

Chapter 4
Regression Analysis

-Introduction................................. 10/30/99

-A Regression Model.........................11/02/99

-Forecasts.................................11/06/99

-Measurable Goodness of Fit...................11/10/99

-Transformed Values............................11/14/99

-Using Regression Utility......................11/18/99

-Exercises.................................11/22/99

-Articles............................11/26/99

Chapter 5
Causal Inference

-Introduction & Concepts............................11/30/99

-Observational Data..................................12/01/99

-Which Independent Variables Should be Included?......................12/02/99

-How to identify the Relevant Independent variables................12/03/99

-Exercises....................................12/05/99

Final Examination 12/10/99

possible uncertainty in the values of the independent variables used in a forecast

possible uncertainty about which one of several competing regression models is correct

the values of the independent variables used in the forecast are well within the range of the independent variables in the data from which the regression model was constructed

the values of the independent variables used in the forecast are known with
the regression model has no plausible competitors
the model fits the data well
the sample residuals are approximately normally distributed

MEASURES OF GOODNESS OF FIT

Regression programs provide various measures of goodness of fit, of which the two most important are the residual standard deviation (RSD) and the coefficient of determination, or R2.

Residual Standard Deviation

The residual standard deviation is just what it sounds like-the estimated
standard deviation of the residuals-the square root of the sum of squared
residuals divided by the degrees of freedom. When we partitioned the data
into "look-alike" cells, the degrees of freedom were the number of observations
(n) less the number of cells. number of regression coefficients (including the constant term). Thus, in Model
2, there are n = 10 observations and three regression coefficients, so that there
are seven degrees of freedom. The sum of squared residuals is 116,080,000, so
In comparing regressions having the same observations and the same dependent variable, the one with a lower RSD indicates a better fit. As you add independent variables to a regression model, the sum of squared residuals almost always decreases (at worst, it doesn't change), but the degrees of freedom also decrease. If adding a variable causes the RSD to increase, you almost surely have too many independent variables in your model.

R^2 measures the percent improvement in fit that a regression provides relative to a base case which assumes that the values of the dependent variable are indistinguishable. For the base case we compute the sum of squared residuals about the mean of the dependent variable; let's call this the"base-case sum of squares." If the sum of squared residuals from the regression-the 11 regression sum of squares"-is much less, the regression has explained much of the variability in the dependent variable, an indication of a good fit. If the regression sum of squares is almost as large as the base-case sum of squares, the regression has not explained much of the variability in the dependent variable: the fit is bad. R^2 measures the percent reduction in the base-case sum of squares achieved by the regression. R^2 = (base-case sum of squares - regression sum of squares)/ base-case sum of squares.

In the housing data, the base-case sum of squares for selling prices is 1,385,300,000. We have already seen that the regression sum of squares is 116,080,000. An alternative way of defining R^2 is as the square of the correlation between the values of yest and the true values (y) of the dependent variable. In the housing data, the correlation between the true selling price and the price estimated by Model 2 is 0.9572; its square is 0.9162.

Because the regression sum of squares will almost always decrease as you add independent variables, R^2 will always increase (or at worst remain unchanged) as you add variables. "Adjusted" R^2

Specifically, instead of using the raw base-case and regression sums of squares, it first divides each by its respective degrees of freedom. Adjusted R^2 = (1,385,300,000/9 - 116,080,000/7)/(1,385,300,000/9) = 0.8923.

(The first term-1,385,300,000/9-is the square of the sample standard deviation of the dependent variable; the second is the square of the RSD.)

As you add independent variables, it may or may not increase; if it decreases, that suggests you are using too many independent variables in your model. If you have few degrees of freedom and a poor fit, adjusted R^2 may be negative.

In the latter case, virtually any regression model is likely to produce a high value of R^2

You have data on the Standard and Poor's 500 stock index monthly closing price from January 1968 through March 1993. If you use time as an independent variable (January 1968 = 1, February 1968 = 2, etc.), R^2 = 0.7586; if you use last month's closing price as an independent variable, R^2 = 0.9927. You get high values of R^2 because the base case provides a nonsensical way of forecasting any particular month's S&P value: the 'base-case" assumption that the values of the S&P 500 are indistinguishable over this twenty-five year period, when in fact they fluctuated between 63.54 and 451.67, had a pronounced upward trend, and a given month's price was generally much closer to the previous month's price than a price chosen at random, results in a base-case sum of squares which even the simplest regression models can reduce enormously.

If you had a regression model for forecasting monthly changes in the S&P whose R^2 was only 0.05, you could, over the long run, do very well. Great value will accrue to an investor who can achieve even a small improvement over a base-case forecast that assumes that future changes will vary as past changes.

TRANSFORMED VARIABLES

Transformations greatly increase your ability to specify relationships between a dependent variable and a number of independent variables. - permit independent variables to be used that are not contemporaneous with the dependent variable in a time series.

- permit you to use ordinal and categorical variables as independent variables.

- permit you to express relationships between a dependent and independent variable that are curvilinear.

Lagged Variables in Time Series for
Modeling Noncontemporaneous Effects

Suppose we believe that advertising affects sales. If we have a time series of a company's monthly advertising expenditures and unit sales, we could perform a regression with sales as the dependent and advertising as the independent variable, and see if there was any apparent relationship. To perform a regression in which sales depend on current and past values of advertising expenditure and current price, we would first compute lagged transformations xt-1, xt-2, etc., of xt' and then run a regression with yt as the dependent variable, and xt' xt' xt-2, pt, and additional lagged x's if appropriate, as independent variables.

Notice that each time you lag a variable an additional period, you create a missing value (denoted by #N/A in the spreadsheet). Since an observation used in regression must be complete, i.e., no value missing for any variable used in the regression, lagging a variable results in lost observations. A regression with xt, xt' and xt-2 as independent variables has four fewer degrees of freedom than one with just xt as an independent variable: two are lost because there are two more variables; two more are lost because there are two fewer observations..
UNIT ADVERTISING ADVERTISING ADVERTISING AVERAGE
SALES EXPENDITURES EXPENDITURES EXPENDITURES UNIT

MONTH (000) ($000) ($000) ($000) PRICE

We can lag not only values of an independent variable, but also values of the dependent variable. We might believe that sales levels tend to persist: this month's level is more likely to be high if last month's was high than if last month's was low. In that case, we might include yt-1 as an independent variable. If we believe that levels of sales in the more remote past tend to persist into the present, we could include values of yt-2, yt-3, etc. as independent variables as well. As before, each additional lag "costs" two degrees of freedom, one for the additional independent variable in the model, one for the observation lost in creating the lag.

Dummy Variables for Modeling Effects of
Ordinal or Categorical Variables

Sometimes ordinal or categorical variables may be plausible explanatory variables in a regression. If x represents the variable, and has possible values of 1 through 5, just including x in the model implies that, everything else being equal, the average difference in selling price between two houses rated 1 and 2 ("poor" and "fair") will be the same as for two houses rated 4 and 5 ("good" and "excellent"). If we really believe the differences may be substantial, just using the rating scale x as an independent variable misspecifies the relationship between condition and selling price.

An even more serious problem occurs with categorical data. You might have a variable representing quality of construction, in one of three categories: frame, mixed frame and brick, and all brick. If we coded those categories 1, 2, and 3 respectively, could we just use the coded variable as an independent variable in our regression? Dummy variables provide a way of specifying ordinal and categorical relationships. If we include this dummy variable as an independent variable in a regression, the corresponding regression coefficient tells us by how much brick houses differ from frame houses in selling price, on average, when the other independent variables included in the regression are held constant. Aregression coefficient of 1,234, for example, implies that brick houses sell, on average, for $1,234 more than frame houses that are alike in all other respects measured by the other independent variables in the model. If the regression coefficient were (3,456), it would imply that brick houses sell for $3,456 less than frame houses.14

We will create two dummy variables. One dummy will have value 1 if the house is mixed frame and brick, 0 if it is not (i.e., if it is either frame or all brick); the other will have value 1 if the house is all brick, 0 if it is not. Suppose the regression coefficients for the first and second dummies are 1,234 and (3,456) respectively. Remembering that we are talking about average relationships; with other variables included in the model held constant, these regression coefficients imply that, relative to the base case (frame houses), mixed frame and brick houses sell for $1,234 more, while allbrick houses sell for $3,456 less.

Suppose we had selected some other category-say all brick-as the base case. Then the regression coefficient for frame houses would be 3,456 and for mixed frame and brick 4,690 and we would reach the same conclusion as before: mixed frame and brick houses sell for $1,234 more, and frame houses for $3,456 more, than all brick. Although the choice of base case affects the values of the regression coefficients, it does not affect their interpretation.

In general, when a categorical or an ordinal variable has c categories, you can represent the effect of each category by defining c-1 dummy variables, use any one of the categories as a base case, and then use the c-1 dummy variables, along with whatever other independent variables are appropriate, in the regression.

Detecting, Specifying, and Interpreting
Curvilinear Relationships: Exploratory Analysis

Suppose you suspect that in a model with more than one independent variable, the relationship between a particular independent variable (x) and the dependent variable (y) is curvilinear rather than linear. Abetter way to detect curvilinearity under these Circumstances would be to perform a regression using all the variables, compute the residuals, and plot the residuals (on the vertical axis) against x (on the horizontal). If this plot looks curvilinear, it suggests that the relationship between y and x, when all the other independent variables in the model are held constant, is curvilinear.

along with all the other independent variables, in the regression model. Thus, the model is specified as:

yest = b0 = a1x = a2x^2 = b1x1 = b2x2 =

Adding a squared transformation of an independent variable to the model thus provides an easy way of detecting curvilinearity, but understanding in what way the relationship between y and x is curvilinear is a trickier matter. 1. If a2 (the regression coefficient for x^2 in the previous equation) is negative, the parabola rises to a peak and then descends, while if a2 is positive, the parabola first descends to a trough and then rises again.

Table 4.9

CASE # VALUES OF X VALUE OF a2 BEHAVIOR OF yest

I x < -aj1(2a2) a2 < 0 Increases at a decreasing rate
(Decreasing returns to scale)

2 x > -aj1(2a2) a2 < 0 Decreases at an increasing rate

3 x straddles -aj1(2a2) a2 < 0 Increases to max., then decreases

4 x < -aj1(2a2) a2> 0 Decreases at a decreasing rate

5 x > -aj1(2a2) a2 > 0 Increases at an increasing rate
(Increasing returns to scale)

The arrows indicate the location of --all(2a2) relative to the values of x in the data.
If, for example, a,= 17.43 and a2= -2.367, then -all(2a2) 3.682 and if most of the values of x are above 3.682, Case 2 applies: y,,t decreases at an increasing rate as x increases.

For details, see Multiplicative Regression Models, Chapter 6.

USING THE REGRESSION UTILITY

The regression utility supplied with this text lets you perform regressions on data in Excel files, view various outputs, and make forecasts of values of the dependent variable when corresponding values of the independent variables are known. Performing Regressions

1-Opening the Data File.
Use the File Open commands in Excel to open the relevant data file.

Activating the Regression Utility. Follow the instructions distributed with the regression utility to activate it.

2-Setting Up the Data Range.
Highlight the entire block of data you want to analyze. This block of data should include all observations of the variables you want to analyze. Column/variable labels appear in the regression output and are a convenient way of describing data. If your labels extend over several rows, you may want to insert a new row with abbreviated labels just above the first row of data.

To select the data range, click on the column label of the cell in the upper-left corner of the block and highlight the entire block of data. Choose Data from the menu bar and Regression from the pull-down menu, then Set Data Range from the Regression menu. When prompted, "Does the top row of your data range contain column (or variable) labels?," click Yes if in fact you have included the labels in your data range, as suggested earlier. 2. Setting Up the Dependent Variable Column

Click any cell in the column you want to be your dependent variable. Again choose Data from the menu bar and Regression from the pull-down menu, then Set dependent variable column from the Regression menu. You should then see the values in the dependent variable column in bold type.

3. Setting Up the Independent Variable Column(s)

Click any cell in the first column you want to select as an independent variable. If you want to select more than one independent variable and they are in adjacent columns, you can highlight any row containing the range of columns. If you want to select nonadjacent columns, hold down the [Ctrl] key as you click a cell in each column you want to select as an independent variable. Once you have selected all the columns containing independent variables, choose Data from the menu bar and Regression from the pull-down menu, then Set independent variable columns from the Regression menu. You should see the values in the independent variable columns with a shaded background.

4. Performing the Regression

Before initiating the regression calculations, you may need to exclude observations.

The Regression utility will not calculate if you have observations with any values missing, so you will need to exclude observations containing such missing values. If, after examining the data, you want to exclude observations from the database, hold down the [Ctrl] key and click a cell in each row you want to exclude. Choose Data from the menu bar and Regression from the pull-down menu, and choose Set excluded observation(s) from the Regression menu. Values of the variables in rows that were excluded will appear with strikeout lines through them.

Performing the regression. Choose Data from the menu bar and Regression from the pull-down menu, then Perform regression from the Regression menu. Outputs

This output was generated from the data set HTWT.XLS. Our model used weight as the dependent variable, and height and gender as the independent or explanatory variables.

Regression Number 1
Dependent Variable: WT
Regr. Coef. Std. Error 0.210 14.9 1.74
t value 20.0 (9.0) (10.3)

R-squared = 0. The "constant" term will always appear in the dependent variable column.

5. Calculate Yest and Residual Values

Once you have looked at the regression statistics, you may want to calculate Yest (the regression estimates of the dependent variable) and residual values for the observations in your data range. This procedure can be time-consuming for large data sets. To initiate this process, choose Data from the menu bar and Regression from the pull-down menu, and Calculate yest and residual values from the Regression menu. The values of Yest and the residuals appear in two columns to the right of the data range you selected.

6. View Charts

For charts involving yest and/or residual values, you must of course calculate those values first, using the procedure described above. The charts available are:

Yest vs. Yact (the actual value of the dependent variable) yest vs. Residuals

Any X vs. Residuals

To view the charts, choose Data from the menu bar and Regression from the pull-down menu, and View charts from the Regression menu. Forecasting

If you have a number of past observations on a dependent variable and a set of independent variables, and you want to forecast values of the dependent variable on future observations given specified values of the independent variables, do this:

1. Be sure that the values of the independent variables on future observations are appended to the past observations in the data file.

2. Leave the values of the dependent variable on these observations blank.
3. Set the data range to include both the past and future observations.

4. Set the dependent variable, then set the independent variables.

5. Exclude the future observations, using the Set excluded observation(s) option from the Regression menu.

6. Perform the regression.
7. Calculate yest and Residual Values.

The values of Yest on the future observations are point forecasts for those observations. (Because the future observations were excluded, values of all the variables on these observations, including Yest' will appear with a strikeout line.) Performing Another Regression I

Once you have finished reviewing the output from your first regression, you may want to run a second regression. Choose Data/ Regression,* then Reset current settings from the Regression menu. Often you will want to keep the data range, but select different dependent and independent variables. - Dependent variable column

- Independent variable colurnns

- Excluded observations

Once you have cleared the ranges, set your new variables and perform the regression again. Your regression output will again appear below your data set, but above your first regression. It will be numbered Regression Number 2. Again, if you want to calculate values of Yest and residuals, or view charts

Old values of Yest and residuals will be overwritten.

Return to Prior Regressions. If you want to return to a prior regression, choose Data/Regression, then Prior regressions from the Regression menu. Keep in mind that you will need to choose Calculate Yest and residual values again if you want to look at charts for these data.

Just choose Data/Regression and then View statistics from the Regression menu.

Print Statistics. This choice allows you to print the regression statistics. Choose Data/Regression and then Print statistics from the Regression menu.

DOING REGRESSION ANALYSIS

We will now use the regression utility to analyze three data sets. You will learn how to perform a regression, understand regression output, diagnose violations of regression assumptions, transform variables, and make forecasts.

A good precursor to any regression analysis consists of doing the kinds of graphical analysis (scatter diagrams and time-series charts) discussed in Chapter 1. Many of these graphical analyses can be done within the regression utility.

- Example 1: Burlington Press

The Burlington Press publishes textbooks, primarily texts for junior high schools (seventh and eighth grade). Purchased

Year Texts Students
Preliminary Analysis

containing values of Year, Number of Purchased Texts, and Number of
Students. The data in the file run from 1967 through 1990. A model for
predicting purchases of texts might be:

Texts = BO + BI*Year + B2 * Students + error (Model R1)

Enter 1991 in the cell beneath "1990" (cell C31), and 2600 in cell E31. Leave cell D31, which represents the unknown Number of Purchased Texts for 1991, blank. The observation in row 31 is incomplete; we must remember to exclude it when we run the regression.

Dependent and independent Variables. Clearly, we are interested in forecasting Number of Purchased Texts; this is, therefore, the dependent variable. We might initially believe that the Number of Purchased Texts will increase or decrease with time, with number of students, or both. As a preliminary step, motivated especially to activate the graphics capabilities of the regression utility, let's designate School Year and Number of Students as independent variables.

Running the Regression Utility. First, open REGRUTIL.XLS, and click on the Burlington-Press tab, and activate the regression utility. Click Data/Regression and then Set data range. Reply "Yes" to the next question; row 6 contains variable /colurrm labels. Indicate the column containing the dependent variable by clicking any value in column D (Purchased Texts), then clicking Data/Regression, and on the Set dependent variable column from the Regression menu. To indicate the columns containing the independent variables, click on any value in colurnn C (School Year). Then while holding down [Ctrl] the move to any value in column E (Students) and click on that value. Now click on Data/Regression and then on the Set independent variable columns option. The final step before performing the regression is to indicate the observation whose value is to be excluded. Move the cursor to row 31 (the observation for 1991), click any value in that row, click Data/Regression; then click the Set excluded observation(s) option. You have now told the regression utility everything it has to know; once again, click Data/Regression and then click Perform regression. After a moment you will see the regression output.

Charts. Before trying to understand the regression output, let's look at what graphics capabilities are now available. Click Data/Regression and then click View charts. 0 Yest vs. Yact

- Yest vs. Residuals

Any X vs. Yact

- Any X vs. Residuals

If you take the third option, you can, for example, look at a time-series chart of Year vs. Number of Purchased Texts, or a scatter diagram of Number of Students vs. Number of Purchased Texts. If you take the fifth option, you can look at a time-series chart of Year vs. Number of Students. Computing Yest' Residuals, and a Point Forecast. Click Data/Regression again, then click the Calculate Yest and residual values option, After a short time two new columns (F and G) will be filled in with values of Yest and residuals. Compare the values of Yest with the values of Texts, year by year. The difference between Texts purchased and Yest is the residual. If the regression estimated values of the dependent variable perfectly, all values of Yest would match the corresponding values of Y, and the residuals would all be zero.

Scroll down to row 31, the incomplete observation for 1991. You will see a value of 2,814 (displayed with a strikeout line) in column F. Given the regression model implied by the choice of dependent and independent variables (Model RI), this is the point forecast for the number of texts to be purchased in 1991.

Regression Output

Let's turn now to the regression output. You can scroll to it, or click Data/ Regression and then click the View statistics option. Coef.," "Std. Error," and "t value." Regression Number 1
Dependent Variable: TEXTS

Year Constant Students

Regr. Std. Error 16,454 31,289 0.5803
t value 0.5 (0-5) 1.5

. R-squared = 0.7638 Resid SD = 135.6

Regression Coefficients. The numbers in columns C and E ("Year," and "Students") are associated with the independent variables. The numbers in column I are associated with the "constant term" in the regression equation; these constant-term values will always appear in the dependent-variable column, which simply is a convenient location in which to display them.

To interpret the first three lines of output, we must start with the realization that the regression that we performed assumed that the 24 observations in the data file were generated by a model of the form:

'Texts = B0 + BI*Year + B2 - Students + error, (Model R1)

where the B's are constant but unobservable regression coefficients. From our sample of 24 observations, estimated values of the B's (denoted by lower-case b's) are obtained: b0 = (15,664), b1 = 8.128, and b2 = 0.8824. Given these estimated regression coefficients, you should verify that the values of Texts,, can be computed by the formula:

Textest = -15,664 + 8.128*Year + 0.8824*Students
For example, the value of Textsest for 1967 is:

Similarly, the values of the residuals can be computed from the formula:

Residuals = Texts - Textsest

For example, the value of the residual for 1967 is:

Residual = 2,111 - 2,089 = 22

You can add the 24 residuals in column G, using the =SUM function, and verify that their SUM is 0.20 If, in column H, you compute the values of the squared residuals and add them, you will find that they add to 386,276.3. If the estimated regression coefficients had any other values than the ones displayed, this sum would be higher; in that sense, the coefficients are estimated by least squares.

The estimated regression coefficients are derived from a sample and are therefore subject to sampling error. Using the standard error to construct confidence limits, we can say, for example, that, with 68% confidence, the true value of the regression coefficient lies between -8.326 and 24.582. To some degree, this uncertainty is reflected in the low t value of 0.5. Therefore, we cannot be at all confident that the sign of the true regression coefficient is positive: from the data alone, it is far from certain that the number of texts purchased will increase over time unless the number of students increases.

Observations and Degrees of Freedom. Turning to the last four outputs, there were 24 observations used to estimate the two regression coefficients and the constant term; since three estimates were generated from the data, this leaves only 24 - 3 = 21 degrees of freedom.

Residual Standard Deviation. The residual standard deviation (often abbreviated as RSD) of 135.6 is an estimate of the standard deviation of the errors. It is computed as the standard deviation of the residuals in column G, "corrected for degrees of freedom." If we divide this sum by the number of degrees of freedom, 21, and take the square root, we get 135.6, the value given in the regression output.

R-squared R^2). Finally, compute (228.16/261.06)^2 = 0.7638, the value of R62 reported in the regression output (Table 4.10). If Yest perfectly predicts Y on all observations, then Yest and Y will have identical standard deviations and R^2 = 1. If the regression has absolutely no predictive power, the values of Yest will be the same on every observation: they will have values equal to the mean of y and their standard deviation will be 0. In comparing competing regression models having the same dependent variable, a model with higher R^2 and lower RSD certainly indicates a better fit, and may indicate a better model, although it is certainly possible, by "fishing" through the data, to find independent variables that have no apparent relationship to the dependent variable but by chance are correlated with it in the data. Adding an independent variable, no matter how unrelated to the dependent variable, will never cause R^2 to decrease, but because the RSD is "corrected for degrees of freedom" and one more variable uses up one more degree of freedom, the RSD may increase. When the RSD increases when you add a new variable, you usually have evidence of "overfitting."

You should remember that the base case against which goodness of fit is measured is just using the mean of the dependent variable as an estimate of its value on all observations. Transformations

- students last year

In regression parlance, such derived variables are called transformations. In a time series, a transformation that creates a variable whose values are those of some other variable in a prior period is called a lag transformation One that creates a new variable that represents the change in the value of some other variable over some period of time is called a difference transformation. Thus the variable Students Last Year is derived by lagging Students This Year by one year; the variable Additional Students is derived as the difference between Students This Year and Students Last Year.

To create these variables in Excel, first start with a clean slate by clicking on Data/Regression, then on the option Reset current settings, and then on Clear all ranges when the dialog box appears. We will create the lagged variable Students Last Year in column F, and the difference variable New Students in column G. Enter column labels of "Last Year" and "New Students" in cells F6 and Go, respectively. Next, click on cell F8 and set the value in that cell equal to the number of students in the preceding year; i.e., enter the formula = E7 The cell should now contain the value 2,000, the number of students in 1967. The cell should contain the value 27,

Regression Using Transformed Variables. We are now ready to run a new regression. The only trick is that in addition to excluding row 31 (the forecast), you must now exclude row 7 (the data for 1967, which is incomplete because we have no values for the newly created variables.) Before invoking the Set excluded observations option, click on any value in row 7, scroll down to row 31 and, while holding down [Ctrl], click on any value in that row. To run the regression, you need to:

1. Set Data range (C6:G31)

2. Set Dependent variable column (column D)

3. Set Independent variable columns) (columns F and G)

4. Set Excluded observations (rows 7 and 31)

5. Perform regression
Texts = BO + B,*Students Last Year + B2*New Students + error

(Model R2) First, comparing it with the output of Regression 1, shown in Table 4.10,

- the value of R^2 is higher (0.8479 vs. 0.7638),
- the RSD is lower (108.4 vs. 135.6), and
a the t values associated with the independent variables are much higher.

Regression Number 2
Dependent Variable: TEXTS
Constant Last Yea New Students

Std. Error 274.75 0.1259 1.311
t value 0.2 7.9 4.5

R-squared = 0.8479 Resid SD = 108.4

Also, the t value for the constant term suggests that the true value of the constant could easily be zero or negative, even though its sample value is positive. Nevertheless, the regression appears to fit the data much better than Regression 1, and it tells a simple story: on average, about one text is added to each set left by students last year, and new students acquire roughly six new texts; these two factors account for much of the variability in the number of texts purchased.

Before making a forecast, we might ask whether we should have included Year as one of the independent variables in Regression 2. We can easily add a variable by clicking on the columns for Year, Students Last Year, and New Students, invoking the Set independent variable columns option, and performing the regression. The model we are now estimating is:

Texts = BO + B,*Year + B2*Students Last Year + B3*New Students + error.

(Model R3)

Compared with Table 4.11 (Regression 2):

- RSD has increased (from 108.4 to 110.6);

- because one more independent variable was included, the degrees of freedom decreased (from 20 to 19);

- the t values associated with the independent variables are much lower.

We can conclude that including Year as an independent variable has resulted in overfilling.

Chapter 5 Causal Inference Introduction & Concepts

An assertion that reducing the price of a product from $18 per unit to $17 causes demand to increase from 22,000 units per month to 23,000 makes sense only in terms of an ideal experiment-one that can never be performed in practice. The experiment involves two scenarios. In the first, price is set at $18. The second scenario is identical to the first in all respects except that the price is set at $17. Under both scenarios, the monthly demand is observed. If demand was 22,000 units per month when price was $18, and 23,000 when price was $17, we can conclude that the price change caused the change in demand.

"Identical in all respects . . ." means just that. The two scenarios cannot take place in different periods of time, or in different geographical regions. Although it is common practice to assert that an increase in demand that followed a reduction in price was "caused" by the price reduction, such an assertion involves measurement of demand in two different periods, between which other factors that affect demand may have changed.

Given that the "ideal" experiment can never be performed in practice, we can never measure exactly how much a change in the value of one variable causes the value of another variable to change. Our challenge is to find methods that come as close as possible to mimicking the ideal experiment. But before discussing such methods, let's explore what we would learn if we could actually carry out this two-scenario experiment.

The price reduction-the variable whose value we deliberately change-is called a treatment or an intervention. What is the effect of that treatment? Although we have focused on just one effect that was of particular interest to us-the change in demand-it should be clear that the price change causes not just this one effect, but many effects. An increase in demand of 1,000 units next month means that some customers who would not have purchased at the old price decided to purchase as a result of the intervention. This, in turn, may mean that some of them will not purchase some other product, and that some wilt reduce their savings or increase their debt. Perhaps a competitor will respond to our price reduction by reducing his price as well (something he would not have done had we not lowered our price), and this too might have an impact on the demand for our product.

Some of those effects may have little to do with our "bottom line," but others may. If lowering the price of one item in our product line diverts demand away from other items, then the total effect of the intervention is the increase in demand for the item whose price was reduced, less the decrease in demand for substitute products. This is more appropriately measured in dollars than in units. A single intervention usually can change the values of many variables, but one of them-the net change in dollar sales across all items in our product line-is the one we select to be the dependent variable, the one that most appropriately measures the total effect of the intervention.

WHAT IS AN EX P E R I M E N T?
The nearest we can come in the real world to measuring the true effect of a treatment is to conduct an experiment by finding "matched pairs"-pairs of individuals (or experimental units) that are as alike as possible in all respectsand to apply the treatment to one member of the pair and not to the other. If the pairs were truly matched in all respects, we could achieve with matched pairs what the "ideal experiment" does with a single individual or experimental unit:

measure the effect by measuring the difference in their responses. Unfortunately, from a practical point of view, it is not possible to find individuals who are exactly alike in all respects. (Even identical twins have almost certainly been exposed to different environmental influences.) Thus "matched pairs" may be alike in many respects, but they may differ with respect to other, unmeasured variables that have an effect on the dependent variable. If these unmeasured variables happen to be correlated with the treatment, the observed treatment effect will include a proxy effect for these other variables.

Even the choice of which experimental unit gets the treatment and which does not may make it difficult to sort out effects. For example, the average effect on longevity of giving up smoking (the "treatment") may be different for people who voluntarily give it up than for people who are forced to give it up. In the extreme, we could imagine a situation in which those who voluntarily give up smoking are the only ones whose longevity is increased. If this were true, we would observe that those who gave up smoking lived longer than those who did not, but we would also discover that applying coercion or providing incentives to nonvolunteers to give up smoking would provide no benefits to them.

Unwanted proxy effects of unmeasured variables can be eliminated by using a random device to choose which experimental unit in a matched pair receives the treatment. Random assignment of treatments assures that, on average, whatever unmeasured variables affect the dependent variable will not be correlated with the treatment. Thus, the treatment will not capture unwanted proxy effects.

There are situations where this random assignment can be carried out relatively easily. Returning to our price-reduction problem, suppose the context is that of a direct-mail company. The company could prepare two sets of catalogs, both identical except for the item whose price reduction was under consideration. One set of catalogs would show the standard price; the other set, the reduced price. Matched pairs of customers could be selected based on the recency, requency, and monetary value of their previous purchases, and for each pair the catalog with the reduced price could be assigned at random. In drug testing, it is routine to assign the drug to one member of a matched pair, and a placebo to the other, with the determination of who gets what decided by a randomizing device (e.g., the flip of a coin).

There are other situations where random assignment is virtually impossible, either because it is too hard to implement or because it is socially unacceptable. In the smoking experiment, it would be impossible to justify and enforce a policy in which some people, chosen at random, were instructed to keep smoking, while others were told to stop smoking. In dealing with the economy, we cannot segment the population into two groups, and take measures to increase unemployment among only one group, then observe the difference in the rate of inflation in the two segments. Even in the pricing example, the mail-order company's management might find it unacceptable to have two catalogs with different prices in circulation. We are often reduced to relying on observational data.

OBSERVATIONAL DATA

When we seek to estimate from observational data the effect of a "treatment"-
an independent variable whose value we will be able to manipulate in
the future-on a dependent variable, the estimation problem is made more
difficult ' by the presence of other independent variables that may also affect
the dependent variable, and may be correlated with the treatment variable.

An independent variable may be correlated with the treatment for one of four reasons:

1. There may be no causal relationship between the two variables, but they might be correlated by chance alone.

2. The independent variable may affect the treatment.
3. It may be affected by the treatment.

4. It and the treatment may both be affected by some other variable: they may be correlated due to a "common cause."

An Example

We may, for example, be interested in learning by how much changes in the posted speed limit on highways (the treatment) affect motor-vehicle death rates-deaths per thousand drivers per year (the dependent variable). The reason for our interest is that if we discover that reducing the speed limit reduces death rates, we might want to propose legislation to lower the speed limit.

Suppose we have a cross section of the fifty states in the United States, with data on each state's maximum speed limit and motor-vehicle death rate. Even if lowering the speed limit really reduced the death rate, a scatter diagram of speed limit vs. death rate might show that, in the data, death rate declines as speed limit increases. How could this be? It might be that states with very high death rates are states where bad weather makes driving conditions hazardous, where drivers drive long distances, where driving under the influence of alcohol is prevalent, etc. These states may have lowered the speed limit to reduce the carnage, but still have higher death rates than states in which driving is safer, but the speed limit remains high. If this story is correct, then low speed limits may reduce the death rate but also proxy for variables that increase the death rate.

If those other variables-weather conditions, miles driven per capita per year, alcohol consumption, etc.-are included, along with speed limit, as independent variables in a regression model, then the regression coefficient on speed limit will show how death rate varies with speed limit when the other variables in the model are held constant. Speed limit will no longer proxy for these other variables. If lower speed limits reduce the death rate, then the regression coefficient on speed limit will be negative. Weather conditions, miles driven, and alcohol consumption are examples of other independent variables that affect death rate and that are correlated with speed limit. The correlation occurs because these variables have caused the speed limit to be lowered in states where they are major contributors to highway deaths. These variables should be included in the model to eliminate their unwanted proxy effects on the treatment variable.

Suppose we discover that states whose citizens have pronounced concerns for public safety tend to have both low speed limits and rigorous automobileinspection standards. This is a case where the treatment and another variable (automobile inspection) that may affect death rates are correlated because they are both affected by another variable that is a common cause-concern for public safety Clearly, a measure of the rigor of inspection standards should be included as an independent variable; otherwise, speed limit will capture the unwanted proxy effect of inspection on death rate.'

Now suppose that reduction in the posted speed limit causes drivers to drive slower, on average, and it is the actual reduction in speed driven, not the posted speed limit, that causes the death rate to decrease. Should we include average speed driven as an independent variable in our model? Clearly not: if we did, the regression coefficient on posted speed limit would show its relationship to death rate when average speed driven remained constant, and since actual speed driven, not posted speed limit, causes fatal accidents, the coefficient would indicate that the effect of posted speed limit on death rate was zero, and thus make it appear that changing posted speed has no effect on death rate. To assure that the regression coefficient correctly captures the causal relationship, we want posted speed limit to include the "good" proxy effect of driving speed, and thus we want to exclude driving speed from the regression model. Driving speed is an example of a variable that affects the dependent variable but is affected by the treatment variable. Such a variable should be excluded from the model, so that the treatment variable will capture its proxy effects.

Finally, consider the case where some other variable, say the average age of cars in the various states, affects death rates, but has no causal relationship to posted speed limits. Nonetheless, average age may be correlated with speed limits in the sample data: even variables that have nothing to do with one another are seldom perfectly uncorrelated in observational data. In this case, failure to include average age of cars as an independent variable in the model will cause speed limit to carry an unwanted proxy effect for age of cars. We should, therefore, include age of cars as an independent variable.

WHICH INDEPENDENT VARIABLES SHOULD BE INCLUDED?

Based on this example, we can state the following rules. When you want to estimate the effect that a treatment or intervention will have on a dependent variable, you should:

- include as an independent variable any variable that you believe might affect the dependent variable and that is correlated with the treatment variable because (a) it affects the treatment variable, or (b) both the dependent variable and the treatment variable are affected by a common cause, or (c) the correlation occurred purely by chance.

- exclude from the model any variable that you believe might affect the dependent variable and that is correlated with the treatment variable because it is affected by the treatment variable.

If a variable affects the dependent variable but is uncorrelated with the treatment variable, whether you include it or not makes no difference in the regression coefficient for the treatment variable: there are no proxy effects from an uncorrelated variable. As a matter of practice, you should include any variable that affects the treatment variable; at the very least, it will improve the fit of the model. In a sample of observational data only rarely are two variables completely uncorrelated.

The consequences of these rules may seem counterintaitive. A variable that affects the dependent variable and is correlated with the treatment variable must be included in the model; the higher the correlation, the more important it is to include it. Including such a variable will. not greatly improve the fit (increase R^2, decrease RSD Nevertheless, omitting a variable like this distorts the apparent

effect of the treatment variable by causing it to pick up the unwanted proxy effects of the omitted variable.

On the other hand, omitting an independent variable that affects the dependent variable and is affected by the treatment variable assures that the treatment variable captures its "good" proxy effects. Nevertheless, omitting such a variable invariably results in a poorer fit (lower R^2 higher RSD).

In both cases, what is good for proper causal inference is bad for forecasting. This seemingly counterintuitive result is resolved by recognizing that causal inference involves correctly estimating a particular regression coefficient, while forecasting involves providing a good fit to past data. What is good practice for dealing with one of these problems is not necessarily good practice for dealing with the other.

HOW TO IDENTIFY THE RELEVANT INDEPENDENT VARIABLES

Given that you should include any independent variable that might affect the dependent variable and that is correlated with the treatment variable, but is not affected by it, how do you decide what variables to include? The answer depends on your understanding of what causes what, and on your ingenuity in finding ways of measuring crucial variables. Sometimes you have to settle for a variable that is correlated with such a crucial variable. In our speed-limit example, you may have trouble obtaining data on drunk-driving convictions in a state, but probably can easily get statistics on alcohol consumption. For many reasons this may be an imperfect measure of driving under the influence, but it may be good enough for our purposes. Think about how you would measure weather conditions that are dangerous for drivers, or amount of driving per person.

A useful tool for depicting causal relationships is the influence diagram. Figure 5.1 shows such a diagram schematically A treatment variable and a dependent variable are shown. Other variables are classified by type. Type A variables affect the dependent variable directly as well as indirectly through their effect on the treatment variable. Type B variables affect the dependent variable and are correlated with the treatment variable by virtue of a common cause. Type C variables affect the dependent variable and are correlated with the treatment variable by chance. Type D variables affect the dependent variable but are uncorrelated with the treatment variable. All of these variables should be included as independent variables in the regression model.

Type E variables, on the other hand, affect the dependent variable directly, but are affected in turn by the treatment variable. They should not be included in the model.

It is not always clear which way the causation goes. Advertising expenditures by your competitor may be correlated with your advertising. The correlation may be due to a common cause (seasonality, business conditions), in which case your competitor's advertising should be included as an independent variable. On the other hand, your competitor may simply be reacting to your advertising, raising expenditures when you raise yours, and vice versa, in which case it should be excluded. About all you can do under such circumstances is perform the regression with and without the competitive-advertising variable, and weight the resulting regression coefficients on the treatment variable by the probability you assign to the two competing causal models.