How to compare two regression models in Stata

How to compare two regression models in Stata
by Jeff Meyer

An “estimation command” in Stata is a generic term used for a command that runs a statistical model. Examples are regress, ANOVA, Poisson, logit, and mixed.

Stata has more than 100 estimation commands.

Creating the “best” model requires trying alternative models.  There are a number of different model building approaches, but regardless of the strategy you take, you’re going to need to compare them.

Running all these models can generate a fair amount of output to compare and contrast. How can you view and keep track of all of the results?

You could scroll through the results window on your screen. But this method makes it difficult to compare differences.

You could copy and paste the results into a Word document or spreadsheet. Or better yet use the “esttab” command to output your results.  But both of these require a number of time consuming steps.

But Stata makes it easy: my suggestion is to use the post-estimation command “estimates”.

What is a post-estimation command?  A post-estimation command analyzes the stored results of an estimation command (regress, ANOVA, etc).

As long as you give each model a different name you can store countless results (Stata stores the results as temp files).  You can then use post-estimation commands to dig deeper into the results of that specific estimation.

Here is an example.  I will run four regression models to examine the impact several factors have on one’s mental health (Mental Composite Score). I will then store the results of each one.

regress MCS  weeks_unemployed   i.marital_status
estimates store model_1

regress MCS  weeks_unemployed   i.marital_status   kids_in_house
estimates store model_2

regress MCS  weeks_unemployed   i.marital_status   kids_in_house   religious_attend
estimates store model_3

regress MCS  weeks_unemployed   i.marital_status   kids_in_house  religious_attend    income
estimates store model_4

To view the results of the four models in one table my code can be as simple as:

estimates table model_1 model_2 model_3 model_4

But I want to format it so I use the following:

estimates table model_1 model_2 model_3 model_4, varlabel varwidth(25)  b(%6.3f) /// star(0.05 0.01 0.001) stats(N r2_a)

Here are my results:

How to compare two regression models in Stata

My base category for marital status was “widowed”. Is “widowed” the base category I want to use in my final analysis? I can easily re-run model 4, using a different reference group base category each time.

Putting the results into one table will make it easier for me to determine which category to use as the base.

How to compare two regression models in Stata

Note in table 1 the size of the samples have changed from model 2 (2,070) to model 3 (2,067) to model 4 (1,682). In the next article we will explore how to use post-estimation data to use the same sample for each model.
Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.

Sometimes your research may predict that the size of a regression coefficient should be bigger for one group than for another. For example, you might believe that the regression coefficient of height predicting weight would be higher for men than for women. Below, we have a data file with 10 fictional females and 10 fictional males, along with their height in inches and their weight in pounds.

id gender height weight 1 F 56 117 2 F 60 125 3 F 64 133 4 F 68 141 5 F 72 149 6 F 54 109 7 F 62 128 8 F 65 131 9 F 65 131 10 F 70 145 11 M 64 211 12 M 68 223 13 M 72 235 14 M 76 247 15 M 80 259 16 M 62 201 17 M 69 228 18 M 74 245 19 M 75 241 20 M 82 269

We analyzed their data separately using the regress command below after first sorting by gender.

use https://stats.idre.ucla.edu/stat/stata/faq/compreg2, clear sort gender by gender: regress weight height

The parameter estimates (coefficients) for females and males are shown below, and the results do seem to suggest that height is a stronger predictor of weight for males (3.19) than for females (2.1).

-> gender=F Source | SS df MS Number of obs = 10 ---------+------------------------------ F( 1, 8) = 359.81 Model | 1319.56112 1 1319.56112 Prob > F = 0.0000 Residual | 29.3388815 8 3.66736019 R-squared = 0.9782 ---------+------------------------------ Adj R-squared = 0.9755 Total | 1348.90 9 149.877778 Root MSE = 1.915 ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- height | 2.095872 .110491 18.969 0.000 1.84108 2.350665 _cons | -2.39747 7.053272 -0.340 0.743 -18.66234 13.8674 ------------------------------------------------------------------------------ -> gender=M Source | SS df MS Number of obs = 10 ---------+------------------------------ F( 1, 8) = 669.93 Model | 3882.53627 1 3882.53627 Prob > F = 0.0000 Residual | 46.3637317 8 5.79546646 R-squared = 0.9882 ---------+------------------------------ Adj R-squared = 0.9867 Total | 3928.90 9 436.544444 Root MSE = 2.4074 ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- height | 3.189727 .1232367 25.883 0.000 2.905543 3.473912 _cons | 5.601677 8.930197 0.627 0.548 -14.99139 26.19475 ------------------------------------------------------------------------------

We can compare the regression coefficients of males with females to test the null hypothesis Ho: Bf = Bm, where Bf is the regression coefficient for females, and Bm is the regression coefficient for males. To do this analysis, we first make a dummy variable called female that is coded 1 for female, and 0 for male and femht that is the product of female and height. We then use female height and femht as predictors in the regression equation.

generate female=. replace female = 1 if gender == "F" replace female = 0 if gender == "M" generate femht = female*height regress weight female height femht

The output is shown below

Source | SS df MS Number of obs = 20 ---------+------------------------------ F( 3, 16) = 4250.11 Model | 60327.0974 3 20109.0325 Prob > F = 0.0000 Residual | 75.7026131 16 4.73141332 R-squared = 0.9987 ---------+------------------------------ Adj R-squared = 0.9985 Total | 60402.80 19 3179.09474 Root MSE = 2.1752 ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- female | -7.999147 11.37055 -0.703 0.492 -32.10363 16.10533 height | 3.189727 .1113503 28.646 0.000 2.953675 3.425779 femht | -1.093855 .1677774 -6.520 0.000 -1.449528 -.7381831 _cons | 5.601677 8.068862 0.694 0.497 -11.50355 22.7069 ------------------------------------------------------------------------------

The term femht tests the null hypothesis Ho: Bf = Bm. The T value is -6.52 and is significant, indicating that the regression coefficient Bf is significantly different from Bm.

Let’s look at the parameter estimates to get a better understanding of what they mean and how they are interpreted.

First, recall that our dummy variable gender is 1 if female, and 0 if male, then males are the omitted group. This is needed for proper interpretation of the estimates.

Parameter Variable Estimate INTERCEPT 5.601677 : This is the intercept for the males (omitted group) This corresponds to the intercept for males in the separate groups analysis. FEMALE -7.999147 : Intercept Females - Intercept males This corresponds to differences of the intercepts from the separate groups analysis. and is indeed -2.397470040 - 5.601677149 HEIGHT 3.189727 : Slope for males (omitted group), i.e. Bm. FEMHT -1.093855 : Slope for females - Slope for males (i.e. Bf - Bm). From the separate groups, this is indeed 2.095872170 - 3.189727463 .

Note that we constructed all of the variables manually to make it very clear what each variable represented.  However, in day-to-day use, you would probably be more likely to use factor variable notation to generate the dummy variables and interactions for you.  For example,

regress weight i.female##c.height       Source |       SS           df       MS      Number of obs   =        20 -------------+----------------------------------   F(3, 16)        =   4250.11        Model |  60327.0974         3  20109.0325   Prob > F        =    0.0000     Residual |  75.7026131        16  4.73141332   R-squared       =    0.9987 -------------+----------------------------------   Adj R-squared   =    0.9985        Total |     60402.8        19  3179.09474   Root MSE        =    2.1752 ---------------------------------------------------------------------------------          weight |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval] ----------------+----------------------------------------------------------------        1.female |  -7.999147   11.37055    -0.70   0.492    -32.10363    16.10533          height |   3.189727   .1113503    28.65   0.000     2.953675    3.425779                 | female#c.height |              1  |  -1.093855   .1677774    -6.52   0.000    -1.449528   -.7381831                 |           _cons |   5.601677   8.068862     0.69   0.497    -11.50355     22.7069 ---------------------------------------------------------------------------------