This guided tour contains mathematical formulas and/or Greek symbols and are therefore best viewed with Internet Explorer, because other web browsers may not display the "Symbol" fonts involved. For example, "b" should be displayed as the Greek letter "beta" rather than the Roman "b". If not, reload this guided tour in Internet Explorer, or make the latter your default web browser.
Heckman's sample selection model is based on the following two latent variable models:
The first model is the model we are interested in. However, the latent variable Y1 is only observed if Y2 > 0. Thus, the actual dependent variable is:
The latent variable Y2 itself is not observable, but only its sign: We only know that Y2 > 0 if Y is observable, and Y2 £ 0 if not. Consequently, we may without loss of generality normalize U2 such that its variance is equal to 1.
As has been explained in detail in HECKMAN.PDF, if you ignore the sample selection problem and regress Y on X using the observed Y's only, then the OLS estimator of b will be biased, due to the fact that
where F is the cumulative distribution function of the standard normal distribution, f is the corresponding density, s2 is the variance of U1, and r is the correlation between U1 and U2. Hence,
The latter term causes the sample selection bias if r is non-zero.
In order to avoid the sample selection problem, and to get asymptotically efficient estimators,
one has to estimate the model parameters by maximum likelihood. In order to do so, we need to derive the conditional density
h(y|X,Z,b,g,r,s),
say, of Y1 given
where F and f are the same as before. See HECKMAN.PDF. Next, let Yj, Xj, Zj, j = 1,2,...,n, be the observations on Y, X, and Z, respectively, where some of the Yj's are missing values. Define the dummy variable Dj = 0 if Yj is a missing value, and Dj = 1 if not. Then the log-likelihood function takes the form:
The details of the maximum likelihood estimation procedure can be found in HECKMAN.PDF.
The data has been generated artificially as follows. First, the independent variables X1,j and X2,j for j = 1,....,n = 500 have been generated as:
The data file involved is HECKMANDATA.TXT in
EasyReg space delimited text format, containing the variables
Y =
In order to demonstrate the effect of the sample selection bias, regress Y on X1, X2 and the constant 1. EasyReg will automatically skip the (246) observations for which Y is a missing value. Then you will get the following results:
OLS estimation results
Parameters Estimate t-value H.C. t-value(*)
b(1) 0.88716 9.064 9.088
b(2) 0.89251 9.814 10.802
b(3) 0.35509 3.476 3.810
(*) Based on White's heteroskedasticity consistent variance matrix.
Effective sample size (n) = 244
Standard error of the residuals = 0.970034
R-square = 0.596347
Adjusted R-square = 0.592997
Recall that the true values of the parameters are:
Wald test statistic: 12.60 Asymptotic null distribution: Chi-square(3) p-value = 0.00557 Significance levels: 10% 5% Critical values: 6.25 7.81 Conclusions: reject reject
Wald test statistic: 14.91 Asymptotic null distribution: Chi-square(3) p-value = 0.00190 Significance levels: 10% 5% Critical values: 6.25 7.81 Conclusions: reject reject
As said before, when you select the model variables EasyReg automatically skips the observations containing missing values. In order to get around this problem, you have to transform the Y variable first: Open "Menu > Input > Transform variables" in the EasyReg main window.
Click the "x is missing value -> dummy = 0, x = 0" button and then double click the variable Y:
Click "Selection OK". Then the following two new variables will be created:
Click "OK":
Click "Done".
The variable "Missing is zero[Y]" will now be used as the dependent variable in the sample selection model instead of Y, and the dummy variably
Do not rename these variables, because EasyReg will automatically select a matching pair of variables of the type
Now open "Menu > Single equation models > Sample selection (cross-section) models" in the EasyReg main window:
The model variables are X1, X2,
Selection of a subset of observations usually make no sense for sample selection models. Therefore, click "No" and then "Continue":
As said before, EasyReg will automatically select a matching pair of variables of the type
Click "Continue":
Next, you have to select the X-variables. EasyReg automatically includes the constant 1 in the list of potential X variables, and preselects it, i.e., the window opens with "* 1". Select the additional X variables X1 and X2, and click "Selection OK". Then the window changes to:
In our case the X and Z variables are the same. Again, EasyReg automatically includes the constant 1 in the list of potential Z variables, and preselects it. Thus, select the additional Z variables X1 and X2, and click "Selection OK". Then the window changes to:
Click "Continue". Then the window changes to:
We are now going to estimate the Probit model for
Note that EasyReg displays model
Note that EasyReg displays model Click "Continue":
Click "Start SIMPLEX iteration":
Following the advice of EasyReg, leave "Auto restart" checked, and click "Restart SIMPLEX iteration".
Then click "Done with SIMPLEX iteration":
The asymptotic variance matrix of the ML estimates is usually not of interest, but if you want to print it to the output file,
uncheck the box involved.
If some of the parameter estimates involve very large or small numbers (in absolute value), check the box "Display the estimation results in floating point format", otherwise leave it unchecked.
When you click "Continue", the t and p values of the ML estimates will be computed, and if everything goes well (i.e., if the estimated Fisher information matrix is nonsingular), the output will be displayed:
Click "Continue". Then module "NEXTMENU" is activated, which in this case will enable you to conduct the Wald test of linear parameter restrictions, and append the output file OUTPUT.TXT with the output shown below:
Recall that the true values of the parameters are:
Since you have seen the Wald test option before in conducting OLS, I will not discuss it here.
Click "Continue":
Given the Probit results, these parameters have been estimated by OLS. The initial estimates involved
will now be used as starting values for full information maximum likelihood.
See HECKMAN.PDF.
The estimated parameters are pretty close to the true values (and not significantly different from the true values at the 5% significance level), but actually all but one of the initial estimates
are closer. However, this seems coincidental.
The output
Heckman's sample selection model:
Latent variable model 1:
Y1 = b'X + U1,
where Y1 = Y
Latent variable model 2:
Y2 = c'Z + U2,
where only the sign of Y2 is observed and Y1 is only observed if Y2 > 0:
Dummy not missing[Y] = 1 if Y2 > 0, else
Dummy not missing[Y] = 0.
The error terms U1 and U2 are jointly normally distributed, and are
independent of X and Z. Moreover, Var(U2) = 1.
Next to the components of b and c there are two additional parameters:
r = the correlation coefficient of U1 and U2
s = the square root of the variance of U1
X variables:
X(1)=X1
X(2)=X2
X(3)=1
Z variables:
Z(1)=X1
Z(2)=X2
Z(3)=1
Chosen sample: Observations 1 to 500
Effective sample size: 500
Frequency of Dummy not missing[Y] = 1: 48.80
Initial Probit estimates of c in latent variable model 2:
Newton iteration succesfully completed after 8 iterations
Last absolute parameter change = 0.0000
Last percentage change of the likelihood = 0.0000
Maximum likelihood estimation results:
Z variables c(.) (t-value) [p-value]
Z(1)=X1 2.101653E+00 (7.52) [0.00000]
Z(2)=X2 2.007804E+00 (7.42) [0.00000]
Z(3)=1 -8.367128E-02 (-0.74) [0.45698]
[The two-sided p-values are based on the normal approximation]
Log likelihood: -8.01301859191E+001
Sample size (n): 500
Initial parameter estimates:
b(1) = 1.068566E+00
b(2) = 1.038234E+00
b(3) = -9.298452E-02
c(1) = 2.101653E+00
c(2) = 2.007804E+00
c(3) = -8.367128E-02
r = 0.000000E+00
s = 1.000785E+00
The Log-likelihood has been maximized using the simplex method of Nelder
and Mead. The algorithm involved is a Visual Basic translation of the
Fortran algorithm involved in:
Press, W.H., B.P.Flannery, S.A.Teukolsky and W.T.Vetterling (1986):
'Numerical Recipes', Cambridge University Press, pp. 292-293
Full information maximum likelihood estimation results:
Parameters ML estimates t-value [p-value]
b(1) 1.096625 9.679 [0.00000]
b(2) 1.025825 9.175 [0.00000]
b(3) -0.112768 -0.621 [0.53480]
c(1) 1.870422 4.812 [0.00000]
c(2) 2.035303 5.583 [0.00000]
c(3) -0.063557 -0.381 [0.70318]
r 0.945032 0.786 [0.43205]
s 0.998827 16.626 [0.00000]
[The two-sided p-values are based on the normal approximation]
Log-Likelihood = -401.49387556627
n = 500
Information criteria:
Akaike: 1.637975502
Hannan-Quinn: 1.664436388
Schwarz: 1.705409232
This is the end of the guided tour on Heckman's sample selection model.