一、Lasso方法的核心原理
Lasso(Least Absolute Shrinkage and Selection Operator)由Tibshirani于1996年提出,是一种融合特征选择与正则化的回归方法。其核心在于通过在最小二乘法目标函数中加入L1范数惩罚项,实现变量筛选与系数压缩的双重目标
其中λ为惩罚参数,控制惩罚力度。当λ足够大时,部分系数会被压缩至零,从而实现变量选择。
这种特性使其特别适用于高维数据场景。
二、变量选择的双重应用
(一)控制变量选择:pdslasso命令
当模型包含大量控制变量时,传统方法易产生过拟合或遗漏变量偏误。后双选Lasso(Post-Double-Selection)通过以下步骤实现稳健选择:
- 第一阶段回归:以因变量y和核心解释变量d分别对控制变量集进行Lasso回归
该方法有效避免了主观选择偏差,保证估计量的一致性。
(二)工具变量选择:ivlasso命令
在工具变量法(IV)中,当存在大量潜在工具变量时,Lasso可通过以下步骤筛选有效工具:
- 第一阶段筛选:用Lasso回归从候选工具变量中选择与内生变量相关性强的IV
此方法尤其适用于弱工具变量场景,提高了估计效率。
三、Stata实现步骤
pdslasso
和 ivlasso
是用于高维线性模型的后选择(post-selection)和后正则化(post-regularization)OLS/IV估计与推断的工具。核心方法基于 PDS (Post-Double-Selection) 和 CHS (Chernozhukov-Hansen-Spindler) 方法论,结合 Lasso 或 Square-Root Lasso 进行变量选择。
(一)命令安装
ssc install pdslasso, replace // 安装主程序
ssc install ftools // 安装依赖包
ssc install rlasso // 安装正则化回归模块
(二)语法结构
pdslasso
pdslasso depvar regressors (hd_controls) [weight] [if exp] [in range] [,
partial(varlist) pnotpen(varlist) aset(varlist) post(method) robust
cluster(var) fe noftools rlasso[(name)] sqrt noisily
loptions(options) olsoptions(options) noconstant ]
ivlasso
ivlasso depvar regressors [(hd_controls)] (endog=instruments) [if exp] [in range] [,
partial(varlist) pnotpen(varlist) aset(varlist) post(method) robust
cluster(var) fe noftools rlasso[(name)] sqrt noisily
loptions(options) ivoptions(options) first idstats sscset
ssgamma(real) ssgridmin(real) ssgridmax(real) ssgridpoints(integer 100)
ssgridmat(name) noconstant ]
通用选项
| |
---|
partial(varlist) | 在 Lasso 前剔除的变量 (Frisch-Waugh-Lovell 定理应用) |
pnotpen(varlist) | |
aset(varlist) | |
post(method) | 结果展示方法:pds (默认), lasso , plasso |
robust |
|
cluster(var) | |
fe | |
rlasso[(name)] | |
sqrt | |
loptions(options) | 传递给 rlasso 的选项 (如 prestd , center ) |
ivlasso 特有选项
|
|
---|
first | |
idstats | |
sscset | |
ssgamma(#) | sup-score 检验的显著性水平 (默认 0.05) |
ssgridmin/max | |
ssgridpoints(#) |
|
方法论要点
PDS 方法
- 双重选择:分别对因变量和核心解释变量进行 Lasso 回归,取两次选择的变量并集
- 正交化处理:通过剔除高维控制变量的影响,解决遗漏变量偏误问题
CHS 方法
- 后正则化推断:使用 Lasso 选择变量后,构建正交化的因变量和解释变量
- 改善集优化:允许将未被 Lasso 选中的变量加入后估计 (
aset()
)
Sup-Score 检验
- 类似 Anderson-Rubin 检验的高维扩展
- 支持三种计算方法:渐近边界 (
abound
)、模拟 (simulate
)、选择 (select
)
示例应用
AJR 数据应用
pdslasso logpgp95 avexpr (lat_abst edes1975 avelf temp* humid*), rob
ivlasso logpgp95 (avexpr=logem4 euro1900-cons00a), partial(lat_abst) idstats sscset
以殖民起源对经济发展的影响研究为例:
* 导入数据
use "AJR.dta", clear
* 传统OLS估计
reg logpgp95 avexpr
* Lasso控制变量选择
pdslasso logpgp95 avexpr (lat_abst edes1975 avelf temp* humid*), post(pds)
* Lasso工具变量选择
ivlasso logpgp95 (avexpr=logem4) (lat_abst edes1975), first

1、导入数据
use https://statalasso.github.io/dta/AJR.dta
2、复刻OLS结果--Panel C, col. 9.
reg logpgp95 avexpr lat_abst edes1975 avelf temp* humid* steplow-oilres
结果为:
reg logpgp95 avexpr lat_abst edes1975 avelf temp* humid* steplow-oilres
Source | SS df MS Number of obs = 64
-------------+---------------------------------- F(25, 38) = 8.56
Model | 58.2413524 25 2.32965409 Prob > F = 0.0000
Residual | 10.3403661 38 .272114898 R-squared = 0.8492
-------------+---------------------------------- Adj R-squared = 0.7500
Total | 68.5817185 63 1.08859871 Root MSE = .52165
------------------------------------------------------------------------------
logpgp95 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avexpr | .3720903 .064387 5.78 0.000 .2417456 .502435
lat_abst | -1.647298 1.179221 -1.40 0.171 -4.034506 .73991
edes1975 | .0073935 .0043338 1.71 0.096 -.0013798 .0161668
avelf | -1.277516 .320108 -3.99 0.000 -1.92554 -.6294907
temp1 | .2088959 .1075967 1.94 0.060 -.0089221 .4267139
temp2 | -.0621873 .0374028 -1.66 0.105 -.1379053 .0135306
temp3 | -.0562697 .046481 -1.21 0.234 -.1503655 .0378261
temp4 | -.0814298 .045515 -1.79 0.082 -.1735701 .0107105
temp5 | -.0138574 .0263971 -0.52 0.603 -.0672955 .0395807
humid1 | .0000699 .0171774 0.00 0.997 -.0347039 .0348437
humid2 | .0136996 .026547 0.52 0.609 -.0400419 .0674411
humid3 | .0178939 .0163351 1.10 0.280 -.0151748 .0509626
humid4 | -.0144873 .0188511 -0.77 0.447 -.0526492 .0236747
steplow | -.3340291 .1788555 -1.87 0.070 -.6961031 .028045
deslow | .4256796 .2544295 1.67 0.103 -.089386 .9407453
stepmid | .2600089 .4920797 0.53 0.600 -.7361543 1.256172
desmid | .8471518 .6259677 1.35 0.184 -.4200535 2.114357
drystep | -.0009126 .2713833 -0.00 0.997 -.5502994 .5484741
drywint | -.6415219 4.435143 -0.14 0.886 -9.62 8.336956
landlock | .336199 .3189248 1.05 0.298 -.3094304 .9818284
goldm | .0554595 .0942704 0.59 0.560 -.1353809 .2462999
iron | -.0655436 .0744225 -0.88 0.384 -.2162041 .0851169
silv | -.0154063 .057096 -0.27 0.789 -.1309911 .1001785
zinc | .0397732 .0976334 0.41 0.686 -.1578753 .2374218
oilres | 1.49e-07 1.94e-07 0.77 0.447 -2.44e-07 5.43e-07
_cons | 5.073594 1.326987 3.82 0.000 2.38725 7.759939
------------------------------------------------------------------------------
.
3、基础用法,选择高维控制变量Basic usage: select from high-dim controls.
pdslasso logpgp95 avexpr (lat_abst edes1975 avelf temp* humid* steplow-oilres)
结果为:
. pdslasso logpgp95 avexpr (lat_abst edes1975 avelf temp* humid* steplow-oilres)
1. (PDS/CHS) Selecting HD controls for dep var logpgp95...
Selected: edes1975 avelf
2. (PDS/CHS) Selecting HD controls for exog regressor avexpr...
Selected: edes1975 zinc
Estimation results:
Specification:
Regularization method: lasso
Penalty loadings: homoskedastic
Number of observations: 64
Exogenous (1): avexpr
High-dim controls (24): lat_abst edes1975 avelf temp1 temp2 temp3 temp4
temp5 humid1 humid2 humid3 humid4 steplow deslow
stepmid desmid drystep drywint landlock goldm iron
silv zinc oilres
Selected controls (3): edes1975 avelf zinc
Unpenalized controls (1): _cons
Structural equation:
OLS using CHS lasso-orthogonalized vars
------------------------------------------------------------------------------
logpgp95 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avexpr | .4262511 .0540552 7.89 0.000 .3203049 .5321974
------------------------------------------------------------------------------
OLS using CHS post-lasso-orthogonalized vars
------------------------------------------------------------------------------
logpgp95 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avexpr | .391257 .0574894 6.81 0.000 .2785799 .503934
------------------------------------------------------------------------------
OLS with PDS-selected variables and full regressor set
------------------------------------------------------------------------------
logpgp95 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avexpr | .3913455 .0561862 6.97 0.000 .2812225 .5014684
edes1975 | .0091289 .003184 2.87 0.004 .0028883 .0153694
avelf | -.9974943 .2474453 -4.03 0.000 -1.482478 -.5125104
zinc | -.0079226 .0280604 -0.28 0.778 -.0629201 .0470748
_cons | 5.764133 .3773706 15.27 0.000 5.024501 6.503766
------------------------------------------------------------------------------
Standard errors and test statistics valid for the following variables only:
avexpr
------------------------------------------------------------------------------
.
4、复刻IV结果 Panels A & B, col. 9.
ivreg logpgp95 (avexpr=logem4) lat_abst edes1975 avelf temp* humid* steplow-oilres, first
5、Select controls; specify that logem4 is an unpenalized instrument to be partialled out.
. ivlasso logpgp95 (avexpr=logem4) (lat_abst edes1975 avelf temp* humid* steplow-oilres), partial(logem4)
结果为:
. ivlasso logpgp95 (avexpr=logem4) (lat_abst edes1975 avelf temp* humid* steplow-oilres), pa
> rtial(logem4)
1. (PDS/CHS) Selecting HD controls for dep var logpgp95...
Selected: lat_abst edes1975 avelf temp3 humid2 humid3 humid4
3. (PDS) Selecting HD controls for endog regressor avexpr...
Selected: lat_abst edes1975 temp3 humid2 humid4 zinc
4. (PDS) Selecting HD controls for IV logem4...
Selected: edes1975 avelf temp2 humid2
5. (CHS) Selecting HD controls and IVs for endog regressor avexpr...
Selected:
Also inc: logem4
6a. (CHS) Selecting lasso HD controls and creating optimal IV for endog regressor avexpr...
Selected: lat_abst edes1975 temp3 humid2
6b. (CHS) Selecting post-lasso HD controls and creating optimal IV for endog regressor avexp
> r...
Selected: lat_abst edes1975 temp3 humid2
7. (CHS) Creating orthogonalized endogenous regressor avexpr...
Estimation results:
Specification:
Regularization method: lasso
Penalty loadings: homoskedastic
Number of observations: 64
Endogenous (1): avexpr
High-dim controls (24): lat_abst edes1975 avelf temp1 temp2 temp3 temp4
temp5 humid1 humid2 humid3 humid4 steplow deslow
stepmid desmid drystep drywint landlock goldm iron
silv zinc oilres
Selected controls, PDS (9): lat_abst edes1975 avelf temp2 temp3 humid2 humid3
humid4 zinc
Selected controls, CHS-L (7): lat_abst edes1975 avelf temp3 humid2 humid3 humid4
Selected controls, CHS-PL (7): lat_abst edes1975 avelf temp3 humid2 humid3 humid4
Partialled-out instruments (1): logem4
Structural equation:
IV using CHS lasso-orthogonalized vars
------------------------------------------------------------------------------
logpgp95 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avexpr | .9569281 .1517884 6.30 0.000 .6594284 1.254428
------------------------------------------------------------------------------
IV using CHS post-lasso-orthogonalized vars
------------------------------------------------------------------------------
logpgp95 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avexpr | .8809197 .1921417 4.58 0.000 .5043289 1.257511
------------------------------------------------------------------------------
IV with PDS-selected variables and full regressor set
------------------------------------------------------------------------------
logpgp95 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avexpr | .8518871 .2521345 3.38 0.001 .3577126 1.346062
lat_abst | -.7342439 1.427484 -0.51 0.607 -3.532061 2.063573
edes1975 | .0028945 .0056097 0.52 0.606 -.0081003 .0138893
avelf | -.9375327 .3875679 -2.42 0.016 -1.697152 -.1779136
temp2 | -.0040378 .0270368 -0.15 0.881 -.057029 .0489535
temp3 | .0110046 .0279139 0.39 0.693 -.0437056 .0657149
humid2 | .0100696 .0154822 0.65 0.515 -.020275 .0404142
humid3 | .0041323 .0112036 0.37 0.712 -.0178263 .0260908
humid4 | -.0123506 .0168476 -0.73 0.464 -.0453714 .0206701
zinc | -.0662942 .0569847 -1.16 0.245 -.1779822 .0453937
_cons | 2.527712 2.259827 1.12 0.263 -1.901468 6.956891
------------------------------------------------------------------------------
Standard errors and test statistics valid for the following variables only:
avexpr
------------------------------------------------------------------------------
.