*DATA MINING SAS NOTES * *Box-Cox transformations and variable selection; *Creating example data; data regression_example; call streaminit(5366); do i=1 to 1000; u=rand('UNIFORM'); X1=10+20*rand('UNIFORM'); X2=40+10*rand('UNIFORM'); X3=500+100*rand('UNIFORM'); epsilon=.1*rand("NORMAL"); W=200+3*X1+4*X2+0.7*X3+epsilon; Y=W**2; output; end; run; *Checking example data with glmselect; proc glmselect data=regression_example; model W=X1 X1*X1 X2 X3 ; run; *Performing a box-cox transformation to Y *In general, Ytilde = (y^lambda-1)/lambda, if lambda != 0 * Ytilde = log(Y), if lambda = 0; ods graphics on; proc transreg data=regression_example; model boxcox(Y/lambda = -2 to 2 by .01) = Identity(X1 X2 X3); output out=myoutput coefficients; run; ods graphics off; proc print data=myoutput; run; *Tranforming Y using lambda obtained from Box-Cox procedure; data regression_example; call streaminit(5366); do i=1 to 1000; u=rand('UNIFORM'); X1=10+20*rand('UNIFORM'); X2=40+10*rand('UNIFORM'); X3=500+100*rand('UNIFORM'); epsilon=.1*rand("NORMAL"); W=200+3*X1+4*X2+0.7*X3+epsilon; Y=W**2; Ytilde=(Y**0.5-1)/0.5; output; end; run; *Refitting model to transformed Y; proc glmselect data=regression_example; model Ytilde=X1 X1*X1 X2 X3 ; run; *Creating data for performing a stepwise regression; data regression_example; call streaminit(5366); do i=1 to 1000; u=rand('UNIFORM'); X1=10+20*rand('UNIFORM'); X2=40+10*rand('UNIFORM'); X3=500+100*rand('UNIFORM'); X4=500+100*rand('UNIFORM'); X5=500+100*rand('UNIFORM'); X6=500+100*rand('UNIFORM'); X7=500+100*rand('UNIFORM'); X8=500+100*rand('UNIFORM'); X9=500+100*rand('UNIFORM'); X10=500+100*rand('UNIFORM'); X11=500+100*rand('UNIFORM'); X12=500+100*rand('UNIFORM'); X13=500+100*rand('UNIFORM'); X14=500+100*rand('UNIFORM'); X15=500+100*rand('UNIFORM'); X16=500+100*rand('UNIFORM'); X17=500+100*rand('UNIFORM'); X18=500+100*rand('UNIFORM'); X19=500+100*rand('UNIFORM'); X20=500+100*rand('UNIFORM'); epsilon=.1*rand("NORMAL"); Y=200+3*X1+4*X2+0.7*X3+epsilon; output; end; run; *Typical regression; proc reg data=regression_example; model Y=X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20; run; *To generate the model statement, use the R code temp="" for(i in 1:20){ temp=paste(temp,paste0("X",i)) } temp ; *Regression with stepwise variable selection; proc glmselect data=regression_example; model Y=X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20/selection=stepwise; run;