/*LAB 3: ADJUSTED MEANS, INTERACTIONS, BOX-COX TRANSFORMATIONS*/ /*Part 1: ADJUSTED MEANS*/ data senic; input id length age risk culture xray beds msch region census nurses svcs; datalines; 1 7.13 55.7 4.1 9.0 39.6 279 2 4 207 241 60.0 2 8.82 58.2 1.6 3.8 51.7 80 2 2 51 52 40.0 3 8.34 56.9 2.7 8.1 74.0 107 2 3 82 54 20.0 4 8.95 53.7 5.6 18.9 122.8 147 2 4 53 148 40.0 5 11.20 56.5 5.7 34.5 88.9 180 2 1 134 151 40.0 6 9.76 50.9 5.1 21.9 97.0 150 2 2 147 106 40.0 7 9.68 57.8 4.6 16.7 79.0 186 2 3 151 129 40.0 8 11.18 45.7 5.4 60.5 85.8 640 1 2 399 360 60.0 9 8.67 48.2 4.3 24.4 90.8 182 2 3 130 118 40.0 10 8.84 56.3 6.3 29.6 82.6 85 2 1 59 66 40.0 11 11.07 53.2 4.9 28.5 122.0 768 1 1 591 656 80.0 12 8.30 57.2 4.3 6.8 83.8 167 2 3 105 59 40.0 13 12.78 56.8 7.7 46.0 116.9 322 1 1 252 349 57.1 14 7.58 56.7 3.7 20.8 88.0 97 2 2 59 79 37.1 15 9.00 56.3 4.2 14.6 76.4 72 2 3 61 38 17.1 16 11.08 50.2 5.5 18.6 63.6 387 2 3 326 405 57.1 17 8.28 48.1 4.5 26.0 101.8 108 2 4 84 73 37.1 18 11.62 53.9 6.4 25.5 99.2 133 2 1 113 101 37.1 19 9.06 52.8 4.2 6.9 75.9 134 2 2 103 125 37.1 20 9.35 53.8 4.1 15.9 80.9 833 2 3 547 519 77.1 21 7.53 42.0 4.2 23.1 98.9 95 2 4 47 49 17.1 22 10.24 49.0 4.8 36.3 112.6 195 2 2 163 170 37.1 23 9.78 52.3 5.0 17.6 95.9 270 1 1 240 198 57.1 24 9.84 62.2 4.8 12.0 82.3 600 2 3 468 497 57.1 25 9.20 52.2 4.0 17.5 71.1 298 1 4 244 236 57.1 26 8.28 49.5 3.9 12.0 113.1 546 1 2 413 436 57.1 27 9.31 47.2 4.5 30.2 101.3 170 2 1 124 173 37.1 28 8.19 52.1 3.2 10.8 59.2 176 2 1 156 88 37.1 29 11.65 54.5 4.4 18.6 96.1 248 2 1 217 189 37.1 30 9.89 50.5 4.9 17.7 103.6 167 2 2 113 106 37.1 31 11.03 49.9 5.0 19.7 102.1 318 2 1 270 335 57.1 32 9.84 53.0 5.2 17.7 72.6 210 2 2 200 239 54.3 33 11.77 54.1 5.3 17.3 56.0 196 2 1 164 165 34.3 34 13.59 54.0 6.1 24.2 111.7 312 2 1 258 169 54.3 35 9.74 54.4 6.3 11.4 76.1 221 2 2 170 172 54.3 36 10.33 55.8 5.0 21.2 104.3 266 2 1 181 149 54.3 37 9.97 58.2 2.8 16.5 76.5 90 2 2 69 42 34.3 38 7.84 49.1 4.6 7.1 87.9 60 2 3 50 45 34.3 39 10.47 53.2 4.1 5.7 69.1 196 2 2 168 153 54.3 40 8.16 60.9 1.3 1.9 58.0 73 2 3 49 21 14.3 41 8.48 51.1 3.7 12.1 92.8 166 2 3 145 118 34.3 42 10.72 53.8 4.7 23.2 94.1 113 2 3 90 107 34.3 43 11.20 45.0 3.0 7.0 78.9 130 2 3 95 56 34.3 44 10.12 51.7 5.6 14.9 79.1 362 1 3 313 264 54.3 45 8.37 50.7 5.5 15.1 84.8 115 2 2 96 88 34.3 46 10.16 54.2 4.6 8.4 51.5 831 1 4 581 629 74.3 47 19.56 59.9 6.5 17.2 113.7 306 2 1 273 172 51.4 48 10.90 57.2 5.5 10.6 71.9 593 2 2 446 211 51.4 49 7.67 51.7 1.8 2.5 40.4 106 2 3 93 35 11.4 50 8.88 51.5 4.2 10.1 86.9 305 2 3 238 197 51.4 51 11.48 57.6 5.6 20.3 82.0 252 2 1 207 251 51.4 52 9.23 51.6 4.3 11.6 42.6 620 2 2 413 420 71.4 53 11.41 61.1 7.6 16.6 97.9 535 2 3 330 273 51.4 54 12.07 43.7 7.8 52.4 105.3 157 2 2 115 76 31.4 55 8.63 54.0 3.1 8.4 56.2 76 2 1 39 44 31.4 56 11.15 56.5 3.9 7.7 73.9 281 2 1 217 199 51.4 57 7.14 59.0 3.7 2.6 75.8 70 2 4 37 35 31.4 58 7.65 47.1 4.3 16.4 65.7 318 2 4 265 314 51.4 59 10.73 50.6 3.9 19.3 101.0 445 1 2 374 345 51.4 60 11.46 56.9 4.5 15.6 97.7 191 2 3 153 132 31.4 61 10.42 58.0 3.4 8.0 59.0 119 2 1 67 64 31.4 62 11.18 51.0 5.7 18.8 55.9 595 1 2 546 392 68.6 63 7.93 64.1 5.4 7.5 98.1 68 2 4 42 49 28.6 64 9.66 52.1 4.4 9.9 98.3 83 2 2 66 95 28.6 65 7.78 45.5 5.0 20.9 71.6 489 2 3 391 329 48.6 66 9.42 50.6 4.3 24.8 62.8 508 2 1 421 528 48.6 67 10.02 49.5 4.4 8.3 93.0 265 2 2 191 202 48.6 68 8.58 55.0 3.7 7.4 95.9 304 2 3 248 218 48.6 69 9.61 52.4 4.5 6.9 87.2 487 2 3 404 220 48.6 70 8.03 54.2 3.5 24.3 87.3 97 2 1 65 55 28.6 71 7.39 51.0 4.2 14.6 88.4 72 2 2 38 67 28.6 72 7.08 52.0 2.0 12.3 56.4 87 2 3 52 57 28.6 73 9.53 51.5 5.2 15.0 65.7 298 2 3 241 193 48.6 74 10.05 52.0 4.5 36.7 87.5 184 1 1 144 151 68.6 75 8.45 38.8 3.4 12.9 85.0 235 2 2 143 124 48.6 76 6.70 48.6 4.5 13.0 80.8 76 2 4 51 79 28.6 77 8.90 49.7 2.9 12.7 86.9 52 2 1 37 35 28.6 78 10.23 53.2 4.9 9.9 77.9 752 1 2 595 446 68.6 79 8.88 55.8 4.4 14.1 76.8 237 2 2 165 182 48.6 80 10.30 59.6 5.1 27.8 88.9 175 2 2 113 73 45.7 81 10.79 44.2 2.9 2.6 56.6 461 1 2 320 196 65.7 82 7.94 49.5 3.5 6.2 92.3 195 2 2 139 116 45.7 83 7.63 52.1 5.5 11.6 61.1 197 2 4 109 110 45.7 84 8.77 54.5 4.7 5.2 47.0 143 2 4 85 87 25.7 85 8.09 56.9 1.7 7.6 56.9 92 2 3 61 61 45.7 86 9.05 51.2 4.1 20.5 79.8 195 2 3 127 112 45.7 87 7.91 52.8 2.9 11.9 79.5 477 2 3 349 188 65.7 88 10.39 54.6 4.3 14.0 88.3 353 2 2 223 200 65.7 89 9.36 54.1 4.8 18.3 90.6 165 2 1 127 158 45.7 90 11.41 50.4 5.8 23.8 73.0 424 1 3 359 335 45.7 91 8.86 51.3 2.9 9.5 87.5 100 2 3 65 53 25.7 92 8.93 56.0 2.0 6.2 72.5 95 2 3 59 56 25.7 93 8.92 53.9 1.3 2.2 79.5 56 2 2 40 14 5.7 94 8.15 54.9 5.3 12.3 79.8 99 2 4 55 71 25.7 95 9.77 50.2 5.3 15.7 89.7 154 2 2 123 148 25.7 96 8.54 56.1 2.5 27.0 82.5 98 2 1 57 75 45.7 97 8.66 52.8 3.8 6.8 69.5 246 2 3 178 177 45.7 98 12.01 52.8 4.8 10.8 96.9 298 2 1 237 115 45.7 99 7.95 51.8 2.3 4.6 54.9 163 2 3 128 93 42.9 100 10.15 51.9 6.2 16.4 59.2 568 1 3 452 371 62.9 101 9.76 53.2 2.6 6.9 80.1 64 2 4 47 55 22.9 102 9.89 45.2 4.3 11.8 108.7 190 2 1 141 112 42.9 103 7.14 57.6 2.7 13.1 92.6 92 2 4 40 50 22.9 104 13.95 65.9 6.6 15.6 133.5 356 2 1 308 182 62.9 105 9.44 52.5 4.5 10.9 58.5 297 2 3 230 263 42.9 106 10.80 63.9 2.9 1.6 57.4 130 2 3 69 62 22.9 107 7.14 51.7 1.4 4.1 45.7 115 2 3 90 19 22.9 108 8.02 55.0 2.1 3.8 46.5 91 2 2 44 32 22.9 109 11.80 53.8 5.7 9.1 116.9 571 1 2 441 469 62.9 110 9.50 49.3 5.8 42.0 70.9 98 2 3 68 46 22.9 111 7.70 56.9 4.4 12.2 67.9 129 2 4 85 136 62.9 112 17.94 56.2 5.9 26.4 91.8 835 1 1 791 407 62.9 113 9.41 59.5 3.1 20.6 91.7 29 2 3 20 22 22.9 run; data senic; set senic; occpct = census/beds * 100; cens100 = census/100; /* region with NE as reference */ if region=2 then regnc=1; else regnc=0; if region=3 then regs=1; else regs=0; if region=4 then regw=1; else regw=0; /* region with NC as reference */ if region=1 then dum1=1; else dum1=0; if region=3 then dum3=1; else dum3=0; if region=4 then dum4=1; else dum4=0; /* define census x region interaction for both codings */ nccen=cens100*regnc; scen=cens100*regs; wcen=cens100*regw; d1cen=cens100*dum1; d3cen=cens100*dum3; d4cen=cens100*dum4; run; /* look at the models with just region and no interaction */ /* compare the parameter estimates and F test results */ proc reg data=senic; model risk = age length cens100 occpct regnc regs regw; NEref: test regnc=0, regs=0, regw=0; run; proc reg data=senic; model risk = age length cens100 occpct dum1 dum3 dum4; NCref: test dum1=0, dum3=0, dum4=0; run; /* look at the models with region by census interaction */ /* compare the two codings and model estimates */ proc reg data=senic; model risk = age length cens100 occpct regnc regs regw nccen scen wcen; NErefInteract: test nccen=0, scen=0, wcen=0; run; proc reg data=senic; model risk = age length cens100 occpct dum1 dum3 dum4 d1cen d3cen d4cen; NCrefInteract: test d1cen=0, d3cen=0, d4cen=0; run; /* computed (unadjusted) means of risk by region */ proc means data=senic; class region; var risk; run; /* NE has highest risk */ /* computed (unadjusted) means of predictors by region */ proc means data=senic; class region; var age length census occpct; run; /* NE has oldest patients, longest average length of stay, biggest hospitals, highest occpct */ /* all of these are positively associated with risk */ /* maybe differences in regional means of risk are explained by differences in these predictors */ /* get overall means of predictors in preparation for calculating adjusted means */ proc means data =senic; var age length cens100 beds occpct; run; /* means are age = 53.23 length = 9.65 cens100 = 1.91 beds = 252.17 occpct = 73.45 */ /* make a data set to get adjusted means (i.e., predicted value of Y when predictors are at their means) */ data senicpred; input label $ age length cens100 beds occpct region; datalines; NEAdjMean 53.23 9.65 1.91 252.17 73.45 1 NCAdjMean 53.23 9.65 1.91 252.17 73.45 2 SAdjMean 53.23 9.65 1.91 252.17 73.45 3 WAdjMean 53.23 9.65 1.91 252.17 73.45 4 run; data senicpred; set senicpred; /* region with NE as reference */ if region=2 then regnc=1; else regnc=0; if region=3 then regs=1; else regs=0; if region=4 then regw=1; else regw=0; /* region with NC as reference */ if region=1 then dum1=1; else dum1=0; if region=3 then dum3=1; else dum3=0; if region=4 then dum4=1; else dum4=0; /* census by region interaction with both codings */ nccen=cens100*regnc; scen=cens100*regs; wcen=cens100*regw; d1cen=cens100*dum1; d3cen=cens100*dum3; d4cen=cens100*dum4; run; /* put data sets together */ data senicbig; set senic senicpred; run; /* take look in log and in Explorer to make sure it worked. We added some "observations" to the data set because we want to predict Y for these predictor values. The observations are missing risk so they will not be used in when the regression is estimated.*/ /* now get adjusted means for model with and without interactions. Look at the last 4 predicted values in each regression. */ proc reg data=senicbig; model risk = age length cens100 occpct regnc regs regw / p; NCvsSDiff: test regnc - regs = 0; /* test difference of non-reference regions; nothing to do with adj means, just here for illustration */ model risk = age length cens100 occpct dum1 dum3 dum4 / p; /*compare predicted values from Models 1 and 2*/ model risk = age length cens100 occpct regnc regs regw nccen scen wcen / p; model risk = age length cens100 occpct dum1 dum3 dum4 d1cen d3cen d4cen / p; NEvsNCDiff: test dum1 + 1.91*d1cen = 0; /* test difference of reference vs. non-reference adjusted means */ NEvsSDiff: test dum1-dum3 + 1.91*d1cen - 1.91*d3cen= 0; /* test difference of two non-reference adjusted means */ run; /* adjusted means are of course exactly the same for the two codings */ /* the differences in adjusted means between models with and without interaction are small */ /* only some of the difference in risk for NE versus other regions seems to be explained by differences in age, length, census, occpct */ /* Estimated Regression Lines for Each of the Regions */ /* Only consider census, region, & interaction of census by region in the model */ proc reg data=senic; model risk = cens100 regnc regs regw nccen scen wcen/p; output out= pred p = yhat; run; data senic_graph; set pred; if region = 1 then yhat1 = yhat; if region = 2 then yhat2 = yhat; if region = 3 then yhat3 = yhat; if region = 4 then yhat4 = yhat; run; goptions reset all; symbol1 i=join c=black; symbol2 i=join c=red; symbol3 i=join c=blue; symbol4 i=join c=green; axis1 label=(a=90 'Predicted Length'); axis2 label=("Cens100"); proc gplot data=senic_graph; title "Predicted Length by Census and Region"; plot yhat1*cens100=1 yhat2*cens100=2 yhat3*cens100=3 yhat4*cens100=4/overlay vaxis=axis1 haxis=axis2; run; quit; /*********************************************************************************** /* PART 2: BOX-COX TRANSFORMATIONS /*********************************************************************************** /* We will get some experience with Box-Cox transformations two ways, using a macro and using the SAS PROC TRANSREG procedure. The TRANSREG procedure does not provide good graphics; however, the SAS support website provides code to do this (pasted below). The macro, created by Michael Friendly and downloaded from his website, is much easier to use. For more info on the SAS procedure, see the website http://support.sas.com/rnd/app/da/new/802ce/stat/chap15/sect9.htm Both use the method of maximum likelihood to find an optimal power transformation for the response variable Y. What our book calls "p" is called "lambda." /* paste the macro code into an editor window, then RUN it, then proceed */ /* The BOXCOX macro refuses to work if any variable names are longer than 8 chars!! */ %boxcox(data=senic, resp=risk, model=age length cens100 occpct, gplot=RMSE, lopower=-2, hipower=2, conf=.95); run; /*try some other variable as the response variable just for illustration*/ %boxcox(data=senic, resp=length, model=age risk cens100 occpct, gplot=RMSE, lopower=-2, hipower=2, conf=.95); run; /*what transformation of length is suggested? how would we interpret this transformed variable?*/ proc reg data=senic; model nurses=risk; plot nurses*risk; run; %boxcox(data=senic, resp=nurses, model= risk , gplot=RMSE, lopower=-2, hipower=2, conf=.95); run; /*what transformation is suggested?*/ /*Now for TRANSREG. This example shows how to make graphical displays of the Box-Cox transformation results. Plots include the log likelihood function with the confidence interval, root mean squared error as a function of the power parameter, R2 as a function of the power parameter, the Box-Cox transformation of the variable y, the original scatter plot based on the untransformed data, and the new scatter plot based on the transformed data. Also, a condensed version of the log likelihood table with the confidence interval is printed. */ title h=1 'Box-Cox Graphical Displays'; data x; input y x @@; datalines; 10.0 3.0 72.6 8.3 59.7 8.1 20.1 4.8 90.1 9.8 1.1 0.9 78.2 8.5 87.4 9.0 9.5 3.4 0.1 1.4 0.1 1.1 42.5 5.1 57.0 7.5 9.9 1.9 0.5 1.0 121.1 9.9 37.5 5.9 49.5 6.7 8.3 1.8 0.6 1.8 53.0 6.7 112.8 10.0 40.7 6.4 5.1 2.4 73.3 9.5 122.4 9.9 87.2 9.4 121.2 9.9 23.1 4.3 7.1 3.5 12.4 3.3 5.6 2.7 113.0 9.6 110.5 10.0 3.1 1.5 52.4 7.9 80.4 8.1 0.6 1.6 115.1 9.1 15.9 3.1 56.5 7.3 85.4 9.8 32.5 5.8 43.0 6.2 0.1 0.8 21.8 5.2 15.2 3.5 5.2 3.0 0.2 0.8 73.5 8.2 4.9 3.2 0.2 0.3 69.0 9.2 3.6 3.5 0.2 0.9 101.3 9.9 10.0 3.7 16.9 3.0 11.2 5.0 0.2 0.4 80.8 9.4 24.9 5.7 113.5 9.7 6.2 2.1 12.5 3.2 4.8 1.8 80.1 8.3 26.4 4.8 13.4 3.8 99.8 9.7 44.1 6.2 15.3 3.8 2.2 1.5 10.3 2.7 13.8 4.7 38.6 4.5 79.1 9.8 33.6 5.8 9.1 4.5 89.3 9.1 5.5 2.6 20.0 4.8 2.9 2.9 82.9 8.4 7.0 3.5 14.5 2.9 16.0 3.7 29.3 6.1 48.9 6.3 1.6 1.9 34.7 6.2 33.5 6.5 26.0 5.6 12.7 3.1 0.1 0.3 15.4 4.2 2.6 1.8 58.6 7.9 81.2 8.1 37.2 6.9 ; /*Data supplied by SAS website -- not sure what it is, but it is good for illustrating Box-Cox. Let's try a SLR and see what the data look like.*/ proc reg data=x; model y=x; plot y*x; run; quit; /*The TRANSREG procedure is run to find the Box-Cox transformation. The lambda list is -2 TO 2 BY 0.01, which produces 401 lambdas. This many power parameters makes a nice graphical display with plenty of detail around the confidence interval. However, 401 values is a lot to print, so for this reason, the usual Box-Cox transformation information table is excluded from the printed output. Instead, it is output to a SAS data set using ODS so a sample of it can be printed. Just the confidence interval and the rows corresponding to power parameters that are multiples of 0.5 are printed. Null labels are provided for the columns that need to be printed without headers. The details table is also output to a SAS data set using ODS, since it contains information that will be incorporated into some of the plots. */ * Fit Box-Cox model, output results to output data sets; ods output boxcox=b details=d; ods exclude boxcox; proc transreg details data=x; model boxcox(y / convenient lambda=-2 to 2 by 0.01) = identity(x); output out=trans; run; proc print noobs label data=b(drop=rmse); title2 'Confidence Interval'; where ci ne ' ' or abs(lambda - round(lambda, 0.5)) < 1e-6; label convenient = '00'x ci = '00'x; run; /* These next steps extract information from the Box-Cox transformation and details tables, and store the information in macro variables. The confidence interval limit from the details table provides a vertical axis reference line for the log likelihood plot. The convenient power parameter ('Lambda Used') is extracted from the footnote. The confidence interval is extracted from the confidence interval observations of the Box-Cox transformation table and will be used in the footnote and for horizontal axis reference lines in the log likelihood plot. */ * Store values for reference lines; data _null_; set d; if description = 'CI Limit' then call symput('vref', formattedvalue); if description = 'Lambda Used' then call symput('lambda', formattedvalue); run; data _null_; set b end=eof; where ci ne ' '; if _n_ = 1 then call symput('href1', compress(put(lambda, best12.))); if ci = '<' then call symput('href2', compress(put(lambda, best12.))); if eof then call symput('href3', compress(put(lambda, best12.))); run; /*These steps plot the log likelihood, root mean square error, and R2. The input data set is the Box-Cox transformation table, which was output using ODS. */ * Plot log likelihood, confidence interval; axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none; proc gplot data=b; title2 'Log Likelihood'; plot loglike * lambda / vref=&vref href=&href1 &href2 &href3 vaxis=axis1 haxis=axis2 frame cframe=ligr; footnote "Confidence Interval: &href1 - &href2 - &href3, " "Lambda = &lambda"; symbol v=none i=spline c=blue; run; footnote; title2 'RMSE'; plot rmse * lambda / vaxis=axis1 haxis=axis2 frame cframe=ligr; run; title2 'R-Square'; plot rsquare * lambda / vaxis=axis1 haxis=axis2 frame cframe=ligr; axis1 order=(0 to 1 by 0.1) label=(angle=90 rotate=0) minor=none; run; quit; /*The optimal power parameter is 0.46, but since 0.5 is in the confidence interval, and since the CONVENIENT option was specified, the procedure chooses a square root transformation. The next steps plot the transformation of y, the original scatter plot based on the untransformed data, and the new scatter plot based on the transformed data. The input data set is the ordinary output data set from PROC TRANSREG. The transformation of the variable y by default is ty. */ axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none; proc gplot data=trans; title2 'Transformation'; symbol i=splines v=star c=blue; plot ty * y / vaxis=axis1 haxis=axis2 frame cframe=ligr; run; title2 'Original Scatter Plot'; symbol i=none v=star c=blue; plot y * x / vaxis=axis1 haxis=axis2 frame cframe=ligr; run; title2 'Transformed Scatter Plot'; symbol i=none v=star c=blue; plot ty * x / vaxis=axis1 haxis=axis2 frame cframe=ligr; run; quit;