Skip to content

IBMPredictiveAnalytics/STATS_EARTH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STATS_EARTH

The STATS EARTH Extension command

This procedure estimates a Multivariate Adaptive Regression Splines (MARS) model. The MARS algorithm first converts categorical variables into a set of zero-one variables (dummies) referred to as one-hot coding. It then fits pairwise linear segments along with the dummies, adding one at each step to maximally improve the residual sum of squares using hinge functions. A hinge is defined by a variable and a knot.

A hinge function for variable x has the form
h(x,c) = max(0, x-c)
or max(0, c - x)
where c is a constant.

Since h is zero over part of its range, it partitions a variable into two disjoint regions - a hockey stick or hinge. which might be upside down. The process of adding hinge functions and one-hot terms (the forward pass) continues until no nontrivial further improvement can be found. The result is effects that are linear splines plus the dummies. This will be obvious from the variables plot in the output. A variable may be split multiple times. That is, it may have multiple knots, which will be evident from the plot and the coefficients table.

This, of course, is likely to overfit the data leading to a lack of generalizeability to new data. The next step, then, is to remove terms one by one until the best reduced model is found (the backward pass).as measured by the GCV (Generalized Cross Validation) statistic, which is similar to the AIC (Akaike Information Criterion). At the other extreme, variables can be constrained to enter linearly or not at all if the linear term does not improve the GCV. The discovery process can also include interaction terms among the hinges and one-hot variables if the degree parameter is larger than one.

This process, thus, detects nonlinearities and interactions while selecting the best functions of the explanatory variables for prediction purposes. It does not produce traditional significance levels, since the model has been found by heavily mining the data.

A good source for an introductory exposition of the MARS algorithm and its strengths and weaknesses is the Wikipedia article,
Multivariate Adaptive Regression Spline
Specific details on the earth procedure and definitions of the output can be found
Here

The term MARS is trademarked by the inventor of the algorithms, Jerry Friedman; hence the R procedure uses the name earth, and the SPSS procedure is named STATS EARTH.

See the help for the dialog box for more information about the procedure. This help only defines the syntax,

Multivariate Adaptive Regression Spline Syntax

STATS EARTH
DEPVAR = dependent variable *
INDVARS = variable list
LINEARVARS = variable list
The union of INDVARS and LINEARVARS must have at least one variable.
IDVAR = variable - required if predicting
ESTIMATE = YES** or NO
PREDICT = NO** or YES
FAMILY = GAUSSIAN** or BINOMIAL or POISSON, or GAMMA or NONE
PREDDATASET = name for prediction dataset. Required if predicting
PREDTYPE = LINK** or RESPONSE or CLASS
MODELSOURCE = "filespec" for a saved model to use instead of estimating
DATASOURCE = TRAINING** or NEWDATA
SAVEMODEL = "filespec" for saving the estimated model

/OPTIONS
NFOLD = number of folds for cross validation
DEGREE = order of interaction terms to look for
MAXTERMS = maximum number of terms allowed in the model

/DISPLAY
MODELPLOTS = YES** or NO
RESPONSEPLOTS = YES** or NO
VARIMPPLOT = YES** or NO
HEIGHT = plot height in inches
WIDTH = plot height in inches
FONTSIZE = integer for plot font in points

* Required
** Default

STATS EARTH /HELP. prints this information and does nothing else,


STATS EARTH DEPVAR=salary INDVARS=jobcat jobtime prevexp minority 
IDVAR=id ESTIMATE = YES
/DISPLAY HEIGHT=8 WIDTH=8 FONTSIZE=12.

Details

SPLIT FILES is not supported.

Case weights are not supported.

With large datasets - millions of cases and lots of variables - this procedure can take a long time, and there is no way to interrupt it. Setting a small maximum terms limit can dramatically reduce the time and memory requirements

Estimation and Prediction

ESTIMATE specifies whether an equation should be estimated. If NO, a previously estimated model can be specified via MODELSOURCE for use in plots and predictions.

DEPVAR specifies the dependent variables for the equation. The measurement level determines the type of equation but the appropriate family should be chosen. Only categorical versus scale matters

INDVARS is the list of categorical and scale independent variables. Categorical variables will be recoded into a set of one-hot dummy variables The names of the reocded variables have the form variable-name-category. For example, jobcat3 refers to category 3 of the jobcat variable.

LINEARVARS optionally specifies variables that should only be entered as simple linear terms. They can appear in the model separately or in interaction terms if selected. They cannot be categorical variables.

IDVAR specifies an ID variable, whose values must be distinct. It is required if making predictions and must be specified in the estimation phase even if predictions are to be made later. The variable can be used in merging the prediction dataset back with the input dataset with MATCH FILES or similar commands.

PREDICT specifies whether predictione should be made. The equation to use can be either what is estimated if ESTIMATE = YES, or a previously estimated model can be used. The most recently estimated model is automatically available in the same session when the command is reinvoked.

MODELSOURCE specifies a model file previously saved be used for prediction. If none is specified, the in-memory model is used.

DATASOURCE specifies the input data for predictions. TRAINING uses the training or estimation data while NEWDATA specifies that data from the current active dataset be used. The new data must have the same variables, coding, and and measurement levels as the estimation data, but only the explanatory variables actually selected in building the model are required. Typical usage might be to run with ESTIMATE=YES and PREDICT=NO; then activate the dataset for which predictions be made or change the case selection and then running with ESTIMATE=NO and PREDICT=YES.

SAVEMODEL specifies a file for saving the estimated model.

NFOLD specifies the number of folds for cross validation. The default is 0, which means no cross validation. Typical values would be 5 or 10.

DEGREE specifies the interaction order to be considered. The default is 1, which means no interactions.

MAXTERMS specifies the maximum number of terms, i.e., coefficients, allowed in the final model. By default there is no limit up to 1000 terms. For large problems, setting a limit can speed things up considerably.

Display

MODELPLOTS specifies whether the four plots for the overall model are displayed.

RESPONSEPLOTS specifies whether the response effect plots are displayed. They show the shape of the response for each selected variable, whether categorical or scale level.

VARIMPPLOT specifies whether a plot of the three variable importance measures is displayed.

HEIGHT and WIDTH specify the dimensions of the plots in inches.

FONTSIZE specifie the desired font size in points. It may, however, not be effective in some plots.

Acknowledgements

The R package citation is
S. Milborrow. Derived from mda:mars by T. Hastie and R. Tibshirani. earth: Multivariate Adaptive Regression Splines, 2011. R package.

The R module used by this procedure is the work of Stephen Milborrow, Trevor Hastie, Rob Tibshirani, Alan Miller, and Thomas Lumley based on Friedman's papers "Fast MARS" and "Multivariate Adaptive Regression Splines".

These authors were not involved in the development of the SPSS procedure and bear no responsibility for any issuses. The SPSS procedure is the work of Jon K. Peck

© Copyright Jon K. Peck 2024

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •