Class RegressionTree

  • All Implemented Interfaces:
    java.io.Serializable, Regression<double[]>

    public class RegressionTree
    extends java.lang.Object
    implements Regression<double[]>
    Decision tree for regression. A decision tree can be learned by splitting the training set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning.

    Classification and Regression Tree techniques have a number of advantages over many of those alternative techniques.

    Simple to understand and interpret.
    In most cases, the interpretation of results summarized in a tree is very simple. This simplicity is useful not only for purposes of rapid classification of new observations, but can also often yield a much simpler "model" for explaining why observations are classified or predicted in a particular manner.
    Able to handle both numerical and categorical data.
    Other techniques are usually specialized in analyzing datasets that have only one type of variable.
    Tree methods are nonparametric and nonlinear.
    The final results of using tree methods for classification or regression can be summarized in a series of (usually few) logical if-then conditions (tree nodes). Therefore, there is no implicit assumption that the underlying relationships between the predictor variables and the dependent variable are linear, follow some specific non-linear link function, or that they are even monotonic in nature. Thus, tree methods are particularly well suited for data mining tasks, where there is often little a priori knowledge nor any coherent set of theories or predictions regarding which variables are related and how. In those types of data analytics, tree methods can often reveal simple relationships between just a few variables that could have easily gone unnoticed using other analytic techniques.
    One major problem with classification and regression trees is their high variance. Often a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. Besides, decision-tree learners can create over-complex trees that cause over-fitting. Mechanisms such as pruning are necessary to avoid this problem. Another limitation of trees is the lack of smoothness of the prediction surface.

    Some techniques such as bagging, boosting, and random forest use more than one decision tree for their analysis.

    Author:
    Haifeng Li
    See Also:
    GradientTreeBoost, RandomForest, Serialized Form
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.lang.String dot()
      Returns the graphic representation in Graphviz dot format.
      smile.regression.RegressionTree.Node getRoot()
      Returs the root node.
      double[] importance()
      Returns the variable importance.
      int maxDepth()
      Returns the maximum depth" of the tree -- the number of nodes along the longest path from the root node down to the farthest leaf node.
      double predict​(double[] x)
      Predicts the dependent variable of an instance.
      double predict​(int[] x)
      Predicts the dependent variable of an instance with sparse binary features.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • RegressionTree

        public RegressionTree​(double[][] x,
                              double[] y,
                              int maxNodes)
        Constructor. Learns a regression tree with (most) given number of leaves. All attributes are assumed to be numeric.
        Parameters:
        x - the training instances.
        y - the response variable.
        maxNodes - the maximum number of leaf nodes in the tree.
      • RegressionTree

        public RegressionTree​(double[][] x,
                              double[] y,
                              int maxNodes,
                              int nodeSize)
        Constructor. Learns a regression tree with (most) given number of leaves. All attributes are assumed to be numeric.
        Parameters:
        x - the training instances.
        y - the response variable.
        maxNodes - the maximum number of leaf nodes in the tree.
      • RegressionTree

        public RegressionTree​(Attribute[] attributes,
                              double[][] x,
                              double[] y,
                              int maxNodes)
        Constructor. Learns a regression tree with (most) given number of leaves.
        Parameters:
        attributes - the attribute properties.
        x - the training instances.
        y - the response variable.
        maxNodes - the maximum number of leaf nodes in the tree.
      • RegressionTree

        public RegressionTree​(AttributeDataset data,
                              int maxNodes)
        Constructor. Learns a regression tree for random forest and gradient tree boosting.
        Parameters:
        data - the dataset.
        maxNodes - the maximum number of leaf nodes in the tree. samples[i] should be 0 or 1 to indicate if the instance is used for training.
      • RegressionTree

        public RegressionTree​(Attribute[] attributes,
                              double[][] x,
                              double[] y,
                              int maxNodes,
                              int nodeSize)
        Constructor. Learns a regression tree with (most) given number of leaves.
        Parameters:
        attributes - the attribute properties.
        x - the training instances.
        y - the response variable.
        maxNodes - the maximum number of leaf nodes in the tree.
      • RegressionTree

        public RegressionTree​(AttributeDataset data,
                              int maxNodes,
                              int nodeSize)
        Constructor. Learns a regression tree for random forest and gradient tree boosting.
        Parameters:
        data - the dataset.
        maxNodes - the maximum number of leaf nodes in the tree.
        nodeSize - the number of instances in a node below which the tree will not split, setting nodeSize = 5 generally gives good results.
      • RegressionTree

        public RegressionTree​(Attribute[] attributes,
                              double[][] x,
                              double[] y,
                              int maxNodes,
                              int nodeSize,
                              int mtry,
                              int[][] order,
                              int[] samples,
                              RegressionTree.NodeOutput output)
        Constructor. Learns a regression tree for random forest and gradient tree boosting.
        Parameters:
        attributes - the attribute properties.
        x - the training instances.
        y - the response variable.
        maxNodes - the maximum number of leaf nodes in the tree.
        nodeSize - the number of instances in a node below which the tree will not split, setting nodeSize = 5 generally gives good results.
        mtry - the number of input variables to pick to split on at each node. It seems that p/3 give generally good performance, where p is the number of variables.
        order - the index of training values in ascending order. Note that only numeric attributes need be sorted.
        samples - the sample set of instances for stochastic learning. samples[i] should be 0 or 1 to indicate if the instance is used for training.
      • RegressionTree

        public RegressionTree​(AttributeDataset data,
                              int maxNodes,
                              int nodeSize,
                              int mtry,
                              int[][] order,
                              int[] samples,
                              RegressionTree.NodeOutput output)
        Constructor. Learns a regression tree for random forest and gradient tree boosting.
        Parameters:
        data - the dataset.
        maxNodes - the maximum number of leaf nodes in the tree.
        nodeSize - the number of instances in a node below which the tree will not split, setting nodeSize = 5 generally gives good results.
        mtry - the number of input variables to pick to split on at each node. It seems that p/3 give generally good performance, where p is the number of variables.
        order - the index of training values in ascending order. Note that only numeric attributes need be sorted.
        samples - the sample set of instances for stochastic learning. samples[i] should be 0 or 1 to indicate if the instance is used for training.
      • RegressionTree

        public RegressionTree​(AttributeDataset data,
                              int maxNodes,
                              int nodeSize,
                              int mtry,
                              int[][] order,
                              int[] samples,
                              RegressionTree.NodeOutput output,
                              double[] monotonicRegression)
      • RegressionTree

        public RegressionTree​(Attribute[] attributes,
                              double[][] x,
                              double[] y,
                              int maxNodes,
                              int nodeSize,
                              int mtry,
                              int[][] order,
                              int[] samples,
                              RegressionTree.NodeOutput output,
                              double[] monotonicRegression)
      • RegressionTree

        public RegressionTree​(int numFeatures,
                              int[][] x,
                              double[] y,
                              int maxNodes)
        Constructor. Learns a regression tree on sparse binary samples.
        Parameters:
        numFeatures - the number of sparse binary features.
        x - the training instances of sparse binary features.
        y - the response variable.
        maxNodes - the maximum number of leaf nodes in the tree.
      • RegressionTree

        public RegressionTree​(int numFeatures,
                              int[][] x,
                              double[] y,
                              int maxNodes,
                              int nodeSize)
        Constructor. Learns a regression tree on sparse binary samples.
        Parameters:
        numFeatures - the number of sparse binary features.
        x - the training instances of sparse binary features.
        y - the response variable.
        maxNodes - the maximum number of leaf nodes in the tree.
        nodeSize - the number of instances in a node below which the tree will not split, setting nodeSize = 5 generally gives good results.
      • RegressionTree

        public RegressionTree​(int numFeatures,
                              int[][] x,
                              double[] y,
                              int maxNodes,
                              int nodeSize,
                              int[] samples,
                              RegressionTree.NodeOutput output)
        Constructor. Learns a regression tree on sparse binary samples.
        Parameters:
        numFeatures - the number of sparse binary features.
        x - the training instances.
        y - the response variable.
        maxNodes - the maximum number of leaf nodes in the tree.
        nodeSize - the number of instances in a node below which the tree will not split, setting nodeSize = 5 generally gives good results.
        samples - the sample set of instances for stochastic learning. samples[i] should be 0 or 1 to indicate if the instance is used for training.
    • Method Detail

      • importance

        public double[] importance()
        Returns the variable importance. Every time a split of a node is made on variable the impurity criterion for the two descendent nodes is less than the parent node. Adding up the decreases for each individual variable over the tree gives a simple measure of variable importance.
        Returns:
        the variable importance
      • predict

        public double predict​(double[] x)
        Description copied from interface: Regression
        Predicts the dependent variable of an instance.
        Specified by:
        predict in interface Regression<double[]>
        Parameters:
        x - the instance.
        Returns:
        the predicted value of dependent variable.
      • predict

        public double predict​(int[] x)
        Predicts the dependent variable of an instance with sparse binary features.
        Parameters:
        x - the instance.
        Returns:
        the predicted value of dependent variable.
      • maxDepth

        public int maxDepth()
        Returns the maximum depth" of the tree -- the number of nodes along the longest path from the root node down to the farthest leaf node.
      • dot

        public java.lang.String dot()
        Returns the graphic representation in Graphviz dot format. Try http://viz-js.com/ to visualize the returned string.
      • getRoot

        public smile.regression.RegressionTree.Node getRoot()
        Returs the root node.
        Returns:
        root node.