Want to learn more? Take the full course at https://learn.datacamp.com/courses/ma... at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
Hi! My name is Elie Kawerk, I'm a Data Scientist and I'll be your instructor. In this course, you'll be learning about tree-based models for classification and regression.
In chapter 1, you'll be introduced to a set of supervised learning models known as Classification-And-Regression-Tree or CART.
In chapter 2, you'll understand the notions of bias-variance trade-off and model ensembling.
Chapter 3 introduces you to Bagging and Random Forests.
Chapter 4 deals with boosting, specifically with AdaBoost and Gradient Boosting.
Finally, in chapter 5, you'll understand how to get the most out of your models through hyperparameter-tuning.
Given a labeled dataset, a classification tree learns a sequence of if-else questions about individual features in order to infer the labels.
In contrast to linear models, trees are able to capture non-linear relationships between features and labels. In addition, trees don't require the features to be on the same scale through standardization for example.
To understand trees more concretely, we'll try to predict whether a tumor is malignant or benign in the Wisconsin Breast Cancer dataset using only 2 features.
The figure here shows a scatterplot of two cancerous cell features with malignant-tumors in blue and benign-tumors in red.
When a classification tree is trained on this dataset, the tree learns a sequence of if-else questions with each question involving one feature and one split-point.
Take a look at the tree diagram here. At the top, the tree asks whether the concave-points mean of an instance is smaller or equal 0-point-051. If it is, the instance traverses the True branch; otherwise, it traverses the False branch. Similarly, the instance keeps traversing the internal branches until it reaches an end. The label of the instance is then predicted to be that of the prevailing class at that end.
The maximum number of branches separating the top from an extreme-end is known as the maximum depth which is equal to 2 here.
Now that you know what a classification tree is, let's fit one with scikit-learn.
First, import DecisionTreeClassifier from sklearn.tree as shown in line 1. Also, import the functions train_test_split() from sklearn.model_selection and accuracy_score() from sklearn.metrics.
In order to obtain an unbiased estimate of a model's performance, you must evaluate it on an unseen test set. To do so, first split the data into 80% train and 20% test using train_test_split(). Set the parameter stratify to y in order for the train and test sets to have the same proportion of class labels as the unsplit dataset.
You can now use DecisionTreeClassifier() to instantiate a tree classifier, dt with a maximum depth of 2 by setting the parameter max_depth to 2. Note that the parameter random_state is set to 1 for reproducibility.
Then call the fit method on dt and pass X_train and y_train. To predict the labels of the test-set, call the predict method on dt.
Finally, print the accuracy of the test set using accuracy_score().
To understand the tree's predictions more concretely, let's see how it classifies instances in the feature-space.
A classification-model divides the feature-space into regions where all instances in one region are assigned to only one class-label. These regions are known as decision-regions.
Decision-regions are separated by surfaces called decision-boundaries.
The figure here shows the decision-regions of a linear-classifier. Note how the boundary is a straight-line.
In contrast, as shown here on the right, a classification-tree produces rectangular decision-regions in the feature-space. This happens because, at each split made by the tree, only one feature is involved.
Now let's practice!
#DataCamp #PythonTutorial #MachineLearningwithPython #DecisionTreeforClassification #MachineLearningTutorial