Data Science : Decision Trees Basics

Decision tree learning is one of the predictive modeling approaches used in statistics, data mining, and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity.

The Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too.

The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from prior data(training data).

Let us take an example: - A men gets an offer from a company, and we are building a decision tree, does he accepts the offer or not?

Suppose the first feature is 'salary': how much salary did he get? In the above diagram, our first feature is salary, and it ranges between (50k - 80k) $. There are two conditions based on this salary range, the first is that the salary is in the middle of the range, then the men accept the offer otherwise, he rejects the offer.

So far, we have satisfied one condition, but have some more features. Our second feature is the 'office near the house': is the office near the house or not? If yes, he accepts the offer otherwise he rejects the offer.

The third feature is whether the company provides a 'cab facility' or not? If yes, he accepts the offer otherwise he rejects the offer.

Therefore, based on the features, we created a tree base structure and decided whether the proposal was accepted or not.

Important Terminology related to Decision Trees

Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.

Splitting: It is a process of dividing a node into two or more sub-nodes.

Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.

Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.

Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.

Branch / Sub-Tree: A subsection of the entire tree is called a branch or sub-tree.

Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.

So far, we have covered the basic understanding of the decision tree algorithm and its terminology. In an upcoming blog, we will discuss advanced topics such as - how to split decision trees, different splitting criteria, how to optimize the performance of decision trees, and more.

Happy Learning :-)

References -

image reference - img_ref

Wikipedia reference - wiki

Data Science

Wednesday, March 24, 2021

Decision Trees Basics

No comments:

Post a Comment

Data Science for Marketing and Planning

Search This Blog