Quali Sono Gli Svantaggi Di Usare Un Albero Decisionale Per La Classificazione?

QNA > W > Quali Sono Gli Svantaggi Di Usare Un Albero Decisionale Per La Classificazione?

Domanda

Quali sono gli svantaggi di usare un albero decisionale per la classificazione?

Risposte

02/23/2022

Pall Fijon

1) Gli alberi decisionali sono utilizzati al meglio per classificare i dati che sono intrinsecamente di natura categorica, come le informazioni sulle partite sportive, le diagnosi mediche e gli avvisi di sicurezza, ecc. Tuttavia, se c'è già una forte evidenza che suggerisce che il dataset sottostante ha un vero ordine statistico, allora alcuni metodi più semplici possono essere preferibili (tecniche di regressione per esempio).

2) La precisione di qualsiasi modello generato da questo tipo di algoritmo non è garantita. Se un dato attributo non ha un grande impatto sui risultati finali, allora può essere ignorato del tutto dall'algoritmo di classificazione che lavora all'interno dell'albero decisionale, quindi è probabile che sia necessaria una revisione manuale da parte di un analista prima dello schieramento o dell'utilizzo finale. Questo può essere superato con il continuo

Per concludere: Anche se l'albero decisionale può non essere accurato o preciso come altri metodi di classificazione, offre molti vantaggi in termini di facilità d'uso e flessibilità analitica. Nel complesso questo dovrebbe permettere ad un analista che capisce bene la metodologia di creare un modello che sia semplice, ma efficace.

04/07/2022

Kathrine

3 Problems with Decision Trees

I illustrate by fitting a decision tree model in R to the iris dataset, which collects measurement data on 3 species of flowers. I focus on two of those measurements: sepal length and sepal width.

library(rpart)
library(rpart.plot)
model1 <- rpart(Species ~ Sepal.Length + Sepal.Width, iris)
prp(model1, digits = 3)

main-qimg-3c4af3cc8685224009fcfb22f497cc3c

Now, I will perturb the data by adding 0.1 to each datapoint with probability 0.25, and subtracting 0.1 to each datapoint with probability 0.25.

set.seed(1)
tmp <- function() rbinom(nrow(iris), size = 1, prob = 0.5)
perturb <- function() (tmp() - tmp()) / 10
iris <- iris + perturb()
iris <- iris + perturb()
model2 <- rpart(Species ~ Sepal.Length.Perturbed +
Sepal.Width.Perturbed, iris)
prp(model2, digits = 3)

main-qimg-116d9ed72a35475c5b9d46e114ff8919

Key observation - Notice how just by perturbing the data a little bit, I made a different-looking decision tree?

To get a better look at what's happening, I plot the decision tree boundaries and the actual data points on a scatter plot. I color each region by the plurality class.

main-qimg-1ef89906e27ae2b34bac96862bf4f72f

main-qimg-1e92434cfc5be261f6c53929abb0857a

Some problem we see here when we apply our decision tree on continuous data:

Instability - The decision tree changes when I perturb the dataset a bit. This is not desirable as we want our classification algorithm to be pretty robust to noise and be able to generalize well to future observed data. This can undercut confidence in the tree and hurt the ability to learn from it. One solution - Is to switch to a tree-ensemble method that combines many decision trees on slightly different versions of the dataset.
Classification Plateaus - There's a very big difference between being on the left side of a boundary instead of a right side. We could see two different flowers with similar characteristics classified very differently. Some sort of rolling hill type of classification could work better than a plateau classification scheme. One solution - (like above), is to switch to a tree-ensemble method that combines many decision trees on slightly different versions of the dataset.
Decision Boundaries are parallel to the axis - We could imagine diagonal decision boundaries that would perform better, e.g. separating the setosa flowers and the versicolor flowers.

One very good method to reduce the instability is to rely on an ensemble of decision trees, by trying some sort of random forest or boosted decision tree algorithm. This also helps smooth out a classification plateau. An ensemble of slightly different trees will almost always outperform a single decision tree.

If you prefer classification boundaries that aren't as rigid, you would also be interested in tree ensembles or something like K-Nearest-Neighbors.

If you're looking for decision boundaries that are NOT parallel to the axis, you would want to try an SVM or Logistic Regression. See What are the advantages of logistic regression over decision trees? Are there any cases where it's better to use logistic regression instead of decision trees?

For the other side of decision trees, see What are the advantages of using a decision tree for classification?

Dare una risposta

Articoli simili

Qual è il nome scientifico delle alghe? :: Perché Savitar vuole uccidere Iris?