Я работаю над проектом на своем рабочем месте, и у меня возникают некоторые проблемы с анализом дерева решений.Это не домашнее задание.Образец набора данных
PRODUCT_SUB_LINE_DESCR MAJOR_CATEGORY_DESCR CUST_REGION_DESCR
SUNDRY SMALL EQUIP NORTH EAST REGION
SUNDRY SMALL EQUIP SOUTH EAST REGION
SUNDRY SMALL EQUIP SOUTH EAST REGION
SUNDRY SMALL EQUIP NORTH EAST REGION
SUNDRY PREVENTIVE SOUTH CENTRAL REGION
SUNDRY PREVENTIVE SOUTH EAST REGION
SUNDRY PREVENTIVE SOUTH EAST REGION
SUNDRY SMALL EQUIP NORTH CENTRAL REGION
SUNDRY SMALL EQUIP MOUNTAIN WEST REGION
SUNDRY SMALL EQUIP MOUNTAIN WEST REGION
SUNDRY COMPOSITE NORTH CENTRAL REGION
SUNDRY COMPOSITE NORTH CENTRAL REGION
SUNDRY COMPOSITE OHIO VALLEY REGION
SUNDRY COMPOSITE NORTH EAST REGION
Sales QtySold MFGCOST MarginDollars new_ProductName
209.97 3 134.55 72.72 no
-76.15 -1 -44.85 -30.4 no
275.6 2 162.5 109.84 no
138.7 1 81.25 55.82 no
226 2 136 87.28 no
115 1 68 45.64 no
210.7 2 136 71.98 no
29 1 18.85 9.77 no
29 1 18.85 9.77 no
46.32 2 37.7 7.86 no
159.86 1 132.4 24.81 no
441.3 2 264.8 171.2 no
209.62 1 132.4 74.57 no
209.62 1 132.4 74.57 no
1) У моего дерева только два узла, и вот почему
>summary(tree_model)
Classification tree:
tree(formula = new_ProductName ~ ., data = training_data)
Variables actually used in tree construction:
[1] "PRODUCT_SUB_LINE_DESCR"
Number of terminal nodes: 2
Residual mean deviance: 0 = 0 / 41140
Misclassification error rate: 0 = 0 / 41146
2) Я создал новый фрейм данных, в котором есть только факторы с уровнем ниже 22уровень.Есть один фактор с 25 уровнями, но tree () не выдает ошибку, поэтому я думаю, что алгоритм принимает 25 уровней
>str(new_Dataset)
'data.frame': 51433 obs. of 7 variables:
$ PRODUCT_SUB_LINE_DESCR: Factor w/ 3 levels "Handpieces","PRIVATE
LABEL",..: 3 3 3 3 3 3 3 3 3 3 ...
$ MAJOR_CATEGORY_DESCR : Factor w/ 25 levels "AIR ABRASION",..: 23 23 23
23 21 21 21 23 23 23 ...
$ CUST_REGION_DESCR : Factor w/ 7 levels "MOUNTAIN WEST REGION",..: 3
6 6 3 5 6 6 2 1 1 ...
$ Sales : num 210 -76.2 275.6 138.7 226 ...
$ QtySold : int 3 -1 2 1 2 1 2 1 1 2 ...
$ MFGCOST : num 134.6 -44.9 162.5 81.2 136 ...
$ MarginDollars : num 72.7 -30.4 109.8 55.8 87.3 ...
3) Вот как я настроил свой анализ
# I choose product name as my main attribute(maybe that is why it appears at
the root node?)
new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE
LABEL","yes","no")
data = data.frame(new_Dataset, new_ProductName)
set.seed(100)
train = sample(1:nrow(data), 0.8*nrow(data)) # training row indices
training_data = data[train,] # training data
testing_data = data[-train,] # testing data
#fit the tree model using training data
tree_model = tree(new_ProductName ~.,data = training_data)
summary(tree_model)
plot(tree_model)
text(tree_model, pretty = 0)
out = predict(tree_model) # predict the training data
# actuals
input.newproduct = as.character(training_data$new_ProductName)
# predicted
pred.newproduct = colnames(out)[max.col(out,ties.method = c("first"))]
mean (input.newproduct != pred.newproduct) # misclassification %
# Cross Validation to see how much we need to prune the tree
set.seed(400)
cv_Tree = cv.tree(tree_model, FUN = prune.misclass) # run cross validation
attach(cv_Tree)
plot(cv_Tree) # plot the CV
plot(size, dev, type = "b")
# set size corresponding to lowest value in the plot above.
treePruneMod = prune.misclass(tree_model, best = 9) plot(treePruneMod)
text(treePruneMod, pretty = 0)
out = predict(treePruneMod) # fit the pruned tree
# Predicted
pred.newproduct = colnames(out)[max.col(out,ties.method = c("random"))]
# calculate Mis-classification error
mean(training_data$new_ProductName != pred.newproduct)
# Predict testData with Pruned tree
out = predict(treePruneMod, testing_data, type = "class")
4) Я никогда не делал этого раньше.Я посмотрел пару видео на YouTube и начал это делать.Я приветствую отличные советы, объяснения, критику и, пожалуйста, помогите мне в этом процессе.Это было сложным для меня.
> table(data$PRODUCT_SUB_LINE_DESCR, data$new_ProductName)
no yes
Handpieces 164 0
PRIVATE LABEL 0 14802
SUNDRY 36467 0