CS 7646: Decision Trees Part 2

แชร์
ฝัง
  • เผยแพร่เมื่อ 19 ก.ย. 2024

ความคิดเห็น • 16

  • @radiazwahl
    @radiazwahl 2 ปีที่แล้ว +74

    02:00: lecture theme: how to build a decision tree
    05:10: gap (skip)
    05:40: decision tree example
    08:09: tabular view of decision tree - use Numpy 2D array
    9:00: if a factor value represents a leaf node, the associated SplitVal represents a y_pred value and not a split value
    10:30: student question about binary tree efficiency
    13:00: algorithm for building decision tree
    15:00: decision tree algorithm (JR Quinlan)
    16:30: IF base cases, RETURN row of data
    17:00: description of recursive algorithm
    22:40: student questions
    24:50: how to determine the "best" feature
    29:10: student questions
    • Check which axis your tree rows are being appended to each other
    30:45: ndarray representation of decision tree
    33:30: Use highest absolute value of correlation to select optimal 'splitting' factor
    • Numpy has a built-in correlation method
    • Split value should be the median value
    35:30: gap (skip)
    36:20: ndarray representation of decision tree (cont.)
    39:00: when multiple factors have the same correlation value with y, use a deterministic approach to select the "best split factor" (e.g. the lowest indexed factor) rather than a random approach -- this will make results more reproduceable because we have to calculate correlation or entropy, Gini Impurity, etc.
    43:30: which steps in JR Quinlan decision tree algorithm are the most computationally expensive?
    44:30: A: determining the best feature to split on
    45:30: Most time-intensive parts of the JR Quinlan Decision Tree algorithm are:
    1. Determining the best feature to split on
    2. Calculating the median for SplitVal
    • Random trees can help with this
    46:30: Random Tree Algorithm (A Cutler)
    50:30: You can create a random tree by randomly selecting features of the tree OR by creating a tree with randomly selected subsets of the training data
    52:30: Strengths & weaknesses of Decision Tree learners
    • Cost of learning: decision trees more expensive than parametric or KNN
    • Cost of querying: Linear regression is fastest. But decision trees are faster than KNN
    • Decision Trees: No need to normalize data

  • @claybuxton8167
    @claybuxton8167 3 ปีที่แล้ว +26

    DT Learner Algorithm Explanation starts at 14:30
    RT Learner Algorithm Explanation starts at 47:00

  • @mikedunn597
    @mikedunn597 5 ปีที่แล้ว +35

    Algorithm explained starting at 14:27

  • @yogeshluthra123
    @yogeshluthra123 7 ปีที่แล้ว +3

    At around 11:50, there was a question on why binary trees are more efficient and why not use higher branching?
    In fact, higher branching trees, like 2-3 RedBlack trees and even B-Trees are made from the same basic BST structure. For example to answer 'yes, no or maybe', left branch could just be 'yes' and right branch just delegates answering to the sub-tree (which could be a leaf node)

  • @balance3826
    @balance3826 ปีที่แล้ว +5

    These requirements require a law degree to understand...

  • @merckens
    @merckens 4 ปีที่แล้ว

    FYI Factor 2 was "volatile acidity" and negatively correlated with rating around -0.391 (almost as highly correlated as alcohol content). Volatile acidity is the presence of acidic aromas in wine basically, and can have a similar effect to smelling vinegar or even nail polish remover (which is why it's a negative correlation - more is worse). Factor 10 was the amount of sulphates in the wine, which had a positive correlation with score of around 0.251. (I actually think they meant "sulfites" or possibly it's just a translation issue.) Sulfites in wine occur both naturally and via additives. The purpose of sulfites is to prevent spoilage and oxidation during the fermentation process, which obviously weakens (or outright ruins) the wine.

  • @frankhahn1384
    @frankhahn1384 8 ปีที่แล้ว +1

    You're probably going to use random values using cutlers model

  • @MohammedHamadii
    @MohammedHamadii 8 ปีที่แล้ว

    When we split based on the median (half the data is greater than median and the other half is less), do we assume that the column is always sorted?

    • @MohammedHamadii
      @MohammedHamadii 8 ปีที่แล้ว +8

      The answer is yes, the column is sorted. I should have continued watching.

  • @zhuzhuqing4862
    @zhuzhuqing4862 5 ปีที่แล้ว +1

    42:44 The answer is WRONG!!!!!!!!!!!!!!!!!!!!!!!

    • @zhuzhuqing4862
      @zhuzhuqing4862 5 ปีที่แล้ว

      The value 10.9 should on the right right subtree.

    • @merckens
      @merckens 4 ปีที่แล้ว

      @@zhuzhuqing4862 Whaddya mean? 10.9 is the median of 10.0 and 11.8 on the right left subtree. 10.7 is the median of 10.5 and 10.9 on the right right subtree. Looks fine to me.

    • @michaelgentry7534
      @michaelgentry7534 4 ปีที่แล้ว +7

      Shouldn't the top right value be 8 instead of 7?

    • @J3Compton
      @J3Compton 4 ปีที่แล้ว

      @@michaelgentry7534 yes.