Q&A - Hierarchical Softmax in word2vec

แชร์
ฝัง
  • เผยแพร่เมื่อ 3 ธ.ค. 2024
  • What is the "Hierarchical Softmax" option of a word2vec model? What problems does it address, and how does it differ from Negative Sampling? How is Hierarchical Softmax implemented?
    For more insights into word2vec, check out my full online course on word2vec here:
    www.chrismccor...

ความคิดเห็น • 36

  • @abhijeetsharma5715
    @abhijeetsharma5715 3 ปีที่แล้ว +2

    This was the best explanation of HS that I saw! Very clearly explained.
    In my opinion, the most essential part of this was that even with HS we do have |V|-1 output units..but only log|V| units need to be computed while training since remaining units are "dont-cares" and we can compute loss based on the log|V| output units only.
    However, while testing, we would have certainly required to compute all |V| softmax probabilities to make any prediction, but we don't really care about testing/predicting since our aim is just to train the embedding.

    • @gemini_537
      @gemini_537 11 หลายเดือนก่อน

      I like your comments, but I don't quite understand why only log|V| units need to be computed during training. Could you give an example?

  • @vgkk5637
    @vgkk5637 4 ปีที่แล้ว +10

    Thank you Chris. well explained and just perfect for people like me who are interested in understanding the concepts and usage rather than academic maths behind it..

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 ปีที่แล้ว

      Thanks VeniVig! That's nice to hear, and I'm glad it provided a practical understanding.

  • @stasbekman8852
    @stasbekman8852 4 ปีที่แล้ว +14

    there is a small typo at 13:50 - it should be .72, instead of .62 so that it adds up to 1.
    and thank you!

  • @ariverhorse
    @ariverhorse 11 หลายเดือนก่อน

    Best explanation of HS I have seen!

  • @자연어천재만재
    @자연어천재만재 2 ปีที่แล้ว

    This is amazing. Although I am Korean and bad at English, your lecture made me smart.

  • @doctorshadow2482
    @doctorshadow2482 7 หลายเดือนก่อน

    Thank you. Good explanation.
    Some questions:
    1. At 2:40. Why are we interested in getting the sum as 1, which softmax provides? What's wrong with using existing output values, we already have the weight for 8 higher than others, so we have the answer. Why do we need the extra work at all?
    2. At 9:49. What is this "word vector"? Is it still one-hot vector for the word from the dictionary or something else? How is this vector represented in this case?
    3. At 15:00. That's fine, but if we trained for "cpupacabra", what would be with the weights when we train for the other words? Wouldn't it just blend or "blur" the coefficients making them closer to "white noise"?

  • @user-re1bi2bc8b
    @user-re1bi2bc8b 4 ปีที่แล้ว +1

    Incredibly easy to understand thanks to your explanation. Thank you Chris!

  • @joyli9106
    @joyli9106 3 ปีที่แล้ว

    Thank you Chris! I would say it's the best explanation I've ever seen about HS.

  • @hamzaleb9215
    @hamzaleb9215 5 ปีที่แล้ว

    Always clear explanation and right to the point. Thanks Chris. Waiting for next videos. Your 2 articles explaining Word2Vec were just perfect.

  • @Dao007forever
    @Dao007forever 2 ปีที่แล้ว

    Great explanation!

  • @j045ua
    @j045ua 4 ปีที่แล้ว

    These videos have been a great help for my thesis! Thank you Chris!

  • @amratanshu99
    @amratanshu99 4 ปีที่แล้ว +1

    Nicely explained. Thanks man!

  • @8g8819
    @8g8819 5 ปีที่แล้ว

    Great video series, keep it going !!!!

  • @guaguago2583
    @guaguago2583 5 ปีที่แล้ว

    Your fans, second comment, I am a chinese phd student, very very expect next videos :D

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  5 ปีที่แล้ว

      Thank you! I'm hoping to upload a new video about every week or so.

  • @utubemaloy
    @utubemaloy 4 ปีที่แล้ว

    Thank you!!! I loved this.

  • @corgirun7892
    @corgirun7892 หลายเดือนก่อน

    super amazing

  • @samba789
    @samba789 4 ปีที่แล้ว +1

    Great videos Chris! I absolutely love your content!
    Just a quick clarification, is the output matrix also going to be weights that we need to learn?

  • @nikolaoskaragkiozis5330
    @nikolaoskaragkiozis5330 3 ปีที่แล้ว

    Hi Chris, thank you for the video. So, if I understand correctly there are 2 things that are being learned here. 1) The word embedding, 2) the Output Matrix which contains the weights associated with the output layer?

  • @souvikjana2048
    @souvikjana2048 4 ปีที่แล้ว +1

    Hi Chris. Great Video. Could you explain the part how the binary tree is trained. I can't seem to understand for the input-context pair(chewpacabra-active) how do we select 0/1 at the root node or the subsequent nodes below.

    • @haardshah1676
      @haardshah1676 4 ปีที่แล้ว

      I think you know the sequence of 0/1s for the context word. So for each node, you have a logistic regression model that takes as input the embedding for the input word and outputs probability of 0/1 for that node. So for the example you describe, we know the true label for root node should be "1", for 4th node is "0", and for 3rd node is "1".

  • @libzz6133
    @libzz6133 2 ปีที่แล้ว

    at 10:53, we got label for 4 is 0, how about the others label ? like the label for number 1 ?

  • @mahdiamrollahi8456
    @mahdiamrollahi8456 3 ปีที่แล้ว

    Hello dear Chris,
    Hope all is well,
    Thanks for your lecture, that was fabulous.
    I have some tiny questions:
    - For negative sampling, it is said that negative samples will selected randomly. So, It means that we just need to update the params for those samples instead of all possible words in softmax? (so, in softmax we need to update the params for both correct and incorrect classes, true?)
    -How we can calculate the output matrix? How do we have it?
    -if we want to calculate the prob of all context words, we need to traverse over all the tree, right?
    Best Wishes.Mahdi

  • @yuantao563
    @yuantao563 4 ปีที่แล้ว

    The video is great! Is there any rules in why each blue node corresponds to the given row in the output matrix? Like why first blue node is row 6? How is it determined?

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 ปีที่แล้ว

      Hi Yuan,
      It's just a byproduct of the Huffman tree building algorithm. If I recall correctly, I think it does result in the rows being sorted relative to the tree depth (the frequency of the word). This isn't important to the implementation, though.

    • @abhijeetsharma5715
      @abhijeetsharma5715 3 ปีที่แล้ว

      You can assign each blue node to any row in output-matrix. Order of assigning rows is unimportant since this is not like any RNN output(i.e. it isn't sequential output). Just like input units can be in any order in a vanilla neural net.

  • @gavin8535
    @gavin8535 4 ปีที่แล้ว

    Nice. What does the vector 6 look like and where is it? Is it in the output layer?

  • @mikemihay
    @mikemihay 5 ปีที่แล้ว

    Waiting for more videos

  • @maliozers
    @maliozers 5 ปีที่แล้ว

    First like, first comment :) thanks for your share Chris.

  • @anoop5611
    @anoop5611 3 ปีที่แล้ว

    What does the output vector list in blue contain?
    Something from the hidden-output weights?

    • @anoop5611
      @anoop5611 3 ปีที่แล้ว

      Okay, I missed the part that answers it. So, output vector's particular row would correspond to one of those non-leaf nodes, and the size of the row would be equal to the number of units in the hidden layer?
      Thank you Chri,s!

  • @ANSHULJAINist
    @ANSHULJAINist 4 ปีที่แล้ว

    how to implement the hierarchical softmax for any model? Do frameworks like pytorch tensorflow have inbuilt implementation for them? If not, how can be build to work it on any model ?