The Best Way to do Topic Modeling in Python - Top2Vec Introduction and Tutorial

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 พ.ย. 2024

ความคิดเห็น • 98

  • @justinhuang8034
    @justinhuang8034 2 ปีที่แล้ว +9

    Your killing it lately with these videos. Keep up the great work.

    • @python-programming
      @python-programming  2 ปีที่แล้ว +4

      Thanks! That is great to hear. I am trying out a new style this month to see if subscribers like it.

    • @jesusmtz29
      @jesusmtz29 2 ปีที่แล้ว

      @@python-programming New subscriber here. Love your style of presentation

  • @Adrian_Marmy
    @Adrian_Marmy ปีที่แล้ว +2

    Dude, this video is awesome. Breaking things down seems to be your super power.... 👌

    • @python-programming
      @python-programming  ปีที่แล้ว +1

      Thanks so much! I always wanted a super power. Since this video came out, I think BertTopic is a bit better. It has more features and is a bit more accessible now to beginners too. It also has a thriving community.

    • @Adrian_Marmy
      @Adrian_Marmy ปีที่แล้ว

      ​@@python-programming wow, awesome for you to comment this. I will have a look at it :-)

  • @dankchan420
    @dankchan420 2 ปีที่แล้ว +8

    I am a new subscriber and this .. was .. simply .. great! I wish there were more Top2Vec videos (ranging from beginner to advanced) . Keep up the excellent work. *hint* *hint* 🙂

    • @python-programming
      @python-programming  2 ปีที่แล้ว +2

      Thanks! Great to hear! I will be making more in the future.

  • @abasisadegh
    @abasisadegh 9 หลายเดือนก่อน +1

    Thank you very much for this video man, Is there a way to use pyLDAvis visualizations with top2vec?

  • @sjoerdbraaksma9358
    @sjoerdbraaksma9358 ปีที่แล้ว +1

    This is such a great find! What I am wondering is: Can you train a BERT sentencetransformer on a large set of documents spanning several projects, then have top2vec use these embeddings to make a topic model for each project (so basically, for each subset of the larger corpus)?

  • @rush19772112
    @rush19772112 2 ปีที่แล้ว +1

    Dr Mattingly wish you my best to your channel and CONGRATULATIONS,
    you 've been GREAT help/assist with your videos in understanding Pandas. Topic modeling is an area of INTEREST to me specially everything related to social sciences especially the LDA. Looking forward seeing your video-tutorial.
    needless to say how grateful I am to you, cause you HELPED ME to UNDERSTAND by showing step by step Pandas Tutorials. If you could do the same with Latent Dirichlet Allocation algorithm that you would be marvelous. Even though you do have some code already written in a past video tutorial I am still not quite there in how to apply it in a project with texts in third languages than English, such as Greek or Hebrew.
    Looking forward seeing your video-tutorial
    kind wishes,
    Christos Bardas

    • @python-programming
      @python-programming  2 ปีที่แล้ว

      Thank you so much for your very kind comment! It means a lot to me. I will see if I can put together a video for topic modeling with non-English texts. Would Latin be alright? I don't have Greek or Hebrew unfortunately.

    • @rush19772112
      @rush19772112 2 ปีที่แล้ว

      @@python-programming any non english language would be ok. What I can't work out in the video about LDA is how to transform data. For instance how to make a data set of texts (from historical data) in pdf format and tokenize words, make all necessary steps to run the lda algorithm etc but please make everything from scratch as you did in the pandas series. Your videos in pandas series have been proved an inspiration for me, therefore I'm truly grateful Dr Mattingly!
      I'm looking forward for an LDA one as well!
      ..Kind regards..
      ..christos bardas..

  • @prabhacar
    @prabhacar 2 ปีที่แล้ว +1

    brilliant stuff! thanks! Just a small comment.....i am quite visual and I learn better with pictures...in future if possible please include some visualizations of the topic modeling.

  • @saranshtiwari8543
    @saranshtiwari8543 25 วันที่ผ่านมา

    Hey TH-cam, Why am I seeing this after 2 years? Recommend videos like these as soon as they get uploaded!

  • @kosemekars
    @kosemekars 2 ปีที่แล้ว +2

    Great vid, as always. I'm interested in creating my own WordNet dataset, any ideas where I should start?

    • @python-programming
      @python-programming  2 ปีที่แล้ว +1

      You are too kind! Thanks. That is a very interesting question that I have never gotten before. I have never attempted something precisely like that before (so take what I say with a grain of salt), but I have worked with similar problems that were very domain--specific. I used a combination of heuristics and FastText embeddings to generate a sort of weak supervised approach to forming a knowledge tree based on semantic and syntactic meaning. Does this help?

    • @kosemekars
      @kosemekars 2 ปีที่แล้ว +1

      ​@@python-programming Thanks for the illuminating answer. Do you think that a graph-based approach (using something like NetworkX) could be helpful? Basically starting from a lexicon or dictionary and mapping the relations.

    • @python-programming
      @python-programming  2 ปีที่แล้ว +1

      No problem! Indeed I do. That was actually exactly how I graphed them out. Also use word vectors and use the similarity to calculate the weights of the edges in the graph. That may help

    • @kosemekars
      @kosemekars 2 ปีที่แล้ว +1

      @@python-programming Very interesting. Thanks.

    • @python-programming
      @python-programming  2 ปีที่แล้ว

      No problem!

  • @sarasharick5209
    @sarasharick5209 2 ปีที่แล้ว +1

    I just started my first data science role and there’s a project coming up with a topic modeling aspect to it. Looking forward to this video.

    • @python-programming
      @python-programming  2 ปีที่แล้ว

      Awesome! Glad to hear it. I hope it helps out a lot.

  • @dankchan420
    @dankchan420 2 ปีที่แล้ว +4

    can you show how to compare it to lda with topic information gain? or coherence score? something i’m curious to see

    • @edadila
      @edadila 2 ปีที่แล้ว +2

      I need this too! Thanks for the great video by the way👏

    • @python-programming
      @python-programming  2 ปีที่แล้ว +2

      Great idea for a new video! Thanks!

  • @sinabaghaei3504
    @sinabaghaei3504 ปีที่แล้ว +1

    so do you suggest working with Top2Vec rather than LDA? I mean do you think doing those manual changes in implementing LDA and data preprocessing worth it? or let's stick to Top2Vec. by the way your videos are awesome and I am really interested to go deep into Topic Modeling.

    • @python-programming
      @python-programming  ปีที่แล้ว +1

      For most tasks, it makes more sense to use the newer methods applied in Top2Vec, BERTopic, or LeetTopic than doing traditional LDA Topic Modeling. That said, there are times that LDA may make more sense. It just depends on the problem that you are trying to solve. I have not had to use LDA in a while because the results from transformer-based topic modeling is far superior.

  • @amrmoursi7303
    @amrmoursi7303 2 ปีที่แล้ว

    Thanks, wish you my best to your channel, and CONGRATULATIONS,
    How can we evaluate the topic modeling algo like top2vec or BerTopic
    Thanks in advanced

  • @AndrewPeverells
    @AndrewPeverells 2 ปีที่แล้ว +1

    Hello Dr Mattingly, great guide as always!
    I'm in need of help though. I'll post this here, so maybe other people who have the same issue can solve it, but if you prefer I can send you a pm.
    I'm trying to feed my model 2 kinds of lists:
    1. ["arma", "virumque", "cano", "troiae"...]
    2. ["arma virumque cano troiae qui primus ab oris..."]
    (from what i get in the documentation, the first one should be the way to go, as it processes lists of strings)
    When trying to build my model, I get these two types of errors; for the first one:
    "Exception in thread Thread-171:
    Traceback (most recent call last):
    File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
    File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
    File "/home/apevere/.local/lib/python3.8/site-packages/gensim/models/word2vec.py", line 1163, in _worker_loop
    tally, raw_tally = self._do_train_job(data_iterable, alpha, thread_private_mem)
    File "/home/apevere/.local/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 424, in _do_train_job
    tally += train_document_dbow(
    File "gensim/models/doc2vec_inner.pyx", line 358, in gensim.models.doc2vec_inner.train_document_dbow
    TypeError: Cannot convert list to numpy.ndarray" (and it gets stuck loading)
    For the second one:
    "hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__()
    hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()
    sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._kd_tree.BinaryTree.query()
    k must be less than or equal to the number of training points"
    Do you know how to solve it?

    • @python-programming
      @python-programming  2 ปีที่แล้ว

      Thanks!
      And great question. I think we can work on it here, that way others can get the benefit of hearing about the issue and potential solutions. First, unlike other topic modeling approaches, with top2vec, you do not need to tokenize your text, so a list of docs (as strings) is what you want to give the model. I have not tried to give it a list of lists, yet, but from what I can see from your first example, you appear to just be giving the model a list of words.
      In this scenario, you would typically want to give it a list of lists with each sublist being the tokens (words) from each document. Does that make sense?
      I suspect this is the origin of the error, but I would need to see your code more to address it properly. If you want, DM me on Twitter with a larger snippet and I will respond here with a better answer.

    • @AndrewPeverells
      @AndrewPeverells 2 ปีที่แล้ว +1

      @@python-programming Thank you for your quick answer!
      Yes, it does make sense. As with the a pretty consistent part of coding-related problems, it's an issue of data types and how to properly handle them.
      Now though I'm a bit lost. As a test, I'm trying to feed my model this list of strings:
      " lst = [["arma", "virumque", "cano", "troiae", "qui", "primus", "ab", "oris"],
      ["nunc", "est", "bibendum", "nunc", "pede", "libero", "pulsare", "tellus"],
      ["uiuamus", "mea", "lesbia", "atque", "amemus"]] "
      (yes, I'm working with latin!)
      The error for model = Top2Vec(lst) now is: "ValueError: Documents need to be a list of strings"
      Isn't it, like you said, a list of lists, with each sublist being strings (the tokens)? Am I missing something terribly basic, because I'm a complete beginner at coding?

    • @python-programming
      @python-programming  2 ปีที่แล้ว

      @@AndrewPeverells no problem! Happy to help. Can you try and give it a list of sentences rather than a list of lists of tokens and see if that helps? Also can you paste your whole code here so I can see how you are loading your data? Also what OS are you using?

    • @AndrewPeverells
      @AndrewPeverells 2 ปีที่แล้ว

      @@python-programming Now I tried with a simple list of sentences, and it gave me another error.
      I'll paste the whole code, although it's very short:
      >> from top2vec import Top2Vec
      >> lst = ["arma virumque cano troiae qui primus ab oris", "nunc est bibendum, nunc pede libero pulsare tellus", "uiuamus mea lesbia atque amemus"]
      >> model = Top2Vec(lst)
      Error:
      "RuntimeError Traceback (most recent call last)
      /tmp/ipykernel_1573/2552625371.py in
      ----> 1 model = Top2Vec(lst)
      ~/.local/lib/python3.8/site-packages/top2vec/Top2Vec.py in __init__(self, documents, min_count, ngram_vocab, ngram_vocab_args, embedding_model, embedding_model_path, embedding_batch_size, split_documents, document_chunker, chunk_length, max_num_chunks, chunk_overlap_ratio, chunk_len_coverage_ratio, sentencizer, speed, use_corpus_file, document_ids, keep_documents, workers, tokenizer, use_embedding_model_tokenizer, umap_args, hdbscan_args, verbose)
      524 logger.info('Creating joint document/word embedding')
      525 self.embedding_model = 'doc2vec'
      --> 526 self.model = Doc2Vec(**doc2vec_args)
      527
      528 self.word_vectors = self.model.wv.get_normed_vectors()
      [...]
      RuntimeError: you must first build vocabulary before training the model"
      I'm working on jupyter notebook, from an Ubuntu terminal environment for Windows.

    • @AndrewPeverells
      @AndrewPeverells 2 ปีที่แล้ว +1

      Update
      I think I found the issue for this. It's the size of your corpus. If I raise the number of documents (being whole sentences) in my corpus, it stops giving me the error. I went for at least 15 documents.
      Now it gives me another error though, and I'm quite lost.
      Code:
      >> from top2vec import Top2Vec
      >> lst = ["document1", "document2", "document3", ... "document17"]
      >> model = Top2Vec(lst)
      Error:
      "~/.local/lib/python3.8/site-packages/top2vec/Top2Vec.py in __init__(self, documents, min_count, ngram_vocab, ngram_vocab_args, embedding_model, embedding_model_path, embedding_batch_size, split_documents, document_chunker, chunk_length, max_num_chunks, chunk_overlap_ratio, chunk_len_coverage_ratio, sentencizer, speed, use_corpus_file, document_ids, keep_documents, workers, tokenizer, use_embedding_model_tokenizer, umap_args, hdbscan_args, verbose)
      682
      683 # create topic vectors
      --> 684 self._create_topic_vectors(cluster.labels_)
      685
      686 # deduplicate topics
      ~/.local/lib/python3.8/site-packages/top2vec/Top2Vec.py in _create_topic_vectors(self, cluster_labels)
      857 unique_labels.remove(-1)
      858 self.topic_vectors = self._l2_normalize(
      --> 859 np.vstack([self.document_vectors[np.where(cluster_labels == label)[0]]
      860 .mean(axis=0) for label in unique_labels]))
      861
      in vstack(*args, **kwargs)
      ~/.local/lib/python3.8/site-packages/numpy/core/shape_base.py in vstack(tup)
      280 if not isinstance(arrs, list):
      281 arrs = [arrs]
      --> 282 return _nx.concatenate(arrs, 0)
      283
      284
      in concatenate(*args, **kwargs)
      ValueError: need at least one array to concatenate"
      I really don't know what's this all about.

  • @juanmanuelaguiar3368
    @juanmanuelaguiar3368 2 ปีที่แล้ว

    Great video, very clear!
    Do you know how Top2Vec deals with outliers? there is no 'outlier topic' at the end and all the documents seem to be assigned a topic. (I have BERTopic in mind where there is a -1 topic with the outliers)

  • @fetchthebattleaxe
    @fetchthebattleaxe 2 ปีที่แล้ว +2

    Great video! Do you know if top2vec has options for when you have a dataset too large to fit into RAM? I have a dataset that is something like 9gb of text that I've been trying to topic model with different methods, so I'd be curious to try this out. But I probably can't just load the whole thing into a list and pass it in

    • @python-programming
      @python-programming  2 ปีที่แล้ว +1

      Thanks! Great question. I have not personally tried it with a dataset that large just yet. What are your computer's specs? Do you have a Cuda-accelerated GPU?

    • @fetchthebattleaxe
      @fetchthebattleaxe 2 ปีที่แล้ว

      @@python-programming
      CPU: AMD Ryzen 7 3700X 8 core
      16 gb ram
      GPU: RTX 2070 super
      The GPU does have Cuda installed and I've used it for deep learning a bit. But the GPU itself only has 8gb vram and i've run into cuda memory issues before. Though admittedly I have no idea how memory needs are shared between CPU and GPU.
      Either way, I'll probably try this library on a random slice of the full data to see if it shows promise. Thanks for drawing my attention to it!

  • @Kylbigel
    @Kylbigel ปีที่แล้ว +1

    Exactly what I needed thank you!

  • @jubinamarie
    @jubinamarie ปีที่แล้ว +1

    This and your other top2vec videos are awesome! This is exactly what I needed. I have a question for you. I use other tools (e.g., Tableau) and would want to export topic data from Jupyter Notebooks to use elsewhere. I figured out how to export the DF to Excel with a column added for the topic numbers, but can't for the life of me figure out how to get columns with the other information, such as the document scores, maybe the top 10 words for each topic. The inability to move all the data out is holding me back. Hope you can help. Thank you!

    • @python-programming
      @python-programming  ปีที่แล้ว

      I ran into these same issues thats why I created LeetTopic with a colleague. It does a lot of the same things as Too2Vec but returns a df with all this data you want.

  • @cuneyttyler4922
    @cuneyttyler4922 ปีที่แล้ว

    Nice video. But when I listed the words for each topic it shows stop words only - isn't it supposed to remove them in preprocessing stage?

  • @lukechen8015
    @lukechen8015 หลายเดือนก่อน

    Hi there, how would this work if there's multiple topics tagged to one line? Is it all mutually exclusive?

  • @SonnyGeorgeVlogs
    @SonnyGeorgeVlogs 2 ปีที่แล้ว +1

    Great video. Glad to have stumbled on it.

  • @TheAbdallahk
    @TheAbdallahk 2 ปีที่แล้ว +1

    Wow, this is amazing. Thank you so much!

  • @gwaliwamashaka8724
    @gwaliwamashaka8724 3 หลายเดือนก่อน

    Awesome, thank you very much.

  • @JayShankarpure
    @JayShankarpure 2 ปีที่แล้ว +1

    Hi sir I checked out your NER Playlist and had a doubt . How can we calculate accuracy of a ner model ?

    • @python-programming
      @python-programming  2 ปีที่แล้ว

      Hi. I am glad you are watching the video. You analyze the Precision, Recall, and F-Score during training, but this only let's you know how the model is performing during the training process. To gather proper metrics, you need to structure a formal test with a heldout set and monitor the results. I have a video on it here: th-cam.com/video/k1FtpADlusE/w-d-xo.html

    • @JayShankarpure
      @JayShankarpure 2 ปีที่แล้ว +1

      @@python-programming Got it , Thanks sir . Actually i am making a stock research platform called Shodh . Which involves some advanced nlp. Would love to take your guidance on few of topics that i am making . Can we connect anytime soon. Thanks

    • @python-programming
      @python-programming  2 ปีที่แล้ว

      Sure! I do consultation, just fill out the form on my website wjbmattingly.com

  • @dynahmhyte
    @dynahmhyte 10 หลายเดือนก่อน +1

    ValueError: Documents need to be a list of strings (I get this when I type model = Top2Vec(docs)

    • @python-programming
      @python-programming  10 หลายเดือนก่อน

      Perhaps a few of your items are NaN values or ints or floats?

  • @RedCloudServices
    @RedCloudServices ปีที่แล้ว

    How do you filter stop words and how does this compare to Bartopic

  • @speedTurtle
    @speedTurtle ปีที่แล้ว +1

    Bro is the NLP Gawd.

  • @BispensGipsGebis
    @BispensGipsGebis 2 ปีที่แล้ว +1

    You my Sir are awesome

  • @TC-bv4on
    @TC-bv4on 2 ปีที่แล้ว +1

    Working on topic modeling for legal opinions. Have you tried Bert?

    • @python-programming
      @python-programming  2 ปีที่แล้ว +1

      I have. It works very nicely. There is a library that wraps around HuggingFace BERT model. It is called BerTopic, but top2vec does the same actions and a bit more. Just specify the BERT model.

    • @TC-bv4on
      @TC-bv4on 2 ปีที่แล้ว +2

      @@python-programming awesome! Thanks. I know there is a Legal Bert that is pretrained on legal materials so idk if there is a way to specify it. Also hoping to supplement it with a citation network because you really can’t understand an opinion without understanding it’s citations. If you have any ideas I’m all ears!
      Btw your channel is so needed. Hope it keeps growing while staying helpful and non-youtubey.

    • @python-programming
      @python-programming  2 ปีที่แล้ว

      @@TC-bv4on you should be able to point to the legal BERT. Thanks so much for your kind words about my channel! If you want to see some legal content, let me know.

    • @TC-bv4on
      @TC-bv4on 2 ปีที่แล้ว

      I personally would but idk I might be the only one. Law is super far behind as far as technology goes

  • @thepresistence5935
    @thepresistence5935 2 ปีที่แล้ว

    Try BERT TOPIC

  • @patrykkoakowski4357
    @patrykkoakowski4357 ปีที่แล้ว

    How did you force the code to run on CPU?

  • @j0shm0o1
    @j0shm0o1 2 ปีที่แล้ว +1

    Thanks for a great video ! I installed top2vec and tried importing it it. I get following error 'No module named 'wordcloud.query_integral_image'. Any ideas

    • @python-programming
      @python-programming  2 ปีที่แล้ว

      Thanks! Interesting question. Did you create a new environment? I am wondering if you have an older version of wordcloud installed in your base?

    • @j0shm0o1
      @j0shm0o1 2 ปีที่แล้ว +1

      This got resolved when I created a new environment

    • @python-programming
      @python-programming  2 ปีที่แล้ว

      @@j0shm0o1 excellent!

  • @jordoobodi
    @jordoobodi ปีที่แล้ว

    4:20
    which is it!?
    "Each word in that document, type, all th the items of that vector, all the documents.."

  • @malikrumi1206
    @malikrumi1206 2 ปีที่แล้ว +1

    Do you mean that Top2Vec requires *actual sentences*? What about paragraphs? Paragraphs with more than one topic inside them?

    • @python-programming
      @python-programming  2 ปีที่แล้ว +1

      Great question. You can use any length text but if you are using BERT, you want to keep it under 512 tokens. (Double check my number). If your texts have frequently overlapping subjects you can plot the texts and see where that overlap occurs visually and assign labels accordingly. Say topic 3 shares features of topics 1 and 2. It would be plotted theoretically between the teo with pull towards the one it shares the most overlap. But yes, you can use sentences or paragarphs. Either should be fine.

    • @malikrumi1206
      @malikrumi1206 2 ปีที่แล้ว +1

      Great! Thanks.

    • @python-programming
      @python-programming  2 ปีที่แล้ว

      No problem!

  • @tonyberber
    @tonyberber ปีที่แล้ว

    I'm getting this error:
    from top2vec import Top2Vec
    ImportError: cannot import name 'Top2Vec' from partially initialized module 'top2vec' (most likely due to a circular import)
    any ideas are appreciated.
    I'm using an M1 Mac

  • @AlexAlexanderIII
    @AlexAlexanderIII 2 ปีที่แล้ว +1

    Great video.

  • @wenqianzhou9174
    @wenqianzhou9174 2 ปีที่แล้ว +1

    how about BerTopic

    • @python-programming
      @python-programming  2 ปีที่แล้ว +2

      I will do a video on that

    • @khalifakhalifa610
      @khalifakhalifa610 2 ปีที่แล้ว

      @@python-programming Can't wait for your BerTopic video. Your style is just amazing, Kudos!!!

  • @boubacarbah1455
    @boubacarbah1455 2 ปีที่แล้ว

    Hello , i'm trying to reproduce your exercice. But i got a problem when i tried to import Top2vec " from top2vec import Top2Vec ".I get this error " no module named "llvmlite.binding.dylib". And i could not fix it.So i wonder if you have a solution ?

    • @bben4507
      @bben4507 2 ปีที่แล้ว

      similar here, but I got: OSError: Could not load shared object file: libllvmlite.dylib

  • @moemarocha3893
    @moemarocha3893 ปีที่แล้ว

    Hi! Anyone here having trouble importing Top2Vec due to problems with Numpy version? Just tried most of possible solutions I found on stackoverflow but nothing works...

  • @babyroo555
    @babyroo555 ปีที่แล้ว +1

    Any R coders here?