How to Create an LDA Topic Model in Python with Gensim (Topic Modeling for DH 03.03)

แชร์
ฝัง
  • เผยแพร่เมื่อ 8 ต.ค. 2024
  • Notebook: github.com/wjb...
    In this video, we use Gensim and Python to create an LDA Topic Model. As with other text analysis methods, most time is spent preparing the data and getting it into a form readable by the ML system.
    If you enjoy this video, please subscribe. I provide all my content at no cost. If you want to support my channel, please donate via
    PayPal: www.paypal.com...
    Patreon: / wjbmattingly (its my www.themedievalworld.com account as well).
    If there's a specific video you would like to see or a tutorial series, let me know in the comments and I will try and make it.
    If you liked this video, check out www.PythonHumanities.com, where I have Coding Exercises, Lessons, on-site Python shells where you can experiment with code, and a text version of the material discussed here.
    You can follow me at:
    / wjb_mattingly

ความคิดเห็น • 62

  • @python-programming
    @python-programming  3 ปีที่แล้ว +8

    Notebook: github.com/wjbmattingly/topic_modeling_textbook/blob/main/03_03_lda_model_demo.ipynb

    • @vt-fc6gq
      @vt-fc6gq 3 ปีที่แล้ว +2

      First : thank you VERY MUCH for this video.
      I was still facing an error message mentionning that :
      module 'pyLDAvis' has no attribute 'gensim'
      It comes from the fact that the name changed.
      You now have to write :
      import pyLDAvis
      import pyLDAvis.gensim_models as gensimvis
      pyLDAvis.enable_notebook()
      and then :
      # feed the LDA model into the pyLDAvis instance
      lda_viz = gensimvis.prepare(ldamodel, corpus, dictionary)
      (source : script_kitty : stackoverflow.com/questions/66759852/no-module-named-pyldavis)

    • @krishnazanwar4709
      @krishnazanwar4709 2 ปีที่แล้ว

      Hey man , this jupyter notebook isn't working apparently. Gives the below error.
      Unreadable Notebook: C:\Users\PT\Downloads\topic_modelling.ipynb NotJSONError("Notebook does not appear to be JSON: '\
      \
      \
      \
      \
      \
      \

  • @HungNguyen-te9dd
    @HungNguyen-te9dd 3 ปีที่แล้ว +34

    Literally one of the most underrated channel for NLP, keep up the works!!!

  • @victorevelandiasuarez2447
    @victorevelandiasuarez2447 ปีที่แล้ว +1

    I usually view the tutorial videos in speed 1.5x, but this man speaks in 2x XD, thanks for the video

    • @python-programming
      @python-programming  ปีที่แล้ว

      Haha! No problem. I spent a lot of time trying to learn to speak this slowly. =)

  • @purposeoriented6094
    @purposeoriented6094 2 ปีที่แล้ว +3

    Really!!! one of the many underared channels for NLP on youtube, keep your good work, prize will be followed. Thank you

  • @ShahanShawkat
    @ShahanShawkat 2 ปีที่แล้ว

    Good day Dr. highly appreciate your time and effort to create these videos and make it available on TH-cam ! Best wishes

  • @domagojpalenkas4437
    @domagojpalenkas4437 2 ปีที่แล้ว +5

    Thanks for the great tutorial on topic modeling in general, very valuable material here.
    In this particular video, have you forgotten to exclude the stop_words? Skimming through the code, but can't find the place in which those were used.
    Keep up the good work (Y)

  • @miladrogha4904
    @miladrogha4904 ปีที่แล้ว

    Thank you for this great tutorial!

  • @zaireenabdulrahman9900
    @zaireenabdulrahman9900 ปีที่แล้ว

    Wow so easy to understand. Tq so much.

  • @gisleberge4363
    @gisleberge4363 3 ปีที่แล้ว +1

    Very clear tutorial...easily understood 🙂

  • @KT-st5gw
    @KT-st5gw 3 ปีที่แล้ว +5

    after identifying topics , how do we assign them to each record in the DF? we identify 2 cluster and their relevant words. how to identify which document belongs to cluster 1 and which to cluster 2.

  • @quarkplankton
    @quarkplankton 2 ปีที่แล้ว

    Great tutorial ovrerall, but just a few things I would love clarification on: as a few others have mentioned, where are you removing the stopwords, and what is "glob" used for and when is it used?

  • @MariaLuizaCarvalhoMLCAP
    @MariaLuizaCarvalhoMLCAP 3 ปีที่แล้ว +3

    I am trying to do LDA where my texts, almost 100 words per line, when introduced to model become one sole document. Is there anywhere I can find a tutorial using a pd Dataframe where the text is cells in the columns? Please help...

  • @wasgeht2409
    @wasgeht2409 3 ปีที่แล้ว +1

    wow... very good!

  • @Chanezk
    @Chanezk ปีที่แล้ว

    Thanks !

  • @rahulmukerjee477
    @rahulmukerjee477 ปีที่แล้ว

    @17:10 doc2bow should give the frequency of words in the doc, not the corpus. Please confirm this.

  • @ElvisSCL
    @ElvisSCL 9 หลายเดือนก่อน

    could you please explain after identifying topics , how do we assign them to each record in the DF?

  • @maximviner4809
    @maximviner4809 2 ปีที่แล้ว

    Very helpful

  • @leo_dr8198
    @leo_dr8198 2 ปีที่แล้ว

    Amazing tutorial!!

  • @pravinmhaske
    @pravinmhaske 3 หลายเดือนก่อน

    My output shows one huge cluster and many tiny clusters. What does that mean?
    Note: First cluster clearly represents the topic. All other clusters have the other irrelevant words with frequency generally 1. Shall I remove all those tokens from the corpus?

  • @fefefefezzz
    @fefefefezzz 2 ปีที่แล้ว

    Very good video!

  • @xinyipeng9337
    @xinyipeng9337 2 ปีที่แล้ว +2

    It seems like the import pyLDAvis.gensim does not work, I changed to import pyLDAvis_gensim_models

    • @python-programming
      @python-programming  2 ปีที่แล้ว

      Thanks for updating us! I have liked this so others can see it higher in the comments.

  • @hardtoplaygamesti9592
    @hardtoplaygamesti9592 3 ปีที่แล้ว +1

    I need to create a file as a matrix in witch every line correspond to a topic and every colum correspond to a word. The information that correspond to the line and the colum is the probability that the word is from the topic. Any help in how to do this?
    Thank you

  • @hamidrezamohammadzadeh6807
    @hamidrezamohammadzadeh6807 3 ปีที่แล้ว

    Hey, many thanks for this tutorial! But I have a question:
    How can I Export the Final result (viz) to an Excel File ?!
    BR,
    Hamidreza

  • @tomasmoyashowtime
    @tomasmoyashowtime ปีที่แล้ว

    Hey !! I stumbled upon this video looking for nlp methods to analyze text pdfs .
    I want to analyze over 2000 files, is there a fast way to do so ?
    The clustering, would help me to analyze topics within 1 archive is there a way to do it on the 2000 files automatically?

  • @woodgeorge5585
    @woodgeorge5585 3 ปีที่แล้ว +2

    after trying data = load_data("data/ushmm_dn.json")["texts"]
    an error occur JSONDecodeError: Expecting value: line 1 column 1 (char 0)
    could anyone teach me how to solve this

  • @dataruncoach
    @dataruncoach ปีที่แล้ว +1

    Can someone please help me. I am getting this error at the end when I try to show the vis:
    TypeError: Object of type complex is not JSON serializable
    Does anyone know how to fix this? This is my last line of code:
    import pyLDAvis
    import pyLDAvis.gensim_models
    pyLDAvis.enable_notebook()
    vis=pyLDAvis.gensim_models.prepare(lda_model,corpus,id2word)
    vis
    I think it has to do with how the JSON file was read and formatted?

  • @xiaohanhannahwen8181
    @xiaohanhannahwen8181 2 ปีที่แล้ว

    May i ask which python we should use for this ? i keep getting the ModuleNotFoundError...

  • @LorenzoFilitti
    @LorenzoFilitti 4 หลายเดือนก่อน

    at minute 6:48, how can I make it work if I have a txt file and not a json one?

  • @woodgeorge5585
    @woodgeorge5585 3 ปีที่แล้ว +4

    ❌ import pyLDAvis.gensim
    ✅import pyLDAvis.gensim_models

  • @rrrajat04
    @rrrajat04 3 ปีที่แล้ว

    can you make a video on Guided LDA or Corex LDA for Semi Supervised LDA?

  • @THECORNFACTORY
    @THECORNFACTORY 2 ปีที่แล้ว

    Hey! What version of spaCy are you using?

  • @VengefulSpace
    @VengefulSpace 3 ปีที่แล้ว +1

    Hi Dr. Mattingly, is the ushmm_dn.json file available as well through the GitHub repo ...?

    • @python-programming
      @python-programming  3 ปีที่แล้ว

      I have a meeting today in which I will be asking permission to share that file. Thanks for reminding me.

    • @hamidrezamohammadzadeh6807
      @hamidrezamohammadzadeh6807 3 ปีที่แล้ว

      Hey, many thanks for this tutorial! But I have a question:
      How can I Export the Final result (viz) to an Excel File ?!
      BR,
      Hamidreza

    • @hizzuhishaam9392
      @hizzuhishaam9392 2 ปีที่แล้ว

      @@python-programming can i get that file?

  • @DonJuan247
    @DonJuan247 2 ปีที่แล้ว

    Hey man I'm having an issue with this line
    vis = prepare(lda_model,corpus,id2word,mds="mmds",R=20)
    When I run it, it says
    TypeError: prepare() missing 2 required positional arguments: 'vocab' and 'term_frequency'

  • @RobertOSullivan
    @RobertOSullivan 3 ปีที่แล้ว +1

    If ushmm_dn.json is not available can you point us towards data that we can use?

    • @python-programming
      @python-programming  3 ปีที่แล้ว +3

      I got the okay to share it. Here it is: github.com/wjbmattingly/topic_modeling_textbook/tree/main/data

  • @premdayal8175
    @premdayal8175 3 ปีที่แล้ว +1

    can you please help me ? I am getting this error ModuleNotFoundError: No module named 'pyLDAvis.gensim'

    • @maniac123ful
      @maniac123ful 3 ปีที่แล้ว +3

      import pyLDAvis.gensim_models

  • @alexwinquist8092
    @alexwinquist8092 3 ปีที่แล้ว +1

    Any suggestions on visualizing the output in and IDE such as Pycharm? It seems like there are some issues using pyLDAvis in pycharm

    • @python-programming
      @python-programming  3 ปีที่แล้ว

      Unfortunately, I do not know of any. PyLDAvis was designed with Jupyter in mind, I believe. You can, however, save it as an html and open it externally. Here's how => stackoverflow.com/questions/41936775/export-pyldavis-graphs-as-standalone-webpage

    • @alexwinquist8092
      @alexwinquist8092 3 ปีที่แล้ว +1

      @@python-programming yeah thats what I have been reading as well thanks!

    • @python-programming
      @python-programming  3 ปีที่แล้ว

      @@alexwinquist8092 No problem! Wish I had better news.

  • @nanyinyang7994
    @nanyinyang7994 3 ปีที่แล้ว

    When I do the visualization part, it tells that "module 'pyLDAvis' has no attribute 'gensim'
    ". Not sure how to deal with it.

    • @philippplazibat397
      @philippplazibat397 3 ปีที่แล้ว +2

      try gensim_models, if i got that right, they changed it in a later version, therefor it does not work anymore with only gensim

    • @talhamasood0000
      @talhamasood0000 3 ปีที่แล้ว +3

      import pyLDAvis
      import pyLDAvis.gensim_models
      pyLDAvis.enable_notebook()
      vis=pyLDAvis.gensim_models.prepare(lda_model,corpus,id2word)

  • @janni7439
    @janni7439 2 ปีที่แล้ว +2

    Using final and new as variable names makes me a little angry as a Java developer :D

    • @python-programming
      @python-programming  2 ปีที่แล้ว

      Haha! I know it is such a bad habbit, I forgot about final keyword in Java.

  • @Gama1939
    @Gama1939 2 ปีที่แล้ว

    Uhmm. I think u forgot to remove the stopwords.

    • @Gama1939
      @Gama1939 2 ปีที่แล้ว

      afaik "simple_preprocess" doesn't remove stopwords.
      Anw, please cmiiw

  • @mouwersor
    @mouwersor 2 ปีที่แล้ว

    "No module named gensim"

  • @circlezsquare1626
    @circlezsquare1626 6 หลายเดือนก่อน

    pip install "pandas

  • @hunaydahsaeid1609
    @hunaydahsaeid1609 2 ปีที่แล้ว

    Thanks 😊