How to Create an LDA Topic Model in Python with Gensim (Topic Modeling for DH 03.03)

Python Tutorials for Digital Humanities

มุมมอง 62 828

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 8 ต.ค. 2024
Notebook: github.com/wjb...
In this video, we use Gensim and Python to create an LDA Topic Model. As with other text analysis methods, most time is spent preparing the data and getting it into a form readable by the ML system.
If you enjoy this video, please subscribe. I provide all my content at no cost. If you want to support my channel, please donate via
PayPal: www.paypal.com...
Patreon: / wjbmattingly (its my www.themedievalworld.com account as well).
If there's a specific video you would like to see or a tutorial series, let me know in the comments and I will try and make it.
If you liked this video, check out www.PythonHumanities.com, where I have Coding Exercises, Lessons, on-site Python shells where you can experiment with code, and a text version of the material discussed here.
You can follow me at:
/ wjb_mattingly

ความคิดเห็น • 62

@python-programming 3 ปีที่แล้ว ⁺⁸
Notebook: github.com/wjbmattingly/topic_modeling_textbook/blob/main/03_03_lda_model_demo.ipynb
@vt-fc6gq 3 ปีที่แล้ว ⁺²
First : thank you VERY MUCH for this video.
I was still facing an error message mentionning that :
module 'pyLDAvis' has no attribute 'gensim'
It comes from the fact that the name changed.
You now have to write :
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()
and then :
# feed the LDA model into the pyLDAvis instance
lda_viz = gensimvis.prepare(ldamodel, corpus, dictionary)
(source : script_kitty : stackoverflow.com/questions/66759852/no-module-named-pyldavis)
@krishnazanwar4709 2 ปีที่แล้ว
Hey man , this jupyter notebook isn't working apparently. Gives the below error.
Unreadable Notebook: C:\Users\PT\Downloads\topic_modelling.ipynb NotJSONError("Notebook does not appear to be JSON: '\
\
\
\
\
\
\
@HungNguyen-te9dd 3 ปีที่แล้ว ⁺³⁴
Literally one of the most underrated channel for NLP, keep up the works!!!
@python-programming 3 ปีที่แล้ว
You are too kind! Thanks! And I will
@victorevelandiasuarez2447 ปีที่แล้ว ⁺¹
I usually view the tutorial videos in speed 1.5x, but this man speaks in 2x XD, thanks for the video
@python-programming ปีที่แล้ว
Haha! No problem. I spent a lot of time trying to learn to speak this slowly. =)
@purposeoriented6094 2 ปีที่แล้ว ⁺³
Really!!! one of the many underared channels for NLP on youtube, keep your good work, prize will be followed. Thank you
@python-programming 2 ปีที่แล้ว ⁺¹
Thanks for the kind words!
@ShahanShawkat 2 ปีที่แล้ว
Good day Dr. highly appreciate your time and effort to create these videos and make it available on TH-cam ! Best wishes
@domagojpalenkas4437 2 ปีที่แล้ว ⁺⁵
Thanks for the great tutorial on topic modeling in general, very valuable material here.
In this particular video, have you forgotten to exclude the stop_words? Skimming through the code, but can't find the place in which those were used.
Keep up the good work (Y)
@miladrogha4904 ปีที่แล้ว
Thank you for this great tutorial!
@zaireenabdulrahman9900 ปีที่แล้ว
Wow so easy to understand. Tq so much.
@gisleberge4363 3 ปีที่แล้ว ⁺¹
Very clear tutorial...easily understood 🙂
@python-programming 3 ปีที่แล้ว
Thanks! Glad you found it useful.
@KT-st5gw 3 ปีที่แล้ว ⁺⁵
after identifying topics , how do we assign them to each record in the DF? we identify 2 cluster and their relevant words. how to identify which document belongs to cluster 1 and which to cluster 2.
@quarkplankton 2 ปีที่แล้ว
Great tutorial ovrerall, but just a few things I would love clarification on: as a few others have mentioned, where are you removing the stopwords, and what is "glob" used for and when is it used?
@MariaLuizaCarvalhoMLCAP 3 ปีที่แล้ว ⁺³
I am trying to do LDA where my texts, almost 100 words per line, when introduced to model become one sole document. Is there anywhere I can find a tutorial using a pd Dataframe where the text is cells in the columns? Please help...
@wasgeht2409 3 ปีที่แล้ว ⁺¹
wow... very good!
@Chanezk ปีที่แล้ว
Thanks !
@rahulmukerjee477 ปีที่แล้ว
@17:10 doc2bow should give the frequency of words in the doc, not the corpus. Please confirm this.
@ElvisSCL 9 หลายเดือนก่อน
could you please explain after identifying topics , how do we assign them to each record in the DF?
@maximviner4809 2 ปีที่แล้ว
Very helpful
@leo_dr8198 2 ปีที่แล้ว
Amazing tutorial!!
@pravinmhaske 3 หลายเดือนก่อน
My output shows one huge cluster and many tiny clusters. What does that mean?
Note: First cluster clearly represents the topic. All other clusters have the other irrelevant words with frequency generally 1. Shall I remove all those tokens from the corpus?
@fefefefezzz 2 ปีที่แล้ว
Very good video!
@xinyipeng9337 2 ปีที่แล้ว ⁺²
It seems like the import pyLDAvis.gensim does not work, I changed to import pyLDAvis_gensim_models
@python-programming 2 ปีที่แล้ว
Thanks for updating us! I have liked this so others can see it higher in the comments.
@hardtoplaygamesti9592 3 ปีที่แล้ว ⁺¹
I need to create a file as a matrix in witch every line correspond to a topic and every colum correspond to a word. The information that correspond to the line and the colum is the probability that the word is from the topic. Any help in how to do this?
Thank you
@hamidrezamohammadzadeh6807 3 ปีที่แล้ว
Hey, many thanks for this tutorial! But I have a question:
How can I Export the Final result (viz) to an Excel File ?!
BR,
Hamidreza
@tomasmoyashowtime ปีที่แล้ว
Hey !! I stumbled upon this video looking for nlp methods to analyze text pdfs .
I want to analyze over 2000 files, is there a fast way to do so ?
The clustering, would help me to analyze topics within 1 archive is there a way to do it on the 2000 files automatically?
@woodgeorge5585 3 ปีที่แล้ว ⁺²
after trying data = load_data("data/ushmm_dn.json")["texts"]
an error occur JSONDecodeError: Expecting value: line 1 column 1 (char 0)
could anyone teach me how to solve this
@dataruncoach ปีที่แล้ว ⁺¹
Can someone please help me. I am getting this error at the end when I try to show the vis:
TypeError: Object of type complex is not JSON serializable
Does anyone know how to fix this? This is my last line of code:
import pyLDAvis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
vis=pyLDAvis.gensim_models.prepare(lda_model,corpus,id2word)
vis
I think it has to do with how the JSON file was read and formatted?
@xiaohanhannahwen8181 2 ปีที่แล้ว
May i ask which python we should use for this ? i keep getting the ModuleNotFoundError...
@LorenzoFilitti 4 หลายเดือนก่อน
at minute 6:48, how can I make it work if I have a txt file and not a json one?
@woodgeorge5585 3 ปีที่แล้ว ⁺⁴
❌ import pyLDAvis.gensim
✅import pyLDAvis.gensim_models
@rrrajat04 3 ปีที่แล้ว
can you make a video on Guided LDA or Corex LDA for Semi Supervised LDA?
@THECORNFACTORY 2 ปีที่แล้ว
Hey! What version of spaCy are you using?
@VengefulSpace 3 ปีที่แล้ว ⁺¹
Hi Dr. Mattingly, is the ushmm_dn.json file available as well through the GitHub repo ...?
@python-programming 3 ปีที่แล้ว
I have a meeting today in which I will be asking permission to share that file. Thanks for reminding me.
@hamidrezamohammadzadeh6807 3 ปีที่แล้ว
Hey, many thanks for this tutorial! But I have a question:
How can I Export the Final result (viz) to an Excel File ?!
BR,
Hamidreza
@hizzuhishaam9392 2 ปีที่แล้ว
@@python-programming can i get that file?
@DonJuan247 2 ปีที่แล้ว
Hey man I'm having an issue with this line
vis = prepare(lda_model,corpus,id2word,mds="mmds",R=20)
When I run it, it says
TypeError: prepare() missing 2 required positional arguments: 'vocab' and 'term_frequency'
@RobertOSullivan 3 ปีที่แล้ว ⁺¹
If ushmm_dn.json is not available can you point us towards data that we can use?
@python-programming 3 ปีที่แล้ว ⁺³
I got the okay to share it. Here it is: github.com/wjbmattingly/topic_modeling_textbook/tree/main/data
@premdayal8175 3 ปีที่แล้ว ⁺¹
can you please help me ? I am getting this error ModuleNotFoundError: No module named 'pyLDAvis.gensim'
@maniac123ful 3 ปีที่แล้ว ⁺³
import pyLDAvis.gensim_models
@alexwinquist8092 3 ปีที่แล้ว ⁺¹
Any suggestions on visualizing the output in and IDE such as Pycharm? It seems like there are some issues using pyLDAvis in pycharm
@python-programming 3 ปีที่แล้ว
Unfortunately, I do not know of any. PyLDAvis was designed with Jupyter in mind, I believe. You can, however, save it as an html and open it externally. Here's how => stackoverflow.com/questions/41936775/export-pyldavis-graphs-as-standalone-webpage
@alexwinquist8092 3 ปีที่แล้ว ⁺¹
@@python-programming yeah thats what I have been reading as well thanks!
@python-programming 3 ปีที่แล้ว
@@alexwinquist8092 No problem! Wish I had better news.
@nanyinyang7994 3 ปีที่แล้ว
When I do the visualization part, it tells that "module 'pyLDAvis' has no attribute 'gensim'
". Not sure how to deal with it.
@philippplazibat397 3 ปีที่แล้ว ⁺²
try gensim_models, if i got that right, they changed it in a later version, therefor it does not work anymore with only gensim
@talhamasood0000 3 ปีที่แล้ว ⁺³
import pyLDAvis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
vis=pyLDAvis.gensim_models.prepare(lda_model,corpus,id2word)
@janni7439 2 ปีที่แล้ว ⁺²
Using final and new as variable names makes me a little angry as a Java developer :D
@python-programming 2 ปีที่แล้ว
Haha! I know it is such a bad habbit, I forgot about final keyword in Java.
@Gama1939 2 ปีที่แล้ว
Uhmm. I think u forgot to remove the stopwords.
@Gama1939 2 ปีที่แล้ว
afaik "simple_preprocess" doesn't remove stopwords.
Anw, please cmiiw
@mouwersor 2 ปีที่แล้ว
"No module named gensim"
@circlezsquare1626 6 หลายเดือนก่อน
pip install "pandas
@hunaydahsaeid1609 2 ปีที่แล้ว
Thanks 😊

ต่อไป

เล่นอัตโนมัติ

How to Create Bigrams and Trigrams and Remove Frequent Words (Topic Modeling for DH 03.04)