Topic modeling with R and tidy data principles

Julia Silge

มุมมอง 62 486

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 24 พ.ย. 2024

ความคิดเห็น • 93

@learningstuffs5718 4 ปีที่แล้ว ⁺³
I am learning R and just got pass the basics and try to implement it into projects. Your channel is a fantastic place for people like me to learn please keep teaching. Thank you.
@XJRULO 4 ปีที่แล้ว ⁺¹
I took one or two of your DataCamp courses, but making this available with no fees is a remarkable and nice work, thnks a lot!!!
@toshiyukihasumi825 2 ปีที่แล้ว
Thank you so much for your video. It's the ONLY tutorial I've found that talks about STM! Please keep them coming and truly appreciate your video!
@JuliaSilge 2 ปีที่แล้ว ⁺¹
I've also got this blog/screencast that demonstrates how to use STM: juliasilge.com/blog/spice-girls/
@lightspd714 5 ปีที่แล้ว ⁺²
Julia you are a great teacher. I love your text mining with R book but it is nice to see the concepts come to life in video.
@samuelholt7775 4 ปีที่แล้ว ⁺²⁰
Please do more! This was a brilliant introduction with perfect pace, I learned so much in less than 30 min! Hopefully this tip helps you as much as this demonstration helped me: crtl+shift+m (or cmd+shift+m) is a handy dplyr shortcut. Thank me later ;)
@donataamato3418 6 หลายเดือนก่อน
THANK YOU so much!!!
@hesamseraj 2 ปีที่แล้ว
It is again very helpful. I wish you keep sharing more videos on any new topic that interests you.
@happylearning-gp 2 ปีที่แล้ว
Excellent contribution, so fast, very clear, error-free, well explained
@Mrsandis89 3 ปีที่แล้ว
Julia, you’re an angel. I have to do my dissertation through STM, and, thank to you, I can literally complete it in 2 weeks b4 April 4th deadline.
@djcfb2889 3 ปีที่แล้ว
Wow! This is probably the best R tutorial I've seen like forever!
@jadesweeney1690 3 ปีที่แล้ว
This was so helpful to me during my research placement on tidytext data mining, thank you!
@Dawgs10100 3 ปีที่แล้ว ⁺¹
Thank you for this great video. I hope there are more to come! :)
@DrGabriella.K 3 ปีที่แล้ว
You directed me to topic modeling after I asked a Q on stackoverflow, thank you so much! Thank you for this amazing amazing resrouce!
@terraflops 3 ปีที่แล้ว
****PLEASE ZOOM IN **** for the future, please! I _Love_ this, thank you so much!
@RosieOutdoors 5 ปีที่แล้ว
Thank you so much for this video. As a complete newcomer to r and topic modelling, this was so well explained.
@mxm8900 5 หลายเดือนก่อน
Wow great video. I have nothing to do with text analysis, but I still watched the whole video
@DanTaninecz 6 ปีที่แล้ว
Great work. Very clear video. This type of solid instruction is all too rare in data science. Generally this type of stuff is just dumped on the user.
@Mrsandis89 3 ปีที่แล้ว
And of course, I’ve read your work. You’re brilliant.
@robertc2121 6 ปีที่แล้ว
Julia - this is amazing. Love your book -and I had been tempted by DataCamp for months before only signing up because of your Course. What a help they both have been Thank you!!
@JuliaSilge 6 ปีที่แล้ว ⁺²
HA you are so welcome! I'm really glad these resources are helpful. 👍
@jianzhang9157 4 ปีที่แล้ว
I really like your Introduction! It's great.
@edutimqiu1168 3 ปีที่แล้ว
Amazing work, incredibly helpful. All the best!
@prabhacar 2 ปีที่แล้ว
thanks for such a nice explanation. loved the demo!
@englianhu 6 ปีที่แล้ว
I used to use quanteda for my professional certificate few years ago.
The tidytext and stm packages that you introduce will be more suitable for natural language processing. 😉
@RajatSrivatava 6 ปีที่แล้ว
Hi ma'am your presentation and teaching skills are so good . thanks so much
@swazy1777 4 ปีที่แล้ว
You are an amazing teacher!
@morzaq123 6 ปีที่แล้ว
Amazing Video. Looking Forward to more videos on Text Mining
@entrepreneuriatrecherchesetcon 2 ปีที่แล้ว
Nice presentation. I suggest to increase the size via tools, general settings, appearance and choose for instance 16 or 18. Codes will e more clear.
@kaswin6527 6 ปีที่แล้ว ⁺¹
Fabulous explanation ever seen ..
Thank you sooooo much
@avijitnandy6662 6 ปีที่แล้ว
Maam we need more videos like this.
@stewartli5395 6 ปีที่แล้ว
great insights in a tidy way. like it very much. thanks.
@vikrantnag86 5 ปีที่แล้ว ⁺¹
Thank you Julia. Ca you please share some knowledge on how to do Sentiment analysis in R. Will be very helpful.
@vm2321 3 ปีที่แล้ว
She's written a book about it bro lol
Here's the link www.tidytextmining.com/
5 ปีที่แล้ว ⁺¹
That was an awesome teaching, thanks so much!
@TerezaS 4 ปีที่แล้ว
THank you so much for this video! And I love your book :)) If you considered doing more videos, I would love aspect-based sentiment analysis as a topic :))))
@entrepreneuriatrecherchesetcon 2 ปีที่แล้ว
@Tereza S look on my video on sentiment analysis on many documents th-cam.com/video/rU97L9Tu7Dg/w-d-xo.html
@lrschm 3 ปีที่แล้ว
Awesome video - super helpful! :)
@GustavoMontanha 4 ปีที่แล้ว
thanks julia, loved it
@paulmm6878 3 ปีที่แล้ว
Me encantan tus videos 😃 saludos desde Ecuador ✌️
@janidelemmanuelcastaneda8318 4 ปีที่แล้ว
Awesome content
@2108966 6 ปีที่แล้ว
Julia you are amazing!!! Thank´s!!!
@ai_refrains 6 ปีที่แล้ว ⁺¹
Great video! I am hoping to do some topic modeling on some 19th-century German texts with your approach. I still am unsure what I will do to import German stop words, but I will do some digging.
One critique: it is difficult to type along while you are talking, especially when you are entering things into the console so quickly. Maybe slow down by 5%.
Thanks a lot for the great website and video.
@ai_refrains 6 ปีที่แล้ว
Thanks a lot for the quick reply and very useful info!
@hkia7893 4 ปีที่แล้ว
You can reduce the playback speed
@delando983 5 ปีที่แล้ว ⁺¹
Nice video!! I am getting an error not sure if its me...more likely it is :(
sherlock_tf_idf %>%
+ mutate(word = reorder(word, tf_idf, story)) %>%
+ ggplot(aes(word, tf_idf, fill = story)) +
+ geom_col(alpha = 0.8, show.legend = FALSE) +
+ facet_wrap(~ story, scales = "free", ncol = 3) +
+ scale_x_reordered() +
+ coord_flip() +
+ theme(strip.text=element_text(size=11)) +
+ labs(x = NULL, y = "tf-idf",
+ title = "Highest tf-idf words in Sherlock Holmes short stories",
+ subtitle = "Individual stories focus on different characters and narrative elements")
Error in mutate_impl(.data, dots) :
Evaluation error: object 'FUN' of mode 'function' was not found.
@celloharper 6 ปีที่แล้ว
Thanks for the video. Please post more. How does one find your blog.
@knowledgeispower7007 4 ปีที่แล้ว ⁺¹
Thank you so much for this video. I’m very new to R and to STM. I’m working on a paper and trying to analyze press releases to formulate my hypotheses and find relevant topics. The press releases are stored on a word document. Could you please help/guide me on where to start and how to go about this? I’m trying to find latent variables and I heard that STM is a great modeling to use for this purpose. I appreciate your help 🙏
@JuliaSilge 4 ปีที่แล้ว
The first thing you need to do is read the Word files into R, because Word files are a special format that require specific handling. One package I like for dealing with Word and other Office files is officer: davidgohel.github.io/officer/
You can look at the same of the other options folks use here: stackoverflow.com/questions/50439684/how-to-extract-plain-text-from-docx-file-using-r
@knowledgeispower7007 4 ปีที่แล้ว
@@JuliaSilge thank you so much for your prompt response and for the resources you provided 🙏 I will definitely try them
@bbbbraveheart 2 ปีที่แล้ว
thank you so much~~~~
@shilpasuresh641 4 ปีที่แล้ว
How do you text mine a lot of urls stored in a CSV file ? or in other words topic modeling
@abdulrahmanabdulkadri4825 4 ปีที่แล้ว ⁺¹
This is great and very helpful! I would like to ask, how might we know which documents fall under which topic? Might there also be a data visualization for this? We only see how many documents fall under which topic, but not specifically which document.
@JuliaSilge 4 ปีที่แล้ว ⁺¹
Yes, check out the topic modeling section of the workshop I taught at rstudio::conf this year:
github.com/rstudio-conf-2020/text-mining
@abdulrahmanabdulkadri4825 4 ปีที่แล้ว
@@JuliaSilge Amazing! Thank you very much!
@odhiambogigs2829 5 ปีที่แล้ว
nice work....this was very helpful
@biaoyang6207 5 ปีที่แล้ว
Great! Thanks for sharing!
@pe66o 5 ปีที่แล้ว
Dear Julia - how can I create a topic model , when I have dataset as follows - Column1 word , Column 2 frequency of the word in the texts, Column 3 Main class and Column 4 the subclass? The topics should be classes and the subclasses. I made already a dictionary with the classes and subclasses. Thank you
@sonabaghdasaryan1198 6 ปีที่แล้ว ⁺²
Hi, an amazing video. But still I have a problem from the very beginning: I get an error while downloading gutenbergr. Error: No package with the name gutengergr. Which RStudio version do you use in this video? Thx in beforehand ^^
@sonabaghdasaryan1198 6 ปีที่แล้ว ⁺¹
Everything is fine, thx. After restarting my computer my code is running ^^ Julia, u r great ^^u inspired me to do TM ..
@dr.tarunsengupta6248 2 ปีที่แล้ว
gutenbergr package is not available in new version of R. please change the code accordingly so that analysis can be done form ant text or pdf document.
@dianaszabo3875 3 ปีที่แล้ว
Thank you :)
@hkia7893 4 ปีที่แล้ว
Thanks Julia for this interesting implementation of topic modelling
So in the end we get 6 topics with probability of 7 words each. And we do not know which story belongs to which topics.... 🤔
@JuliaSilge 4 ปีที่แล้ว ⁺¹
If you look at the gamma probabilities, you can see how the stories are related to topics. Check out the plot "Distribution of document probabilities for each topic" here: juliasilge.com/blog/sherlock-holmes-stm/
@hkia7893 4 ปีที่แล้ว
@@JuliaSilge thanks, I'm gonna check that out...
@davidizquierdogomez 5 ปีที่แล้ว
hello Julia...very nice video thanks a lot. I have a question...in my network graph of bi-grams, I get nodes without names...does it mean that i haven´t clean the white spaces properly? thank you very much.
@davidizquierdogomez 5 ปีที่แล้ว
Thanks for the response...I double checked and it is not a problem related to white spaces. I coded to get a igraph of bigrams and i get bigrams which are alone in two-nodes associations. Instead a bigram, there is a number on the empty node...
@srisreshtan1471 4 ปีที่แล้ว
When I am trying to install the 'Guttenberger' package, I am getting a message package ‘guttenberger’ is not available (for R version 3.6.3)
@JuliaSilge 4 ปีที่แล้ว ⁺¹
I think you're dealing with some typos there; there's just one "t" and no "e" at the end: cran.r-project.org/package=gutenbergr
@srisreshtan1471 4 ปีที่แล้ว
Yes. My mistake. Apologies. Thanks for correcting it.
@jacobbonsell4776 6 ปีที่แล้ว
Is there a way to get the frequency counts next to the betas in the topic-word distribution? I wanted to either use mutate or join somehow but I don't know where to retrieve the counts.
@jacobbonsell4776 6 ปีที่แล้ว
Thank you
@ilCapotasto 6 ปีที่แล้ว ⁺¹
cast_dfm has been moved from quanteda to tidytext, correct?
@justinwallace1304 6 ปีที่แล้ว
Ol
@bistanz 3 ปีที่แล้ว
Thanks for the video! One small question. Don't you need Sherlock %>% filter(!is.na(story)) to remove all NA rows?
@JuliaSilge 3 ปีที่แล้ว
It's been a while since I looked at this, but I don't believe there are any NA rows, at least as of how the data was formatted when I originally created this video/post back in 2018. You can see that in the tf-idf plot, no NA story facet: juliasilge.com/blog/sherlock-holmes-stm/
@bistanz 3 ปีที่แล้ว
@@JuliaSilge Thanks for replying. Don't we select only the top 10 words on each document to plot td-idf? Oh! eventually NA is not that frequent. You are right, we may no need to remove NAs. Thanks again for the amazing material.
@PaulYoung-r8g ปีที่แล้ว
Great
@Yi-cu7ie 4 ปีที่แล้ว
Hi, thank you for your video, which helps me a lot. I have a question. I have raw text with pdf and word form, how could I transfer this to data frame form like sherlock_raw and sherlock in the program. Thank you so much for your time and consideration!!!
@JuliaSilge 4 ปีที่แล้ว ⁺¹
For PDFs, my favorite tool for reading text into R is the pdftools package: docs.ropensci.org/pdftools/
I have less experience reading in .docx files, but I have occasionally used the textreadr package: github.com/trinker/textreadr
Good luck!
@emilierademakers70 6 ปีที่แล้ว
Hi Julia, thanks for sharing this tutorial! It was exactly what I needed. I am working on recovering latent dimensions in job descriptions and I am using R topic modelling to gain insight. I have two questions.
\1. I first started working on my data using the Text Mining in R and got acquinted with the lda methods. I see there are similarities with the stm package, however in the documentation it stated that without covariates (which is what I am doing at the moment), STM reduces to a logistic-normal topic model, oftern called the Correlated Topic Model. What would you say are the main differences between the CTM and LDA? And apart from it being fast (indeed!) what would you say is the main motivation for using the STM package (with spectral initilization)?
\2. Would you recommend first filtering out synonyms using e.g. the wordnet package in R? Or should the co-currence of these words with other words in documents solve this more or less?
Many many thanks!
Emilie
@JuliaSilge 6 ปีที่แล้ว ⁺²
I don't think you need to filter out synonyms before implementing topic modeling, because that is one of the things topic modeling is doing, during the modeling process, finding the latent topics. Related, you might want to even consider whether stemming is useful for your domain space: transacl.org/ojs/index.php/tacl/article/view/868
I have had consistent, excellent results with STM, which is one of the reasons I recommend it to folks. LDA models are based on the Dirichlet distribution (if you draw a sample from a Dirichlet distribution, you get a positive vector that sums to one); these models are based on priors over topics/words, then you solve for (approximate) posterior. CTM is a different approach, which models that one topic can be correlated with another (LDA assumes they are independent). Instead of Dirichlet, it uses the logistic normal distribution, as I understand it. If you want to read the original paper for CTM, it is here:
arxiv.org/pdf/0708.3601.pdf
As far as spectral initialization, it is a good place to start and nice for getting quick and reasonable results. If I need something very robust, then I do all the work that is laid out in the stm package vignette. I am working on some tidy tooling around that, and hope to get it out sometime soon!
@Jaji1948 2 ปีที่แล้ว
Resolution too low. Can’t read the screen. Can you send me a link to a higher res version?
@dinohadjiyannis3225 5 หลายเดือนก่อน
Julia, if I'm using a topic model on TH-cam comments to determine which video best explains topic modeling, how can I decide if your video or another video should be suggested? I see the model ranks comments with "gamma." If each comment is linked to a video ID, and based on gamma some or all comments rank highly in a hypothetical "topic modeling" topic, what then ? can we infer that your video is the best ?
@JuliaSilge 5 หลายเดือนก่อน ⁺¹
HAHA I can't tell if this is serious or not 🙈
In case it is, I will say that since topic modeling is unsupervised ML, it can't be used in a straightforward way to evaluate better/worse (you are not predicting a label). Instead, like you say, you could compare the relative proportion of certain topics (like, say, a topic that seems to be mostly about topic modeling) in one video's comments compared to others, and make an evaluation of videos based on that.
@dinohadjiyannis3225 5 หลายเดือนก่อน
@@JuliaSilge
If I can "cluster" comments related to topic modeling and find that the most relevant ones are linked to your video ID (based on beta, which will give you the top word probabilities), your video will appear with the highest relevance to that topic (based on gamma). This means your video is the most representative of that specific topic. But wait..
Then, if I manually compare, say, the top 10 most relevant videos and see that your video (which is at the top) also has a lot of likes, comments, engagement, and perhaps a great sentiment (after computing it) compared to the other 9, I can conclude that your video is the "best" and would recommend it.
Does this make sense, or am I misinterpreting the gamma/beta.
***Assume I have concatenated all comments into 1 corpora. Each corpora is linked to a video ID.
@JuliaSilge 5 หลายเดือนก่อน
@@dinohadjiyannis3225 I think that makes sense! Sounds to me like you are interpreting correctly. 👍
@dinohadjiyannis3225 5 หลายเดือนก่อน
@@JuliaSilge A big thanks to you for replying, given that this video is 6 years old. 🥇
@PatriciaRiosblog 6 ปีที่แล้ว
Hi julia would stm work nowadays for twitter or facebook content? thanks
@JuliaSilge 6 ปีที่แล้ว
Yep! This example shows using stm for topic modeling with long documents (books) but this approach also works with shorter documents. If you want to see an example of this, I have a blog here implementing topic modeling with Hacker News posts: juliasilge.com/blog/evaluating-stm/
@puspa_indah 5 ปีที่แล้ว
How to calculate theta and beta in structural topic modeling manually? does anyone know the formula or concept?
@puspa_indah 5 ปีที่แล้ว
@Julia Silge yes, I've already checked that paper but I don't find specific information that related to the formula I mention, does the algorithm on estimating theta and beta matrix is similar to any topic modeling methods (i.e LDA, CTM, STM, etc)? thanks for the previous reply btw :)
@renatacavalcanti8297 4 ปีที่แล้ว
vídeo mais que perfeito

ต่อไป

เล่นอัตโนมัติ

Text analysis / mining in R - how to plot word-graphs