Nice! If you're working with survey data, you might also want to look into structural topic modeling (which has a great R package). That was originally also developed for open-ended survey questions, and allows you to model covariates. doi.org/10.1111/ajps.12103
@@kasperwelbers I have a feeling that these are outdated and would not as good as the current libraries like spacy. Is that right? How good are these when compared to a phrase extraction using logics/rules on spacy's POS-tagging+Dependency-paser data?
@@slkslk7841 I'd say the goal is quite different. Spacy is a great tool for preprocessing data, which can enhance various types of analysis (including topic modeling). One approach is indeed to use rules on dependency trees, for instance to extract phrases and semantic patterns such as who does what (coincidentally, we developed the rsyntax package for working with dependency trees in R: computationalcommunication.org/ccr/article/view/51/30 ). But if you want to automatically classify documents in latent classes, you'll need some form of dimensionality reduction, and topic modeling is still a great way to do this. That said, it is certainly true that the vanilla LDA approach discussed here is getting older (though as I think I mention in the video, it's still a nice place to start). In terms of more state-of-the-art alternatives, I've seen some topic models that use contextual word embeddings (which you could for instance obtain using spacy).
@@kasperwelbers Great. Much thanks for the reply Kasper! Do you mind elaborating a bit more on the method mentioned in your last sentence. Thanks again.
@@slkslk7841 I'll try, though it's hard to summarize. Classic LDA uses a document term matrix, which tells us nothing about the semantic similarities between words. With just this data, our model doesn't understand that the column "cat" and "dog" are more similar than "cat" and "airplane". Given enough data, the topic model can learn that "cat" and "dog" are more similar if they often co-occur (perhaps in a "pets" topic). But wouldn't it be nice if we could actually already infuse our model with some general information about these types of semantic relations of words beforehand? Here's where word embeddings come in to play. These are lower dimensional representations of words that are typically learned by training a deep learning model on a huuuuuge number of texts. This is great for all sorts of machine learning tasks, because it means our model knows more than just what it can learn from our training data. Even if "dog" and "cat" never co-occur together, the word embeddings convey that they are more similar than "cat" and "airplane". One final cool thing to mention is that there are also alligned word embeddings across languages. This way our model even 'knows' which words are similar across languages, which is paving the way for multilingual topic models.
Kasper, your videos are so helpful! Do you know of any good videos explaining how to use Wordfish or Latent Semantic Scaling? I'm struggling with those.
Hi Sam! I don't know of any videos, but have you already checked out the Quanteda tutorials on the topic? tutorials.quanteda.io/machine-learning/. Quanteda has great support for word scaling methods, and some of the lead developers contributed greatly to this field. Regarding more background on wordfish and wordscores I recommend looking up some of the work of Ken Benoit. For latent semantic scaling, see Kohei Watanabe's LSX package (github.com/koheiw/LSX ), and his excellent recent paper about this method (linked on the github page).
Hi @guaravsutar9080, I'm afraid I haven't planned anything of the sort. It's been a while since I used topic modeling (simply because my research took me elsewhere), so I'm not fully up to speed on the current state of the field.
If I recall correctly, the correlated topic model mostly differs in that it takes the correlations between topics into account in fitting the model. It probably adds a covariance matrix, but there should still be posterior distributions for document-topic and word-document, and so you should still just be able to visualize the correlations of topics and documents (or topics with topics) in a heatmap. Though depending on what package you use to compute them, extracting the posteriors might work differently.
@@kasperwelbers Thank you! I tried this code, it seems to have worked for basic LDA: beta_x % arrange(term) %>% select(-term) %>% rename_all(~paste0("topic", .)) } beta _w
Thank you for this great tutorial! How do we order the heat map if we aggregate by date instead of by president? It is not sorted, so visually we cannot retrieve information.
Ah right! So the thing with the heatmap is that by default it creates a dendogram (the tree thing on the top and left) that shows hierarchical clustering. The rows and columns are reordered so similar rows/columns are closer. If you want to use a specific order (like Year), you can order the matrix that you pass to the heatmap function, and disable the clustering feature. If you look at the documentation of the heatmap function (run ?heatmap) you see that the Rowv and Colv arguments control the dendogram and re-ordering. You can turn this of by passing the NA value. heatmap(as.matrix(tpp[-1]), Rowv = NA)
Are you referring to biterm topic modelling? I haven't yet used it, but I know that Jan Wijffels (who also wrote the udpipe package) wrote a package for it. The documentation on the GitHub page makes it look pretty easy to implement: github.com/bnosac/BTM
Sure, though off course depending on how large is large. The main limitation is your computer's memory, and the more text the longer it will take. If you have lots of data, you might want to limit your vocabulary size by dropping very rare (or very common) words. The data type of the tweets should just be text (character, string). The main topic modeling packages in R (topicmodels, stm) take a quanteda dfm as input, so if you learn how to convert tweets to this dfm you're good to go.
Hi Brian! LDA itself does not care about language, because it only looks at word occurrences in documents. Simply put, as long as you can preprocess the text and represent it as a document term matrix, you can apply LDA.
Hi Ronan, that type of error could have many reasons. What often helps is updating R to the latest version (I at least remember that R 4.0.0 had some issues with packages that use Rcpp, which quanteda certainly does).
Hi Kasper. I find this video very useful. I have one question. In my research, I am analysing comments from social media. But I organised the data as follows and I would be happy if you could help. In the Excel document, I have the authors of the comment written in one column, and I have the content of the comment written in the other column. So I have approx. 4,000 rows and each row has two columns - one for the author and one for the comment. I had all of these comments “separate” in my document, but I wanted to combine them. I obtained each group of comments from individual web portals (eg. Facebook posts, comments under articles, Reddit debates, ...). And I combined all these documents of comments into two columns. So now all comments are written in one column. That's my corpus (it is binary now - all the comments in one row). Can I use the LDA in the R program on this data set? Or do comment groups need to be separated into individual documents for the LDA method? I hope my question is clear, thank you so much.
When I type the first command, I get this message: Error: corpus_reshape() only works on corpus objects. So I guess I did not prepare the data correctly. How is data prepared correctly for LDA?
Thanks for sharing! The best tutorials I've watched. No fancy slides, but very very useful code line by line.
This was really awesome! I am going to try implement this to a survey here in New Zealand 🇳🇿
Nice! If you're working with survey data, you might also want to look into structural topic modeling (which has a great R package). That was originally also developed for open-ended survey questions, and allows you to model covariates. doi.org/10.1111/ajps.12103
@@kasperwelbers I have a feeling that these are outdated and would not as good as the current libraries like spacy. Is that right?
How good are these when compared to a phrase extraction using logics/rules on spacy's POS-tagging+Dependency-paser data?
@@slkslk7841 I'd say the goal is quite different. Spacy is a great tool for preprocessing data, which can enhance various types of analysis (including topic modeling). One approach is indeed to use rules on dependency trees, for instance to extract phrases and semantic patterns such as who does what (coincidentally, we developed the rsyntax package for working with dependency trees in R: computationalcommunication.org/ccr/article/view/51/30 ).
But if you want to automatically classify documents in latent classes, you'll need some form of dimensionality reduction, and topic modeling is still a great way to do this. That said, it is certainly true that the vanilla LDA approach discussed here is getting older (though as I think I mention in the video, it's still a nice place to start). In terms of more state-of-the-art alternatives, I've seen some topic models that use contextual word embeddings (which you could for instance obtain using spacy).
@@kasperwelbers Great. Much thanks for the reply Kasper!
Do you mind elaborating a bit more on the method mentioned in your last sentence.
Thanks again.
@@slkslk7841 I'll try, though it's hard to summarize. Classic LDA uses a document term matrix, which tells us nothing about the semantic similarities between words. With just this data, our model doesn't understand that the column "cat" and "dog" are more similar than "cat" and "airplane". Given enough data, the topic model can learn that "cat" and "dog" are more similar if they often co-occur (perhaps in a "pets" topic). But wouldn't it be nice if we could actually already infuse our model with some general information about these types of semantic relations of words beforehand?
Here's where word embeddings come in to play. These are lower dimensional representations of words that are typically learned by training a deep learning model on a huuuuuge number of texts. This is great for all sorts of machine learning tasks, because it means our model knows more than just what it can learn from our training data. Even if "dog" and "cat" never co-occur together, the word embeddings convey that they are more similar than "cat" and "airplane".
One final cool thing to mention is that there are also alligned word embeddings across languages. This way our model even 'knows' which words are similar across languages, which is paving the way for multilingual topic models.
Many thanks for putting this all together. Very helpful.
Thank you tons, sir, very helpful video!
Thanks for such a good content!!!
Thank you so much, this helps a lot
wow! very helpful!
Hi kasper, nice explaination on TM, i am not able to figure out how to plot latent topics to visualise the evolution of topics yearwise.
This is a great tutorial. I have a quick question. Which file type do I have to convert my current data set in an Excel file?
How do I add my own csv file as the corpus?
Kasper, your videos are so helpful! Do you know of any good videos explaining how to use Wordfish or Latent Semantic Scaling? I'm struggling with those.
Hi Sam! I don't know of any videos, but have you already checked out the Quanteda tutorials on the topic? tutorials.quanteda.io/machine-learning/. Quanteda has great support for word scaling methods, and some of the lead developers contributed greatly to this field. Regarding more background on wordfish and wordscores I recommend looking up some of the work of Ken Benoit. For latent semantic scaling, see Kohei Watanabe's LSX package (github.com/koheiw/LSX ), and his excellent recent paper about this method (linked on the github page).
Hello it was good to learn LDA from this video, but can you arrange any videos for Structural topic modelling full explanation ?
Hi @guaravsutar9080, I'm afraid I haven't planned anything of the sort. It's been a while since I used topic modeling (simply because my research took me elsewhere), so I'm not fully up to speed on the current state of the field.
@@kasperwelbers oh yes thank you so much
Actually I’m going through it but some of the codes I’m not able to interpret in R
Hi again, I also have one question: how to add Slovenian stopwords in R? Do you know it maybe? Thank you so much.
Great video ! Although I found that wearing a wig was a bit over the top
hahahaha, how did I miss this comment!! I'm afraid it's my actual hair though.
Kasper, how would this work for a correlation topic model heat map with topic rows/topic columns?
If I recall correctly, the correlated topic model mostly differs in that it takes the correlations between topics into account in fitting the model. It probably adds a covariance matrix, but there should still be posterior distributions for document-topic and word-document, and so you should still just be able to visualize the correlations of topics and documents (or topics with topics) in a heatmap. Though depending on what package you use to compute them, extracting the posteriors might work differently.
@@kasperwelbers Thank you! I tried this code, it seems to have worked for basic LDA:
beta_x %
arrange(term) %>%
select(-term) %>%
rename_all(~paste0("topic", .))
}
beta _w
Thank you for this great tutorial! How do we order the heat map if we aggregate by date instead of by president? It is not sorted, so visually we cannot retrieve information.
Ah right! So the thing with the heatmap is that by default it creates a dendogram (the tree thing on the top and left) that shows hierarchical clustering. The rows and columns are reordered so similar rows/columns are closer. If you want to use a specific order (like Year), you can order the matrix that you pass to the heatmap function, and disable the clustering feature.
If you look at the documentation of the heatmap function (run ?heatmap) you see that the Rowv and Colv arguments control the dendogram and re-ordering. You can turn this of by passing the NA value.
heatmap(as.matrix(tpp[-1]), Rowv = NA)
@@kasperwelbers just saw the answer. Thank you very much, it helps a lot! 🙏🏿🙏🏿
Nice! Can you try run the other topic modelling that is BTM?
Are you referring to biterm topic modelling? I haven't yet used it, but I know that Jan Wijffels (who also wrote the udpipe package) wrote a package for it. The documentation on the GitHub page makes it look pretty easy to implement: github.com/bnosac/BTM
For the corpus can I use a large number of tweets. If so, what should be the data type?
Sure, though off course depending on how large is large. The main limitation is your computer's memory, and the more text the longer it will take. If you have lots of data, you might want to limit your vocabulary size by dropping very rare (or very common) words. The data type of the tweets should just be text (character, string). The main topic modeling packages in R (topicmodels, stm) take a quanteda dfm as input, so if you learn how to convert tweets to this dfm you're good to go.
Hello! I have a question.. is there a way to implement LDA in other languages? I'm trying to applied to Italian Reviews from the web
Hi Brian! LDA itself does not care about language, because it only looks at word occurrences in documents. Simply put, as long as you can preprocess the text and represent it as a document term matrix, you can apply LDA.
@@kasperwelbers Thanks a lot for your fast reply. And of course thanks for the high quality content videos.
Hi Kasper, when I try using quanteda package I am getting an error: Error: package or namespace load failed for ‘quanteda’:
Hi Ronan, that type of error could have many reasons. What often helps is updating R to the latest version (I at least remember that R 4.0.0 had some issues with packages that use Rcpp, which quanteda certainly does).
Thanks! Would you recommend the latest R version, 4.0.5?
@@ronandunne1097 In general, just always go for the latest. The issue with Rcpp was solved in 4.0.2 I think.
Hi Kasper. I find this video very useful. I have one question. In my research, I am analysing comments from social media. But I organised the data as follows and I would be happy if you could help. In the Excel document, I have the authors of the comment written in one column, and I have the content of the comment written in the other column. So I have approx. 4,000 rows and each row has two columns - one for the author and one for the comment. I had all of these comments “separate” in my document, but I wanted to combine them. I obtained each group of comments from individual web portals (eg. Facebook posts, comments under articles, Reddit debates, ...). And I combined all these documents of comments into two columns. So now all comments are written in one column. That's my corpus (it is binary now - all the comments in one row). Can I use the LDA in the R program on this data set? Or do comment groups need to be separated into individual documents for the LDA method? I hope my question is clear, thank you so much.
When I type the first command, I get this message: Error: corpus_reshape() only works on corpus objects. So I guess I did not prepare the data correctly. How is data prepared correctly for LDA?
@@nejc8316 Maybe install.package(reshape2) will solve the problem
how do i connect with you
I'm not very well hidden online, but I tend to prefer via my university email (research.vu.nl/en/persons/kasper-welbers)