If any semantic meaning that lda has, I think its because of the gibbs sampling step that tries to push a word into a topic that its neighbor words are already in. In broader sense what gibbs sampling is doing is P(selected word | each topic) x P(neighboring words | each topic)
I think the weakness in LDA is that it conflates semantics with words. Meaning arises via the relations between words; which totally escape LDA analysis. All LDA is good for, is to estimate the word promiximity between documents, but it's effectively incapable of extractive precise topics from documents; only generic topics.
Sure; but what is it good at ? What is the semantic value of the (let's call it a cartesian distance) between two LDA signatures ? I know what I'm talking about. I worked for a couple of years on an LDA-based classification project and the semantic value of the topics extracted from the documents was too general to be truly useful. I think Blei et all have found an interesting statistical method and a cool idea, but what they fail to express in this entire approach is precisely in what way their metric and the methods by which they choose words yields any meaningful insights on the analyzed texts. I find this whole thing very superficial. Without connecting your word net to some semantic ontology, you are doing nothing but an arbitrary match; arbitrary in the sense that meaning in language occurs in more complex ways than with individual nouns, vebs and adjectives.
I'm a noob on this, few weeks into NLP and i'm trying to solve a usecase and i'm hitting exactly this issue. Ultimately LDA just gives me a bunch of topic ids with words that dont mean anything together. I read that i have to name the topics myself ! And i landed here looking for a 'solution'. hmm .. i'm not the only one. Meanwhile i found something interesting ..dont know its worth, ieeexplore.ieee.org/document/6405699/. This introduces the term 'concept' between topic and word - Could not find any implementations as yet.
Pritish N I applied lda to public speeches and was able to compare results to manual results (i.e. people read the speeches and distinguished the main topics) and lda performed rather well and discovered 12 out 15 distinct topics. For instance health care topic had such words as health care afford insurance cost at the top, so u won't confuse it with anything else. I also have a few topics that are hard to interpret, but it gave me the main topics I needed across 6000 documents. I need to mention that in addition to stopwords I had to exclude about 30 other words that were frequent but noninformative, such as year state always because etc, these will depend on your area, of course, but they pollute the results, and the exclusion helped a lot.
The problem is that the whole concept of the "topic" is grossly inflated. It has very shallow semantic value. A topic is a broad and ambiguous category.
I think you should use hard clustering algorithm like k-means or hierarchical clustering for strict topics because topic modelling is a. Soft clustering approach
@@HarpreetKaur-qq8rx to my understanding, a hard clustering would assume all the documents in a corpus have the same probably of showing each of the topics. Each document is assumed to only show one of the topics and all the words in this document are assumed to show this topic. A soft clustering assumes each document has different probabilities of showing each of the topics. And a document shows all the topics rather than one. A word in a document shows one of the topics and the words in a document may show different topics.
Beautiful lecture on topic modeling. Thanks Prof Blei and Univ Edinburg for making this lecture available.
link for pdf of the presentation - www.cs.columbia.edu/~blei/talks/Blei_User_Behavior.pdf
A very informative session
Very beautiful! Thanks for sharing.
If any semantic meaning that lda has, I think its because of the gibbs sampling step that tries to push a word into a topic that its neighbor words are already in. In broader sense what gibbs sampling is doing is P(selected word | each topic) x P(neighboring words | each topic)
Trivial contrivance
I think the weakness in LDA is that it conflates semantics with words. Meaning arises via the relations between words; which totally escape LDA analysis. All LDA is good for, is to estimate the word promiximity between documents, but it's effectively incapable of extractive precise topics from documents; only generic topics.
its good enough if you have to deal with hundreds of documents containing thousands of words each
Sure; but what is it good at ? What is the semantic value of the (let's call it a cartesian distance) between two LDA signatures ? I know what I'm talking about. I worked for a couple of years on an LDA-based classification project and the semantic value of the topics extracted from the documents was too general to be truly useful. I think Blei et all have found an interesting statistical method and a cool idea, but what they fail to express in this entire approach is precisely in what way their metric and the methods by which they choose words yields any meaningful insights on the analyzed texts. I find this whole thing very superficial. Without connecting your word net to some semantic ontology, you are doing nothing but an arbitrary match; arbitrary in the sense that meaning in language occurs in more complex ways than with individual nouns, vebs and adjectives.
I'm a noob on this, few weeks into NLP and i'm trying to solve a usecase and i'm hitting exactly this issue. Ultimately LDA just gives me a bunch of topic ids with words that dont mean anything together. I read that i have to name the topics myself ! And i landed here looking for a 'solution'. hmm .. i'm not the only one. Meanwhile i found something interesting ..dont know its worth, ieeexplore.ieee.org/document/6405699/. This introduces the term 'concept' between topic and word - Could not find any implementations as yet.
Pritish N I applied lda to public speeches and was able to compare results to manual results (i.e. people read the speeches and distinguished the main topics) and lda performed rather well and discovered 12 out 15 distinct topics. For instance health care topic had such words as health care afford insurance cost at the top, so u won't confuse it with anything else. I also have a few topics that are hard to interpret, but it gave me the main topics I needed across 6000 documents. I need to mention that in addition to stopwords I had to exclude about 30 other words that were frequent but noninformative, such as year state always because etc, these will depend on your area, of course, but they pollute the results, and the exclusion helped a lot.
The problem is that the whole concept of the "topic" is grossly inflated. It has very shallow semantic value. A topic is a broad and ambiguous category.
16:36 add one for Tomotopy
Does anyone know where his other talk is that describes how to perform inference? 16:12
how to make this graph from 2:30 in R?
Hello professor, Can LDA be used to categorize documents into strict categories. Your video suggest otherwise but I wanted to confirm.
I think you should use hard clustering algorithm like k-means or hierarchical clustering for strict topics because topic modelling is a. Soft clustering approach
Thank You Manish for the reply but could you further elaborate on what is meant by soft and hard techniques
@@HarpreetKaur-qq8rx to my understanding, a hard clustering would assume all the documents in a corpus have the same probably of showing each of the topics. Each document is assumed to only show one of the topics and all the words in this document are assumed to show this topic. A soft clustering assumes each document has different probabilities of showing each of the topics. And a document shows all the topics rather than one. A word in a document shows one of the topics and the words in a document may show different topics.